Synthetica has quit [Quit: Connection closed for inactivity]
drakonis1 has quit [Quit: WeeChat 2.6]
drakonis has joined #nixos-dev
cjpbirkbeck has joined #nixos-dev
phreedom has quit [Remote host closed the connection]
drakonis has quit [Quit: WeeChat 2.6]
phreedom has joined #nixos-dev
orivej has quit [Ping timeout: 240 seconds]
justan0theruser is now known as justanotheruser
cjpbirkbeck has quit [Quit: Quitting now.]
justanotheruser has quit [Ping timeout: 240 seconds]
justanotheruser has joined #nixos-dev
orivej has joined #nixos-dev
Jackneill has joined #nixos-dev
ddima_ has joined #nixos-dev
psyanticy has joined #nixos-dev
__monty__ has joined #nixos-dev
Mic92 has joined #nixos-dev
edwtjo has joined #nixos-dev
edwtjo has joined #nixos-dev
edwtjo has quit [Changing host]
drakonis has joined #nixos-dev
drakonis_ has quit [Ping timeout: 240 seconds]
drakonis_ has joined #nixos-dev
drakonis1 has joined #nixos-dev
drakonis has quit [Read error: Connection reset by peer]
drakonis1 has quit [Read error: Connection reset by peer]
drakonis has joined #nixos-dev
drakonis_ has quit [Ping timeout: 246 seconds]
drakonis_ has joined #nixos-dev
drakonis has quit [Read error: Connection reset by peer]
drakonis has joined #nixos-dev
drakonis_ has quit [Read error: Connection reset by peer]
drakonis_ has joined #nixos-dev
ckauhaus has joined #nixos-dev
drakonis has quit [Ping timeout: 245 seconds]
drakonis has joined #nixos-dev
drakonis1 has joined #nixos-dev
drakonis_ has quit [Ping timeout: 250 seconds]
<__red__>
the only /clear
drakonis has quit [Ping timeout: 245 seconds]
drakonis has joined #nixos-dev
drakonis1 has quit [Ping timeout: 276 seconds]
drakonis1 has joined #nixos-dev
cptchaos83 has quit [Remote host closed the connection]
xwvvvvwx has quit [Ping timeout: 252 seconds]
cptchaos83 has joined #nixos-dev
xwvvvvwx has joined #nixos-dev
drakonis_ has joined #nixos-dev
drakonis1 has quit [Ping timeout: 276 seconds]
drakonis has quit [Ping timeout: 250 seconds]
drakonis_ has quit [Ping timeout: 250 seconds]
drakonis1 has joined #nixos-dev
drakonis_ has joined #nixos-dev
drakonis has joined #nixos-dev
drakonis_ has quit [Ping timeout: 245 seconds]
eraserhd2 is now known as eraserhd
phreedom_ has joined #nixos-dev
phreedom has quit [Ping timeout: 260 seconds]
drakonis1 has quit [Ping timeout: 246 seconds]
drakonis_ has joined #nixos-dev
drakonis has quit [Ping timeout: 240 seconds]
drakonis_ has quit [Read error: Connection reset by peer]
drakonis_ has joined #nixos-dev
drakonis_ has quit [Ping timeout: 245 seconds]
drakonis has joined #nixos-dev
<thoughtpolice>
Here's a weird question: has anybody ever seen a +5GB .nar file uh, in the wild? Like ever?
<gchristensen>
yead
<gchristensen>
libguestfs' appliance is huge as a nar
<thoughtpolice>
I think it'd be impossibly expensive to query our S3 for that info so I figured I'd ask
<gchristensen>
hydra will not publish such a large nar
<thoughtpolice>
Oh, I mean downloaded from the cache. But in theory any .nar file will do
<thoughtpolice>
I see.
<gchristensen>
it fails th ebuild if the result is too large
<thoughtpolice>
Do you know what that limit is by chance? Or should I just dig in the code?
<samueldr>
2.2 GiB IIRC
<thoughtpolice>
For background: we have an upper limit for Fastly on what size objects the cache can serve by default. By default, it's 2GB. With our settings, it's 5GB. Finally, we do have a way to enable arbitrarily-large sized objects, but with a few minor downsides
<thoughtpolice>
(None of those downsides are relevant in our case, at least)
<thoughtpolice>
So I was just curious while going over my TODO list
<thoughtpolice>
(Anything that goes beyond the 2gb/5gb limit gets `503`'d in response, which is obviously very unfriendly)
<thoughtpolice>
samueldr: Thanks! Good to know
<samueldr>
I was wrong, 2.0 GiB
<samueldr>
the 2.2 is in GB :)
<thoughtpolice>
So, in theory yes it could happen in the wild for *somebody* (e.g. I just `nix copy --all` to S3), but in practice no, it cannot happen for cache.nixos.org
<thoughtpolice>
That's basically my question! Good to know
Jackneill has quit [Remote host closed the connection]
<thoughtpolice>
gchristensen: Cool! I've also been playing with Grafana for the new cache logs a bit, so hopefully I can cook up some dashboards to go along with it. Haven't looked at prom for monitoring yet.
<gchristensen>
right on
<gchristensen>
grafana is nice, prometheus is nice
<gchristensen>
there is quite a lot of data in our prometheus, I encourage people curious about the infrastructure to become familiar with how to query it
<thoughtpolice>
That same basic query (slightly tweaked) works fine in the Grafana ClickHouse plugin, so we can track latencies by DCs that a user hits. :)
<thoughtpolice>
(Though I don't have enough data to show a very fancy Grafana panel yet...)
<LnL>
prometheus is nice, but a little weird to get used to in beginning
<{^_^}>
tilpner: 4 days, 3 hours ago <gchristensen> to ping me :)
<tilpner>
Huh
<tilpner>
I definitely spoke today already
<gchristensen>
tilpner: I told you I asked {^_^} to let you know :P
<tilpner>
infinisil: Is ,tell per-channel?
<thoughtpolice>
(It's actually more like "What are the avg, p95, p99 latencies, broken down by DC, with request count, over the last week", which is pretty specific)
<thoughtpolice>
I'll look that over. For some background, I'm still investigating a way to purge stale 404s from the cache when an upload occurs. The trick is just making sure you don't suddenly like, upload 20,000 objects in a 2min interval and then try to purge 20,000 things really fast and rate limit yourself somewhere.
<thoughtpolice>
So knowing what the rate of upload/file churn is, is really useful.
<thoughtpolice>
(It's possible it's too irregular or tedious to figure this out, so I might abandon the whole idea and we can just keep reasonably low TTLs for 404s, but the knowledge helps a lot!)
<thoughtpolice>
gchristensen++
<{^_^}>
gchristensen's karma got increased to 177
<thoughtpolice>
gchristensen: Is there a way to get the total over an interval as well? A sum or something?
<thoughtpolice>
oh the _total one I think?
<thoughtpolice>
_seconds_total
drakonis_ has joined #nixos-dev
<thoughtpolice>
Yes, so I think the rate of uploads per second will certainly hit some of our hourly rate limits in the long run if we want to purge-on-upload. That's unfortunate.
<thoughtpolice>
At ~4 per second at over a 1hr interval that's actually 14x over our default purge API limit I think. :)
<gchristensen>
one sec thoughtpolice
<gchristensen>
thoughtpolice: what is the purge limit?
ixxie has joined #nixos-dev
<thoughtpolice>
1000/hr, IIRC. I believe purges are counted in that, not separately.
<thoughtpolice>
Mmmm, yeah, if I'm reading it right, then that's rough. I think unless I beg someone internally to get our API limits increased it's probably not going to work without some tricks, and like 70k hour burst is a pretty big ask.
<gchristensen>
:)
noonien has joined #nixos-dev
<thoughtpolice>
In theory if we could purge by surrogates we can do 256 at once. Which would be about ~274 requests, which is viable.
<thoughtpolice>
I'll have to think about it. Super duper useful, though! Thank you gchristensen
<gchristensen>
note we wouldn't really need to invalidate all of them
<gchristensen>
we'd only need to invalidate narinfo files
<thoughtpolice>
That's true...
<gchristensen>
but also I doubt fastly would like us invalidating thousands of paths an hour
<thoughtpolice>
I think of it more like a challenge personally. Plus I have the really good begging angle of "I work on this and we need help :(" which is great.
<gchristensen>
I think a better route though is to negatively cache 404s for less time
<gchristensen>
under the assumption that few users will share the same set of 404ing paths
<thoughtpolice>
I thought about that too, I think it's worth doing in any case.
<thoughtpolice>
Yeah, even without extra-long TTLs, we can still do it without hurting the origin too much anymore like before.
<{^_^}>
curl/curl#3750 (by TvdW, 32 weeks ago, closed): "Error in the HTTP2 framing layer" after 1000 requests
<thoughtpolice>
The more distressing one is https://github.com/NixOS/nix/issues/2733, but interestingly the backtrace in that ticket also has curl 7.64 in it. That HTTP/2 fix (which has also popped up in our issues) however only went into curl 7.65 and later.
<thoughtpolice>
Mmm, looking at the history of it all, probably not. That fix was much earlier this year before more recent reports like some cachix bugs
drakonis_ has joined #nixos-dev
red[evilred] has quit [Remote host closed the connection]
bridge[evilred] has quit [Remote host closed the connection]
bridge[evilred] has joined #nixos-dev
red[evilred] has joined #nixos-dev
drakonis_ has quit [Ping timeout: 240 seconds]
<thoughtpolice>
gchristensen: Word on the streets is that with batching, purging that many things should be fine! Not the most efficient though. And the narinfo bit is a good insight.
<thoughtpolice>
I think if we made the s3 copy code a little smarter, we could also make it substantially more efficient. In brief: the copy routine just needs to attach a few pieces of metadata to the things it uploads. Then we could for instance purge any 404s for the narinfo, the nar file, logs, and debug info all in one swoop.
<thoughtpolice>
That would effectively negate the cost of uploading multiple files per nar
<thoughtpolice>
Actually I take that back, that's probably not easily doable...
drakonis_ has joined #nixos-dev
ixxie has quit [Ping timeout: 265 seconds]
<__red__>
gchristensen: I dropped you a brief question via /msg if you have a second :-) <4
<__red__>
err
<__red__>
<3
justan0theruser has joined #nixos-dev
justanotheruser has quit [Ping timeout: 240 seconds]
ckauhaus has quit [Quit: WeeChat 2.6]
Jackneill has joined #nixos-dev
<worldofpeace1>
noo, I've created a mass rebuild on master.
<gchristensen>
did yourealize eval hadn't finished?
<gchristensen>
just let it go imo, we'll be ok
<worldofpeace1>
lol, it did I think but I guess I failed to notice
<gchristensen>
I wish we could make required checks only apply to PRs
<samueldr>
I, too, would
<gchristensen>
worldofpeace1: at 1,000 rebuilds I wouldn't have pushed a revert -- just would have realised the mistake and tried to not do it next time :)
<gchristensen>
the alert fired (:47) and resolved itself (:53) within a few minutes
<gchristensen>
alerts should be actionable and specific, and a flapping alert is harmful because it trains responders to ignore them
<eyJhb>
Ah, makes sense! Thanks for explaining it gchristensen :)
<gchristensen>
yep :)
<eyJhb>
Conflicted between waiting for NixOS to build unstable, or go to bed..
<gchristensen>
https://status.nixos.org/prometheus/graph?g0.range_input=30m&g0.expr=predict_linear(node_filesystem_avail_bytes%7Bmountpoint%3D%22%2F%22%2Cinstance%3D%22bd949fc7.packethost.net%22%7D%5B1h%5D%2C%204%20*%203600)%20%3C%200&g0.tab=0&g1.range_input=1h&g1.expr=node_filesystem_avail_bytes%7Bmountpoint%3D%22%2F%22%2Cinstance%3D%22bd949fc7.packethost.net%22%7D&g1.tab=0 the linear prediction has a hard
<gchristensen>
time with this second graph
<eyJhb>
Wish my browser worked atm. so I could open the links...
<gchristensen>
it looks like this -\_ :)
<eyJhb>
Ohh...
<tilpner>
gchristensen: Set for = x; to some longer x
<eyJhb>
Think I will head off to bed, my computer is useless atm. Night :)
<tilpner>
That way the expr needs to trigger for that period of time before being considered firing
<tilpner>
But that might be useless if these are long bursts of high write speeds
red[evilred] has quit [Quit: Idle timeout reached: 10800s]