samueldr changed the topic of #nixos-dev to: #nixos-dev NixOS Development (#nixos for questions) | NixOS 19.09 is released! https://discourse.nixos.org/t/nixos-19-09-release/4306 | https://hydra.nixos.org/jobset/nixos/trunk-combined https://channels.nix.gsc.io/graph.html | https://r13y.com | 19.09 RMs: disasm, sphalerite | https://logs.nix.samueldr.com/nixos-dev
drakonis has joined #nixos-dev
drakonis1 has joined #nixos-dev
__monty__ has quit [Quit: leaving]
orivej has quit [Ping timeout: 265 seconds]
orivej has joined #nixos-dev
drakonis_ has joined #nixos-dev
drakonis has quit [Ping timeout: 246 seconds]
Synthetica has quit [Quit: Connection closed for inactivity]
drakonis1 has quit [Quit: WeeChat 2.6]
drakonis has joined #nixos-dev
cjpbirkbeck has joined #nixos-dev
phreedom has quit [Remote host closed the connection]
drakonis has quit [Quit: WeeChat 2.6]
phreedom has joined #nixos-dev
orivej has quit [Ping timeout: 240 seconds]
justan0theruser is now known as justanotheruser
cjpbirkbeck has quit [Quit: Quitting now.]
justanotheruser has quit [Ping timeout: 240 seconds]
justanotheruser has joined #nixos-dev
orivej has joined #nixos-dev
Jackneill has joined #nixos-dev
ddima_ has joined #nixos-dev
psyanticy has joined #nixos-dev
__monty__ has joined #nixos-dev
Mic92 has joined #nixos-dev
edwtjo has joined #nixos-dev
edwtjo has joined #nixos-dev
edwtjo has quit [Changing host]
drakonis has joined #nixos-dev
drakonis_ has quit [Ping timeout: 240 seconds]
drakonis_ has joined #nixos-dev
drakonis1 has joined #nixos-dev
drakonis has quit [Read error: Connection reset by peer]
drakonis1 has quit [Read error: Connection reset by peer]
drakonis has joined #nixos-dev
drakonis_ has quit [Ping timeout: 246 seconds]
drakonis_ has joined #nixos-dev
drakonis has quit [Read error: Connection reset by peer]
drakonis has joined #nixos-dev
drakonis_ has quit [Read error: Connection reset by peer]
drakonis_ has joined #nixos-dev
ckauhaus has joined #nixos-dev
drakonis has quit [Ping timeout: 245 seconds]
drakonis has joined #nixos-dev
drakonis1 has joined #nixos-dev
drakonis_ has quit [Ping timeout: 250 seconds]
<__red__> the only /clear
drakonis has quit [Ping timeout: 245 seconds]
drakonis has joined #nixos-dev
drakonis1 has quit [Ping timeout: 276 seconds]
drakonis1 has joined #nixos-dev
cptchaos83 has quit [Remote host closed the connection]
xwvvvvwx has quit [Ping timeout: 252 seconds]
cptchaos83 has joined #nixos-dev
xwvvvvwx has joined #nixos-dev
drakonis_ has joined #nixos-dev
drakonis1 has quit [Ping timeout: 276 seconds]
drakonis has quit [Ping timeout: 250 seconds]
drakonis_ has quit [Ping timeout: 250 seconds]
drakonis1 has joined #nixos-dev
drakonis_ has joined #nixos-dev
drakonis has joined #nixos-dev
drakonis_ has quit [Ping timeout: 245 seconds]
eraserhd2 is now known as eraserhd
phreedom_ has joined #nixos-dev
phreedom has quit [Ping timeout: 260 seconds]
drakonis1 has quit [Ping timeout: 246 seconds]
drakonis_ has joined #nixos-dev
drakonis has quit [Ping timeout: 240 seconds]
drakonis_ has quit [Read error: Connection reset by peer]
drakonis_ has joined #nixos-dev
drakonis_ has quit [Ping timeout: 245 seconds]
drakonis has joined #nixos-dev
<thoughtpolice> Here's a weird question: has anybody ever seen a +5GB .nar file uh, in the wild? Like ever?
<gchristensen> yead
<gchristensen> libguestfs' appliance is huge as a nar
<thoughtpolice> I think it'd be impossibly expensive to query our S3 for that info so I figured I'd ask
<gchristensen> hydra will not publish such a large nar
<thoughtpolice> Oh, I mean downloaded from the cache. But in theory any .nar file will do
<thoughtpolice> I see.
<gchristensen> it fails th ebuild if the result is too large
<thoughtpolice> Do you know what that limit is by chance? Or should I just dig in the code?
<samueldr> 2.2 GiB IIRC
<thoughtpolice> For background: we have an upper limit for Fastly on what size objects the cache can serve by default. By default, it's 2GB. With our settings, it's 5GB. Finally, we do have a way to enable arbitrarily-large sized objects, but with a few minor downsides
<thoughtpolice> (None of those downsides are relevant in our case, at least)
<samueldr> >> maxOutputSize(config->getIntOption("max_output_size", 2ULL << 30))
<thoughtpolice> So I was just curious while going over my TODO list
<thoughtpolice> (Anything that goes beyond the 2gb/5gb limit gets `503`'d in response, which is obviously very unfriendly)
<thoughtpolice> samueldr: Thanks! Good to know
<samueldr> I was wrong, 2.0 GiB
<samueldr> the 2.2 is in GB :)
<thoughtpolice> So, in theory yes it could happen in the wild for *somebody* (e.g. I just `nix copy --all` to S3), but in practice no, it cannot happen for cache.nixos.org
<thoughtpolice> That's basically my question! Good to know
Jackneill has quit [Remote host closed the connection]
<gchristensen> I just setup {^_^} to forward prometheus alerts to this channel, if it becomes even a little bit annoying I'll turn it off: https://github.com/NixOS/nixos-org-configurations/commit/8e28e502a23dca75146014773a6481d1496c9a10
<gchristensen> (and I'll redirect it to another channel) ... (thank you tilpner for writing this code)
<samueldr> what are the kind of alerts that might be seen?
<gchristensen> the message will look something like this: Notice({^_^}): firing: ThisAlertIsATest: https://status.nixos.org/prometheus/alerts and it could be any of the alerts you see on this page: https://status.nixos.org/prometheus/alerts
<gchristensen> additional alerts can be made of course, based on data available here: https://status.nixos.org/prometheus/graph by sending a PR to this list: https://github.com/NixOS/nixos-org-configurations/blob/8e28e502a23dca75146014773a6481d1496c9a10/delft/eris.nix#L101
<samueldr> great, thanks gchristensen++ and tilpner++
<{^_^}> gchristensen's karma got increased to 176, tilpner's karma got increased to 55
<samueldr> those kind of alerts should help in being more proactive when those happen
<gchristensen> here are some alerts which have triggered since these alerts have been configured: https://status.nixos.org/prometheus/graph?g0.range_input=2d&g0.expr=ALERTS&g0.tab=0
<thoughtpolice> gchristensen: Cool! I've also been playing with Grafana for the new cache logs a bit, so hopefully I can cook up some dashboards to go along with it. Haven't looked at prom for monitoring yet.
<gchristensen> right on
<gchristensen> grafana is nice, prometheus is nice
<thoughtpolice> Speaking of that, here's a fun SQL query: https://usercontent.irccloud-cdn.com/file/LIdIWgKY/image.png
<gchristensen> there is quite a lot of data in our prometheus, I encourage people curious about the infrastructure to become familiar with how to query it
<thoughtpolice> That same basic query (slightly tweaked) works fine in the Grafana ClickHouse plugin, so we can track latencies by DCs that a user hits. :)
<thoughtpolice> (Though I don't have enough data to show a very fancy Grafana panel yet...)
<LnL> prometheus is nice, but a little weird to get used to in beginning
<gchristensen> for example on the 14th Hydra built 110,000 builds, totallying a bit over 1TB of build results: https://status.nixos.org/prometheus/graph?g0.range_input=3d&g0.end_input=2019-11-16%2005%3A16&g0.expr=sum(increase(hydra_steps_done_total%5B24h%5D))&g0.tab=0&g1.range_input=3d&g1.end_input=2019-11-16%2005%3A16&g1.expr=sum(increase(hydra_store_nar_write_bytes_total%5B24h%5D))&g1.tab=0
<tilpner> \o/
<{^_^}> tilpner: 4 days, 3 hours ago <gchristensen> to ping me :)
<tilpner> Huh
<tilpner> I definitely spoke today already
<gchristensen> tilpner: I told you I asked {^_^} to let you know :P
<tilpner> infinisil: Is ,tell per-channel?
<thoughtpolice> (It's actually more like "What are the avg, p95, p99 latencies, broken down by DC, with request count, over the last week", which is pretty specific)
<tilpner> (I think it is)
<thoughtpolice> Interesting, a nice Prom metric endpoint for ClickHouse: https://github.com/percona-lab/clickhouse_exporter
<thoughtpolice> I'll have to see if I can set that up.
<tilpner> gchristensen: node_systemd_unit_state{state="failed"}==1 is useful (but perhaps too trigger-happy)
<samueldr> neat, should setup an "appliance" raspberry pi disk image that could be used to show a dashboard on a display
<gchristensen> that'd be cool :)
<tilpner> gchristensen: Make sure to set repeat_interval to something higher than 1m
<gchristensen> oh?
<tilpner> That was just useful for testing, not necessarily a recommendation
<gchristensen> oh I see
<tilpner> It doesn't actually send alerts every minute
<tilpner> But it's still safer, wouldn't want to spam this channel just because you're sleeping
<tilpner> :)
<thoughtpolice> gchristensen: Okay, here's an interesting question I have that I wonder if Prom can answer?
<gchristensen> shoot
<thoughtpolice> "How many files were uploaded into S3"
<thoughtpolice> I bet we can't track that yet because of `nix copy`, but is there anything close?
<gchristensen> https://status.nixos.org/prometheus/graph?g0.range_input=1d&g0.end_input=2019-11-16%2005%3A16&g0.expr=rate(hydra_store_s3_put_total%5B24h%5D)&g0.tab=0&g1.range_input=3d&g1.end_input=2019-11-16%2005%3A16&g1.expr=sum(increase(hydra_store_nar_write_bytes_total%5B24h%5D))&g1.tab=0 this graph shows the number of uploads per second
<thoughtpolice> Amazing! In that chart, the data comes from Hydra, so I assume "upload" means "upload one object into the cache?", as in a .nar file?
<gchristensen> bytes written per second: https://status.nixos.org/prometheus/graph?g0.range_input=1d&g0.end_input=2019-11-16%2005%3A16&g0.expr=rate(hydra_store_s3_put_bytes_total%5B5m%5D)&g0.tab=0&g1.range_input=3d&g1.end_input=2019-11-16%2005%3A16&g1.expr=sum(increase(hydra_store_nar_write_bytes_total%5B24h%5D))&g1.tab=0 how many seconds per second hydra spends writing to s3:
<gchristensen> https://status.nixos.org/prometheus/graph?g0.range_input=1d&g0.end_input=2019-11-16%2005%3A16&g0.expr=rate(hydra_store_s3_put_seconds_total%5B5m%5D)&g0.tab=0&g1.range_input=3d&g1.end_input=2019-11-16%2005%3A16&g1.expr=sum(increase(hydra_store_nar_write_bytes_total%5B24h%5D))&g1.tab=0
<thoughtpolice> If so, then I suppose ((that number) * 3) might be a good approximation: one for the .nar file, one for the .ls, one for the .narinfo
<thoughtpolice> Well and the log/debug files. So 4x minimum.
<gchristensen> I don't know for certain, but I think that number is the number of objects and not nars
<thoughtpolice> Oh!! That's even better then.
<thoughtpolice> I'll look that over. For some background, I'm still investigating a way to purge stale 404s from the cache when an upload occurs. The trick is just making sure you don't suddenly like, upload 20,000 objects in a 2min interval and then try to purge 20,000 things really fast and rate limit yourself somewhere.
<thoughtpolice> So knowing what the rate of upload/file churn is, is really useful.
<thoughtpolice> (It's possible it's too irregular or tedious to figure this out, so I might abandon the whole idea and we can just keep reasonably low TTLs for 404s, but the knowledge helps a lot!)
<thoughtpolice> gchristensen++
<{^_^}> gchristensen's karma got increased to 177
<thoughtpolice> gchristensen: Is there a way to get the total over an interval as well? A sum or something?
<thoughtpolice> oh the _total one I think?
<thoughtpolice> _seconds_total
drakonis_ has joined #nixos-dev
<thoughtpolice> Yes, so I think the rate of uploads per second will certainly hit some of our hourly rate limits in the long run if we want to purge-on-upload. That's unfortunate.
<thoughtpolice> At ~4 per second at over a 1hr interval that's actually 14x over our default purge API limit I think. :)
<gchristensen> one sec thoughtpolice
<gchristensen> thoughtpolice: what is the purge limit?
ixxie has joined #nixos-dev
<thoughtpolice> 1000/hr, IIRC. I believe purges are counted in that, not separately.
<gchristensen> thoughtpolice:
<gchristensen> https://status.nixos.org/prometheus/graph?g0.range_input=2w&g0.expr=increase(hydra_store_s3_put_seconds_total%5B1h%5D)&g0.tab=0&g1.range_input=3d&g1.end_input=2019-11-16%2005%3A16&g1.expr=sum(increase(hydra_store_nar_write_bytes_total%5B24h%5D))&g1.tab=0https://status.nixos.org/prometheus/graph?g0.range_input=2w&g0.expr=increase(hydra_store_s3_put_seconds_total%5B1h%5D)&g0.tab=0&g1.range_input=3
<gchristensen> d&g1.end_input=2019-11-16%2005%3A16&g1.expr=sum(increase(hydra_store_nar_write_bytes_total%5B24h%5D))&g1.tab=0
<gchristensen> oops
<gchristensen> increase(hydra_store_s3_put_seconds_total[1h]) shows how many are done per hour
<thoughtpolice> wow 40k in one hour. Whoo. Okay so the idea might be unworkable then. Good to know.
<gchristensen> ermm
<gchristensen> I'm not doing a good job sending you the right links.
<gchristensen> https://status.nixos.org/prometheus/graph?g0.range_input=2w&g0.expr=increase(hydra_store_s3_put_total%5B1h%5D)&g0.tab=0 let's try this one
<thoughtpolice> Mmmm, yeah, if I'm reading it right, then that's rough. I think unless I beg someone internally to get our API limits increased it's probably not going to work without some tricks, and like 70k hour burst is a pretty big ask.
<gchristensen> :)
noonien has joined #nixos-dev
<thoughtpolice> In theory if we could purge by surrogates we can do 256 at once. Which would be about ~274 requests, which is viable.
<thoughtpolice> I'll have to think about it. Super duper useful, though! Thank you gchristensen
<gchristensen> note we wouldn't really need to invalidate all of them
<gchristensen> we'd only need to invalidate narinfo files
<thoughtpolice> That's true...
<gchristensen> but also I doubt fastly would like us invalidating thousands of paths an hour
<thoughtpolice> I think of it more like a challenge personally. Plus I have the really good begging angle of "I work on this and we need help :(" which is great.
<gchristensen> I think a better route though is to negatively cache 404s for less time
<gchristensen> under the assumption that few users will share the same set of 404ing paths
<thoughtpolice> I thought about that too, I think it's worth doing in any case.
<thoughtpolice> Yeah, even without extra-long TTLs, we can still do it without hurting the origin too much anymore like before.
aminechikhaoui has quit [Quit: The Lounge - https://thelounge.github.io]
aminechikhaoui has joined #nixos-dev
<thoughtpolice> I asked internally if I could purge "just a few URLs, only 70,000 or so". We'll see what happens.
<gchristensen> lol
<gchristensen> "...per hour"
ixxie has quit [Read error: Connection timed out]
ixxie has joined #nixos-dev
red[evilred] has joined #nixos-dev
<red[evilred]> So, I'm reviewing a module as opposed to a package
<red[evilred]> is there a specific way to test that, other than hub pr checkout and add it to my configuration.nix
<red[evilred]> ?
<thoughtpolice> If there's a test you can run that.
<thoughtpolice> Or, well, you can ask borg to run it. But if you're reviewing it yes, trying it out yourself is always nice.
<samueldr> instead of your own configuration.nix, a fresh one you build-vm for
drakonis has quit [Read error: Connection reset by peer]
drakonis_ has quit [Read error: Connection reset by peer]
psyanticy has quit [Quit: Connection closed for inactivity]
<red[evilred]> the module is in a branch that's 48,000 behind master
<red[evilred]> and when I try to test it, I get all kinds of "missing" stuff
<red[evilred]> so - asking them to bring it current to master
<red[evilred]> is that reasonable?
<thoughtpolice> Hmmmm, interesting. I wonder if this is related to our HTTP/2 issues: https://github.com/curl/curl/issues/3750
<{^_^}> curl/curl#3750 (by TvdW, 32 weeks ago, closed): "Error in the HTTP2 framing layer" after 1000 requests
<thoughtpolice> The more distressing one is https://github.com/NixOS/nix/issues/2733, but interestingly the backtrace in that ticket also has curl 7.64 in it. That HTTP/2 fix (which has also popped up in our issues) however only went into curl 7.65 and later.
<{^_^}> nix#2733 (by noonien, 35 weeks ago, open): nix-channel segfault: Fatal error: glibc detected an invalid stdio handle
<thoughtpolice> Mmm, looking at the history of it all, probably not. That fix was much earlier this year before more recent reports like some cachix bugs
drakonis_ has joined #nixos-dev
red[evilred] has quit [Remote host closed the connection]
bridge[evilred] has quit [Remote host closed the connection]
bridge[evilred] has joined #nixos-dev
red[evilred] has joined #nixos-dev
drakonis_ has quit [Ping timeout: 240 seconds]
<thoughtpolice> gchristensen: Word on the streets is that with batching, purging that many things should be fine! Not the most efficient though. And the narinfo bit is a good insight.
<thoughtpolice> I think if we made the s3 copy code a little smarter, we could also make it substantially more efficient. In brief: the copy routine just needs to attach a few pieces of metadata to the things it uploads. Then we could for instance purge any 404s for the narinfo, the nar file, logs, and debug info all in one swoop.
<thoughtpolice> That would effectively negate the cost of uploading multiple files per nar
<thoughtpolice> Actually I take that back, that's probably not easily doable...
drakonis_ has joined #nixos-dev
ixxie has quit [Ping timeout: 265 seconds]
<__red__> gchristensen: I dropped you a brief question via /msg if you have a second :-) <4
<__red__> err
<__red__> <3
justan0theruser has joined #nixos-dev
justanotheruser has quit [Ping timeout: 240 seconds]
ckauhaus has quit [Quit: WeeChat 2.6]
Jackneill has joined #nixos-dev
<worldofpeace1> noo, I've created a mass rebuild on master.
<worldofpeace1> silly jackaudio #73779
<{^_^}> https://github.com/NixOS/nixpkgs/pull/73779 (by worldofpeace, 3 hours ago, merged): jack2: 1.9.13 -> 1.9.14, fix build arm
<gchristensen> did yourealize eval hadn't finished?
<gchristensen> just let it go imo, we'll be ok
<worldofpeace1> lol, it did I think but I guess I failed to notice
<gchristensen> I wish we could make required checks only apply to PRs
<samueldr> I, too, would
<gchristensen> worldofpeace1: at 1,000 rebuilds I wouldn't have pushed a revert -- just would have realised the mistake and tried to not do it next time :)
<gchristensen> especially with the queeue in such good shape https://status.nixos.org/grafana/d/MJw9PcAiz/hydra-jobs?refresh=30s&orgId=1
<worldofpeace1> gchristensen: the thing is... I kinda do this often 🤣
<worldofpeace1> I will check the queue next time though
Jackneill has quit [Remote host closed the connection]
__monty__ has quit [Quit: leaving]
<{^_^}> firing: RootPartitionNoFreeSpace4HrsAway: https://status.nixos.org/prometheus/alerts
<samueldr> that's what it looks like
<eyJhb> It looks mystical samueldr, what does it mean?
<eyJhb> That it will run out of space in 4 hours?
<samueldr> yeah, if everything continues working with the same rate, AFAIUI
<gchristensen> hrm, a bit annoying -- it flapped
<{^_^}> resolved: RootPartitionNoFreeSpace4HrsAway: https://status.nixos.org/prometheus/alerts
<eyJhb> gchristensen: "it flapped"?
<gchristensen> the alert fired (:47) and resolved itself (:53) within a few minutes
<gchristensen> alerts should be actionable and specific, and a flapping alert is harmful because it trains responders to ignore them
<eyJhb> Ah, makes sense! Thanks for explaining it gchristensen :)
<gchristensen> yep :)
<eyJhb> Conflicted between waiting for NixOS to build unstable, or go to bed..
<gchristensen> https://status.nixos.org/prometheus/graph?g0.range_input=30m&g0.expr=predict_linear(node_filesystem_avail_bytes%7Bmountpoint%3D%22%2F%22%2Cinstance%3D%22bd949fc7.packethost.net%22%7D%5B1h%5D%2C%204%20*%203600)%20%3C%200&g0.tab=0&g1.range_input=1h&g1.expr=node_filesystem_avail_bytes%7Bmountpoint%3D%22%2F%22%2Cinstance%3D%22bd949fc7.packethost.net%22%7D&g1.tab=0 the linear prediction has a hard
<gchristensen> time with this second graph
<eyJhb> Wish my browser worked atm. so I could open the links...
<gchristensen> it looks like this -\_ :)
<eyJhb> Ohh...
<tilpner> gchristensen: Set for = x; to some longer x
<eyJhb> Think I will head off to bed, my computer is useless atm. Night :)
<tilpner> That way the expr needs to trigger for that period of time before being considered firing
<tilpner> But that might be useless if these are long bursts of high write speeds
red[evilred] has quit [Quit: Idle timeout reached: 10800s]
drakonis_ has quit [Ping timeout: 246 seconds]