samueldr changed the topic of #nixos-infra to: NixOS infrastructure | logs: https://logs.nix.samueldr.com/nixos-infra/
cole-h has quit [Ping timeout: 240 seconds]
<sterni> This jobset can be cleaned up, the PR it was testing has since been abandoned https://hydra.nixos.org/jobset/nixpkgs/pr-91557
<lukegb> update-nixos-unstable.service is broken again :'(
<lukegb> oh, is this the nix bug again
<lukegb> oh no, it's different, this is a segfault
<gchristensen> Wed 2021-04-28 12:24:39 CEST 27973 497 100 11 missing /nix/store/gq1nqzaf210jibq7dgm57gka8d5d42gr-nixos-channel-native-programs/bin/generate-programs-index
<gchristensen> hm
<gchristensen> wrong day, and missing :)
* lukegb attempts to figure out how to run mirror-nixos-branch locally
<lukegb> aha I have figured out the magic incantation for generate-programs-index, let's see if it segfaults
<gchristensen> fingers crossed I think :)
<lukegb> Segmentation fault (core dumped)
<lukegb> bingo
<gchristensen> nice
<lukegb> aaand it truncated it
<gchristensen> great
<lukegb> I should probably just run it under gdb
<lukegb> this is possibly JSON that's so malformed that it's breaking nlohmann json
<lukegb> seems to have crashed on "/nix/store/rm19p00n8hvd63k9dd3yfbigc7rw8kqs-tela-icon-theme-2021-01-21"
cole-h has joined #nixos-infra
<lukegb> oh, hmm, that's concerning
<lukegb> https://hydra.nixos.org/build/142182308 this build seems to have generated corrupted brotli too
<lukegb> are all the hydra machines running nix < faa31f4 or > 8d651a1f?
<gchristensen> almost definitely not
<gchristensen> most of them run Nix stable
<lukegb> if they're running nix stable then they should be fine
<lukegb> I don't really know how things get from hydra builders to the s3 store
<gchristensen> builders -NAR->hydra-queue-runner->s3
<gchristensen> the machines that don't get rebuilt daily almost definitely run Nix unstable
<lukegb> is hydra-queue-runner built against a "good" version of nix
<lukegb> do we have a procedure for obliterating things from the cache?
<gchristensen> we don't, no
<gchristensen> and almost certainly yes
<gchristensen> but I'll check
<lukegb> hrm, interesting
<gchristensen> wait, actually, maybe not
<gchristensen> I forgot that flakes bring new interesting changes around unified nix versions
<lukegb> basically if we delete rm19p00n8hvd63k9dd3yfbigc7rw8kqs.ls from the cache, that should unstick us, at the cost of losing some metadata about the store path
<lukegb> OTOH that file is broken anyway
<lukegb> rm19p[...] corresponds to /nix/store/rm19p00n8hvd63k9dd3yfbigc7rw8kqs-tela-icon-theme-2021-01-21 (the hydra build I linked earlier)
<gchristensen> which machine did it run on?
<lukegb> packet builder; 2718a894.packethost.net
<lukegb> but I'm not sure at what point the .ls is generated - it might be in hydra-queue-runner
<lukegb> inside BinaryCacheStore::addToStoreCommon; I don't think it gets copied around, so I think it'll always get regenerated when adding to a new store?
<lukegb> gchristensen: thoughts about deleting the broken .ls file?
<lukegb> otherwise I'll just bump it out of the cache by making a pointless change to the derivation to change the hash
<gchristensen> lukegb: https://github.com/NixOS/hydra/blob/master/flake.lock#L12 does this fall within the range?
<lukegb> I think that's new enough to be > 8d651a1f
<lukegb> hrm
<lukegb> and that was 10 days ago too
<lukegb> (I assume it was deployed at that point as well?)
<gchristensen> almost definitely
<lukegb> hmm, so I can see: commit bumping nix on the 22nd, restarted probably on the 25th (looking at hydra_uptime_total prometheus graphs), and then https://hydra.nixos.org/build/142182308 on the 27th
<lukegb> although... does that make any sense? hydra.nixos.org's footer says 0.1.20210429.18d2716 but there wasn't a restart according to hydra_uptime_total after the 29th
<gchristensen> ummmm
<gchristensen> you know what
<gchristensen> sigh
<gchristensen> Active: active (running) since Sun 2021-04-25 17:37:25 CEST; 6 days ago
<lukegb> which should be fine, right? which nix is in its closure?
<gchristensen> /nix/store/p9ajdjlnparcgkkxssxg8qy4gad8awiq-nix-2.4pre20210422_d9864be
<lukegb> which should be safe
<gchristensen> yeah
<lukegb> bah, maybe there's still a bug somewhere with compression
<lukegb> I should write some tests
<lukegb> but first, lunch
<gchristensen> yes, tests would be very helpful
<lukegb> sorry for dragging you into this a lot 😓
<gchristensen> no worries :)
<gchristensen> I'm in it, whether you ptu me there or not :D
<lukegb> I can definitely reproduce the problem at b60b0d62d6a65ad8051a24cf4d4e6c50d27abf6a, but not at d9864be4b757468d33bc49edddce5e4f04ef4b90
<lukegb> rm19p00n8hvd63k9dd3yfbigc7rw8kqs.ls is a 4.7M file, which decompresses to 4.3GB of JSON
<gchristensen> holy shit lol
<gchristensen> what is in there?
<lukegb> (it's not _valid_ JSON mind you)
<lukegb> it's the NAR archive list thingy
<gchristensen> of what?
<lukegb> ah, so https://cache.nixos.org/rm19p00n8hvd63k9dd3yfbigc7rw8kqs.ls is "/nix/store/rm19p00n8hvd63k9dd3yfbigc7rw8kqs-tela-icon-theme-2021-01-21"'s archive manifest
<lukegb> so it's just a json tree of the files in the NAR
<lukegb> it's only about 77M of actual valid JSON, the rest is broken
<gchristensen> incredible
<lukegb> we have a lot of these broken files in the cache
<lukegb> (my argument is we should delete or fix them, at least the ones referenced in the current latest evals for each jobset)
<gchristensen> we could probably delete them, but it is definitely not common practice
<lukegb> yeah.
<gchristensen> so I'm a bit anxious about doing so
<gchristensen> I'd rather not accidentally something
<lukegb> understandable
<lukegb> I merged https://github.com/NixOS/nixpkgs/pull/121519 which will change tela-icon-theme's output hash, I think, so once today's eval runs we can see if we're still uploading broken files
<lukegb> shower thoughts: it would be interesting to have hydra automatically decide when to run evals based on how busy the workers are (and some configuration metric) rather than on a fixed time interval
<gchristensen> like, find work to do?
<gchristensen> there is a "one-at-a-time" jobset type which only queues evaluation if the prev finished
<lukegb> hmm, I guess. it'd be interesting to try to consider global state though
<lukegb> it gets a bit tricky when you mash up the fact that we have separate pools of resources though
<lukegb> scheduling is Hard (tm)
<gchristensen> yes it is :)
<lukegb> surprise! https://hydra.nixos.org/build/142485633 seems to have generated a broken manifest again
<lukegb> so we conclusively know that we're still generating broken things
<lukegb> (I'm testing with, effectively: `curl -s https://cache.nixos.org/ap80xr55d3alhs67ysi8pyqw2r7qa70i.ls | nix-shell -p brotli --run "brotli -d" | jq . >/dev/null`)
<gchristensen> I'm not sure how to debug this further :(
<gchristensen> maybe a "ping eelco" thing?
<lukegb> maybe
<lukegb> gchristensen: can you doublecheck the nix daemon version on the machine running the hydra queue runner as well?
<gchristensen> yup
<gchristensen> but note hydra skips the daemon
<lukegb> yeah, I'm just grasping at straws
<gchristensen> ExecStart=@/nix/store/p9ajdjlnparcgkkxssxg8qy4gad8awiq-nix-2.4pre20210422_d9864be/bin/nix-daemon
<lukegb> which should be safe
<lukegb> bah.
<lukegb> we don't have something set which would cause the queue runner to only reload on update and not restart, right?
<lukegb> although it's restarted since the update anyway, so
<lukegb> cursed machine
<gchristensen> it does not, indeed, restart on deploy by default
<gchristensen> hey wait
* lukegb waits
<lukegb> we should put the queue runner version and queue runner nix version in /queue-runner-status
<gchristensen> this is FASCINATING.
<lukegb> oh?
<gchristensen> til: the running process is not necessarily matching the config in systemd status
<lukegb> right yeah, that's mostly what I'd expect
<lukegb> well, ish
<lukegb> :P
<gchristensen> sigh
<gchristensen> this is the wrong version isn't it
<gchristensen> not restarting the queue runner strikes again
<gchristensen> May 02 21:50:09 ceres systemd[1]: hydra-queue-runner.service: Consumed 1month 2w 13h 41min 18.485s CPU time, received 3.9T IP traffic, sent 1.3T IP traffic.
<lukegb> [lukegb@totoro:~/Projects/nix]$ git merge-base --is-ancestor faa31f4 9b9e703df41d75949272059f9b8bc8b763e91fce; echo $?
<lukegb> 0
<lukegb> [lukegb@totoro:~/Projects/nix]$ git merge-base --is-ancestor 8d651a1f 9b9e703df41d75949272059f9b8bc8b763e91fce; echo $?
<lukegb> 1
<lukegb> yup
<gchristensen> can you submit another bogus PR?
<lukegb> I hope we end up rebuilding the world before 21.05 branches off
<lukegb> or we're going to be plagued with broken .ls files and mysteriously incomplete binary indexes
<lukegb> yeah, will do
<gchristensen> I'm sure some rebuild-the-world crisis vuln will drop
<lukegb> well, if the binutils change gets merged that'll probably help I guess
<lukegb> my prayer has been answered https://github.com/NixOS/nixpkgs/pull/121527
<gchristensen> nice
<andi-> We can also add a few more comments to stdenv, last time I checked there were some fixmes in there :-)
supersandro2000 has quit [Killed (verne.freenode.net (Nickname regained by services))]
supersandro2000 has joined #nixos-infra