#nixos-infra on 2021-05-02

10:43 <sterni> This jobset can be cleaned up, the PR it was testing has since been abandoned https://hydra.nixos.org/jobset/nixpkgs/pr-91557

12:06 <lukegb> update-nixos-unstable.service is broken again :'(

12:46 <gchristensen> lukegb: https://gist.github.com/grahamc/4b11707eeb0653b8b6eb7572c6a49722

12:47 <lukegb> oh, is this the nix bug again

12:49 <lukegb> oh no, it's different, this is a segfault

12:50 <gchristensen> Wed 2021-04-28 12:24:39 CEST 27973 497 100 11 missing /nix/store/gq1nqzaf210jibq7dgm57gka8d5d42gr-nixos-channel-native-programs/bin/generate-programs-index

12:50 <gchristensen> hm

12:51 <gchristensen> wrong day, and missing :)

13:03 <lukegb> aha I have figured out the magic incantation for generate-programs-index, let's see if it segfaults

13:11 <gchristensen> fingers crossed I think :)

13:27 <lukegb> Segmentation fault (core dumped)

13:27 <lukegb> bingo

13:27 <gchristensen> nice

13:27 <lukegb> aaand it truncated it

13:27 <gchristensen> great

13:28 <lukegb> I should probably just run it under gdb

13:33 <lukegb> this is possibly JSON that's so malformed that it's breaking nlohmann json

13:33 <lukegb> seems to have crashed on "/nix/store/rm19p00n8hvd63k9dd3yfbigc7rw8kqs-tela-icon-theme-2021-01-21"

13:46 <lukegb> oh, hmm, that's concerning

13:47 <lukegb> https://hydra.nixos.org/build/142182308 this build seems to have generated corrupted brotli too

13:47 <lukegb> are all the hydra machines running nix < faa31f4 or > 8d651a1f?

13:47 <lukegb> (because https://github.com/NixOS/nix/commit/8d651a1f68c018b8a10dd37da81e9d3612073656)

13:53 <gchristensen> almost definitely not

13:54 <gchristensen> most of them run Nix stable

13:54 <lukegb> if they're running nix stable then they should be fine

13:54 <lukegb> I don't really know how things get from hydra builders to the s3 store

13:54 <gchristensen> builders -NAR->hydra-queue-runner->s3

13:55 <gchristensen> the machines that don't get rebuilt daily almost definitely run Nix unstable

13:56 <lukegb> is hydra-queue-runner built against a "good" version of nix

13:56 <lukegb> do we have a procedure for obliterating things from the cache?

13:56 <gchristensen> we don't, no

13:56 <gchristensen> and almost certainly yes

13:56 <gchristensen> but I'll check

13:56 <lukegb> hrm, interesting

13:57 <gchristensen> wait, actually, maybe not

13:57 <gchristensen> I forgot that flakes bring new interesting changes around unified nix versions

13:57 <lukegb> basically if we delete rm19p00n8hvd63k9dd3yfbigc7rw8kqs.ls from the cache, that should unstick us, at the cost of losing some metadata about the store path

13:57 <lukegb> OTOH that file is broken anyway

13:58 <lukegb> rm19p[...] corresponds to /nix/store/rm19p00n8hvd63k9dd3yfbigc7rw8kqs-tela-icon-theme-2021-01-21 (the hydra build I linked earlier)

14:01 <gchristensen> which machine did it run on?

14:02 <lukegb> packet builder; 2718a894.packethost.net

14:02 <lukegb> but I'm not sure at what point the .ls is generated - it might be in hydra-queue-runner

14:03 <lukegb> since it happens here: https://github.com/NixOS/nix/blob/d15a1962cb65ef1769e0928af4032c32df205c37/src/libstore/binary-cache-store.cc#L207-L221

14:03 <lukegb> inside BinaryCacheStore::addToStoreCommon; I don't think it gets copied around, so I think it'll always get regenerated when adding to a new store?

14:41 <lukegb> gchristensen: thoughts about deleting the broken .ls file?

14:47 <lukegb> otherwise I'll just bump it out of the cache by making a pointless change to the derivation to change the hash

14:50 <lukegb> https://github.com/NixOS/nixpkgs/pull/121519

14:55 <gchristensen> lukegb: https://github.com/NixOS/hydra/blob/master/flake.lock#L12 does this fall within the range?

14:57 <lukegb> I think that's new enough to be > 8d651a1f

14:57 <lukegb> hrm

14:59 <lukegb> and that was 10 days ago too

14:59 <lukegb> (I assume it was deployed at that point as well?)

15:00 <gchristensen> almost definitely

15:09 <lukegb> hmm, so I can see: commit bumping nix on the 22nd, restarted probably on the 25th (looking at hydra_uptime_total prometheus graphs), and then https://hydra.nixos.org/build/142182308 on the 27th

15:10 <lukegb> although... does that make any sense? hydra.nixos.org's footer says 0.1.20210429.18d2716 but there wasn't a restart according to hydra_uptime_total after the 29th

15:11 <gchristensen> ummmm

15:11 <gchristensen> you know what

15:11 <gchristensen> sigh

15:12 <gchristensen> Active: active (running) since Sun 2021-04-25 17:37:25 CEST; 6 days ago

15:12 <lukegb> which should be fine, right? which nix is in its closure?

15:12 <gchristensen> /nix/store/p9ajdjlnparcgkkxssxg8qy4gad8awiq-nix-2.4pre20210422_d9864be

15:12 <lukegb> which should be safe

15:13 <gchristensen> yeah

15:13 <lukegb> bah, maybe there's still a bug somewhere with compression

15:13 <lukegb> I should write some tests

15:13 <lukegb> but first, lunch

15:13 <gchristensen> yes, tests would be very helpful

15:14 <lukegb> sorry for dragging you into this a lot 😓

15:14 <gchristensen> no worries :)

15:14 <gchristensen> I'm in it, whether you ptu me there or not :D

16:07 <lukegb> I can definitely reproduce the problem at b60b0d62d6a65ad8051a24cf4d4e6c50d27abf6a, but not at d9864be4b757468d33bc49edddce5e4f04ef4b90

17:02 <lukegb> rm19p00n8hvd63k9dd3yfbigc7rw8kqs.ls is a 4.7M file, which decompresses to 4.3GB of JSON

17:03 <gchristensen> holy shit lol

17:04 <gchristensen> what is in there?

17:04 <lukegb> (it's not _valid_ JSON mind you)

17:04 <lukegb> it's the NAR archive list thingy

17:04 <gchristensen> of what?

17:05 <lukegb> ah, so https://cache.nixos.org/rm19p00n8hvd63k9dd3yfbigc7rw8kqs.ls is "/nix/store/rm19p00n8hvd63k9dd3yfbigc7rw8kqs-tela-icon-theme-2021-01-21"'s archive manifest

17:06 <lukegb> so it's just a json tree of the files in the NAR

17:06 <lukegb> it's only about 77M of actual valid JSON, the rest is broken

17:06 <gchristensen> incredible

17:06 <lukegb> we have a lot of these broken files in the cache

17:07 <lukegb> (my argument is we should delete or fix them, at least the ones referenced in the current latest evals for each jobset)

17:08 <gchristensen> we could probably delete them, but it is definitely not common practice

17:08 <lukegb> yeah.

17:09 <gchristensen> so I'm a bit anxious about doing so

17:09 <gchristensen> I'd rather not accidentally something

17:09 <lukegb> understandable

17:11 <lukegb> I merged https://github.com/NixOS/nixpkgs/pull/121519 which will change tela-icon-theme's output hash, I think, so once today's eval runs we can see if we're still uploading broken files

18:14 <lukegb> shower thoughts: it would be interesting to have hydra automatically decide when to run evals based on how busy the workers are (and some configuration metric) rather than on a fixed time interval

18:18 <gchristensen> like, find work to do?

18:18 <gchristensen> there is a "one-at-a-time" jobset type which only queues evaluation if the prev finished

19:09 <lukegb> hmm, I guess. it'd be interesting to try to consider global state though

19:09 <lukegb> it gets a bit tricky when you mash up the fact that we have separate pools of resources though

19:09 <lukegb> scheduling is Hard (tm)

19:18 <gchristensen> yes it is :)

19:20 <lukegb> surprise! https://hydra.nixos.org/build/142485633 seems to have generated a broken manifest again

19:20 <lukegb> so we conclusively know that we're still generating broken things

19:21 <lukegb> (I'm testing with, effectively: `curl -s https://cache.nixos.org/ap80xr55d3alhs67ysi8pyqw2r7qa70i.ls | nix-shell -p brotli --run "brotli -d" | jq . >/dev/null`)

19:33 <gchristensen> I'm not sure how to debug this further :(

19:33 <gchristensen> maybe a "ping eelco" thing?

19:36 <lukegb> maybe

19:37 <lukegb> gchristensen: can you doublecheck the nix daemon version on the machine running the hydra queue runner as well?

19:37 <gchristensen> yup

19:37 <gchristensen> but note hydra skips the daemon

19:38 <lukegb> yeah, I'm just grasping at straws

19:38 <gchristensen> ExecStart=@/nix/store/p9ajdjlnparcgkkxssxg8qy4gad8awiq-nix-2.4pre20210422_d9864be/bin/nix-daemon

19:38 <lukegb> which should be safe

19:38 <lukegb> bah.

19:41 <lukegb> we don't have something set which would cause the queue runner to only reload on update and not restart, right?

19:41 <lukegb> although it's restarted since the update anyway, so

19:42 <lukegb> cursed machine

19:43 <gchristensen> it does not, indeed, restart on deploy by default

19:44 <gchristensen> hey wait

19:45 <lukegb> we should put the queue runner version and queue runner nix version in /queue-runner-status

19:45 <gchristensen> this is FASCINATING.

19:45 <lukegb> oh?

19:47 <gchristensen> til: the running process is not necessarily matching the config in systemd status

19:47 <gchristensen> https://gist.github.com/grahamc/d45854414d6beedd2606fd8144eee459#file-gistfile1-txt

19:48 <lukegb> right yeah, that's mostly what I'd expect

19:48 <lukegb> well, ish

19:48 <lukegb> :P

19:48 <gchristensen> sigh

19:48 <gchristensen> https://github.com/NixOS/hydra/blob/1bb1ba69281060fd0edf2547710011f438473deb/flake.lock

19:48 <gchristensen> this is the wrong version isn't it

19:49 <gchristensen> not restarting the queue runner strikes again

19:50 <gchristensen> May 02 21:50:09 ceres systemd[1]: hydra-queue-runner.service: Consumed 1month 2w 13h 41min 18.485s CPU time, received 3.9T IP traffic, sent 1.3T IP traffic.

19:50 <lukegb> [lukegb@totoro:~/Projects/nix]$ git merge-base --is-ancestor faa31f4 9b9e703df41d75949272059f9b8bc8b763e91fce; echo $?

19:50 <lukegb> 0

19:50 <lukegb> [lukegb@totoro:~/Projects/nix]$ git merge-base --is-ancestor 8d651a1f 9b9e703df41d75949272059f9b8bc8b763e91fce; echo $?

19:50 <lukegb> 1

19:50 <lukegb> yup

19:51 <gchristensen> can you submit another bogus PR?

19:51 <lukegb> I hope we end up rebuilding the world before 21.05 branches off

19:52 <lukegb> or we're going to be plagued with broken .ls files and mysteriously incomplete binary indexes

19:52 <lukegb> yeah, will do

19:52 <gchristensen> I'm sure some rebuild-the-world crisis vuln will drop

20:07 <lukegb> well, if the binutils change gets merged that'll probably help I guess

20:30 <lukegb> my prayer has been answered https://github.com/NixOS/nixpkgs/pull/121527

20:37 <gchristensen> nice

21:19 <andi-> We can also add a few more comments to stdenv, last time I checked there were some fixmes in there :-)