#nixos-infra on 2021-05-01

2020-03-23 19:08 samueldr changed the topic of #nixos-infra to: NixOS infrastructure | logs: https://logs.nix.samueldr.com/nixos-infra/

02:21 <lukegb> samueldr: https://github.com/NixOS/nixpkgs/pull/121352 :/

02:24 lukegb has joined #nixos-infra

02:31 <samueldr> lukegb: what I don't get is, here

02:31 <samueldr> https://hydra.nixos.org/build/142182234#tabs-summary

02:31 <samueldr> >> Disk image size: 3128950784 bytes

02:31 <lukegb> Also: why did it succeed after being rerun a few times

02:31 <samueldr> >> 1705409024 Apr 30 22:31 nixos-amazon-image-21.05pre285770.7859f8a9d6e-aarch64-linux.vhd

02:31 <lukegb> I wish Hydra kept logs from failed runs

02:32 <samueldr> how do you fit 3128950784 bytes in 1705409024?

02:32 <lukegb> If a bunch of them are null?

02:32 <samueldr> that wouldn't matter unless it was a sparse FS

02:32 <samueldr> and AFAIK NARs don't deal with sparse

02:32 <lukegb> Doesn't qemu-img conversion automatically sparsify

02:33 <samueldr> dunno, I usually don't deal with qemu-img

02:34 <lukegb> I _think_ it does

02:34 <samueldr> "logical_bytes": "3129090048"

02:34 <lukegb> if it finds at least one 4k chunk of null

02:34 <samueldr> https://github.com/NixOS/nixpkgs/pull/121352/files#diff-15d957b64e4bcfadbe6317cb441ec661eacac936af6e918f289cfffbb46444edR89

02:35 <samueldr> ah, but it's not a sparse file, as in sparse in the FS, but a file that that itself handles sparseness I see

02:36 <samueldr> lukegb: what do you say we first try a last ditch effort: chuck an `ls -l $out/` and `cat $out/nix-support/image-info.json`

02:36 <samueldr> so we get more info about the issue

02:36 <samueldr> right now the main issue we have is we have no idea what's going on

02:36 <lukegb> how's this as a counteroffer

02:37 <lukegb> https://github.com/NixOS/nixpkgs/pull/121352/files <- make the channel-blocking one statically sized so at least we don't block it?

02:37 <lukegb> by $out/ do you mean the outpath of the amazonImage?

02:37 <samueldr> we'd need a failed run's info to know what's going on I assume

02:38 <samueldr> yes

02:38 <samueldr> so it gets in the logs

02:38 <lukegb> I have the outpath dled locally

02:38 <samueldr> but not one of the failed!

02:38 <lukegb> oh, of a failed run

02:38 <lukegb> obviously, yeah, d'oh

02:38 <samueldr> :)

02:38 <samueldr> nix-store --realise got me the out paths of successful runs too

02:38 <samueldr> when does it get transformed into a vhd?

02:39 <lukegb> err, one of the last steps

02:39 <lukegb> https://sourcegraph.com/github.com/NixOS/nixpkgs/-/blob/nixos/lib/make-disk-image.nix#L334

02:39 <samueldr> I see https://github.com/NixOS/nixpkgs/blob/070f37edd656c82accb8ced82fbdcf0b334f820a/nixos/lib/make-disk-image.nix#L334

02:39 <samueldr> yeah

02:39 <samueldr> is there a verbose flag?

02:40 <samueldr> I see trace, probably too much

02:40 <samueldr> the last thing in the log is the VM powering down

02:40 <samueldr> that's not a log of info from qemu-img!

02:41 <samueldr> I hadn't realised before... a few moments ago... that it wasn't just a raw image

02:41 <lukegb> aaaaaaaaaah

02:41 <samueldr> a?

02:41 <lukegb> nah, nothing useful

02:45 <lukegb> yeah, honestly I'm a little stumped

02:46 <lukegb> we could get someone to add us a new hydra jobset on a separate branch so we can play with things?

02:47 samueldr has left #nixos-infra [#nixos-infra]

02:47 samueldr has joined #nixos-infra

02:47 <lukegb> wb

02:47 <samueldr> (client had a broken nick list from the netsplit the other day)

02:47 <samueldr> maybe a gchristensen might tomorrow

02:47 <samueldr> or have other input about that

02:48 <samueldr> so yeah, reverting that for now might be the right option

02:48 <lukegb> hmm, do any of the other images have this problem actually

02:49 <samueldr> uh, I thought I asked, but maybe I just thought

02:49 <lukegb> we only build the ova for x86_64-linux

02:49 <samueldr> was it only the aarch64 amazon image that caused issues?

02:50 <lukegb> I don't _think_ the x86_64-linux amazon image had the same problem

02:50 <lukegb> but I'm not sure if that's due to anything inherent about aarch64/x86_64 or just "luck" due to the package set size

02:52 <lukegb> x86_64-linux fetches 1038.75 MiB of stuff; aarch64-linux fetches 1099.04 MiB of stuff

02:52 <lukegb> so it might just be luck

02:54 <samueldr> I don't know

02:54 <samueldr> seems to consistent to only be luck

02:54 <lukegb> right but the fact that _sometimes_ retrying it will eventually make it work

02:55 <lukegb> is a little suspicious?

02:55 <samueldr> yes

02:55 <samueldr> https://hydra.nixos.org/build/142424262 if this builds, I don't think it's luck that it builds on x86_64

02:56 <samueldr> I would assume *something* either in aarch64-linux or on the builder configs on aarch64 make it weird

02:56 <lukegb> hmm, why?

02:56 <samueldr> uh

02:56 <samueldr> why is machine "n/a" for the failed runs?

02:57 <samueldr> right, they do show the machine on the log page though

02:57 <samueldr> I guess that's normal then

02:57 <samueldr> why? because it seems too consistent that it fails mainly for aarch64, but never for x86_64

02:57 <samueldr> out of 5 total builds

02:58 <lukegb> right, but the closure size for x86_64 and aarch64 is a little smaller

02:58 <lukegb> *is a little smaller than aarch64

02:58 <samueldr> closure size doesn't matter for the output size

02:58 <lukegb> it does, because we're cramming the closure into the output, no

02:58 <lukegb> (as in, we put the closure into the disk image)

02:58 <samueldr> with the newfound knowledge of qemu-img muddying the waters, I kind of assume something is going wrong

02:58 <samueldr> to go from 1.6GiB to 2.2GiB++

02:58 <samueldr> and really **wrong**

02:59 <samueldr> oh

02:59 <samueldr> let's run that aarch64 drv on the community builder

02:59 <lukegb> the other thing we could try is enabling discard on the VM builder and running fstrim

02:59 <lukegb> but I don't think that'll save that much space

03:00 <samueldr> I don't think that's it either

03:00 <lukegb> I don't think I have access to the community builder, heh

03:01 <lukegb> do you have any theories as to what it _is_?

03:02 <samueldr> not yet

03:02 <samueldr> only conjecture

03:02 <samueldr> something on aarch64 here is causing weird to happen

03:03 <samueldr> hehehe the reproduce scripts don't work on aarch64????

03:04 <lukegb> the reproduce scripts don't work fullstop iirc

03:04 <samueldr> hadn't tried them yet

03:04 <lukegb> aah

03:04 <lukegb> yeah, they've never worked for me

03:04 <samueldr> tries to execute an x86_64 bash AFAICT

03:04 <lukegb> I've just been doing nix-build by hand which is fine and all but means the commit hash and commit count is wrong, but _eh_

03:05 <samueldr> didn't investigate further

03:11 <samueldr> [samueldr@aarch64:~/nixpkgs]$ nix-build ./nixos/release-combined.nix -A nixos.amazonImage.aarch64-linux

03:11 <samueldr> Segmentation fault (core dumped)

03:11 <samueldr> that uh

03:11 <samueldr> that wasn't on my bingo card

03:13 <samueldr> with nix-shell -p nix it evals

03:27 <samueldr> testing with everyone's best friend: builtins.currentTime!

03:29 <samueldr> -r--r--r-- 1 root root 2351489536 Jan 1 1970 nixos-amazon-image-21.05pre56789.gfedcba-aarch64-linux.vhd

03:30 <samueldr> something's reall odd!

03:34 <samueldr> -r--r--r-- 1 root root 2200457728 Jan 1 1970 nixos-amazon-image-21.05pre56789.gfedcba-aarch64-linux.vhd

03:34 <samueldr> same image, only difference is a builtins.currentTime in the drv to force a rebuild

03:35 <samueldr> and that currentTime is in the script that build the image, AFAIK it's not part of its closure

03:42 <samueldr> I don't know if your discard hint is basically the same as the "old" trick with VM disk images

03:42 <samueldr> I don't know if on a raw image it would work

03:43 <samueldr> apparently e2fsck -E discard src_fs can do it, maybe?

03:44 <samueldr> not sure it zeroes out

03:46 <samueldr> >> Ted T'so says that he uses compress-rootfs to maintains the VM root filesystem that he uses to test upstream ext4 changes.

03:46 <samueldr> https://git.kernel.org/pub/scm/fs/ext2/xfstests-bld.git/tree/kvm-xfstests/compress-rootfs

03:46 <samueldr> I guess it does

04:24 <samueldr> just started running in a loop on aarch64

04:24 <samueldr> -r--r--r-- 1 root root 2320024576 Jan 1 1970 nixos-amazon-image-21.05pre56789.gfedcba-aarch64-linux.vhd

04:24 <samueldr> -r--r--r-- 1 root root 1697018368 Jan 1 1970 nixos-amazon-image-21.05pre56789.gfedcba-aarch64-linux.vhd

04:26 <samueldr> so it is extremely irreproducible AFAICT

04:26 <samueldr> I'm even now running that fsck "discard" before

04:26 <samueldr> doesn't change the results I observe

04:27 <samueldr> at least the good news is that it's not a hydra issue

04:43 <samueldr> built the same thing far fewer times on x86_64, but every time the result is good, 2786402304 reduced to ~1575353344

04:43 <samueldr> so it's not like on aarch64 the ratio is off base

05:04 aminechikhaoui8 has joined #nixos-infra

05:07 eyJhbV2 has joined #nixos-infra

05:13 eyJhb has quit [*.net *.split]

05:13 aminechikhaoui has quit [*.net *.split]

05:13 asymmetric has quit [*.net *.split]

05:13 V has quit [*.net *.split]

05:13 eyJhbV2 is now known as eyJhb

05:13 aminechikhaoui8 is now known as aminechikhaoui

05:13 eyJhb has joined #nixos-infra

05:13 eyJhb has quit [Changing host]

05:23 V has joined #nixos-infra

05:23 asymmetric has joined #nixos-infra

07:25 cole-h has quit [Ping timeout: 260 seconds]

09:30 asymmetric has quit [*.net *.split]

09:30 V has quit [*.net *.split]

09:30 aminechikhaoui has quit [*.net *.split]

09:31 hexa- has quit [*.net *.split]

09:31 NinjaTrappeur has quit [*.net *.split]

09:31 endocrimes has quit [*.net *.split]

09:31 mcint has quit [*.net *.split]

09:31 tazjin has quit [*.net *.split]

09:31 nh2[m] has quit [*.net *.split]

09:31 pie_ has quit [*.net *.split]

09:31 MichaelRaskin has quit [*.net *.split]

09:31 JJJollyjim has quit [*.net *.split]

09:31 lukegb has quit [*.net *.split]

09:31 supersandro2000 has quit [*.net *.split]

09:31 thefloweringash has quit [*.net *.split]

09:31 Ericson2314 has quit [*.net *.split]

09:31 domenkozar[m] has quit [*.net *.split]

09:31 sterni has quit [*.net *.split]

09:31 colemickens has quit [*.net *.split]

09:31 aristid has quit [*.net *.split]

09:31 gchristensen has quit [*.net *.split]

09:31 roberth has quit [*.net *.split]

09:31 garbas[m] has quit [*.net *.split]

09:31 XgF has quit [*.net *.split]

09:31 ryantm has quit [*.net *.split]

09:31 andi- has quit [*.net *.split]

09:31 zimbatm[m] has quit [*.net *.split]

09:31 flokli has quit [*.net *.split]

09:31 qyliss has quit [*.net *.split]

09:32 ikwildrpepper has quit [*.net *.split]

09:32 niksnut has quit [*.net *.split]

09:32 samueldr has quit [*.net *.split]

09:44 domenkozar[m] has joined #nixos-infra

09:44 JJJollyjim has joined #nixos-infra

09:44 aristid has joined #nixos-infra

09:44 andi- has joined #nixos-infra

09:44 hexa- has joined #nixos-infra

09:44 gchristensen has joined #nixos-infra

09:44 sterni has joined #nixos-infra

09:44 eyJhb has joined #nixos-infra

09:44 supersandro2000 has joined #nixos-infra

09:44 pie_ has joined #nixos-infra

09:44 qyliss has joined #nixos-infra

09:44 garbas[m] has joined #nixos-infra

09:44 endocrimes has joined #nixos-infra

09:44 XgF has joined #nixos-infra

09:44 ikwildrpepper has joined #nixos-infra

09:44 roberth has joined #nixos-infra

09:44 mcint has joined #nixos-infra

11:06 <lukegb> samueldr: are you running e2fsck inside or outside the VM?

11:06 <lukegb> if you're doing it inside the VM, it might not actually do anything unless you change the qemu flags to turn discard support on for the disk

11:56 qyliss has quit [Quit: bye]

11:59 qyliss has joined #nixos-infra

13:37 supersandro2000 has quit [Quit: The Lounge - https://thelounge.chat]

13:48 supersandro2000 has joined #nixos-infra

14:44 <gchristensen> I can do stuff, what do you want me to do? :)

15:33 cole-h has joined #nixos-infra

15:49 cole-h has quit [Quit: Goodbye]

15:59 cole-h has joined #nixos-infra

16:10 <gchristensen> ikwildrpepper: it looks like this new mac is a 128G hard disk, is that possible?

16:14 <samueldr> lukegb: inside

16:14 <samueldr> outside would be hard because it is a partition

16:14 <lukegb> yeah, try setting discard=unmap?

16:15 <samueldr> same build

16:15 <samueldr> https://gist.github.com/samueldr/67534945d56489a6747acdbd4223a5be

16:15 <samueldr> ran 20 times

16:15 <samueldr> only difference is a comment in the script with #${toString builtins.currentTime}

16:15 <samueldr> we have three builds that look more "normal" comparing with x86_64 equivalent builds

16:16 <gchristensen> btw, did we need a jobset for something?

16:16 <samueldr> so something *definitely* is amiss on aarch64-linux

16:16 <samueldr> I don't think so now

16:16 <gchristensen> ok

16:16 <samueldr> turns out it's reproducible on the community box

16:16 <gchristensen> cool

16:16 <gchristensen> let me know :)

16:20 <lukegb> samueldr: does the same thing happen with x86_64?

16:21 <samueldr> I didn't run in loop

16:21 <samueldr> but out of 5 local builds

16:21 <samueldr> looks basically the same, few bytes difference in the result

16:21 <samueldr> not half a gigabyte

16:22 <samueldr> so I'm really thinking that something on aarch64 acts just different enough to sometimes cause... weirdness?

16:24 <samueldr> that inconsistency is troubling

16:29 <gchristensen> very

16:29 <samueldr> restarted with discard param

16:29 <samueldr> also, 4 out of 20 times, that's around 80% of the time doing the "weird" thing

16:29 <samueldr> at the very least it's easier to get a good feeling that things are going right

16:30 <gchristensen> maybe diffoscope has something interesting to say?

16:31 <samueldr> not sure how to actually run it, and the only thing that differs is the final disk image, we don't keep the intermediary raw disk image

16:33 <gchristensen> diffoscope a b :)

16:33 <gchristensen> 1s

16:33 <samueldr> I would hazard a guess that the raw disk image is maybe where the differences could be interesting

16:33 <samueldr> but then it's unreproducible ext4

16:34 <gchristensen> diffoscope unpacks filesystems

16:34 <samueldr> does it work with filesystem structures?

16:35 <gchristensen> I think this works: diffoscope --html ./out.html - patha pathb

16:35 <gchristensen> and I think yes

16:36 <qyliss> the answer to "does it ...?" with diffoscope tends to be yes, ime :P

16:37 <gchristensen> it is truly remarkable software

16:38 <samueldr> two builds in with the umap param, no changes

16:53 <samueldr> welp, the community machine reset itself

16:53 <samueldr> but with what I saw I'm pretty confident that there were no changes

16:55 <gchristensen> womp womp

16:55 <gchristensen> whats the umap change?

16:55 <samueldr> hm?

16:55 <samueldr> ah

16:55 <samueldr> unmap

16:55 <samueldr> [12:14:47] <lukegb> yeah, try setting discard=unmap?

16:55 <gchristensen> whats the unmap change? :D

16:55 <samueldr> qemu setting

16:55 <gchristensen> ah!

16:55 <gchristensen> iiinteresting

16:56 <samueldr> no change from it

16:56 <samueldr> so it doesn't look like anything related to "discarding" or "zeroing" a disk really

16:56 <lukegb> no changes in size, or no changes from the size changing

16:56 <lukegb> as in, the size was still changing a lot?

16:56 <samueldr> same wrong behaviour

16:56 <lukegb> ah, boo

16:56 <gchristensen> oh, dang

16:56 <gchristensen> I thought by no change you meant ... stable :)

16:57 <gchristensen> in the wanted way.

16:57 <samueldr> the issue is stable enough to not go away

16:58 <samueldr> I wonder if fallocate can somehow allocate garbage data? but it shouldn't, no?

16:59 <samueldr> though there is a --zero-range parameter

16:59 <samueldr> or are we using truncate?

17:00 <samueldr> truncate

17:00 <lukegb> surely there's no way the FS would just... give us uninitialized bytes

17:00 <samueldr> [if a file is] extended [it] reads as zero bytes.

17:00 <samueldr> lukegb: by design

17:00 <samueldr> well

17:00 <samueldr> no

17:01 <samueldr> I mean

17:01 <samueldr> I don't know?

17:02 <samueldr> yeah, without -z it's all null bytes on my laptop

17:02 <samueldr> that could have been "fun"

17:03 <samueldr> and a fresh truncate on the community server also shows me it is zeroes

18:47 andi- has quit [Ping timeout: 250 seconds]

18:55 andi- has joined #nixos-infra

23:26 supersandro2000 is now known as Guest38188

23:26 supersandro2000 has joined #nixos-infra

23:29 Guest38188 has quit [Ping timeout: 240 seconds]