samueldr changed the topic of #nixos-infra to: NixOS infrastructure | logs:
lukegb has joined #nixos-infra
<samueldr> lukegb: what I don't get is, here
<samueldr> >> Disk image size: 3128950784 bytes
<lukegb> Also: why did it succeed after being rerun a few times
<samueldr> >> 1705409024 Apr 30 22:31 nixos-amazon-image-21.05pre285770.7859f8a9d6e-aarch64-linux.vhd
<lukegb> I wish Hydra kept logs from failed runs
<samueldr> how do you fit 3128950784 bytes in 1705409024?
<lukegb> If a bunch of them are null?
<samueldr> that wouldn't matter unless it was a sparse FS
<samueldr> and AFAIK NARs don't deal with sparse
<lukegb> Doesn't qemu-img conversion automatically sparsify
<samueldr> dunno, I usually don't deal with qemu-img
<lukegb> I _think_ it does
<samueldr> "logical_bytes": "3129090048"
<lukegb> if it finds at least one 4k chunk of null
<samueldr> ah, but it's not a sparse file, as in sparse in the FS, but a file that that itself handles sparseness I see
<samueldr> lukegb: what do you say we first try a last ditch effort: chuck an `ls -l $out/` and `cat $out/nix-support/image-info.json`
<samueldr> so we get more info about the issue
<samueldr> right now the main issue we have is we have no idea what's going on
<lukegb> how's this as a counteroffer
<lukegb> <- make the channel-blocking one statically sized so at least we don't block it?
<lukegb> by $out/ do you mean the outpath of the amazonImage?
<samueldr> we'd need a failed run's info to know what's going on I assume
<samueldr> yes
<samueldr> so it gets in the logs
<lukegb> I have the outpath dled locally
<samueldr> but not one of the failed!
<lukegb> oh, of a failed run
<lukegb> obviously, yeah, d'oh
<samueldr> :)
<samueldr> nix-store --realise got me the out paths of successful runs too
<samueldr> when does it get transformed into a vhd?
<lukegb> err, one of the last steps
<samueldr> yeah
<samueldr> is there a verbose flag?
<samueldr> I see trace, probably too much
<samueldr> the last thing in the log is the VM powering down
<samueldr> that's not a log of info from qemu-img!
<samueldr> I hadn't realised before... a few moments ago... that it wasn't just a raw image
<lukegb> aaaaaaaaaah
<samueldr> a?
<lukegb> nah, nothing useful
<lukegb> yeah, honestly I'm a little stumped
<lukegb> we could get someone to add us a new hydra jobset on a separate branch so we can play with things?
samueldr has left #nixos-infra [#nixos-infra]
samueldr has joined #nixos-infra
<lukegb> wb
<samueldr> (client had a broken nick list from the netsplit the other day)
<samueldr> maybe a gchristensen might tomorrow
<samueldr> or have other input about that
<samueldr> so yeah, reverting that for now might be the right option
<lukegb> hmm, do any of the other images have this problem actually
<samueldr> uh, I thought I asked, but maybe I just thought
<lukegb> we only build the ova for x86_64-linux
<samueldr> was it only the aarch64 amazon image that caused issues?
<lukegb> I don't _think_ the x86_64-linux amazon image had the same problem
<lukegb> but I'm not sure if that's due to anything inherent about aarch64/x86_64 or just "luck" due to the package set size
<lukegb> x86_64-linux fetches 1038.75 MiB of stuff; aarch64-linux fetches 1099.04 MiB of stuff
<lukegb> so it might just be luck
<samueldr> I don't know
<samueldr> seems to consistent to only be luck
<lukegb> right but the fact that _sometimes_ retrying it will eventually make it work
<lukegb> is a little suspicious?
<samueldr> yes
<samueldr> if this builds, I don't think it's luck that it builds on x86_64
<samueldr> I would assume *something* either in aarch64-linux or on the builder configs on aarch64 make it weird
<lukegb> hmm, why?
<samueldr> uh
<samueldr> why is machine "n/a" for the failed runs?
<samueldr> right, they do show the machine on the log page though
<samueldr> I guess that's normal then
<samueldr> why? because it seems too consistent that it fails mainly for aarch64, but never for x86_64
<samueldr> out of 5 total builds
<lukegb> right, but the closure size for x86_64 and aarch64 is a little smaller
<lukegb> *is a little smaller than aarch64
<samueldr> closure size doesn't matter for the output size
<lukegb> it does, because we're cramming the closure into the output, no
<lukegb> (as in, we put the closure into the disk image)
<samueldr> with the newfound knowledge of qemu-img muddying the waters, I kind of assume something is going wrong
<samueldr> to go from 1.6GiB to 2.2GiB++
<samueldr> and really **wrong**
<samueldr> oh
<samueldr> let's run that aarch64 drv on the community builder
<lukegb> the other thing we could try is enabling discard on the VM builder and running fstrim
<lukegb> but I don't think that'll save that much space
<samueldr> I don't think that's it either
<lukegb> I don't think I have access to the community builder, heh
<lukegb> do you have any theories as to what it _is_?
<samueldr> not yet
<samueldr> only conjecture
<samueldr> something on aarch64 here is causing weird to happen
<samueldr> hehehe the reproduce scripts don't work on aarch64????
<lukegb> the reproduce scripts don't work fullstop iirc
<samueldr> hadn't tried them yet
<lukegb> aah
<lukegb> yeah, they've never worked for me
<samueldr> tries to execute an x86_64 bash AFAICT
<lukegb> I've just been doing nix-build by hand which is fine and all but means the commit hash and commit count is wrong, but _eh_
<samueldr> didn't investigate further
<samueldr> [samueldr@aarch64:~/nixpkgs]$ nix-build ./nixos/release-combined.nix -A nixos.amazonImage.aarch64-linux
<samueldr> Segmentation fault (core dumped)
<samueldr> that uh
<samueldr> that wasn't on my bingo card
<samueldr> with nix-shell -p nix it evals
<samueldr> testing with everyone's best friend: builtins.currentTime!
<samueldr> -r--r--r-- 1 root root 2351489536 Jan 1 1970 nixos-amazon-image-21.05pre56789.gfedcba-aarch64-linux.vhd
<samueldr> something's reall odd!
<samueldr> -r--r--r-- 1 root root 2200457728 Jan 1 1970 nixos-amazon-image-21.05pre56789.gfedcba-aarch64-linux.vhd
<samueldr> same image, only difference is a builtins.currentTime in the drv to force a rebuild
<samueldr> and that currentTime is in the script that build the image, AFAIK it's not part of its closure
<samueldr> I don't know if your discard hint is basically the same as the "old" trick with VM disk images
<samueldr> I don't know if on a raw image it would work
<samueldr> apparently e2fsck -E discard src_fs can do it, maybe?
<samueldr> not sure it zeroes out
<samueldr> >> Ted T'so says that he uses compress-rootfs to maintains the VM root filesystem that he uses to test upstream ext4 changes.
<samueldr> I guess it does
<samueldr> just started running in a loop on aarch64
<samueldr> -r--r--r-- 1 root root 2320024576 Jan 1 1970 nixos-amazon-image-21.05pre56789.gfedcba-aarch64-linux.vhd
<samueldr> -r--r--r-- 1 root root 1697018368 Jan 1 1970 nixos-amazon-image-21.05pre56789.gfedcba-aarch64-linux.vhd
<samueldr> so it is extremely irreproducible AFAICT
<samueldr> I'm even now running that fsck "discard" before
<samueldr> doesn't change the results I observe
<samueldr> at least the good news is that it's not a hydra issue
<samueldr> built the same thing far fewer times on x86_64, but every time the result is good, 2786402304 reduced to ~1575353344
<samueldr> so it's not like on aarch64 the ratio is off base
aminechikhaoui8 has joined #nixos-infra
eyJhbV2 has joined #nixos-infra
eyJhb has quit [*.net *.split]
aminechikhaoui has quit [*.net *.split]
asymmetric has quit [*.net *.split]
V has quit [*.net *.split]
eyJhbV2 is now known as eyJhb
aminechikhaoui8 is now known as aminechikhaoui
eyJhb has joined #nixos-infra
eyJhb has quit [Changing host]
V has joined #nixos-infra
asymmetric has joined #nixos-infra
cole-h has quit [Ping timeout: 260 seconds]
asymmetric has quit [*.net *.split]
V has quit [*.net *.split]
aminechikhaoui has quit [*.net *.split]
hexa- has quit [*.net *.split]
NinjaTrappeur has quit [*.net *.split]
endocrimes has quit [*.net *.split]
mcint has quit [*.net *.split]
tazjin has quit [*.net *.split]
nh2[m] has quit [*.net *.split]
pie_ has quit [*.net *.split]
MichaelRaskin has quit [*.net *.split]
JJJollyjim has quit [*.net *.split]
lukegb has quit [*.net *.split]
supersandro2000 has quit [*.net *.split]
thefloweringash has quit [*.net *.split]
Ericson2314 has quit [*.net *.split]
domenkozar[m] has quit [*.net *.split]
sterni has quit [*.net *.split]
colemickens has quit [*.net *.split]
aristid has quit [*.net *.split]
gchristensen has quit [*.net *.split]
roberth has quit [*.net *.split]
garbas[m] has quit [*.net *.split]
XgF has quit [*.net *.split]
ryantm has quit [*.net *.split]
andi- has quit [*.net *.split]
zimbatm[m] has quit [*.net *.split]
flokli has quit [*.net *.split]
qyliss has quit [*.net *.split]
ikwildrpepper has quit [*.net *.split]
niksnut has quit [*.net *.split]
samueldr has quit [*.net *.split]
domenkozar[m] has joined #nixos-infra
JJJollyjim has joined #nixos-infra
aristid has joined #nixos-infra
andi- has joined #nixos-infra
hexa- has joined #nixos-infra
gchristensen has joined #nixos-infra
sterni has joined #nixos-infra
eyJhb has joined #nixos-infra
supersandro2000 has joined #nixos-infra
pie_ has joined #nixos-infra
qyliss has joined #nixos-infra
garbas[m] has joined #nixos-infra
endocrimes has joined #nixos-infra
XgF has joined #nixos-infra
ikwildrpepper has joined #nixos-infra
roberth has joined #nixos-infra
mcint has joined #nixos-infra
<lukegb> samueldr: are you running e2fsck inside or outside the VM?
<lukegb> if you're doing it inside the VM, it might not actually do anything unless you change the qemu flags to turn discard support on for the disk
qyliss has quit [Quit: bye]
qyliss has joined #nixos-infra
supersandro2000 has quit [Quit: The Lounge -]
supersandro2000 has joined #nixos-infra
<gchristensen> I can do stuff, what do you want me to do? :)
cole-h has joined #nixos-infra
cole-h has quit [Quit: Goodbye]
cole-h has joined #nixos-infra
<gchristensen> ikwildrpepper: it looks like this new mac is a 128G hard disk, is that possible?
<samueldr> lukegb: inside
<samueldr> outside would be hard because it is a partition
<lukegb> yeah, try setting discard=unmap?
<samueldr> same build
<samueldr> ran 20 times
<samueldr> only difference is a comment in the script with #${toString builtins.currentTime}
<samueldr> we have three builds that look more "normal" comparing with x86_64 equivalent builds
<gchristensen> btw, did we need a jobset for something?
<samueldr> so something *definitely* is amiss on aarch64-linux
<samueldr> I don't think so now
<gchristensen> ok
<samueldr> turns out it's reproducible on the community box
<gchristensen> cool
<gchristensen> let me know :)
<lukegb> samueldr: does the same thing happen with x86_64?
<samueldr> I didn't run in loop
<samueldr> but out of 5 local builds
<samueldr> looks basically the same, few bytes difference in the result
<samueldr> not half a gigabyte
<samueldr> so I'm really thinking that something on aarch64 acts just different enough to sometimes cause... weirdness?
<samueldr> that inconsistency is troubling
<gchristensen> very
<samueldr> restarted with discard param
<samueldr> also, 4 out of 20 times, that's around 80% of the time doing the "weird" thing
<samueldr> at the very least it's easier to get a good feeling that things are going right
<gchristensen> maybe diffoscope has something interesting to say?
<samueldr> not sure how to actually run it, and the only thing that differs is the final disk image, we don't keep the intermediary raw disk image
<gchristensen> diffoscope a b :)
<gchristensen> 1s
<samueldr> I would hazard a guess that the raw disk image is maybe where the differences could be interesting
<samueldr> but then it's unreproducible ext4
<gchristensen> diffoscope unpacks filesystems
<samueldr> does it work with filesystem structures?
<gchristensen> I think this works: diffoscope --html ./out.html - patha pathb
<gchristensen> and I think yes
<qyliss> the answer to "does it ...?" with diffoscope tends to be yes, ime :P
<gchristensen> it is truly remarkable software
<samueldr> two builds in with the umap param, no changes
<samueldr> welp, the community machine reset itself
<samueldr> but with what I saw I'm pretty confident that there were no changes
<gchristensen> womp womp
<gchristensen> whats the umap change?
<samueldr> hm?
<samueldr> ah
<samueldr> unmap
<samueldr> [12:14:47] <lukegb> yeah, try setting discard=unmap?
<gchristensen> whats the unmap change? :D
<samueldr> qemu setting
<gchristensen> ah!
<gchristensen> iiinteresting
<samueldr> no change from it
<samueldr> so it doesn't look like anything related to "discarding" or "zeroing" a disk really
<lukegb> no changes in size, or no changes from the size changing
<lukegb> as in, the size was still changing a lot?
<samueldr> same wrong behaviour
<lukegb> ah, boo
<gchristensen> oh, dang
<gchristensen> I thought by no change you meant ... stable :)
<gchristensen> in the wanted way.
<samueldr> the issue is stable enough to not go away
<samueldr> I wonder if fallocate can somehow allocate garbage data? but it shouldn't, no?
<samueldr> though there is a --zero-range parameter
<samueldr> or are we using truncate?
<samueldr> truncate
<lukegb> surely there's no way the FS would just... give us uninitialized bytes
<samueldr> [if a file is] extended [it] reads as zero bytes.
<samueldr> lukegb: by design
<samueldr> well
<samueldr> no
<samueldr> I mean
<samueldr> I don't know?
<samueldr> yeah, without -z it's all null bytes on my laptop
<samueldr> that could have been "fun"
<samueldr> and a fresh truncate on the community server also shows me it is zeroes
andi- has quit [Ping timeout: 250 seconds]
andi- has joined #nixos-infra
supersandro2000 is now known as Guest38188
supersandro2000 has joined #nixos-infra
Guest38188 has quit [Ping timeout: 240 seconds]