lukegb has joined #nixos-infra
<
samueldr>
lukegb: what I don't get is, here
<
samueldr>
>> Disk image size: 3128950784 bytes
<
lukegb>
Also: why did it succeed after being rerun a few times
<
samueldr>
>> 1705409024 Apr 30 22:31 nixos-amazon-image-21.05pre285770.7859f8a9d6e-aarch64-linux.vhd
<
lukegb>
I wish Hydra kept logs from failed runs
<
samueldr>
how do you fit 3128950784 bytes in 1705409024?
<
lukegb>
If a bunch of them are null?
<
samueldr>
that wouldn't matter unless it was a sparse FS
<
samueldr>
and AFAIK NARs don't deal with sparse
<
lukegb>
Doesn't qemu-img conversion automatically sparsify
<
samueldr>
dunno, I usually don't deal with qemu-img
<
lukegb>
I
_think_ it does
<
samueldr>
"logical_bytes": "3129090048"
<
lukegb>
if it finds at least one 4k chunk of null
<
samueldr>
ah, but it's not a sparse file, as in sparse in the FS, but a file that that itself handles sparseness I see
<
samueldr>
lukegb: what do you say we first try a last ditch effort: chuck an `ls -l $out/` and `cat $out/nix-support/image-info.json`
<
samueldr>
so we get more info about the issue
<
samueldr>
right now the main issue we have is we have no idea what's going on
<
lukegb>
how's this as a counteroffer
<
lukegb>
by $out/ do you mean the outpath of the amazonImage?
<
samueldr>
we'd need a failed run's info to know what's going on I assume
<
samueldr>
so it gets in the logs
<
lukegb>
I have the outpath dled locally
<
samueldr>
but not one of the failed!
<
lukegb>
oh, of a failed run
<
lukegb>
obviously, yeah, d'oh
<
samueldr>
nix-store --realise got me the out paths of successful runs too
<
samueldr>
when does it get transformed into a vhd?
<
lukegb>
err, one of the last steps
<
samueldr>
is there a verbose flag?
<
samueldr>
I see trace, probably too much
<
samueldr>
the last thing in the log is the VM powering down
<
samueldr>
that's not a log of info from qemu-img!
<
samueldr>
I hadn't realised before... a few moments ago... that it wasn't just a raw image
<
lukegb>
aaaaaaaaaah
<
lukegb>
nah, nothing useful
<
lukegb>
yeah, honestly I'm a little stumped
<
lukegb>
we could get someone to add us a new hydra jobset on a separate branch so we can play with things?
samueldr has left #nixos-infra [#nixos-infra]
samueldr has joined #nixos-infra
<
samueldr>
(client had a broken nick list from the netsplit the other day)
<
samueldr>
maybe a gchristensen might tomorrow
<
samueldr>
or have other input about that
<
samueldr>
so yeah, reverting that for now might be the right option
<
lukegb>
hmm, do any of the other images have this problem actually
<
samueldr>
uh, I thought I asked, but maybe I just thought
<
lukegb>
we only build the ova for x86_64-linux
<
samueldr>
was it only the aarch64 amazon image that caused issues?
<
lukegb>
I don't
_think_ the x86_64-linux amazon image had the same problem
<
lukegb>
but I'm not sure if that's due to anything inherent about aarch64/x86_64 or just "luck" due to the package set size
<
lukegb>
x86_64-linux fetches 1038.75 MiB of stuff; aarch64-linux fetches 1099.04 MiB of stuff
<
lukegb>
so it might just be luck
<
samueldr>
I don't know
<
samueldr>
seems to consistent to only be luck
<
lukegb>
right but the fact that
_sometimes_ retrying it will eventually make it work
<
lukegb>
is a little suspicious?
<
samueldr>
I would assume
*something* either in aarch64-linux or on the builder configs on aarch64 make it weird
<
samueldr>
why is machine "n/a" for the failed runs?
<
samueldr>
right, they do show the machine on the log page though
<
samueldr>
I guess that's normal then
<
samueldr>
why? because it seems too consistent that it fails mainly for aarch64, but never for x86_64
<
samueldr>
out of 5 total builds
<
lukegb>
right, but the closure size for x86_64 and aarch64 is a little smaller
<
lukegb>
*is a little smaller than aarch64
<
samueldr>
closure size doesn't matter for the output size
<
lukegb>
it does, because we're cramming the closure into the output, no
<
lukegb>
(as in, we put the closure into the disk image)
<
samueldr>
with the newfound knowledge of qemu-img muddying the waters, I kind of assume something is going wrong
<
samueldr>
to go from 1.6GiB to 2.2GiB++
<
samueldr>
and really
**wrong**
<
samueldr>
let's run that aarch64 drv on the community builder
<
lukegb>
the other thing we could try is enabling discard on the VM builder and running fstrim
<
lukegb>
but I don't think that'll save that much space
<
samueldr>
I don't think that's it either
<
lukegb>
I don't think I have access to the community builder, heh
<
lukegb>
do you have any theories as to what it _is_?
<
samueldr>
only conjecture
<
samueldr>
something on aarch64 here is causing weird to happen
<
samueldr>
hehehe the reproduce scripts don't work on aarch64????
<
lukegb>
the reproduce scripts don't work fullstop iirc
<
samueldr>
hadn't tried them yet
<
lukegb>
yeah, they've never worked for me
<
samueldr>
tries to execute an x86_64 bash AFAICT
<
lukegb>
I've just been doing nix-build by hand which is fine and all but means the commit hash and commit count is wrong, but
_eh_
<
samueldr>
didn't investigate further
<
samueldr>
[samueldr@aarch64:~/nixpkgs]$ nix-build ./nixos/release-combined.nix -A nixos.amazonImage.aarch64-linux
<
samueldr>
Segmentation fault (core dumped)
<
samueldr>
that wasn't on my bingo card
<
samueldr>
with nix-shell -p nix it evals
<
samueldr>
testing with everyone's best friend: builtins.currentTime!
<
samueldr>
-r--r--r-- 1 root root 2351489536 Jan 1 1970 nixos-amazon-image-21.05pre56789.gfedcba-aarch64-linux.vhd
<
samueldr>
something's reall odd!
<
samueldr>
-r--r--r-- 1 root root 2200457728 Jan 1 1970 nixos-amazon-image-21.05pre56789.gfedcba-aarch64-linux.vhd
<
samueldr>
same image, only difference is a builtins.currentTime in the drv to force a rebuild
<
samueldr>
and that currentTime is in the script that build the image, AFAIK it's not part of its closure
<
samueldr>
I don't know if your discard hint is basically the same as the "old" trick with VM disk images
<
samueldr>
I don't know if on a raw image it would work
<
samueldr>
apparently e2fsck -E discard src_fs can do it, maybe?
<
samueldr>
not sure it zeroes out
<
samueldr>
>> Ted T'so says that he uses compress-rootfs to maintains the VM root filesystem that he uses to test upstream ext4 changes.
<
samueldr>
I guess it does
<
samueldr>
just started running in a loop on aarch64
<
samueldr>
-r--r--r-- 1 root root 2320024576 Jan 1 1970 nixos-amazon-image-21.05pre56789.gfedcba-aarch64-linux.vhd
<
samueldr>
-r--r--r-- 1 root root 1697018368 Jan 1 1970 nixos-amazon-image-21.05pre56789.gfedcba-aarch64-linux.vhd
<
samueldr>
so it is extremely irreproducible AFAICT
<
samueldr>
I'm even now running that fsck "discard" before
<
samueldr>
doesn't change the results I observe
<
samueldr>
at least the good news is that it's not a hydra issue
<
samueldr>
built the same thing far fewer times on x86_64, but every time the result is good, 2786402304 reduced to ~1575353344
<
samueldr>
so it's not like on aarch64 the ratio is off base
aminechikhaoui8 has joined #nixos-infra
eyJhbV2 has joined #nixos-infra
eyJhb has quit [*.net *.split]
aminechikhaoui has quit [*.net *.split]
asymmetric has quit [*.net *.split]
V has quit [*.net *.split]
eyJhbV2 is now known as eyJhb
aminechikhaoui8 is now known as aminechikhaoui
eyJhb has joined #nixos-infra
eyJhb has quit [Changing host]
V has joined #nixos-infra
asymmetric has joined #nixos-infra
cole-h has quit [Ping timeout: 260 seconds]
asymmetric has quit [*.net *.split]
V has quit [*.net *.split]
aminechikhaoui has quit [*.net *.split]
hexa- has quit [*.net *.split]
NinjaTrappeur has quit [*.net *.split]
endocrimes has quit [*.net *.split]
mcint has quit [*.net *.split]
tazjin has quit [*.net *.split]
nh2[m] has quit [*.net *.split]
pie_ has quit [*.net *.split]
MichaelRaskin has quit [*.net *.split]
JJJollyjim has quit [*.net *.split]
lukegb has quit [*.net *.split]
supersandro2000 has quit [*.net *.split]
thefloweringash has quit [*.net *.split]
Ericson2314 has quit [*.net *.split]
domenkozar[m] has quit [*.net *.split]
sterni has quit [*.net *.split]
colemickens has quit [*.net *.split]
aristid has quit [*.net *.split]
gchristensen has quit [*.net *.split]
roberth has quit [*.net *.split]
garbas[m] has quit [*.net *.split]
XgF has quit [*.net *.split]
ryantm has quit [*.net *.split]
andi- has quit [*.net *.split]
zimbatm[m] has quit [*.net *.split]
flokli has quit [*.net *.split]
qyliss has quit [*.net *.split]
ikwildrpepper has quit [*.net *.split]
niksnut has quit [*.net *.split]
samueldr has quit [*.net *.split]
domenkozar[m] has joined #nixos-infra
JJJollyjim has joined #nixos-infra
aristid has joined #nixos-infra
andi- has joined #nixos-infra
hexa- has joined #nixos-infra
gchristensen has joined #nixos-infra
sterni has joined #nixos-infra
eyJhb has joined #nixos-infra
supersandro2000 has joined #nixos-infra
pie_ has joined #nixos-infra
qyliss has joined #nixos-infra
garbas[m] has joined #nixos-infra
endocrimes has joined #nixos-infra
XgF has joined #nixos-infra
ikwildrpepper has joined #nixos-infra
roberth has joined #nixos-infra
mcint has joined #nixos-infra
<
lukegb>
samueldr: are you running e2fsck inside or outside the VM?
<
lukegb>
if you're doing it inside the VM, it might not actually do anything unless you change the qemu flags to turn discard support on for the disk
qyliss has quit [Quit: bye]
qyliss has joined #nixos-infra
supersandro2000 has joined #nixos-infra
<
gchristensen>
I can do stuff, what do you want me to do? :)
cole-h has joined #nixos-infra
cole-h has quit [Quit: Goodbye]
cole-h has joined #nixos-infra
<
gchristensen>
ikwildrpepper: it looks like this new mac is a 128G hard disk, is that possible?
<
samueldr>
lukegb: inside
<
samueldr>
outside would be hard because it is a partition
<
lukegb>
yeah, try setting discard=unmap?
<
samueldr>
same build
<
samueldr>
ran 20 times
<
samueldr>
only difference is a comment in the script with #${toString builtins.currentTime}
<
samueldr>
we have three builds that look more "normal" comparing with x86_64 equivalent builds
<
gchristensen>
btw, did we need a jobset for something?
<
samueldr>
so something
*definitely* is amiss on aarch64-linux
<
samueldr>
I don't think so now
<
samueldr>
turns out it's reproducible on the community box
<
gchristensen>
cool
<
gchristensen>
let me know :)
<
lukegb>
samueldr: does the same thing happen with x86_64?
<
samueldr>
I didn't run in loop
<
samueldr>
but out of 5 local builds
<
samueldr>
looks basically the same, few bytes difference in the result
<
samueldr>
not half a gigabyte
<
samueldr>
so I'm really thinking that something on aarch64 acts just different enough to sometimes cause... weirdness?
<
samueldr>
that inconsistency is troubling
<
gchristensen>
very
<
samueldr>
restarted with discard param
<
samueldr>
also, 4 out of 20 times, that's around 80% of the time doing the "weird" thing
<
samueldr>
at the very least it's easier to get a good feeling that things are going right
<
gchristensen>
maybe diffoscope has something interesting to say?
<
samueldr>
not sure how to actually run it, and the only thing that differs is the final disk image, we don't keep the intermediary raw disk image
<
gchristensen>
diffoscope a b :)
<
samueldr>
I would hazard a guess that the raw disk image is maybe where the differences could be interesting
<
samueldr>
but then it's unreproducible ext4
<
gchristensen>
diffoscope unpacks filesystems
<
samueldr>
does it work with filesystem structures?
<
gchristensen>
I think this works: diffoscope --html ./out.html - patha pathb
<
gchristensen>
and I think yes
<
qyliss>
the answer to "does it ...?" with diffoscope tends to be yes, ime :P
<
gchristensen>
it is truly remarkable software
<
samueldr>
two builds in with the umap param, no changes
<
samueldr>
welp, the community machine reset itself
<
samueldr>
but with what I saw I'm pretty confident that there were no changes
<
gchristensen>
womp womp
<
gchristensen>
whats the umap change?
<
samueldr>
[12:14:47] <lukegb> yeah, try setting discard=unmap?
<
gchristensen>
whats the unmap change? :D
<
samueldr>
qemu setting
<
gchristensen>
iiinteresting
<
samueldr>
no change from it
<
samueldr>
so it doesn't look like anything related to "discarding" or "zeroing" a disk really
<
lukegb>
no changes in size, or no changes from the size changing
<
lukegb>
as in, the size was still changing a lot?
<
samueldr>
same wrong behaviour
<
gchristensen>
oh, dang
<
gchristensen>
I thought by no change you meant ... stable :)
<
gchristensen>
in the wanted way.
<
samueldr>
the issue is stable enough to not go away
<
samueldr>
I wonder if fallocate can somehow allocate garbage data? but it shouldn't, no?
<
samueldr>
though there is a --zero-range parameter
<
samueldr>
or are we using truncate?
<
samueldr>
truncate
<
lukegb>
surely there's no way the FS would just... give us uninitialized bytes
<
samueldr>
[if a file is] extended [it] reads as zero bytes.
<
samueldr>
lukegb: by design
<
samueldr>
I don't know?
<
samueldr>
yeah, without -z it's all null bytes on my laptop
<
samueldr>
that could have been "fun"
<
samueldr>
and a fresh truncate on the community server also shows me it is zeroes
andi- has quit [Ping timeout: 250 seconds]
andi- has joined #nixos-infra
supersandro2000 is now known as Guest38188
supersandro2000 has joined #nixos-infra
Guest38188 has quit [Ping timeout: 240 seconds]