<gchristensen> stripping FHS paths in `./net/bluetooth/bnep/Makefile'...
<gchristensen> ^ doing one of these a second, this is going to be a slow kernel compile :)
<samueldr> oof
<samueldr> more than 10 hours if we look at the failing hydra builds :/
<gchristensen> ...oh
<samueldr> (if it's as loaded as it is while it fails?)
<samueldr> (I have no idea what's going on there)
<gchristensen> ok
<gchristensen> want to play twitch-debugs-performance-problems?
<samueldr> no idea if I'd be of help for real :/
<samueldr> (I don't mean to devalue myself, but I think you're probably more used to it!)
<gchristensen> I'm not sure I am :')
<samueldr> are the hydra hitting the disks?
<samueldr> hydra builders*
<samueldr> if it's all in memory it's weird?
<gchristensen> it does hit the disk
<samueldr> spinning rust I guess
<gchristensen> I'd tell you but I'm waiting for nix-shell -p sysstat -p iotop:P
<gchristensen> ok ... I pinged a friend of mine who is good at linux performance. let's see what they say. initial guess is: smells like contention on a file descriptor
<samueldr> I know next to nothing to how hydra allocates tasks: is there a value you could, like, cut in half and see if it helps? (where it would help in a really bad case) (and I hate to suggest to do things like that in prod)
<samueldr> but if e.g. it's tasked to do 64 builds at one, setting it to 32 might increase the throughput while looking at the real issue?
<samueldr> (don't know how bad it is in reality)
<gchristensen> `ls` takes 10s
<gchristensen> (1) that is possible(2) the system is extremely underloaded, actually, not sure why itis wacking out
<samueldr> from experience, once I/O gets in the mix it's hard to diagnose?
<samueldr> (or lack thereof?)
<gchristensen> slow to diagnose
<gchristensen> my friend here was the database performance guy for a long time
<samueldr> ah, the one who hates all the devs ;)
<gchristensen> stripping FHS paths in `./tools/testing/selftests/powerpc/pmu/Makefile'...
<gchristensen> ^ I am that far :)
<samueldr> during that time
<samueldr> the community builder built the whole kernel, and an iso image
<gchristensen> cool......
<samueldr> in less than 20 minutes
<samueldr> :/
<gchristensen> ............
<samueldr> not bragging, showing how much disparity there is
<samueldr> IIRC, the 96 core machine was a bit faster for the kernel
<gchristensen> part of that is the community builder is waaaay better hardware
<samueldr> there's so many trivial small builds
<gchristensen> but also, something funky is going on here
<samueldr> but yeah, as I said, the kernel was mighty fast on the 96 core machine
<samueldr> compare those two builds
<gchristensen> hrm
<samueldr> I'm pretty sure no huge changes happened between those
<gchristensen> I'm running fstrim to see if that magically helps?
<samueldr> just like that, I have a dumb thing in my mind: there are two arm builders for hydra, right?
<gchristensen> it has been running a Long Time without any output
<gchristensen> yea
<samueldr> are both of them used?
<gchristensen> should be
<samueldr> what would happen if you used two times the same builder to give builds on hydra?
<samueldr> (long shot implausible scenario)
<gchristensen> I don't understand the idea
<gchristensen> 02:09 <mason> Oh, I didn't miss much. I'd guess TRIM. Might not be, but without it the drive has to do a ton of reallocation when it decides it's full, which without TRIM could be unrelated to actual disk-full from the filesystem's perspective.
<samueldr> two machines, A and B, instead of configuring hydra to use A and B, it is accidentally configured to use A and A
<gchristensen> ah
<samueldr> but the, there's no CPU/memory use :/
<samueldr> but then*
<gchristensen> yeah :/
<samueldr> looking at the kernel builds (69) there's a creep, but not that slow, in the time it takes to build
<samueldr> seems that the sample set I'm looking at all built on packet-t2a-1
<gchristensen> t2a-2 is very responsive
<gchristensen> let's take t2a-1 out of hydra for now
<samueldr> packet-t2a-1 (aarch64-linux) (380 steps done, 1937.4 s/step)
<gchristensen> taken out now
<samueldr> packet-t2a-2 (aarch64-linux) (2455 steps done, 266.3 s/step)
<gchristensen> ...lol
<samueldr> I hope for recovery for hydra-built aarch64 sd images
<samueldr> I'm that selfish :)
<gchristensen> :D
<gchristensen> I wonder if I can ctrl-c fstrim
<gchristensen> it is taking hours longer than I expected
<samueldr> are they the first hydra machines with SSD?
<gchristensen> I... have no idea
<samueldr> if not, any difference in configuration that would make the others not behave that badly?
globin has quit [Ping timeout: 252 seconds]
sphalerite has quit [Ping timeout: 260 seconds]
makefu has quit [Ping timeout: 268 seconds]
sphalerite has joined #nixos-aarch64
globin has joined #nixos-aarch64
makefu has joined #nixos-aarch64
worldofpeace has quit [Quit: worldofpeace]
orivej has quit [Ping timeout: 240 seconds]
orivej has joined #nixos-aarch64
orivej has quit [Ping timeout: 250 seconds]
Acou_Bass has quit [Ping timeout: 245 seconds]
Acou_Bass has joined #nixos-aarch64
<gchristensen> date ; fstrim -v /; date
<gchristensen> Thu Dec 13 02:57:38 UTC 2018
<gchristensen> Thu Dec 13 11:40:46 UTC 2018
<gchristensen> /: 53.8 GiB (57720791040 bytes) trimmed
<gchristensen> oh dang
sonnenbloom has quit [Remote host closed the connection]
<makefu> i always assumed fstrip runs at startup ... supposently not
<makefu> /: 11.9 GiB (12711575552 bytes) trimmed
<makefu> trim only took ~10s however
<makefu> /home: 127.2 GiB (136571559936 bytes) trimmed
<makefu> geez
<samueldr> hm, forgot -v, took 28s here on my ~ drive
<gchristensen> my other Packet machine of the same type took <3min
<samueldr> yeah, it smells like toast SSD?
<makefu> seems there is services.fstrim.enable , i think i will just enable that on my laptop
<gchristensen> ok, incoming packet-aarch64-3 -- just going to dump the old machine
orivej has joined #nixos-aarch64
orivej has quit [Ping timeout: 268 seconds]
<gchristensen> rebooting the machine
<gchristensen> (sorry for the late notice)
<gchristensen> damn, I messed it up
<gchristensen> ok rebooting the node again
<gchristensen> building this image took a lot, I had to compile llvm and go and the kernel and and and ...
<gchristensen> hopefully we see real movement on aarch64 having that bad one out.
<samueldr> I want to look into a discrete channel, even if only in the nixpkgs-channels repository, for aarch64 instead of tracking nixos-18.09 which the aarch64 builds may or may not be caught up to :/
<gchristensen> I'm afraid of channel proliferation. it makes for a bad user experience
<samueldr> yeah, but what's the solution?
<samueldr> (yeah, not-channels, something else)
<gchristensen> having x, x-darwin, and x-aarch64 means nobody is on the same thing and you can't use nixops
<samueldr> right now the experience is worse since there's no known good point :/
<gchristensen> yeah
<gchristensen> we could get more hardware and make it a blocker for the regular eval
<gchristensen> /nix/store/5d0g32knkimxiwx6p3n9qs61l3lz8phz-post-device-commands: exec: line 8: /nix/store/qi248i3lh00wz6wm3196hd0yxprvjdh2-post-devices.sh: Permission denied
<gchristensen> it would be super cool if I could not make this mistake :)
<gchristensen> [grahamc@aarch64:~]$ df -h /nix/.rw-store
<gchristensen> Filesystem Size Used Avail Use% Mounted on
<gchristensen> /dev/disk/by-label/scratch-space 434G 73M 412G 1% /nix/.rw-store
<gchristensen> blessed are the spacemakers
<vielmetti> waves hello to the assembled crowd
<gchristensen> hello!
<gchristensen> vielmetti: today is the first time our community aarch64 builder actually touches the hard disk :P
<gchristensen> I've set t2a-2/3 to have 36 jobs each instead of 18, after their load has stayed so low on 18
<gchristensen> last adjustment was on 2018-11-20
orivej has joined #nixos-aarch64
orivej has quit [Ping timeout: 268 seconds]
<samueldr> (finished 6m ago)
<gchristensen> :|
<gchristensen> so I am _NOT_ a "performance person"
<gchristensen> so, if we can find someone who *is* it would be very helpful
<samueldr> :\ though not sure if related to yesterday's issue
<samueldr> it's the good ol' 14400 timeout at 4h
<samueldr> and it is still building!
<gchristensen> I just restarted it :)
<samueldr> this explains it hah
<gchristensen> ok let's try an experiment
<gchristensen> let's set packet-t2a-2 as big-parallel, with 2 simultaneous jobs and 45 cores each
<gchristensen> and packet-t2a-3 as not big-parallel, with 96 jobs and 1 core per
<gchristensen> samueldr: sound like a worthy thing to experiment with?
<samueldr> sounds like it would
<gchristensen> ok
<gchristensen> I'm deploying, and will reboot the nodes to force hydra to restart jobs / redistribute the work
<samueldr> no idea how it all meshes together in hydra exactly though
<gchristensen> can you ask that differently? I'll explain
<gchristensen> (but not sure what to explain :))
<samueldr> I just don't know exactly how those numbers affect the builds :)
<gchristensen> ok
<samueldr> so at face value with the names it sounds good
<gchristensen> hydra has a list of machines, and each machine has a number of jobs to run at once and a list of features
<gchristensen> on the machine side, I can configure how many cores a job can access
<gchristensen> taking away big-parallel from the -3 machine means it won't get kernels, I think
<gchristensen> and I added big-parallel as a required feature, so the one with 2 simultaneous builds will only get declared-large builds
<gchristensen> is this expanding your knowledge, or repeating something you know?
<samueldr> sounds like what I assumed :)
<gchristensen> is there a gap?
<samueldr> I don't think so, other than the hydra-specific bits, but without playing with an instance it's hard to know more :)
<gchristensen> ok
<gchristensen> but if you have questions, ask away
<gchristensen> it is a good use of my time to teach and explain it
<gchristensen> interesting, chromium is not "big-parallel"
Thra11 has joined #nixos-aarch64
<gchristensen> https://screenshotscdn.firefoxusercontent.com/images/5bfd4944-9420-4be9-941d-ff457fd86a23.png the angle of the jobs for aarch64 has changed, that is nice
<gchristensen> we might have done it ...
<gchristensen> samueldr: !!
<samueldr> got something big built?
<gchristensen> a kernel is well in to the INSTALL phase within 24 minutes
<samueldr> well
<samueldr> there's hope then!
* gchristensen refreshes rapidly from excitement
<gchristensen> 32min
<samueldr> that's more like it
worldofpeace has joined #nixos-aarch64
<gchristensen> we've done 800 aarch64 builds in the last hour
<gchristensen> so thats good
<samueldr> I restarted the latest build for the sd image
<samueldr> (moments ago)
<gchristensen> link?
<samueldr> currently doing a dependency, the kernel
<gchristensen> marked as not big-parallel, wonder how it'll do
<samueldr> I believe that as long as its dependents are built, everything should be fine; it's not CPU-bound like the squashfs step in the iso image
<gchristensen> ah cool
<gchristensen> I wonder if the squashfs job is big-parallel
<samueldr> we'll figure it soon, whenever the PR is merged
<gchristensen> it would be very nice to have 3-4 of those ARM servers I was mentioning
<samueldr> I have no idea what I'm blabbering about: but would there be a way to set them up so if there's no big-parallel builds, they take on multiple smaller tasks?
<gchristensen> no
<gchristensen> well
<gchristensen> ok
<gchristensen> so I set it to take big-parallel, but also accept regular jobs
<gchristensen> but there is no way to say, here, you can either do 2 big parallel, or 45 regular
<samueldr> right
<samueldr> could it be configured 1 big parallel and 30 regular?
<samueldr> (if that even makes sense)
<gchristensen> mmmmmaybe
<gchristensen> so, yes, I could probably add the instance to the list twice
<gchristensen> under a second hostname
<gchristensen> but the 30 regular jobs would be able to access the same # of cores as the big-parallel jobs
<samueldr> right, so "not without containerizing or vm or other dastardly tricks"
<gchristensen> it is possible that, on average, that would work pretty well -- just assuming non-big-parallel will behavewell
<samueldr> yeah, just theory-crafting for whenever the situation re-stabilizes
<gchristensen> :)
<gchristensen> *channeling MichaelRaskin*
<gchristensen> I appreciate your optimism that it will stabilize! :)
<samueldr> it was pretty stable at being bad beforehand :)
<gchristensen> back in 10
<gchristensen> (back)
<gchristensen> https://hydra.nixos.org/build/85853984 this standard kernel compiled in 40min
<samueldr> yeah, that's the one queued by the sd_image :)
<gchristensen> perfect
<gchristensen> perfect!
orivej has joined #nixos-aarch64
<gchristensen> samueldr: bringing up another aarch64 node as a spot market machine to help chew through the queue
<samueldr> nice
<gchristensen> ps: yay for bootloader roll-backs, because I definitely deployed the wrong config to this host :D
<gchristensen> https://screenshotscdn.firefoxusercontent.com/images/1fd9b174-4af4-41b2-a768-02198157d1ed.png19:56 t2a-3 is set to do 85 parallel builds with one core each, t2a-2 is set to do two parallel builds with 45 cores each
<samueldr> nice to see work happening
<samueldr> ooh, all deps are built I think for the sd image, it's scheduled
<gchristensen> woop woop
worldofpeace has quit [Read error: Connection reset by peer]
worldofpeace_ has joined #nixos-aarch64
<gchristensen> :OOO
<gchristensen> goooooo sd-image!
<samueldr> I guess it's going to be the moment to rewrite the wiki page
<gchristensen> what the heck happened, why did it go back to being scheduled
<samueldr> you, tell us :/
<gchristensen> :|
<gchristensen> this isn't fair!
<gchristensen> ??? it just restarted
<gchristensen> receiving outputs again
<gchristensen> go puppy go
<gchristensen> 3 minutes to build, 14+ to copy ... :|
<gchristensen> samueldr: !
<samueldr> yay!
<gchristensen> 🧁
<samueldr> glad to see there's a build product
Thra11 has quit [Ping timeout: 272 seconds]
<gchristensen> samueldr: as of now there are fewer aarch64 jobs than darwin jobs
<gchristensen> by about 90 :)
<THFKA4> sweet, new images
<samueldr> chromium is building, and by the size of the log I'm pretty sure the build is going fine
orivej has quit [Ping timeout: 244 seconds]