#nixos-aarch64 on 2018-12-13

2018-09-23 16:56 gchristensen changed the topic of #nixos-aarch64 to: Get access to the NixOS Community aarch64 build box: https://github.com/nix-community/aarch64-build-box ... todo: https://hydra.nixos.org/jobset/nixpkgs/trunk#tabs-jobs build box status: https://monitoring.nix.ci/d/hkRCcV0mk/instance-metrics?var-instance=aarch64.nixos.community&orgId=1

00:00 <gchristensen> stripping FHS paths in `./net/bluetooth/bnep/Makefile'...

00:00 <gchristensen> ^ doing one of these a second, this is going to be a slow kernel compile :)

00:00 <samueldr> oof

00:01 <samueldr> more than 10 hours if we look at the failing hydra builds :/

00:01 <gchristensen> ...oh

00:02 <samueldr> (if it's as loaded as it is while it fails?)

00:02 <samueldr> (I have no idea what's going on there)

00:02 <gchristensen> ok

00:02 <gchristensen> want to play twitch-debugs-performance-problems?

00:03 <gchristensen> https://gist.github.com/grahamc/581c4bbd52f5579d32c5d9df34092ff8 io wait is high

00:03 <samueldr> no idea if I'd be of help for real :/

00:04 <samueldr> (I don't mean to devalue myself, but I think you're probably more used to it!)

00:04 <gchristensen> I'm not sure I am :')

00:04 <samueldr> are the hydra hitting the disks?

00:04 <samueldr> hydra builders*

00:05 <samueldr> if it's all in memory it's weird?

00:05 <gchristensen> it does hit the disk

00:05 <samueldr> spinning rust I guess

00:05 <gchristensen> I'd tell you but I'm waiting for nix-shell -p sysstat -p iotop:P

00:10 <gchristensen> ok ... I pinged a friend of mine who is good at linux performance. let's see what they say. initial guess is: smells like contention on a file descriptor

00:12 <samueldr> I know next to nothing to how hydra allocates tasks: is there a value you could, like, cut in half and see if it helps? (where it would help in a really bad case) (and I hate to suggest to do things like that in prod)

00:12 <samueldr> but if e.g. it's tasked to do 64 builds at one, setting it to 32 might increase the throughput while looking at the real issue?

00:12 <samueldr> (don't know how bad it is in reality)

00:15 <gchristensen> `ls` takes 10s

00:16 <gchristensen> (1) that is possible(2) the system is extremely underloaded, actually, not sure why itis wacking out

00:16 <samueldr> from experience, once I/O gets in the mix it's hard to diagnose?

00:16 <samueldr> (or lack thereof?)

00:17 <gchristensen> slow to diagnose

00:17 <gchristensen> my friend here was the database performance guy for a long time

00:17 <samueldr> ah, the one who hates all the devs ;)

02:04 <gchristensen> stripping FHS paths in `./tools/testing/selftests/powerpc/pmu/Makefile'...

02:04 <gchristensen> ^ I am that far :)

02:05 <samueldr> during that time

02:05 <samueldr> the community builder built the whole kernel, and an iso image

02:05 <gchristensen> cool......

02:05 <samueldr> in less than 20 minutes

02:05 <samueldr> :/

02:05 <gchristensen> ............

02:05 <samueldr> not bragging, showing how much disparity there is

02:06 <samueldr> IIRC, the 96 core machine was a bit faster for the kernel

02:06 <gchristensen> part of that is the community builder is waaaay better hardware

02:06 <samueldr> there's so many trivial small builds

02:06 <gchristensen> but also, something funky is going on here

02:06 <samueldr> but yeah, as I said, the kernel was mighty fast on the 96 core machine

02:08 <samueldr> https://hydra.nixos.org/build/70706869

02:08 <samueldr> https://hydra.nixos.org/build/85534175

02:08 <samueldr> compare those two builds

02:08 <gchristensen> hrm

02:08 <samueldr> I'm pretty sure no huge changes happened between those

02:09 <gchristensen> I'm running fstrim to see if that magically helps?

02:09 <samueldr> just like that, I have a dumb thing in my mind: there are two arm builders for hydra, right?

02:09 <gchristensen> it has been running a Long Time without any output

02:09 <gchristensen> yea

02:09 <samueldr> are both of them used?

02:09 <gchristensen> should be

02:10 <samueldr> what would happen if you used two times the same builder to give builds on hydra?

02:10 <samueldr> (long shot implausible scenario)

02:10 <gchristensen> I don't understand the idea

02:10 <gchristensen> 02:09 <mason> Oh, I didn't miss much. I'd guess TRIM. Might not be, but without it the drive has to do a ton of reallocation when it decides it's full, which without TRIM could be unrelated to actual disk-full from the filesystem's perspective.

02:10 <samueldr> two machines, A and B, instead of configuring hydra to use A and B, it is accidentally configured to use A and A

02:10 <gchristensen> ah

02:11 <samueldr> but the, there's no CPU/memory use :/

02:11 <samueldr> but then*

02:11 <gchristensen> yeah :/

02:13 <samueldr> looking at the kernel builds (69) there's a creep, but not that slow, in the time it takes to build

02:15 <samueldr> seems that the sample set I'm looking at all built on packet-t2a-1

02:15 <gchristensen> t2a-2 is very responsive

02:16 <gchristensen> let's take t2a-1 out of hydra for now

02:16 <samueldr> packet-t2a-1 (aarch64-linux) (380 steps done, 1937.4 s/step)

02:16 <gchristensen> taken out now

02:16 <samueldr> packet-t2a-2 (aarch64-linux) (2455 steps done, 266.3 s/step)

02:17 <gchristensen> ...lol

02:17 <samueldr> I hope for recovery for hydra-built aarch64 sd images

02:18 <samueldr> I'm that selfish :)

02:18 <gchristensen> :D

02:18 <gchristensen> I wonder if I can ctrl-c fstrim

02:18 <gchristensen> it is taking hours longer than I expected

02:19 <samueldr> are they the first hydra machines with SSD?

02:19 <gchristensen> I... have no idea

02:19 <samueldr> if not, any difference in configuration that would make the others not behave that badly?

04:01 globin has quit [Ping timeout: 252 seconds]

04:01 sphalerite has quit [Ping timeout: 260 seconds]

04:04 makefu has quit [Ping timeout: 268 seconds]

04:08 sphalerite has joined #nixos-aarch64

04:09 globin has joined #nixos-aarch64

04:12 makefu has joined #nixos-aarch64

05:03 worldofpeace has quit [Quit: worldofpeace]

07:11 orivej has quit [Ping timeout: 240 seconds]

08:57 orivej has joined #nixos-aarch64

09:09 orivej has quit [Ping timeout: 250 seconds]

10:40 Acou_Bass has quit [Ping timeout: 245 seconds]

10:45 Acou_Bass has joined #nixos-aarch64

11:53 <gchristensen> date ; fstrim -v /; date

11:53 <gchristensen> Thu Dec 13 02:57:38 UTC 2018

11:54 <gchristensen> Thu Dec 13 11:40:46 UTC 2018

11:54 <gchristensen> /: 53.8 GiB (57720791040 bytes) trimmed

12:03 <gchristensen> oh dang

12:20 sonnenbloom has quit [Remote host closed the connection]

13:43 <makefu> i always assumed fstrip runs at startup ... supposently not

13:43 <makefu> /: 11.9 GiB (12711575552 bytes) trimmed

13:43 <makefu> trim only took ~10s however

13:43 <makefu> /home: 127.2 GiB (136571559936 bytes) trimmed

13:43 <makefu> geez

13:48 <samueldr> hm, forgot -v, took 28s here on my ~ drive

13:52 <gchristensen> my other Packet machine of the same type took <3min

13:53 <samueldr> yeah, it smells like toast SSD?

13:53 <makefu> seems there is services.fstrim.enable , i think i will just enable that on my laptop

14:02 <gchristensen> ok, incoming packet-aarch64-3 -- just going to dump the old machine

14:05 orivej has joined #nixos-aarch64

14:10 orivej has quit [Ping timeout: 268 seconds]

14:12 <gchristensen> rebooting the machine

14:12 <gchristensen> (sorry for the late notice)

14:22 <gchristensen> damn, I messed it up

14:40 <gchristensen> ok rebooting the node again

14:40 <gchristensen> building this image took a lot, I had to compile llvm and go and the kernel and and and ...

14:40 <gchristensen> hopefully we see real movement on aarch64 having that bad one out.

14:42 <samueldr> I want to look into a discrete channel, even if only in the nixpkgs-channels repository, for aarch64 instead of tracking nixos-18.09 which the aarch64 builds may or may not be caught up to :/

14:43 <gchristensen> I'm afraid of channel proliferation. it makes for a bad user experience

14:43 <samueldr> yeah, but what's the solution?

14:43 <samueldr> (yeah, not-channels, something else)

14:43 <gchristensen> having x, x-darwin, and x-aarch64 means nobody is on the same thing and you can't use nixops

14:44 <samueldr> right now the experience is worse since there's no known good point :/

14:44 <gchristensen> yeah

14:44 <gchristensen> we could get more hardware and make it a blocker for the regular eval

14:46 <gchristensen> /nix/store/5d0g32knkimxiwx6p3n9qs61l3lz8phz-post-device-commands: exec: line 8: /nix/store/qi248i3lh00wz6wm3196hd0yxprvjdh2-post-devices.sh: Permission denied

14:46 <gchristensen> it would be super cool if I could not make this mistake :)

15:02 <gchristensen> [grahamc@aarch64:~]$ df -h /nix/.rw-store

15:02 <gchristensen> Filesystem Size Used Avail Use% Mounted on

15:02 <gchristensen> /dev/disk/by-label/scratch-space 434G 73M 412G 1% /nix/.rw-store

15:02 <gchristensen> blessed are the spacemakers

15:22 <vielmetti> waves hello to the assembled crowd

15:22 <gchristensen> hello!

15:22 <gchristensen> vielmetti: today is the first time our community aarch64 builder actually touches the hard disk :P

15:35 <gchristensen> new aarch64 builder in hydra's rotation: https://status.nixos.org/grafana/d/hkRCcV0mk/instance-metrics?orgId=1&var-instance=packet-t2a-3&var-role=All

16:32 <gchristensen> I've set t2a-2/3 to have 36 jobs each instead of 18, after their load has stayed so low on 18

16:32 <gchristensen> last adjustment was on 2018-11-20

16:44 orivej has joined #nixos-aarch64

16:52 orivej has quit [Ping timeout: 268 seconds]

16:55 <samueldr> :/ https://hydra.nixos.org/build/85853984

16:56 <samueldr> (finished 6m ago)

17:01 <gchristensen> :|

17:01 <gchristensen> so I am _NOT_ a "performance person"

17:01 <gchristensen> so, if we can find someone who *is* it would be very helpful

17:02 <samueldr> :\ though not sure if related to yesterday's issue

17:02 <samueldr> it's the good ol' 14400 timeout at 4h

17:04 <samueldr> and it is still building!

17:04 <samueldr> https://hydra.nixos.org/build/85853984/nixlog/1/tail

17:04 <gchristensen> I just restarted it :)

17:04 <samueldr> this explains it hah

17:05 <gchristensen> ok let's try an experiment

17:05 <gchristensen> let's set packet-t2a-2 as big-parallel, with 2 simultaneous jobs and 45 cores each

17:06 <gchristensen> and packet-t2a-3 as not big-parallel, with 96 jobs and 1 core per

17:08 <gchristensen> samueldr: sound like a worthy thing to experiment with?

17:08 <samueldr> sounds like it would

17:08 <gchristensen> ok

17:08 <gchristensen> I'm deploying, and will reboot the nodes to force hydra to restart jobs / redistribute the work

17:09 <samueldr> no idea how it all meshes together in hydra exactly though

17:09 <gchristensen> can you ask that differently? I'll explain

17:09 <gchristensen> (but not sure what to explain :))

17:10 <samueldr> I just don't know exactly how those numbers affect the builds :)

17:10 <gchristensen> ok

17:10 <samueldr> so at face value with the names it sounds good

17:10 <gchristensen> hydra has a list of machines, and each machine has a number of jobs to run at once and a list of features

17:11 <gchristensen> on the machine side, I can configure how many cores a job can access

17:12 <gchristensen> taking away big-parallel from the -3 machine means it won't get kernels, I think

17:13 <gchristensen> and I added big-parallel as a required feature, so the one with 2 simultaneous builds will only get declared-large builds

17:14 <gchristensen> is this expanding your knowledge, or repeating something you know?

17:14 <samueldr> sounds like what I assumed :)

17:14 <gchristensen> is there a gap?

17:16 <samueldr> I don't think so, other than the hydra-specific bits, but without playing with an instance it's hard to know more :)

17:16 <gchristensen> ok

17:16 <gchristensen> but if you have questions, ask away

17:16 <gchristensen> it is a good use of my time to teach and explain it

17:27 <gchristensen> interesting, chromium is not "big-parallel"

17:30 Thra11 has joined #nixos-aarch64

17:40 <gchristensen> https://screenshotscdn.firefoxusercontent.com/images/5bfd4944-9420-4be9-941d-ff457fd86a23.png the angle of the jobs for aarch64 has changed, that is nice

17:46 <gchristensen> we might have done it ...

17:46 <gchristensen> samueldr: !!

17:47 <samueldr> got something big built?

17:47 <gchristensen> a kernel is well in to the INSTALL phase within 24 minutes

17:47 <samueldr> well

17:48 <gchristensen> https://hydra.nixos.org/build/85853766/nixlog/2/tail

17:48 <samueldr> there's hope then!

17:48 * gchristensen refreshes rapidly from excitement

18:01 <gchristensen> 32min

18:03 <samueldr> that's more like it

18:20 worldofpeace has joined #nixos-aarch64

18:28 <gchristensen> we've done 800 aarch64 builds in the last hour

18:28 <gchristensen> so thats good

18:30 <samueldr> I restarted the latest build for the sd image

18:30 <samueldr> (moments ago)

18:31 <gchristensen> link?

18:32 <samueldr> https://hydra.nixos.org/build/85854043

18:32 <samueldr> currently doing a dependency, the kernel

18:32 <gchristensen> marked as not big-parallel, wonder how it'll do

18:33 <samueldr> I believe that as long as its dependents are built, everything should be fine; it's not CPU-bound like the squashfs step in the iso image

18:34 <gchristensen> ah cool

18:34 <gchristensen> I wonder if the squashfs job is big-parallel

18:35 <samueldr> we'll figure it soon, whenever the PR is merged

18:35 <gchristensen> it would be very nice to have 3-4 of those ARM servers I was mentioning

18:40 <samueldr> I have no idea what I'm blabbering about: but would there be a way to set them up so if there's no big-parallel builds, they take on multiple smaller tasks?

18:41 <gchristensen> no

18:41 <gchristensen> well

18:41 <gchristensen> ok

18:41 <gchristensen> so I set it to take big-parallel, but also accept regular jobs

18:41 <gchristensen> but there is no way to say, here, you can either do 2 big parallel, or 45 regular

18:41 <samueldr> right

18:42 <samueldr> could it be configured 1 big parallel and 30 regular?

18:42 <samueldr> (if that even makes sense)

18:42 <gchristensen> mmmmmaybe

18:42 <gchristensen> so, yes, I could probably add the instance to the list twice

18:42 <gchristensen> under a second hostname

18:43 <gchristensen> but the 30 regular jobs would be able to access the same # of cores as the big-parallel jobs

18:43 <samueldr> right, so "not without containerizing or vm or other dastardly tricks"

18:43 <gchristensen> it is possible that, on average, that would work pretty well -- just assuming non-big-parallel will behavewell

18:44 <gchristensen> note, though, that https://status.nixos.org/grafana/d/hkRCcV0mk/instance-metrics?refresh=30s&orgId=1&var-instance=packet-t2a-2&var-instance=packet-t2a-3&var-role=All&from=now-1h&to=now both these nodes are pretty well utilized as-is

18:45 <samueldr> yeah, just theory-crafting for whenever the situation re-stabilizes

18:45 <gchristensen> :)

18:45 <gchristensen> *channeling MichaelRaskin*

18:45 <gchristensen> I appreciate your optimism that it will stabilize! :)

18:46 <samueldr> it was pretty stable at being bad beforehand :)

18:46 <gchristensen> back in 10

18:51 <gchristensen> (back)

18:51 <gchristensen> https://hydra.nixos.org/build/85853984 this standard kernel compiled in 40min

18:52 <samueldr> yeah, that's the one queued by the sd_image :)

18:52 <gchristensen> perfect

18:52 <gchristensen> perfect!

19:19 orivej has joined #nixos-aarch64

19:22 <gchristensen> samueldr: bringing up another aarch64 node as a spot market machine to help chew through the queue

19:22 <samueldr> nice

19:23 <gchristensen> ps: yay for bootloader roll-backs, because I definitely deployed the wrong config to this host :D

19:57 <gchristensen> https://screenshotscdn.firefoxusercontent.com/images/1fd9b174-4af4-41b2-a768-02198157d1ed.png19:56 t2a-3 is set to do 85 parallel builds with one core each, t2a-2 is set to do two parallel builds with 45 cores each

20:01 <samueldr> https://screenshotscdn.firefoxusercontent.com/images/1fd9b174-4af4-41b2-a768-02198157d1ed.png <- there was a 19:56 appended

20:02 <samueldr> nice to see work happening

20:02 <samueldr> ooh, all deps are built I think for the sd image, it's scheduled

20:10 <gchristensen> woop woop

20:35 worldofpeace has quit [Read error: Connection reset by peer]

20:35 worldofpeace_ has joined #nixos-aarch64

20:44 <samueldr> !!! https://hydra.nixos.org/build/85854043

20:45 <gchristensen> :OOO

20:45 <gchristensen> goooooo sd-image!

20:45 <samueldr> I guess it's going to be the moment to rewrite the wiki page

20:51 <gchristensen> what the heck happened, why did it go back to being scheduled

20:51 <samueldr> you, tell us :/

20:51 <gchristensen> :|

20:53 <gchristensen> this isn't fair!

20:56 <gchristensen> ??? it just restarted

20:58 <gchristensen> receiving outputs again

20:58 <gchristensen> go puppy go

21:12 <gchristensen> 3 minutes to build, 14+ to copy ... :|

21:17 <gchristensen> samueldr: !

21:17 <gchristensen> https://hydra.nixos.org/build/85854043#tabs-summary

21:17 <samueldr> yay!

21:17 <gchristensen> 🧁

21:17 <samueldr> glad to see there's a build product

21:17 Thra11 has quit [Ping timeout: 272 seconds]

21:55 <gchristensen> samueldr: as of now there are fewer aarch64 jobs than darwin jobs

21:55 <gchristensen> by about 90 :)

22:24 <THFKA4> sweet, new images

23:14 <samueldr> chromium is building, and by the size of the log I'm pretty sure the build is going fine

23:43 orivej has quit [Ping timeout: 244 seconds]