#nixos-dev on 2019-03-25

2019-02-05 16:18 ekleog changed the topic of #nixos-dev to: NixOS Development (#nixos for questions) | https://hydra.nixos.org/jobset/nixos/trunk-combined https://channels.nix.gsc.io/graph.html https://r13y.com | 18.09 release managers: vcunat and samueldr | https://logs.nix.samueldr.com/nixos-dev

01:33 LnL has quit [Ping timeout: 252 seconds]

02:44 <samueldr> we might want to figure out an alerting solution back for hydra; tested apparently vanished for 2-3 days from nixos-unstable and wasn't really noticed :/ (in addition to tarball failing on 18.09 earlier)

03:07 <domenkozar> yeah :)

03:09 <samueldr> ofborg#342 might help in the future; there was a small sliver of nixos not under watch of ofborg AFAICT

03:09 <{^_^}> https://github.com/NixOS/ofborg/pull/342 (by samueldr, 1 minute ago, open): nixpkgs: Tests nixos' `tested` job

03:09 <domenkozar> I used to have a todo to get howoldis to return json api

03:10 <samueldr> it already does

03:10 <domenkozar> then it could alert IRC after commit <> channel go out of sync for x days

03:10 <samueldr> http://howoldis.herokuapp.com/api/channels

03:10 <samueldr> (your todo could predate that addition I guess)

03:11 <domenkozar> yes I know: https://github.com/madjar/howoldis/commit/33ef87742397c1dc902f97be1a15fbed91464a19#diff-f95d80da64090049d446c7902eb0dbdb

03:11 <samueldr> domenkozar: changing to ghc822 was right, right?

03:11 <domenkozar> :)

03:11 <samueldr> indeed, I guess you know if you touched its code :)

03:12 <domenkozar> so all that's missing is something that reads the api and alrts

03:12 <domenkozar> but maybe it's easier to add to ofborg

03:12 <domenkozar> 822 is not ideal, but it will do

03:12 <domenkozar> :)

03:13 <samueldr> I don't know if it makes sense under ofborg to add alerting and stuff that's not a reaction to a repo event; considering it's (I don't say this authoritatively) a first-responder CI

03:15 <domenkozar> I guess it's up to whoever implements it

03:16 <samueldr> I see that at one point, failures for *:tarball would likely be sent to the nix-commits mailing list, since it's the maintainer of the tarball job

03:19 <domenkozar> I'm quite biased here and believe alerting should be commit based

03:19 <domenkozar> maintainers and commit authors differ a lot in inxpkgs

03:20 <domenkozar> sure we batch commits, so everyone in the batch should get it

03:20 <domenkozar> if I got the email for breakage, I'd fix it right away

03:20 <domenkozar> but currently I'd just make peti angry for getting the email.

03:22 <samueldr> in addition to that, a firehose for the generic things, like the tested job, might be nice to subscribe to

03:23 <domenkozar> yeah that's for release managers as second defense :)

03:40 teto has quit [Ping timeout: 246 seconds]

03:57 clever has quit [Ping timeout: 250 seconds]

04:05 clever has joined #nixos-dev

04:08 ajs124 has left #nixos-dev [#nixos-dev]

04:28 clever has quit [Ping timeout: 250 seconds]

04:45 clever has joined #nixos-dev

04:52 clever has quit [Ping timeout: 250 seconds]

05:29 drakonis has quit [Quit: WeeChat 2.3]

06:09 jtojnar has quit [Read error: Connection reset by peer]

06:09 jtojnar has joined #nixos-dev

06:13 jtojnar has quit [Read error: Connection reset by peer]

06:13 jtojnar has joined #nixos-dev

06:45 Zer000 has quit [Ping timeout: 250 seconds]

07:01 drakonis has joined #nixos-dev

07:06 drakonis1 has quit [Ping timeout: 246 seconds]

07:16 MichaelRaskin has quit [Quit: MichaelRaskin]

07:51 teto has joined #nixos-dev

08:22 clever has joined #nixos-dev

08:39 <sphalerite> gchristensen: why did ofborg request my review on #58215?

08:39 <{^_^}> https://github.com/NixOS/nixpkgs/pull/58215 (by primeos, 13 hours ago, open): iputils: 20180629 -> 20190324

08:56 johanot has joined #nixos-dev

09:02 drakonis_ has joined #nixos-dev

09:05 drakonis has quit [Ping timeout: 250 seconds]

09:19 teto has quit [Ping timeout: 250 seconds]

09:20 teto has joined #nixos-dev

09:42 __Sander__ has joined #nixos-dev

10:28 orivej_ has quit [Ping timeout: 246 seconds]

10:34 orivej has joined #nixos-dev

10:39 orivej has quit [Ping timeout: 255 seconds]

11:05 <gchristensen> sphalerite: because you're a maintainer

11:10 ajs124 has joined #nixos-dev

11:16 <gchristensen> sphalerite: https://github.com/NixOS/nixpkgs/commit/2331332459a436fd39c9d92579a811deedef285f :)

11:24 <primeos> sphalerite: I could update that PR to also remove you from meta.maintainers if you don't want to get these requests in the future ;)

11:32 init_6 has joined #nixos-dev

11:36 jtojnar has quit [Read error: Connection reset by peer]

11:36 jtojnar has joined #nixos-dev

11:37 <sphalerite> gchristensen: whoops, I forgot about that L;)

11:37 <sphalerite> :') even

11:46 jtojnar has quit [Quit: jtojnar]

11:46 jtojnar has joined #nixos-dev

11:47 <gchristensen> I was getting ready to spelunk through debugging code

11:47 <gchristensen> I was delighted to be so simple

12:02 orivej has joined #nixos-dev

12:47 <gchristensen> srhb: 'round?

13:21 johanot has quit [Quit: WeeChat 2.4]

13:29 ckauhaus has joined #nixos-dev

13:30 aszlig has quit [Quit: Kerneling down for reboot NOW.]

13:33 aszlig has joined #nixos-dev

13:35 <gchristensen> samueldr: it looks like master needs to be merged to staging due to -darwin eval problems. should we just ... do that?

13:35 <samueldr> I don't know, I think so?

13:36 genesis has quit [Quit: Leaving]

13:37 <samueldr> isn't that done, usually, when staging diverges for a while?

13:37 <gchristensen> I don't know :|

13:38 <gchristensen> I've never followed staging. I suspect it is time for some more clarity on the staging process and who is in charge

13:38 <samueldr> same

13:38 <gchristensen> cool, cool :)

13:47 init_6 has quit []

13:59 <sphalerite> FRidh seems to do a lot of staging stuff, maybe he can provide enlightenment

14:15 orivej has quit [Ping timeout: 250 seconds]

14:17 jtojnar_ has joined #nixos-dev

14:18 jtojnar has quit [Ping timeout: 250 seconds]

14:19 jtojnar_ is now known as jtojnar

14:49 jtojnar has quit [Ping timeout: 250 seconds]

14:50 drakonis has joined #nixos-dev

14:52 drakonis_ has quit [Ping timeout: 240 seconds]

14:54 genesis has joined #nixos-dev

15:03 genesis_ has joined #nixos-dev

15:03 genesis_ is now known as genesis

15:03 genesis is now known as Guest54103

15:04 <samueldr> at ~ 03:23 UTC, an eval OOM'd for trunk-combined

15:04 <samueldr> nixos:trunk-combined*

15:08 drakonis1 has joined #nixos-dev

15:21 <srhb> gchristensen: pong

15:21 <gchristensen> I was going to ask you to cherry-pick your fixup commit to staging, but probably better to merge master in

15:25 <srhb> gchristensen: Wasn't master just merged in?

15:26 <srhb> https://github.com/NixOS/nixpkgs/pull/58106

15:26 <{^_^}> #58106 (by kalbasit, 3 days ago, merged): Merge master into staging

15:27 <kalbasit> I merged master into staging 3 days ago and fixed the bug that caused eval to fail

15:27 <srhb> kalbasit: Which bug was that?

15:28 <gchristensen> srhb: staging doesn't evaluate due to the same darwin problem

15:28 <gchristensen> presently

15:28 <kalbasit> srhb: I merged a PR about two weeks ago that became broken after merging a recent PR

15:28 <srhb> Got it..

15:29 <srhb> gchristensen: Hmm, I think cherry pick is probably the least disruptive right now..

15:29 <srhb> But dunno.

15:29 <kalbasit> gchristensen: I agree! We really should define staging process and team

15:29 <kalbasit> RFC maybe?

15:30 <kalbasit> srhb: which fixup commit gchristensen is talking about?

15:30 <srhb> I think the darwin-tested set.

15:30 <srhb> Actually, peti already fixed that in master.

15:30 <srhb> So next merge is probably better.

16:11 Zer000 has joined #nixos-dev

16:30 orivej has joined #nixos-dev

16:52 Jackneill has quit [Ping timeout: 244 seconds]

16:55 Jackneill has joined #nixos-dev

17:01 __Sander__ has quit [Quit: Konversation terminated!]

17:33 LnL7 has joined #nixos-dev

17:35 drakonis1 has quit [Quit: WeeChat 2.3]

17:37 drakonis_ has joined #nixos-dev

17:41 drakonis has quit [Ping timeout: 250 seconds]

17:42 <gchristensen> iso_minimal grew by 30 paths

17:42 <gchristensen> I don't know if this is a thing we want to / should care about, but in case somebody does

17:43 <simpson> Export and graph it?

17:43 <simpson> Definitely something worth knowing about. I suppose that there's a hard size limit on the closure, too?

17:43 <gchristensen> probably should be

17:45 <gchristensen> https://screenshotscdn.firefoxusercontent.com/images/227e5884-8422-4fb9-9def-0e9d786d523d.png

17:56 <simpson> Ah, yeah, from Hydra. Nice. I wonder which size is the right one to aim at. Looks like there was a time in 2017 when we could have fit on a miniature CD, even.

18:14 hedning_ has joined #nixos-dev

18:16 hedning_ has quit [Client Quit]

18:52 orivej has quit [Ping timeout: 250 seconds]

20:27 drakonis has joined #nixos-dev

20:30 srk has quit [Ping timeout: 268 seconds]

20:31 drakonis_ has quit [Ping timeout: 250 seconds]

20:32 srk has joined #nixos-dev

21:24 drakonis1 has joined #nixos-dev

21:34 ajs124 has left #nixos-dev [#nixos-dev]

21:34 ajs124 has joined #nixos-dev

21:52 drakonis1 has quit [Quit: WeeChat 2.3]

21:57 drakonis_ has joined #nixos-dev

22:01 drakonis has quit [Ping timeout: 250 seconds]

22:04 drakonis_ has quit [Ping timeout: 240 seconds]

22:12 <samueldr> :( I've restarted the eval of nixos:trunk-combined three times already, twice exit code 99 without further error message, and once with Too many heap sections: Increase MAXHINCR or MAX_HEAP_SECTS... thinking that memory alloc issues will need to be tackled :/ (in addition to slimming the eval)

22:12 <samueldr> what can I do to help?

22:15 * samueldr re-schedules for eval)

22:16 <gchristensen> not sure

22:17 <gchristensen> any commits which are obviously correlated?

22:17 <samueldr> no idea, not sure how to correlate

22:17 <gchristensen> me either

22:17 <samueldr> it seems that we've lately been on the cusp so it might be accretion

22:17 <gchristensen> maybe we can get more RAM

22:18 <samueldr> that would solve one part, but the MAXHINCR/MAX_HEAP_SECTS is a bit more fatal; and this is something that precludes "full" aarch64 support for now

22:25 <gchristensen> samueldr: is there any chance my merging of your PR broke this?

22:26 <samueldr> unlikely

22:26 <samueldr> if you mean the hydra PR

22:26 <gchristensen> yea

22:27 <samueldr> this (unless there's some undefined behaviour which I doubt) will only change the behaviour when evaluator_initial_heap_size was set to a lower value than the initial max heap size

22:27 <samueldr> https://github.com/NixOS/hydra/commit/0721f6623ffb5a4b6a77b499af4eee7d6e4dd6a7

22:27 <gchristensen> right.

22:28 <srhb> Isn't 99 essentially "instantiate" dying?

22:28 <srhb> Like, all the error messages we see should be recoverable, right?

22:28 orivej has joined #nixos-dev

22:28 <samueldr> srhb: just to be sure, which error messages?

22:29 <samueldr> https://gist.github.com/samueldr/640a738d3d51f5b202134653b8ed6650 srhb link to a line here

22:29 <srhb> "because heap size is at ... "

22:29 <samueldr> right

22:29 <samueldr> this is as designed in the hydra evaluator

22:29 <samueldr> https://github.com/NixOS/hydra/blob/0e337e6f9c7ea9edfc64e9975b35faac1bd2b708/src/hydra-eval-jobs/hydra-eval-jobs.cc#L273-L338

22:29 <srhb> Yeah..

22:29 <samueldr> this is intended to keep the memory usage low by dropping the child process when memory is higher...

22:29 <srhb> So are we getting an actual oom kill on the server, causing exit 99?

22:30 <samueldr> sometimes

22:30 * gchristensen is available to do remote hands

22:30 <srhb> Welp..

22:30 * gchristensen doesn't even charge remote hands rates

22:30 <samueldr> this "isn't as much of an issue" than the max heap things

22:30 <srhb> Right, but last time it didn't die with max heap sects, I think.

22:30 <srhb> Your gist did..

22:30 <samueldr> ... but AFAICT (I might be wrong) what happens is that when the eval job is restarted, the GC heaps are still in a pretty fragmented state

22:31 <samueldr> srhb: exactly, right now there are two issues that are a bit cusps-y

22:31 * srhb nods

22:31 <samueldr> (1) memory usage in evals is hitting against the memory limits of the hydra server... but that's not as much of an issue, the SQL server and the whole everythings running there are using a lot of memory

22:32 <samueldr> this is "free" to fix through allocating more memory to the machine (not actually free)

22:32 <samueldr> (2) "Too many heap sections: Increase MAXHINCR or MAX_HEAP_SECTS" might be closer to normal evals than we like

22:33 <samueldr> (and has already stopped aarch64-linux from being made a supported system)

22:33 <srhb> 2 I understand. The non-heap-sects crash I don't.

22:33 <srhb> (Unless, as said, it's really an oom kill)

22:33 <samueldr> I asked on a couple (2) evals gchristensen and he confirmed OOM'd

22:34 <gchristensen> I did? :)

22:34 <gchristensen> I did

22:34 <gchristensen> [24235165.786280] Out of memory: Kill process 10401 (hydra-eval-jobs) score 485 or sacrifice child

22:34 <gchristensen> [24235165.787004] Killed process 10401 (hydra-eval-jobs) total-vm:18451332kB, anon-rss:16338648kB, file-rss:0kB, shmem-rss:0kB

22:34 <samueldr> https://logs.nix.samueldr.com/nixos-dev/2019-03-21#1553189398-1553189581;

22:35 <srhb> OK, that makes sense then.

22:35 <gchristensen> maybe we just have too many builders

22:35 <srhb> I seem to be able to eval in 3 minutes using approximately 10 GiB

22:36 <srhb> so 15 per evaluator to be safe, currently.

22:36 <srhb> (I could probably gc more aggressively with a smaller heap, but what's the point)

22:37 <srhb> Of course, my postgres takes no memory at all, because it's an empty hydra really...

22:37 <samueldr> srhb: it needs to go through <hydra/src/hydra-eval-jobs/hydra-eval-jobs>

22:37 <srhb> It is.

22:37 <gchristensen> bummer, I was hoping to https://github.com/NixOS/nixos-org-configurations/blob/master/delft/hydra.nix#L66 have a magic trick here

22:37 <gchristensen> PS: https://github.com/NixOS/nixos-org-configurations/blob/master/delft/hydra.nix#L63-L64

22:37 <samueldr> that, for a master eval, takes 3GiB here? (from a `time` output)

22:38 <srhb> samueldr: Hmm, I let the ordinary hydra machinery do the thing, instead of calling eval jobs directly.

22:38 <srhb> But really, 3 GiB? o_o

22:38 <samueldr> I need to re-recheck what GNU time's memory unit is

22:38 <samueldr> but it's max mem: 3419300 of those

22:39 * srhb nods

22:39 <samueldr> (and it's one of the helpful gnu tool with everything in info pages, so I don't know how to use their documentation)

22:40 <samueldr> >> %M maximum resident set size in KB\n

22:41 <samueldr> which is 3.2GiB unless I gunked up the works in /1024/1024

22:41 <srhb> Nope, sounds right.

22:41 <samueldr> for anyone playing at home, `time env -i GC_INITIAL_HEAP_SIZE=4G ~/tmp/hydra/hydra/src/hydra-eval-jobs/hydra-eval-jobs -I /nix/store -I $PWD ./ni

22:41 <samueldr> xos/release.nix > /dev/null

22:42 <samueldr> hmmm... my terminal cut the line :)

22:42 <samueldr> >> time env -i GC_INITIAL_HEAP_SIZE=4G ~/tmp/hydra/hydra/src/hydra-eval-jobs/hydra-eval-jobs -I /nix/store -I $PWD ./nixos/release.nix > /dev/null

22:42 <samueldr> the GC_INITIAL_HEAP_SIZE would be wrong and instead I should look into setting evaluator_initial_heap_size = 10000000000

22:43 * srhb will have to rejoin the investigation tomorrow, it's late

22:43 <srhb> o/

22:43 <samueldr> 'night!

22:43 <gchristensen> 'night, srhb

22:43 <gchristensen> thanks :

22:44 <gchristensen> )

22:49 jtojnar has joined #nixos-dev

22:59 <gchristensen> I stopped the queue runner after I got the famous 503 or whatever the dexter page is

23:04 <gchristensen> ok started both queue runner and evaluator again

23:07 drakonis_ has joined #nixos-dev