ekleog changed the topic of #nixos-dev to: NixOS Development (#nixos for questions) | https://hydra.nixos.org/jobset/nixos/trunk-combined https://channels.nix.gsc.io/graph.html https://r13y.com | 18.09 release managers: vcunat and samueldr | https://logs.nix.samueldr.com/nixos-dev
LnL has quit [Ping timeout: 252 seconds]
<samueldr> we might want to figure out an alerting solution back for hydra; tested apparently vanished for 2-3 days from nixos-unstable and wasn't really noticed :/ (in addition to tarball failing on 18.09 earlier)
<domenkozar> yeah :)
<samueldr> ofborg#342 might help in the future; there was a small sliver of nixos not under watch of ofborg AFAICT
<{^_^}> https://github.com/NixOS/ofborg/pull/342 (by samueldr, 1 minute ago, open): nixpkgs: Tests nixos' `tested` job
<domenkozar> I used to have a todo to get howoldis to return json api
<samueldr> it already does
<domenkozar> then it could alert IRC after commit <> channel go out of sync for x days
<samueldr> (your todo could predate that addition I guess)
<samueldr> domenkozar: changing to ghc822 was right, right?
<domenkozar> :)
<samueldr> indeed, I guess you know if you touched its code :)
<domenkozar> so all that's missing is something that reads the api and alrts
<domenkozar> but maybe it's easier to add to ofborg
<domenkozar> 822 is not ideal, but it will do
<domenkozar> :)
<samueldr> I don't know if it makes sense under ofborg to add alerting and stuff that's not a reaction to a repo event; considering it's (I don't say this authoritatively) a first-responder CI
<domenkozar> I guess it's up to whoever implements it
<samueldr> I see that at one point, failures for *:tarball would likely be sent to the nix-commits mailing list, since it's the maintainer of the tarball job
<domenkozar> I'm quite biased here and believe alerting should be commit based
<domenkozar> maintainers and commit authors differ a lot in inxpkgs
<domenkozar> sure we batch commits, so everyone in the batch should get it
<domenkozar> if I got the email for breakage, I'd fix it right away
<domenkozar> but currently I'd just make peti angry for getting the email.
<samueldr> in addition to that, a firehose for the generic things, like the tested job, might be nice to subscribe to
<domenkozar> yeah that's for release managers as second defense :)
teto has quit [Ping timeout: 246 seconds]
clever has quit [Ping timeout: 250 seconds]
clever has joined #nixos-dev
ajs124 has left #nixos-dev [#nixos-dev]
clever has quit [Ping timeout: 250 seconds]
clever has joined #nixos-dev
clever has quit [Ping timeout: 250 seconds]
drakonis has quit [Quit: WeeChat 2.3]
jtojnar has quit [Read error: Connection reset by peer]
jtojnar has joined #nixos-dev
jtojnar has quit [Read error: Connection reset by peer]
jtojnar has joined #nixos-dev
Zer000 has quit [Ping timeout: 250 seconds]
drakonis has joined #nixos-dev
drakonis1 has quit [Ping timeout: 246 seconds]
MichaelRaskin has quit [Quit: MichaelRaskin]
teto has joined #nixos-dev
clever has joined #nixos-dev
<sphalerite> gchristensen: why did ofborg request my review on #58215?
<{^_^}> https://github.com/NixOS/nixpkgs/pull/58215 (by primeos, 13 hours ago, open): iputils: 20180629 -> 20190324
johanot has joined #nixos-dev
drakonis_ has joined #nixos-dev
drakonis has quit [Ping timeout: 250 seconds]
teto has quit [Ping timeout: 250 seconds]
teto has joined #nixos-dev
__Sander__ has joined #nixos-dev
orivej_ has quit [Ping timeout: 246 seconds]
orivej has joined #nixos-dev
orivej has quit [Ping timeout: 255 seconds]
<gchristensen> sphalerite: because you're a maintainer
ajs124 has joined #nixos-dev
<primeos> sphalerite: I could update that PR to also remove you from meta.maintainers if you don't want to get these requests in the future ;)
init_6 has joined #nixos-dev
jtojnar has quit [Read error: Connection reset by peer]
jtojnar has joined #nixos-dev
<sphalerite> gchristensen: whoops, I forgot about that L;)
<sphalerite> :') even
jtojnar has quit [Quit: jtojnar]
jtojnar has joined #nixos-dev
<gchristensen> I was getting ready to spelunk through debugging code
<gchristensen> I was delighted to be so simple
orivej has joined #nixos-dev
<gchristensen> srhb: 'round?
johanot has quit [Quit: WeeChat 2.4]
ckauhaus has joined #nixos-dev
aszlig has quit [Quit: Kerneling down for reboot NOW.]
aszlig has joined #nixos-dev
<gchristensen> samueldr: it looks like master needs to be merged to staging due to -darwin eval problems. should we just ... do that?
<samueldr> I don't know, I think so?
genesis has quit [Quit: Leaving]
<samueldr> isn't that done, usually, when staging diverges for a while?
<gchristensen> I don't know :|
<gchristensen> I've never followed staging. I suspect it is time for some more clarity on the staging process and who is in charge
<samueldr> same
<gchristensen> cool, cool :)
init_6 has quit []
<sphalerite> FRidh seems to do a lot of staging stuff, maybe he can provide enlightenment
orivej has quit [Ping timeout: 250 seconds]
jtojnar_ has joined #nixos-dev
jtojnar has quit [Ping timeout: 250 seconds]
jtojnar_ is now known as jtojnar
jtojnar has quit [Ping timeout: 250 seconds]
drakonis has joined #nixos-dev
drakonis_ has quit [Ping timeout: 240 seconds]
genesis has joined #nixos-dev
genesis_ has joined #nixos-dev
genesis_ is now known as genesis
genesis is now known as Guest54103
<samueldr> at ~ 03:23 UTC, an eval OOM'd for trunk-combined
<samueldr> nixos:trunk-combined*
drakonis1 has joined #nixos-dev
<srhb> gchristensen: pong
<gchristensen> I was going to ask you to cherry-pick your fixup commit to staging, but probably better to merge master in
<srhb> gchristensen: Wasn't master just merged in?
<{^_^}> #58106 (by kalbasit, 3 days ago, merged): Merge master into staging
<kalbasit> I merged master into staging 3 days ago and fixed the bug that caused eval to fail
<srhb> kalbasit: Which bug was that?
<gchristensen> srhb: staging doesn't evaluate due to the same darwin problem
<gchristensen> presently
<kalbasit> srhb: I merged a PR about two weeks ago that became broken after merging a recent PR
<srhb> Got it..
<srhb> gchristensen: Hmm, I think cherry pick is probably the least disruptive right now..
<srhb> But dunno.
<kalbasit> gchristensen: I agree! We really should define staging process and team
<kalbasit> RFC maybe?
<kalbasit> srhb: which fixup commit gchristensen is talking about?
<srhb> I think the darwin-tested set.
<srhb> Actually, peti already fixed that in master.
<srhb> So next merge is probably better.
Zer000 has joined #nixos-dev
orivej has joined #nixos-dev
Jackneill has quit [Ping timeout: 244 seconds]
Jackneill has joined #nixos-dev
__Sander__ has quit [Quit: Konversation terminated!]
LnL7 has joined #nixos-dev
drakonis1 has quit [Quit: WeeChat 2.3]
drakonis_ has joined #nixos-dev
drakonis has quit [Ping timeout: 250 seconds]
<gchristensen> iso_minimal grew by 30 paths
<gchristensen> I don't know if this is a thing we want to / should care about, but in case somebody does
<simpson> Export and graph it?
<simpson> Definitely something worth knowing about. I suppose that there's a hard size limit on the closure, too?
<gchristensen> probably should be
<simpson> Ah, yeah, from Hydra. Nice. I wonder which size is the right one to aim at. Looks like there was a time in 2017 when we could have fit on a miniature CD, even.
hedning_ has joined #nixos-dev
hedning_ has quit [Client Quit]
orivej has quit [Ping timeout: 250 seconds]
drakonis has joined #nixos-dev
srk has quit [Ping timeout: 268 seconds]
drakonis_ has quit [Ping timeout: 250 seconds]
srk has joined #nixos-dev
drakonis1 has joined #nixos-dev
ajs124 has left #nixos-dev [#nixos-dev]
ajs124 has joined #nixos-dev
drakonis1 has quit [Quit: WeeChat 2.3]
drakonis_ has joined #nixos-dev
drakonis has quit [Ping timeout: 250 seconds]
drakonis_ has quit [Ping timeout: 240 seconds]
<samueldr> :( I've restarted the eval of nixos:trunk-combined three times already, twice exit code 99 without further error message, and once with Too many heap sections: Increase MAXHINCR or MAX_HEAP_SECTS... thinking that memory alloc issues will need to be tackled :/ (in addition to slimming the eval)
<samueldr> what can I do to help?
* samueldr re-schedules for eval)
<gchristensen> not sure
<gchristensen> any commits which are obviously correlated?
<samueldr> no idea, not sure how to correlate
<gchristensen> me either
<samueldr> it seems that we've lately been on the cusp so it might be accretion
<gchristensen> maybe we can get more RAM
<samueldr> that would solve one part, but the MAXHINCR/MAX_HEAP_SECTS is a bit more fatal; and this is something that precludes "full" aarch64 support for now
<gchristensen> samueldr: is there any chance my merging of your PR broke this?
<samueldr> unlikely
<samueldr> if you mean the hydra PR
<gchristensen> yea
<samueldr> this (unless there's some undefined behaviour which I doubt) will only change the behaviour when evaluator_initial_heap_size was set to a lower value than the initial max heap size
<gchristensen> right.
<srhb> Isn't 99 essentially "instantiate" dying?
<srhb> Like, all the error messages we see should be recoverable, right?
orivej has joined #nixos-dev
<samueldr> srhb: just to be sure, which error messages?
<srhb> "because heap size is at ... "
<samueldr> right
<samueldr> this is as designed in the hydra evaluator
<srhb> Yeah..
<samueldr> this is intended to keep the memory usage low by dropping the child process when memory is higher...
<srhb> So are we getting an actual oom kill on the server, causing exit 99?
<samueldr> sometimes
* gchristensen is available to do remote hands
<srhb> Welp..
* gchristensen doesn't even charge remote hands rates
<samueldr> this "isn't as much of an issue" than the max heap things
<srhb> Right, but last time it didn't die with max heap sects, I think.
<srhb> Your gist did..
<samueldr> ... but AFAICT (I might be wrong) what happens is that when the eval job is restarted, the GC heaps are still in a pretty fragmented state
<samueldr> srhb: exactly, right now there are two issues that are a bit cusps-y
* srhb nods
<samueldr> (1) memory usage in evals is hitting against the memory limits of the hydra server... but that's not as much of an issue, the SQL server and the whole everythings running there are using a lot of memory
<samueldr> this is "free" to fix through allocating more memory to the machine (not actually free)
<samueldr> (2) "Too many heap sections: Increase MAXHINCR or MAX_HEAP_SECTS" might be closer to normal evals than we like
<samueldr> (and has already stopped aarch64-linux from being made a supported system)
<srhb> 2 I understand. The non-heap-sects crash I don't.
<srhb> (Unless, as said, it's really an oom kill)
<samueldr> I asked on a couple (2) evals gchristensen and he confirmed OOM'd
<gchristensen> I did? :)
<gchristensen> I did
<gchristensen> [24235165.786280] Out of memory: Kill process 10401 (hydra-eval-jobs) score 485 or sacrifice child
<gchristensen> [24235165.787004] Killed process 10401 (hydra-eval-jobs) total-vm:18451332kB, anon-rss:16338648kB, file-rss:0kB, shmem-rss:0kB
<srhb> OK, that makes sense then.
<gchristensen> maybe we just have too many builders
<srhb> I seem to be able to eval in 3 minutes using approximately 10 GiB
<srhb> so 15 per evaluator to be safe, currently.
<srhb> (I could probably gc more aggressively with a smaller heap, but what's the point)
<srhb> Of course, my postgres takes no memory at all, because it's an empty hydra really...
<samueldr> srhb: it needs to go through <hydra/src/hydra-eval-jobs/hydra-eval-jobs>
<srhb> It is.
<gchristensen> bummer, I was hoping to https://github.com/NixOS/nixos-org-configurations/blob/master/delft/hydra.nix#L66 have a magic trick here
<samueldr> that, for a master eval, takes 3GiB here? (from a `time` output)
<srhb> samueldr: Hmm, I let the ordinary hydra machinery do the thing, instead of calling eval jobs directly.
<srhb> But really, 3 GiB? o_o
<samueldr> I need to re-recheck what GNU time's memory unit is
<samueldr> but it's max mem: 3419300 of those
* srhb nods
<samueldr> (and it's one of the helpful gnu tool with everything in info pages, so I don't know how to use their documentation)
<samueldr> >> %M maximum resident set size in KB\n
<samueldr> which is 3.2GiB unless I gunked up the works in /1024/1024
<srhb> Nope, sounds right.
<samueldr> for anyone playing at home, `time env -i GC_INITIAL_HEAP_SIZE=4G ~/tmp/hydra/hydra/src/hydra-eval-jobs/hydra-eval-jobs -I /nix/store -I $PWD ./ni
<samueldr> xos/release.nix > /dev/null
<samueldr> hmmm... my terminal cut the line :)
<samueldr> >> time env -i GC_INITIAL_HEAP_SIZE=4G ~/tmp/hydra/hydra/src/hydra-eval-jobs/hydra-eval-jobs -I /nix/store -I $PWD ./nixos/release.nix > /dev/null
<samueldr> the GC_INITIAL_HEAP_SIZE would be wrong and instead I should look into setting evaluator_initial_heap_size = 10000000000
* srhb will have to rejoin the investigation tomorrow, it's late
<srhb> o/
<samueldr> 'night!
<gchristensen> 'night, srhb
<gchristensen> thanks :
<gchristensen> )
jtojnar has joined #nixos-dev
<gchristensen> I stopped the queue runner after I got the famous 503 or whatever the dexter page is
<gchristensen> ok started both queue runner and evaluator again
drakonis_ has joined #nixos-dev