<samueldr>
we might want to figure out an alerting solution back for hydra; tested apparently vanished for 2-3 days from nixos-unstable and wasn't really noticed :/ (in addition to tarball failing on 18.09 earlier)
<domenkozar>
yeah :)
<samueldr>
ofborg#342 might help in the future; there was a small sliver of nixos not under watch of ofborg AFAICT
<samueldr>
domenkozar: changing to ghc822 was right, right?
<domenkozar>
:)
<samueldr>
indeed, I guess you know if you touched its code :)
<domenkozar>
so all that's missing is something that reads the api and alrts
<domenkozar>
but maybe it's easier to add to ofborg
<domenkozar>
822 is not ideal, but it will do
<domenkozar>
:)
<samueldr>
I don't know if it makes sense under ofborg to add alerting and stuff that's not a reaction to a repo event; considering it's (I don't say this authoritatively) a first-responder CI
<domenkozar>
I guess it's up to whoever implements it
<samueldr>
I see that at one point, failures for *:tarball would likely be sent to the nix-commits mailing list, since it's the maintainer of the tarball job
<domenkozar>
I'm quite biased here and believe alerting should be commit based
<domenkozar>
maintainers and commit authors differ a lot in inxpkgs
<domenkozar>
sure we batch commits, so everyone in the batch should get it
<domenkozar>
if I got the email for breakage, I'd fix it right away
<domenkozar>
but currently I'd just make peti angry for getting the email.
<samueldr>
in addition to that, a firehose for the generic things, like the tested job, might be nice to subscribe to
<domenkozar>
yeah that's for release managers as second defense :)
teto has quit [Ping timeout: 246 seconds]
clever has quit [Ping timeout: 250 seconds]
clever has joined #nixos-dev
ajs124 has left #nixos-dev [#nixos-dev]
clever has quit [Ping timeout: 250 seconds]
clever has joined #nixos-dev
clever has quit [Ping timeout: 250 seconds]
drakonis has quit [Quit: WeeChat 2.3]
jtojnar has quit [Read error: Connection reset by peer]
jtojnar has joined #nixos-dev
jtojnar has quit [Read error: Connection reset by peer]
jtojnar has joined #nixos-dev
Zer000 has quit [Ping timeout: 250 seconds]
drakonis has joined #nixos-dev
drakonis1 has quit [Ping timeout: 246 seconds]
MichaelRaskin has quit [Quit: MichaelRaskin]
teto has joined #nixos-dev
clever has joined #nixos-dev
<sphalerite>
gchristensen: why did ofborg request my review on #58215?
<simpson>
Ah, yeah, from Hydra. Nice. I wonder which size is the right one to aim at. Looks like there was a time in 2017 when we could have fit on a miniature CD, even.
hedning_ has joined #nixos-dev
hedning_ has quit [Client Quit]
orivej has quit [Ping timeout: 250 seconds]
drakonis has joined #nixos-dev
srk has quit [Ping timeout: 268 seconds]
drakonis_ has quit [Ping timeout: 250 seconds]
srk has joined #nixos-dev
drakonis1 has joined #nixos-dev
ajs124 has left #nixos-dev [#nixos-dev]
ajs124 has joined #nixos-dev
drakonis1 has quit [Quit: WeeChat 2.3]
drakonis_ has joined #nixos-dev
drakonis has quit [Ping timeout: 250 seconds]
drakonis_ has quit [Ping timeout: 240 seconds]
<samueldr>
:( I've restarted the eval of nixos:trunk-combined three times already, twice exit code 99 without further error message, and once with Too many heap sections: Increase MAXHINCR or MAX_HEAP_SECTS... thinking that memory alloc issues will need to be tackled :/ (in addition to slimming the eval)
<samueldr>
what can I do to help?
* samueldr
re-schedules for eval)
<gchristensen>
not sure
<gchristensen>
any commits which are obviously correlated?
<samueldr>
no idea, not sure how to correlate
<gchristensen>
me either
<samueldr>
it seems that we've lately been on the cusp so it might be accretion
<gchristensen>
maybe we can get more RAM
<samueldr>
that would solve one part, but the MAXHINCR/MAX_HEAP_SECTS is a bit more fatal; and this is something that precludes "full" aarch64 support for now
<gchristensen>
samueldr: is there any chance my merging of your PR broke this?
<samueldr>
unlikely
<samueldr>
if you mean the hydra PR
<gchristensen>
yea
<samueldr>
this (unless there's some undefined behaviour which I doubt) will only change the behaviour when evaluator_initial_heap_size was set to a lower value than the initial max heap size
<samueldr>
this is intended to keep the memory usage low by dropping the child process when memory is higher...
<srhb>
So are we getting an actual oom kill on the server, causing exit 99?
<samueldr>
sometimes
* gchristensen
is available to do remote hands
<srhb>
Welp..
* gchristensen
doesn't even charge remote hands rates
<samueldr>
this "isn't as much of an issue" than the max heap things
<srhb>
Right, but last time it didn't die with max heap sects, I think.
<srhb>
Your gist did..
<samueldr>
... but AFAICT (I might be wrong) what happens is that when the eval job is restarted, the GC heaps are still in a pretty fragmented state
<samueldr>
srhb: exactly, right now there are two issues that are a bit cusps-y
* srhb
nods
<samueldr>
(1) memory usage in evals is hitting against the memory limits of the hydra server... but that's not as much of an issue, the SQL server and the whole everythings running there are using a lot of memory
<samueldr>
this is "free" to fix through allocating more memory to the machine (not actually free)
<samueldr>
(2) "Too many heap sections: Increase MAXHINCR or MAX_HEAP_SECTS" might be closer to normal evals than we like
<samueldr>
(and has already stopped aarch64-linux from being made a supported system)
<srhb>
2 I understand. The non-heap-sects crash I don't.
<srhb>
(Unless, as said, it's really an oom kill)
<samueldr>
I asked on a couple (2) evals gchristensen and he confirmed OOM'd
<gchristensen>
I did? :)
<gchristensen>
I did
<gchristensen>
[24235165.786280] Out of memory: Kill process 10401 (hydra-eval-jobs) score 485 or sacrifice child
<gchristensen>
[24235165.787004] Killed process 10401 (hydra-eval-jobs) total-vm:18451332kB, anon-rss:16338648kB, file-rss:0kB, shmem-rss:0kB