gchristensen changed the topic of #nixos-borg to: https://www.patreon.com/ofborg https://monitoring.nix.ci/dashboard/db/ofborg?refresh=10s&orgId=1&from=now-1h&to=now "I get to skip reviewing the PHP code and just wait until it is rewritten in something sane, like POSIX shell. || https://logs.nix.samueldr.com/nixos-borg
orivej has quit [Ping timeout: 260 seconds]
cole-h has joined #nixos-borg
cole-h has quit [Ping timeout: 258 seconds]
orivej has joined #nixos-borg
orivej has quit [Ping timeout: 246 seconds]
orivej has joined #nixos-borg
orivej has quit [Ping timeout: 256 seconds]
orivej has joined #nixos-borg
ekleog has quit [Quit: back soon]
ekleog has joined #nixos-borg
<ekleog> is ofborg having issues? I've got a PR that hasn't completed eval in 1hr20min, more than I'm used to (though I may be misremembering maybe?)
<MichaelRaskin> It is slower than in the best times, I think there is a problem with a crashed evaluator maybe?
<ekleog> hm'k, thanks :)
<LnL> yeah, the average rate is still fine but it's pretty noticeable for the busy period
<LnL> haven't been paying much attention to the dashboards so thought it was still fine
cole-h has joined #nixos-borg
<cole-h> Once the pending evals are basically clear, I'm gonna scale ofborg down to 1 evaluator and then back up to 3.
sphalerite has quit [Quit: boot, boot, boot, boot, reboot the outdated server]
sphalerite has joined #nixos-borg
<LnL> ah, was also about to do that
<cole-h> :D
<LnL> does it make much of a difference to wait?
<cole-h> If something goes wrong, we'll only have 1 evaluator working on stuff, no?
<LnL> I guess
<cole-h> But maybe it's better to do it now while it's a manageable 27 queued, since there seems to be more activity right now
<LnL> I still have some time to take a look if it blows up now
<cole-h> Alright, let's get this show on the road, then.
<cole-h> cc gchristensen: https://i.imgur.com/1WQWKaK.png Looks like it failed to delete the spot eval machines.
<LnL> ah right the destroy issue
<LnL> well, we'll just end up with 4 instead of 3 not a huge deal I think
<cole-h> Kinda weird how we're at 0 evaluators right now...
<cole-h> s/Kinda/Really/
<LnL> I see some logging
<LnL> yeah back at 1
<cole-h> Oh nice
<LnL> and the new ones need a full deploy first
<cole-h> Stress ⏬
<cole-h> Yep
<cole-h> Provisioning on packet takes a loooooooooong time lol
<LnL> it's actual hardware provisioning so not surprising it takes a bit longer
<cole-h> Totally fair
<cole-h> I guess it's actually impressive it only takes an amount of minutes as opposed to hours or days
<cole-h> (At least, when I was testing it out)
<cole-h> OK, we have 2...
<MichaelRaskin> Back to no worse than before the intervention
<cole-h> :D
<cole-h> Well, it's kinda weird considering this is still in the dry-run phase
<LnL> guess that's the advantage to not destroying the original hosts
<cole-h> Or rather, not successfully destroying them :D
<cole-h> OK, actual deploy: now!
<cole-h> 4 builders!
<LnL> there we go :)
<cole-h> And this time we didn't lose any aarch builders :D
<cole-h> (We were at 14 before)
<LnL> this also means that my hacky workarounds fixed the full deploy pipeline
<cole-h> <3 LnL
<{^_^}> LnL's karma got increased to 0b1010100
<cole-h> 3 evaluators!!!!
<cole-h> LnL++
<{^_^}> LnL's karma got increased to 85
<LnL> I only watched :)
<cole-h> But you fixed the pipeline :^)
<cole-h> As far as I'm concerned, you're the superhero here. I only clicked 6 buttons ;)
<cole-h> 4 evaluators :O
<cole-h> hehe
<cole-h> Kinda sad it didn't dump the build queue...
<cole-h> Next problem to resolve is the fact that core-0 and eval-1 don't show up in loki anymore
<cole-h> (May just need a service restarted or something, though)
<LnL> I don't think core was ever included
<cole-h> Was it not?
<cole-h> I could swear I used to be able to select core-0 from the `nodename` log label
<cole-h> And also it seems like all of the ofborg services (except for -builder and -evaluator) disappeared from the `unit` log label list.
<LnL> strange
<cole-h> So maybe prometheus/loki just needs to be restarted on core-0?
<LnL> maybe, I totally forgot how loki works
<LnL> there shouldn't be anything wrong with prometheus, it's the server that pulls metric data and all the targets are up
<cole-h> Technology is whacky 💫
NinjaTrappeur has quit [Quit: WeeChat 2.9]
NinjaTrappeur has joined #nixos-borg
<LnL> to clarify, prometheus and loki are totally separate things
<LnL> so it's just loki
<cole-h> Right
<ekleog> hmm were some evaluations dropped? https://github.com/NixOS/nixpkgs/pull/99564 still shows as waiting for eval but the dashboard looks green and quite a few -eval- jobs appear to have completed successfully
<{^_^}> nixpkgs#99564 (by Ekleog, 6 hours ago, open): matrix-synapse module: fix documentation and add release notes
<cole-h> LGTM? The "wait for ofborg" thing is unrelated to us.
<cole-h> (Well, unrelated in that none of us were involved in it/control it)
<ekleog> hmm ok, well, I guess I'll just force-push and force a test rerun
<hexa-> LnL: logcli?
<hexa-> well loki and promtail have a push-based relationship
<hexa-> so check the promtail instances on the evaluators/builders
<hexa-> and make sure loki is reachable from them