#nixos-borg on 2020-10-04

2018-04-19 20:36 gchristensen changed the topic of #nixos-borg to: https://www.patreon.com/ofborg https://monitoring.nix.ci/dashboard/db/ofborg?refresh=10s&orgId=1&from=now-1h&to=now "I get to skip reviewing the PHP code and just wait until it is rewritten in something sane, like POSIX shell. || https://logs.nix.samueldr.com/nixos-borg

04:59 orivej has quit [Ping timeout: 260 seconds]

05:18 cole-h has joined #nixos-borg

08:26 cole-h has quit [Ping timeout: 258 seconds]

10:36 orivej has joined #nixos-borg

12:40 orivej has quit [Ping timeout: 246 seconds]

13:31 orivej has joined #nixos-borg

13:53 orivej has quit [Ping timeout: 256 seconds]

15:38 orivej has joined #nixos-borg

15:51 ekleog has quit [Quit: back soon]

15:53 ekleog has joined #nixos-borg

16:16 <ekleog> is ofborg having issues? I've got a PR that hasn't completed eval in 1hr20min, more than I'm used to (though I may be misremembering maybe?)

16:20 <MichaelRaskin> It is slower than in the best times, I think there is a problem with a crashed evaluator maybe?

16:26 <ekleog> hm'k, thanks :)

16:27 <MichaelRaskin> https://monitoring.nix.ci/d/000000002/ofborg?orgId=1&refresh=10s

16:39 <LnL> yeah, the average rate is still fine but it's pretty noticeable for the busy period

16:40 <LnL> haven't been paying much attention to the dashboards so thought it was still fine

17:10 cole-h has joined #nixos-borg

17:16 <cole-h> Once the pending evals are basically clear, I'm gonna scale ofborg down to 1 evaluator and then back up to 3.

17:18 sphalerite has quit [Quit: boot, boot, boot, boot, reboot the outdated server]

17:22 sphalerite has joined #nixos-borg

17:29 <LnL> ah, was also about to do that

17:30 <cole-h> :D

17:31 <LnL> does it make much of a difference to wait?

17:34 <cole-h> If something goes wrong, we'll only have 1 evaluator working on stuff, no?

17:35 <LnL> I guess

17:40 <cole-h> But maybe it's better to do it now while it's a manageable 27 queued, since there seems to be more activity right now

17:46 <LnL> I still have some time to take a look if it blows up now

18:03 <cole-h> Alright, let's get this show on the road, then.

18:09 <cole-h> cc gchristensen: https://i.imgur.com/1WQWKaK.png Looks like it failed to delete the spot eval machines.

18:12 <LnL> ah right the destroy issue

18:12 <LnL> well, we'll just end up with 4 instead of 3 not a huge deal I think

18:15 <cole-h> Kinda weird how we're at 0 evaluators right now...

18:16 <cole-h> s/Kinda/Really/

18:17 <LnL> I see some logging

18:18 <LnL> yeah back at 1

18:18 <cole-h> Oh nice

18:18 <LnL> and the new ones need a full deploy first

18:19 <cole-h> Stress ⏬

18:19 <cole-h> Yep

18:19 <cole-h> Provisioning on packet takes a loooooooooong time lol

18:20 <LnL> it's actual hardware provisioning so not surprising it takes a bit longer

18:20 <cole-h> Totally fair

18:20 <cole-h> I guess it's actually impressive it only takes an amount of minutes as opposed to hours or days

18:20 <cole-h> (At least, when I was testing it out)

18:22 <cole-h> OK, we have 2...

18:23 <MichaelRaskin> Back to no worse than before the intervention

18:23 <cole-h> :D

18:23 <cole-h> Well, it's kinda weird considering this is still in the dry-run phase

18:23 <LnL> guess that's the advantage to not destroying the original hosts

18:24 <cole-h> Or rather, not successfully destroying them :D

18:25 <cole-h> OK, actual deploy: now!

18:28 <cole-h> 4 builders!

18:28 <LnL> there we go :)

18:29 <cole-h> And this time we didn't lose any aarch builders :D

18:29 <cole-h> (We were at 14 before)

18:30 <LnL> this also means that my hacky workarounds fixed the full deploy pipeline

18:30 <cole-h> <3 LnL

18:30 <{^_^}> LnL's karma got increased to 0b1010100

18:30 <cole-h> 3 evaluators!!!!

18:30 <cole-h> LnL++

18:30 <{^_^}> LnL's karma got increased to 85

18:31 <LnL> I only watched :)

18:31 <cole-h> But you fixed the pipeline :^)

18:31 <cole-h> As far as I'm concerned, you're the superhero here. I only clicked 6 buttons ;)

18:35 <cole-h> 4 evaluators :O

18:35 <cole-h> hehe

18:36 <cole-h> Kinda sad it didn't dump the build queue...

18:46 <cole-h> Next problem to resolve is the fact that core-0 and eval-1 don't show up in loki anymore

18:47 <cole-h> (May just need a service restarted or something, though)

18:49 <LnL> I don't think core was ever included

18:49 <cole-h> Was it not?

18:50 <cole-h> I could swear I used to be able to select core-0 from the `nodename` log label

18:56 <cole-h> And also it seems like all of the ofborg services (except for -builder and -evaluator) disappeared from the `unit` log label list.

19:11 <LnL> strange

19:21 <cole-h> So maybe prometheus/loki just needs to be restarted on core-0?

20:10 <LnL> maybe, I totally forgot how loki works

20:11 <LnL> there shouldn't be anything wrong with prometheus, it's the server that pulls metric data and all the targets are up

20:17 <cole-h> Technology is whacky 💫

20:34 NinjaTrappeur has quit [Quit: WeeChat 2.9]

20:46 NinjaTrappeur has joined #nixos-borg

21:04 <LnL> to clarify, prometheus and loki are totally separate things

21:04 <LnL> so it's just loki

21:05 <cole-h> Right

21:38 <ekleog> hmm were some evaluations dropped? https://github.com/NixOS/nixpkgs/pull/99564 still shows as waiting for eval but the dashboard looks green and quite a few -eval- jobs appear to have completed successfully

21:38 <{^_^}> nixpkgs#99564 (by Ekleog, 6 hours ago, open): matrix-synapse module: fix documentation and add release notes

21:44 <cole-h> LGTM? The "wait for ofborg" thing is unrelated to us.

21:45 <cole-h> (Well, unrelated in that none of us were involved in it/control it)

22:12 <ekleog> hmm ok, well, I guess I'll just force-push and force a test rerun

22:47 <hexa-> LnL: logcli?

22:47 <hexa-> well loki and promtail have a push-based relationship

22:47 <hexa-> so check the promtail instances on the evaluators/builders

22:48 <hexa-> and make sure loki is reachable from them