#nixos-borg on 2020-05-01

2018-04-19 20:36 gchristensen changed the topic of #nixos-borg to: https://www.patreon.com/ofborg https://monitoring.nix.ci/dashboard/db/ofborg?refresh=10s&orgId=1&from=now-1h&to=now "I get to skip reviewing the PHP code and just wait until it is rewritten in something sane, like POSIX shell. || https://logs.nix.samueldr.com/nixos-borg

00:47 cole-h has quit [Quit: Goodbye]

00:49 cole-h has joined #nixos-borg

05:58 cole-h has quit [Quit: Goodbye]

07:05 orivej has quit [Ping timeout: 256 seconds]

12:31 <{^_^}> [ofborg] @LnL7 opened pull request #479 → remove static lifetimes from easylapin → https://git.io/JfOdx

13:24 orivej has joined #nixos-borg

14:53 cole-h has joined #nixos-borg

18:25 <{^_^}> [ofborg] @LnL7 opened pull request #480 → tracing logging → https://git.io/Jf3Iz

18:33 <gchristensen> LnL: when you mentioned the thing about deploying yesterday, I stopped deploying

18:33 <gchristensen> I'm happy to merge these PRs, but I'd like to deploy them around the same time -- let me know what I should do :)

18:35 <LnL> yeah, the debugging stuff I mentioned doesn't have to block that

18:36 cust0dian has joined #nixos-borg

18:36 <cole-h> I like how the switch to tracing appears to be mostly changing the imports from log -> tracing

18:36 <LnL> but I think you'll have to at least reboot the aarch server soon

18:37 <LnL> I don't have enough permissions to look at the stuck builders or restart them

18:38 <LnL> cole-h: yeah, it's really implemented as a superset in a lot of ways

18:39 <cust0dian> borg seems to have hung on my PR: https://github.com/NixOS/nixpkgs/pull/86334 — is this a minor issue and I should just push something else there to trigger another run or do I need to raise an issue?

18:39 <{^_^}> nixpkgs#86334 (by cust0dian, 2 days ago, open): tmuxinator: 1.1.4 -> 2.0.0

18:40 <LnL> one thing that's neat that I already noticed is that it also handles logging of libraries (like lapin) by default

18:41 <cole-h> cust0dian: Usually it's fine to just `@ofborg eval` and restart the eval. I think I know the issue but haven't been able to find time to actually look into fixing it.

18:41 <cust0dian> gotcha, thanks!

18:42 <cole-h> Unless you see a big purple label that says `ofborg-internal-error` -- then, come find me.

18:51 <LnL> why is clippy yelling at me? :/

18:52 <cole-h> LnL++ Thanks for adding a picture -- I was gonna ask for one :P

18:52 <{^_^}> LnL's karma got increased to 46

18:55 <cole-h> LnL: I think clippy wants you to split `head_waiter`'s closure to a separate function and `thread::spawn()` that

18:55 <cole-h> idk about the others

19:04 <cole-h> I smile every time I see "lmao I got a job?" show up in the logs

19:05 <gchristensen> lol

19:12 <cole-h> LnL: jk, splitting that closure did nothing

19:17 <LnL> urgh yeah

19:17 <cole-h> LnL: It appears to be related to the `info!` macro. Commenting out lines 134-5 gets rid of that one error

19:18 <LnL> btw I don't get these with latest

19:18 <cole-h> Latest what, clippy?

19:18 <LnL> oh, sounds like the heuristic might be counting the expanded macro then

19:18 <cole-h> Yep, just found that too

19:18 <LnL> yeah

19:19 <LnL> I generally work in a ~nixpkgs-unstable shell

19:25 <cole-h> LnL: What are your clippy, cargo, and rustc versions when this doesn't happen?

19:25 <cole-h> 0.0.212, 1.43.0, and 1.43.0, respectively?

19:26 <LnL> 1.42

19:40 <LnL> also posted some output of the json formatter

19:43 <cole-h> Now the question is how it looks in Loki

19:44 <LnL> yeah I don't know how that integrates

19:44 <cole-h> It might be useful to remove the date "segment", since Loki does that automagically

19:46 <LnL> btw, I really don't have much of an opionion on the library

19:46 <LnL> if either of you think slog or whatever is the better option I can also try that out

19:49 <cole-h> <3

19:50 <LnL> cole-h: there's a without_time so that looks straightforward if we don't want it

19:58 <LnL> https://github.com/grafana/loki/blob/master/docs/clients/promtail/stages/json.md

20:28 <gchristensen> we can easily reboot, btw, several people in -aarch64 can reboot

20:28 <gchristensen> btw please feel free to merge and deploy without me :)

20:30 <LnL> alright, no reason not to then

20:31 <cole-h> I don't really have an opinion one way or the other re: slog vs tracing.

20:32 <cole-h> Tracing looks good for now, so I say stick with it, assuming gchristensen doesn't have another opinion.

20:32 <gchristensen> makeitso.bmp

20:35 <LnL> the span is a bit magical since it threads the context through the entire call stack and not just the local scope

20:35 <cole-h> btw, what's the reason we use openssl 1.0.2u? I gathered it's because of one of our libraries, I think, but not much more than that

20:36 <LnL> hmm another travis failure

20:36 <gchristensen> newer versions dropped something the the old amqp libraryused

20:36 <LnL> oh!

20:37 <cole-h> Then, maybe we can drop openssl 1.0.2u now that we use lapin? :o

20:37 <LnL> so that might also resolve with switching?

20:37 <gchristensen> yea

20:40 <LnL> ok let's deploy the qos change now then and see if my builder gets stuck again

20:40 <gchristensen> I'm here if you need me :)

20:41 <cole-h> LnL: "We will watch your career with great interest..."

20:42 <cole-h> Wow, getting fancy -- naming your deploy

20:44 <LnL> the default names are kind of useless since it's the infra repo

20:44 <cole-h> Yeah :P

20:45 <LnL> gchristensen: btw one thing I've noticed is that nginx seems to restart every time

20:45 <gchristensen> ...huh

20:45 <LnL> that couldn't be related to the drops we're seeing right?

20:46 <gchristensen> I wouldn't think so

20:46 <LnL> it's a different port so...

20:48 <LnL> oh, mine restarted this time

20:48 <LnL> builder 56462 ofborg 5u IPv4 0x821242dabf05bf6b 0t0 TCP 10.0.2.15:52622->core-0.ewr1.nix.ci:5671 (ESTABLISHED)

20:49 <LnL> https://gist.github.com/LnL7/072fe336484cef7e04963dcc441ef8e0

20:51 <LnL> so that's a good sign, alltho none of the aarch builders disappeared either

20:52 <cole-h> What's that InvalidChannelState error?

20:53 <LnL> something happened to the rabbitmq cannel, the queue getting emptied out is probably related

20:57 <LnL> my guess is either rabbitmq is restarting _somehow_ or one of the services forcibly recreates queues somehow

20:58 <LnL> oh, also see this a few times before the successful restart

20:59 <LnL> Error: IOError(Os { code: 22, kind: InvalidInput, message: "Invalid argument" })

20:59 <cole-h> Any context?

21:02 <LnL> nope, I highly suspect that's https://github.com/NixOS/ofborg/blob/released/ofborg/src/bin/builder.rs#L31

21:17 <LnL> yeah, this is when connecting to a garabage host https://gist.github.com/LnL7/a3fa6ffd1b1f766a2dd41158f9afffab

21:33 <cole-h> Oh. Probably the aarch64 one(s)?

21:33 <LnL> that's what is causing them to stall yes

21:35 <LnL> https://gist.github.com/LnL7/833ce471b4f6bf66272050bce1c8c7e0

21:38 <LnL> even more interesting!

21:39 <LnL> Error: IOError(Custom { kind: Other, error: Ssl(Error { code: ErrorCode(5), cause: Some(Io(Os { code: 111, kind: ConnectionRefused, message: "Connection refused" })) }, X509VerifyResult { code: 0, error: "ok" }) })

21:40 <cole-h> Huh

21:43 <LnL> https://github.com/ofborg/infrastructure/blob/98cfeb744bdb8004098148ed7b2f37f5e461b691/nixops/modules/rabbitmq/default.nix#L33-L35

21:43 <LnL> nixops won't know about that

21:43 <cole-h> Is that why the queue gets dropped???

21:43 <cole-h> Well, that's certainly one way to do it...

21:43 <LnL> if this actually happens then yes

21:44 <LnL> at least until https://github.com/NixOS/ofborg/pull/478 that is

21:44 <{^_^}> #478 (by LnL7, 1 day ago, merged): make messages persistent

21:45 <cole-h> Right -- so it should no longer happen, but it was/might have been because of that

21:45 <LnL> having it just restart otherwise shouldn't be a big deal

21:46 <LnL> assuming hearbeats, etc. get properly handled by the clients, which I'm hoping lapin will mostly do

21:47 <cole-h> Fingers crossed.

21:48 <cole-h> LnL: Thanks for all you do for ofborg (and the rest of the Nix ecosystem) :)

21:49 <cole-h> Some day I'll have my own borg builder to test these kinds of things on, but for now I'm just doing relatively trivial changes.

21:50 <LnL> well it's been a while since I did anything for it :)

21:51 <LnL> also hugging people in pyjamas isn't really socially accepted at the moment so I have more free time

21:51 <cole-h> Hahaha

21:53 <cole-h> # of version bumps I've done: 0; # of version bumps LnL has done: 1

21:53 <cole-h> Doing only slightly more than me ;)

21:53 <LnL> heh

21:56 <gchristensen> LnL: wtf that is terrible!

21:56 <gchristensen> lol!

21:56 <LnL> :D

21:58 <LnL> mind checking if the uptime of rabbitmq is <3h

21:58 <LnL> oh hold on, I think I can do that now

21:59 <cole-h> 2020-05-01 20:47:20.875 [info] <0.9734.9> RabbitMQ is asked to stop...

21:59 <LnL> yeah... pretty confident that's it then :)

22:00 <gchristensen> Active: active (running) since Fri 2020-05-01 20:47:41 UTC; 1h 12min ago

22:00 <cole-h> gchristensen: ICYMI: https://github.com/ofborg/infrastructure/blob/98cfeb744bdb8004098148ed7b2f37f5e461b691/nixops/modules/rabbitmq/default.nix#L33-L35

22:00 <cole-h> lol

22:00 <cole-h> "Gee, I wonder why rabbitmq is dropping its queue

22:00 <cole-h> "

22:01 <gchristensen> so, those messages should definitely be perisstent

22:01 <gchristensen> except not log messages

22:05 <LnL> yeah, regardless of this not loosing stuff when a restart or reboot is needed is much better

22:05 <LnL> still don't get why logs didn't dissapear tho

22:07 NinjaTrappeur has quit [*.net *.split]

22:07 qyliss has quit [*.net *.split]

22:07 {^_^} has quit [*.net *.split]

22:08 <gchristensen> yeah

22:10 qyliss has joined #nixos-borg

22:13 <LnL> how does that part actually work?

22:14 <gchristensen> persisting to disk?

22:15 <LnL> oh! it writes files

22:15 <gchristensen> oh are you looking via the log viewer?

22:15 <LnL> no wonder it's persistent

22:19 <LnL> I thought the log viewer talked to amqp directly and history just rotated because it's a bounded queue

22:25 <gchristensen> ah

22:25 <gchristensen> log viewer does talk to amqp directly, too :)

22:25 <gchristensen> it first connects to rabbitmq, then tries to fetch history off disk

22:26 <qyliss> a/go

22:26 <qyliss> aaaa I keep doing that

22:27 <LnL> ok that makes sense then

22:27 <LnL> so log messages where not persistent either, anything published but not persisted yet would disappear

22:28 <LnL> that's just a really small window

22:29 <gchristensen> yeah

22:29 <gchristensen> and they should not be persisted, either

22:30 <gchristensen> log messages can easily be 1k+/s

22:31 <LnL> ah, could you double check that then

22:35 <gchristensen> all in-memory

22:40 <LnL> hmm so the logs are not durable which means the client might be sending delivery_mode 2 but that's being ignored?

22:40 <gchristensen> hmm

22:40 <gchristensen> not sure

22:40 <gchristensen> the queue created by the receivers is not durable

22:40 <gchristensen> so the messages can't be persistent

22:41 <gchristensen> because a disconnect is instant death

22:42 <gchristensen> I've been on hold 30 minutes to buy a pizza

22:42 <LnL> the exchange is durable tho

22:42 <gchristensen> yeah but the exchange doens't hold anything

22:42 <LnL> whoa

22:43 <gchristensen> the exchange is just a map

22:43 <LnL> so that doesn't really do anything?

22:45 <infinisil> {^_^} is gone :(

22:48 <LnL> since 20:47 by any chance?

22:49 <infinisil> Not sure what timezone you're in, but {^_^} left this channel about 41 minutes ago

22:50 <LnL> that's utc, but sounds like a different thing then

23:16 <gchristensen> there was a netsplit infinisil

23:17 <gchristensen> oh

23:17 <gchristensen> also

23:18 <gchristensen> it needed restarting :)

23:18 {^_^} has joined #nixos-borg

23:52 <infinisil> ahh