#nixos-borg on 2020-05-04

2018-04-19 20:36 gchristensen changed the topic of #nixos-borg to: https://www.patreon.com/ofborg https://monitoring.nix.ci/dashboard/db/ofborg?refresh=10s&orgId=1&from=now-1h&to=now "I get to skip reviewing the PHP code and just wait until it is rewritten in something sane, like POSIX shell. || https://logs.nix.samueldr.com/nixos-borg

00:03 <cole-h> >> Internal error writing commit status: Error(Codec(Error("expected value",

00:03 <cole-h> GitHub pls

01:28 orivej has quit [Ping timeout: 256 seconds]

01:28 orivej_ has joined #nixos-borg

01:57 <cole-h> gchristensen: It looks like a few things died ~hr ago. eval filter, comment filter, comment poster, and log collector all have backtraces

02:23 <gchristensen> thanks cole-h

02:23 <cole-h> :)

02:43 hmpffff_ has joined #nixos-borg

02:46 hmpffff has quit [Ping timeout: 260 seconds]

03:09 LnL has quit [Ping timeout: 260 seconds]

03:10 LnL has joined #nixos-borg

05:51 <cole-h> gchristensen: comment filter seems dead again, even though it has good logs after a backtrace... Just posted something and nothing showed up in logs

06:03 <LnL> morning

06:04 <cole-h> I guess I never realized you were EU, LnL

06:05 <cole-h> I'm just about to go to sleep :P

06:08 <LnL> yeah

06:09 <cole-h> Well, good night. See you in a handful of hours :)

06:12 cole-h has quit [Quit: Goodbye]

07:38 orivej_ has quit [Ping timeout: 260 seconds]

10:21 hmpffff has joined #nixos-borg

10:24 hmpffff_ has quit [Ping timeout: 240 seconds]

11:56 hmpffff_ has joined #nixos-borg

11:58 hmpffff has quit [Ping timeout: 260 seconds]

14:17 orivej has joined #nixos-borg

15:34 cole-h has joined #nixos-borg

15:37 <cole-h> I'm actually an idiot lol

15:38 <cole-h> "comment filter seems dead again" -> because I posted `@ofborg build` and not `@ofborg build list of attrs`

15:38 <cole-h> Derp

15:39 <LnL> FYI either my connection was _extremely_ unstable yesterday or something fishy is going on

15:41 <cole-h> re: all the stalled darwin alerts?

15:42 <LnL> yeah check the graph

15:42 <cole-h> I don't need to, I'll just go through my notification history hehe

15:42 <LnL> gchristensen: you happen to know the semantics of amqp heartbeats?

15:44 <LnL> one thing I noticed that looks slightly suspicious is that a build which publishes a lot of logs seems to receive inconsistent heartbeats

15:46 <LnL> so I'm wondering if it's possible for the channel to get congested so the heartbeat can't get through in time

15:49 <cole-h> Look at the log message collector logs

15:49 <cole-h> "WARN:amqp::session: Error dispatching packet to channel 1: Full! Blocking until there is space."

15:50 <cole-h> Maybe related?

15:56 <gchristensen> "Any traffic (e.g. protocol operations, published messages, acknowledgements) counts for a valid heartbeat. Clients may choose to send heartbeat frames regardless of whether there was any other traffic on the connection but some only do it when necessary."

15:56 <LnL> ok so it's not a mandatory separate thign

15:57 <gchristensen> note the log collector does not do many protocl operations

15:58 <cole-h> OK, so probably unrelated x)

15:58 <LnL> but it sure looked like that staging died while sending logs each time

15:59 <LnL> https://gist.github.com/LnL7/54bcc9b75b4434e38987e85351ff311c

16:01 <LnL> I did also get at least one ConnectionAborted

18:53 <LnL> this might be nice to get a bit more rabbitmq specific data https://gist.github.com/LnL7/7666af126198abad8959f74f4274dadb

19:01 <cole-h> Oh, using the node exporter we already have for systemd stuff, huh? Cool.

19:01 <cole-h> Out of curiosity, do you have an example query that you might use?

19:09 <LnL> rabbitmq_connection_received_bytes

19:09 <LnL> contains a bunch of good stuff

19:10 <LnL> and since our clients / queues are relatively static there shouldn't be a dimensionality problem for any of these

19:14 <LnL> also other interesting stuff like number of ack vs nack builds

19:33 <cole-h> Definitely SGTM.

20:54 <cole-h> Especially if you get around to making a dashboard for that... ;^)

20:59 <LnL> needs a user, etc. to talk to rabbitmq so it's a bit more difficult to setup

21:07 <gchristensen> hrm probably should gc or something on thes emachines

21:08 <cole-h> Or optimise-store, if that isn't automated already?

21:08 <cole-h> Looks like there's ~free space, but not free inodes

21:08 <gchristensen> it isn't

21:09 <cole-h> I mean obviously gc'ing would be good too

21:09 <LnL> I started a gc on eval-2

21:09 <cole-h> Yeah, I see that thing skyrocket in inodes + disk free

21:09 <LnL> gchristensen: is it intentional I can only access one of the hosts?

21:09 <cole-h> Yeah, so you can only bring down at most 1 machine >:)

21:10 <LnL> lol

21:10 <LnL> I'm sure I can do better then that

21:11 <gchristensen> no :)

21:17 <LnL> hmm, this staging build has been running for 2h

21:17 <LnL> has build-timeout become a trusted option or something?

21:19 <gchristensen> I don't think so

21:28 <LnL> yeah no

21:28 <LnL> maybe nix changed to count per build rather then absolute?

21:32 <LnL> also do you want to take a look at the tracing stuff or is that good to go

21:38 <gchristensen> lgtm :)

21:39 <cole-h> Merging, then

21:39 <{^_^}> [ofborg] @cole-h merged pull request #480 → tracing logging → https://git.io/Jf3Iz

21:39 <{^_^}> [ofborg] @cole-h pushed 11 commits to released: https://git.io/JfGlg

21:40 <LnL> yay

21:40 <cole-h> Next is infra#16, but I've heard that LnL can merge that himself ;^)

21:40 <gchristensen> w00t!

21:40 <LnL> ah right

22:03 <LnL> very nice :D

22:03 <LnL> https://monitoring.nix.ci/explore?orgId=1&left=%5B%22now-15m%22,%22now%22,%22Loki%22,%7B%22expr%22:%22%7Bpr%3D%5C%2286848%5C%22%7D%22%7D,%7B%22mode%22:%22Logs%22%7D,%7B%22ui%22:%5Btrue,true,true,%22none%22%5D%7D%5D

22:05 <cole-h> Hot

22:05 <cole-h> Next thing I would like to see is if it's possible to set up coloring based on the `level` field

22:06 <cole-h> Right now, warning, info, etc all show up as gray

22:06 <LnL> oh hmm that should be working

22:06 <cole-h> Look at the "description is too long" messages -- level WARN, but grey

22:06 <cole-h> (Like how I used gray in one sentence and then grey in the next? :D)

22:10 <LnL> hmm don't find anything with that

22:10 <LnL> {unit=~"ofborg.*service"} |~ "description too long"

22:11 <LnL> oh you mean the nix-instantiate output?

22:12 MichaelRaskin has quit [Ping timeout: 256 seconds]

22:12 <cole-h> Yeah, those lines

22:12 <cole-h> (sorry I should have written it literally)

22:12 <cole-h> "description is over 140 char"

22:13 <cole-h> Also gchristensen (or LnL if he's given you access to the other machines): if you could optimise or gc the eval nodes, that would be great :) I'm getting red x's because "no space left on device" lol

22:14 <cole-h> https://gist.github.com/GrahamcOfBorg/59cefc03e1a619ef9bb9446928985a4b from #86488

22:14 <cole-h> nixpkgs#86488

22:14 <{^_^}> https://github.com/NixOS/nixpkgs/pull/86488 (by cole-h, 3 days ago, open): nixos/doas: init

22:15 <cole-h> (Probably spot-eval-1 first -- that has 0 inodes lol)

22:15 <LnL> yeah weird none of the log levels are recognised

22:19 <LnL> https://files.daiderd.com/store/l19svhq9fimam307lxl9z6bxfk8fxkl4-loki.png

22:19 <LnL> ^ they are blue/green/red over here

22:20 <cole-h> Interesting.

22:20 <cole-h> ty for the gc/optimise

22:20 <LnL> only the output of commands doesn't have a log level, but that's not structured so to be expected

22:21 <cole-h> Yeah

22:23 <LnL> -/nix/store/jbw12lhrwv55nymsjpc5jl485ijmv6df-grafana-loki-1.1.0-bin/bin/promtail

22:23 <LnL> +/nix/store/238rz17hj87grmfg6pzjm4alymk11fz5-grafana-loki-1.3.0-bin/bin/promtail

22:23 <LnL> I bet that fixes itself when we update the loki service

22:23 <cole-h> OK, cool

22:24 <LnL> wonder why I have a different verison tho

22:24 <LnL> I didn't upgrade yet

22:28 MichaelRaskin has joined #nixos-borg

22:28 <LnL> also re garbage collect, I noticed some indirect result roots getting freed which seems kind of unexpected

22:28 <gchristensen> packet-spot-eval-1...........> 37150 store paths deleted, 222653.01 MiB freed

22:29 <gchristensen> packet-spot-eval-3...........> 43941 store paths deleted, 227193.83 MiB freed

22:29 <gchristensen> packet-spot-eval-1...........> deleting '/nix/store/pn18yzimrbqjw2bsb2wsl7ja5i3ag1xy-loooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooong.drv'

22:29 <LnL> euhm...

22:30 <infinisil> Hehe

22:30 <LnL> https://github.com/NixOS/nix/pull/3542

22:30 <{^_^}> nix#3542 (by mkenigs, 1 week ago, merged): Set GCROOT to store path to prevent garbage collection

22:30 <infinisil> I think that's my doing with https://github.com/NixOS/nixpkgs/pull/83241

22:30 <{^_^}> nixpkgs#83241 (by Infinisil, 6 weeks ago, merged): lib/strings: Add `sanitizeDerivationName` function

22:31 <LnL> has somebody been bad?

22:31 <infinisil> The loo..oong is by me I mean

22:31 <infinisil> not a problem though, it's just long :)

22:32 <LnL> hmm no that's not the one I was thinking of

22:32 <infinisil> (thought so)

22:33 <LnL> https://github.com/NixOS/nix/pull/3541

22:33 <{^_^}> nix#3541 (by alyssais, 1 week ago, merged): Fix long paths permanently breaking GC

22:34 <cole-h> Well, considering spot-eval-1 is back up to ~90%, I say we're OK for now :)

23:07 <cole-h> omg

23:08 <cole-h> LnL++

23:08 <{^_^}> LnL's karma got increased to 49

23:08 <cole-h> LnL++

23:08 <{^_^}> LnL's karma got increased to 50

23:08 <LnL> hm?

23:08 <cole-h> https://i.imgur.com/ApAZVB9.png Useless backtraces are now grouped with the message that generated it

23:12 <cole-h> Now I don't have to wonder if the error is benign or not -- it's literally right there

23:15 <LnL> nice, didn't even realise that :)

23:16 <LnL> you can also ask loki fo rintext around a log line btw

23:16 <LnL> context*

23:21 <cole-h> Yeah, but it usually doesn't go far enough when there's like 3 backtraces rolled into one

23:21 <cole-h> It only lets me do ~20 lines both directions