#nixos-borg on 2020-04-22

2018-04-19 20:36 gchristensen changed the topic of #nixos-borg to: https://www.patreon.com/ofborg https://monitoring.nix.ci/dashboard/db/ofborg?refresh=10s&orgId=1&from=now-1h&to=now "I get to skip reviewing the PHP code and just wait until it is rewritten in something sane, like POSIX shell. || https://logs.nix.samueldr.com/nixos-borg

05:45 cole-h has quit [Quit: Goodbye]

06:20 hmpffff has joined #nixos-borg

06:25 orivej has joined #nixos-borg

09:48 hmpffff_ has joined #nixos-borg

09:51 hmpffff has quit [Ping timeout: 272 seconds]

09:55 hmpffff has joined #nixos-borg

09:56 hmpffff_ has quit [Ping timeout: 240 seconds]

10:54 <qyliss> Do these logs load for anybody? https://logs.nix.ci/?key=nixos/nixpkgs.85731&attempt_id=6f5c3608-2c60-41c6-8a7e-dfac49339ceb

11:03 <gchristensen> qyliss: sorry qyliss, looks like the log collector died. I restarted it and thinsg should be collected now

11:03 <qyliss> Does that mean I should build again?

11:04 <gchristensen> unfortunately, yeah, I'm sorry

11:04 <qyliss> np

11:04 <qyliss> <3 gchristensen

11:04 <{^_^}> gchristensen's karma got increased to 274

11:05 <gchristensen> there is an ugly problem where some of the workers can half crash, and the crashed thread doesn't take down the whole process. I bet there is a way to fix that ...

11:11 <LnL> I did some basic testing with lapin and that seems to reconnect/die properly at first glance

11:11 <gchristensen> nice

11:11 <gchristensen> lets RiiL:)

11:12 <LnL> but I bet adding a panic in the right place would also fix it

11:12 <gchristensen> I think we need to add a panic handler

11:12 <gchristensen> https://stackoverflow.com/a/36031130/637129

11:13 <LnL> ah if a thread panics it doesn't bring everything down?

11:14 <gchristensen> unfortunately not

11:15 <LnL> right

11:15 <LnL> something something supervisors

12:59 <gchristensen> I think I have a hacky fix to propagate the failure, but it is a bit annoying to test :)

12:59 <gchristensen> iptables -t filter -I INPUT -s 147.75.199.209 -j DROP

15:24 <LnL> couldn't the heartbeat close the session/channels somehow?

15:45 hmpffff has quit [Quit: Bye…]

16:09 cole-h has joined #nixos-borg

16:10 <LnL> gchristensen: https://github.com/grahamc/rust-amqp/blob/f9aec2f40aef69a459f26003ce47048f8e2a08d1/src/session.rs#L132

16:12 <LnL> there's a sender for the session and each channel which means that dropping the one in the heartbeat doesn't close the receiver

16:13 <gchristensen> yeah

16:13 <gchristensen> so I experimented with a patch to do this ... let me push it

16:14 <gchristensen> look at the die-on-heartbeat branch

16:16 <LnL> yeah, not very pretty but I think that would do the trick

16:23 <cole-h> "let sender = send_sender" heh

16:24 <LnL> might be better to reverse the condition and check for TryRecvError::Disconnected

16:24 <gchristensen> that would work for me :)

16:25 <LnL> currently the heartbeat panics so the tombstone message might not get sent

16:25 <gchristensen> I think I fixed the panic by deleting an unwrap and replacing it with a return

16:27 <LnL> ah, missed that was mostly looking at the last commit

16:31 <cole-h> Do we (I) have any way to know when the log collector dies? (re: qyl*ss's question earlier) Seems like it's been happening a decent amount recently (could just be my new-ness having not noticed before though)

16:32 <gchristensen> you can see it manifest as a panic in the logs, on the core-0 machine

16:32 <gchristensen> this branch I showed LnL is maybe going to fi xit

16:32 <gchristensen> LnL: think I should merge and update ofborg to try it?

16:33 <LnL> yeah sounds good

16:33 <LnL> assuming the logcollector thing is the same as what I've seen

16:33 <gchristensen> yeah I think it is

16:34 <gchristensen> cole-h: want to update the dependencies on ofborg, do the carnix thing, and send a PR?

16:34 <cole-h> Oh, is this the SendError(..) thing we're dealing with here?

16:34 <gchristensen> yeah

16:34 <cole-h> And sure

16:34 <gchristensen> cool

16:37 <cole-h> At the end of it all, at the very least I'll know how to bump dependencies in ofborg! :D

16:37 <gchristensen> :)

16:41 <{^_^}> [ofborg] @cole-h opened pull request #464 → Bump amqp → https://git.io/JfkAT

16:42 <cole-h> Uh, nice, 6 (local) test failures

16:43 <cole-h> Oh, maybe because of that matching line stuff again

16:47 <cole-h> Oh, it was because I have `experimental-features` in my nix.conf and I was running it with `nix`

16:47 <cole-h> Heh

16:47 <cole-h> btw, checkPhase timed out on fetching again...

16:55 <cole-h> I broke travis by force-pushing, thinking it would make it rerun like it does with ofborg... RIP.

16:55 <LnL> hm, doesn't that work?

16:55 <cole-h> If it does, it isn't right now :(

17:00 <LnL> I tried out github actions for nix-darwin a while back, pretty easy to switch but can't really say if it's more stable

17:50 <LnL> gchristensen: have you had a chance to take a look at the hydra export yet?

17:51 <gchristensen> yeah I got pretty far and then got stumped on some annoyances w.r.t. mounting devices

17:52 <gchristensen> can continue tonight :)

17:53 <LnL> ah, cool

17:54 <gchristensen> I had a prearranged date to do this https://boinc.bakerlab.org/rosetta/hosts_user.php?sort=rpc_time&rev=0&show_all=0&userid=2145901

17:55 <cole-h> btw gchristensen if you could manually trigger CI on ofborg#464 , that would be swell ^^

17:55 <{^_^}> https://github.com/NixOS/ofborg/pull/464 (by cole-h, 1 hour ago, open): Bump amqp

17:56 <gchristensen> huh

17:56 <gchristensen> seems travis is just bad

17:56 <gchristensen> apparently travis is trying hard to push everybody off of it

17:56 <cole-h> I tried force-pushing to restart the checkPhase (because HTTP timed out... lol) and it just died :D

17:57 <LnL> oh the build status is just gone

17:58 <gchristensen> lol CI passed but it just never wrote statuses

18:11 <gchristensen> lol travis is so broken

18:14 <cole-h> :D

18:19 * gchristensen puts on a tinfoil hat

18:19 <gchristensen> you know,

18:19 <gchristensen> breaking all the other CI systems is a great way to get people on to GitHub Actions

18:20 <cole-h> That actually doesn't sound too farfetched...

18:27 <srk> https://jamescooke.info/travis-hitting-githubs-api-limits-for-open-source-projects.html

18:27 <srk> interesting

18:30 <MichaelRaskin> Hmmm.

18:30 <MichaelRaskin> Should we add a GitHub action that runs nixpkgs-review on every commit?

18:31 <cole-h> Assuming you mean against nixpkgs PRs/commits

18:31 <MichaelRaskin> Yes

18:31 <LnL> heh

18:32 <MichaelRaskin> That's like the last thing ofborg doesn't dare do for Nixpkgs QA yet!

18:32 <gchristensen> heh

18:37 <LnL> world-class CI/CD

18:37 <LnL> sounds great, here's 80k builds

18:37 <cole-h> Heh

18:38 <MichaelRaskin> Make the world-class CI «make world»-class!

18:44 <MichaelRaskin> Actually… hmm… how much has to be done by people with GH org admin access?

18:45 <gchristensen> its no good

18:45 <gchristensen> we only get like 2k minutes of time

18:46 <gchristensen> unless we can do something meaningful in 1min20s per PR

18:47 <cole-h> Most of that would probably be taken up by getting the nix binaries available using the cachix GH action, or whatever

18:48 <MichaelRaskin> Ah

20:24 <{^_^}> [ofborg] @grahamc pushed 2 commits to released: https://git.io/JfIJK

20:24 <{^_^}> [ofborg] @grahamc merged pull request #464 → Bump amqp → https://git.io/JfkAT

20:26 <gchristensen> pushing out

20:31 <gchristensen> ehh this might not be good

20:34 <LnL> something not happy?

20:36 <gchristensen> yeaha lot of panics and reconnects going on https://monitoring.nix.ci/explore?orgId=1&left=%5B%22now-30m%22,%22now%22,%22Loki%22,%7B%22expr%22:%22%7Bjob%3D%5C%22systemd-journal%5C%22%7D%22%7D,%7B%22mode%22:%22Logs%22%7D,%7B%22ui%22:%5Btrue,true,true,%22none%22%5D%7D%5D

20:37 <LnL> ah there's the thing, couldn't find it

20:38 <LnL> hmm isn't that the php part?

20:38 <gchristensen> oh there are some of those , ignore those

20:38 <gchristensen> hmm maybe it has settled down

20:40 <LnL> looks like a few aarch builders disappeared for some reason

20:52 <LnL> "Error consuming IoError(UnexpectedEof)"

20:52 <LnL> looks like stopping just doesn't handle stuff nicely so it freaks out a bit

20:53 <gchristensen> aye

20:53 <LnL> searching with {unit=~"ofborg-.*service"} really helps :)

20:53 <gchristensen> :D

21:18 <cole-h> Just got back. Did the change not work as expected?

21:45 {`-`} has joined #nixos-borg