gchristensen changed the topic of #nixos-borg to: https://www.patreon.com/ofborg https://monitoring.nix.ci/dashboard/db/ofborg?refresh=10s&orgId=1&from=now-1h&to=now "I get to skip reviewing the PHP code and just wait until it is rewritten in something sane, like POSIX shell. || https://logs.nix.samueldr.com/nixos-borg
jtojnar has joined #nixos-borg
jtojnar has quit [Quit: jtojnar]
jtojnar has joined #nixos-borg
orivej_ has quit [Ping timeout: 246 seconds]
MichaelRaskin has quit [Quit: MichaelRaskin]
orivej has joined #nixos-borg
jtojnar has quit [Ping timeout: 255 seconds]
orivej has quit [Ping timeout: 244 seconds]
jtojnar has joined #nixos-borg
{^_^} has quit [Remote host closed the connection]
gchristensen has quit [Quit: WeeChat 2.2]
grahamc has joined #nixos-borg
grahamc is now known as gchristensen
gchristensen is now known as {^_^}
{^_^} is now known as gchristensen
{^_^} has joined #nixos-borg
<infinisil> Hmm, I have a bunch of these lines in my logs:
<infinisil> Apr 23 16:05:26 protos nixbot[26406]: Exception: ConnectionClosedException Abnormal "Could not connect to any of the provided brokers: [((\"events.nix.gsc.io\",5671),HostCannotConnect \"events.nix.gsc.io\" [Network.Socket.connect: <socket: 3>: does not exist (Connection refused),Network.Socket.connect: <socket: 3>: does not exist (Connection refused)])]"
<gchristensen> what TZ are thos e logs in?
<infinisil> That was about 5:30 hours ago
<infinisil> > CEST
<gchristensen> yeah
<{^_^}> "The time in CEST is currently 21:39:51 (UTC +2)"
<gchristensen> I have CEST in my task bar :)
<gchristensen> since so many of my NixOS friends are in CEST
<infinisil> Hehe neta
<infinisil> neat
<gchristensen> so around 1600 CEST I rebooted the events.nix.gsc.io host
<infinisil> Oh I see
<gchristensen> (uptime of 5:10hrs)
<infinisil> Hmm so how should I handle that
<infinisil> Exponential backoff? Or just retry every minute or so
<infinisil> How long was it down?
<gchristensen> every minute is fine
<gchristensen> hmm just a few minutes I think
<gchristensen> let's find out
<infinisil> It might be only 15 seconds
<infinisil> Apr 23 16:05:37 protos nixbot[26406]: AMQP connection opened
<infinisil> Apr 23 16:17:44 protos nixbot[26406]: AMQP connection closed
<infinisil> This is also weird
<infinisil> Man, why is this so complicated
<gchristensen> it halted at 14:29:50 and the kernel started again at 14:30:24 (UTC)
<gchristensen> the ofborg workers just die and relies on systemd restarting them, which it does every 30s
<infinisil> > UTC
<{^_^}> "The time in UTC is currently 19:44:12 (UTC 0)"
<infinisil> Hmm
<infinisil> My logs say it opened a connection again at 16:05, but it never opened a channel, which it's supposed to do right after opening a connection
<infinisil> I guess I should add some timeouts
<infinisil> To retry
<infinisil> Or something
<gchristensen> ofborg's model is if amqp gets weird, fail the process
<gchristensen> and let systemd take care of respawn
<infinisil> Hmm I implemented exponential backoff with a maximum of 1 minute, along with a 5 second timeout for trying to connect