#nixos-borg on 2018-08-12

2018-04-19 20:36 gchristensen changed the topic of #nixos-borg to: https://www.patreon.com/ofborg https://monitoring.nix.ci/dashboard/db/ofborg?refresh=10s&orgId=1&from=now-1h&to=now "I get to skip reviewing the PHP code and just wait until it is rewritten in something sane, like POSIX shell. || https://logs.nix.samueldr.com/nixos-borg

00:14 <andi-> :/

00:15 <gchristensen> rabbitmq 3.7.0 outputs a "your files don't exist, stupid" error

00:15 <andi-> thats motivating!

00:20 <gchristensen> I might have gotten it! time to destroy the server and start over.

00:51 <gchristensen> boot #2 was almost perfect... few more patches, and let's see about boot #3 -happening ow.

00:54 <gchristensen> yaaas

00:55 <andi-> :+1:

04:54 orivej has quit [Ping timeout: 256 seconds]

09:35 timokau has quit [Quit: WeeChat 2.2]

09:44 timokau has joined #nixos-borg

10:21 orivej has joined #nixos-borg

12:32 <LnL> "doesn't work and would be bad to deploy" :'D

12:32 <gchristensen> :D

12:33 <gchristensen> it'd be cool if I could make the nginx enableACME opt enable the dns01 challenge codeinstead of simp_le

12:46 orivej has quit [Ping timeout: 272 seconds]

12:50 orivej has joined #nixos-borg

13:22 <gchristensen> hrm

13:22 <gchristensen> I think it would be best to just bite the bullet and take some down-time for a deploy

13:23 <LnL> for the rabbitmq deployment?

13:23 <gchristensen> otherwise it becomes a bit of a mess of handling partial states and this isn't so critical that 20min of down-time will make many people mad.

13:23 <gchristensen> for upgrading to 18.03

13:23 orivej has quit [Ping timeout: 256 seconds]

13:24 <timokau[m]> Yes I think half an hour downtime is perfectly acceptable

13:24 <gchristensen> from there, it'll be easier to add a second and third rabbitmq node to the cluster, and can then do rolling deploys via buildkite

13:25 <LnL> yeah, that's fine

13:25 <gchristensen> LnL: fwiw if you want to deploy anything, go ahead -- I haven't done anything to make buildkite dangerous, like done unpushed deploys to nodes

13:26 <gchristensen> when it is time to do this I'll make sure you know when and when it is done

13:26 <LnL> unless there's some trivial way to make the webhook put stuff in a local redis queue and flush it afterwards

13:26 <LnL> but that's probably not worth the effort

13:27 <gchristensen> probably not, once we have 3 rmq nodes it'll automatically find an up one

13:27 <gchristensen> it is just this first one which is painful

13:28 <LnL> what about upgrading a cluster, they don't need to eg. run the same erlang version?

13:30 <gchristensen> "Please note that a full cluster stop is required for feature version upgrades." going from 3.6 -> 3.7 will require that

13:30 <gchristensen> https://www.rabbitmq.com/upgrade.html#rabbitmq-version-upgradability

13:31 <gchristensen> in cases like that we can split off one node on the old version and upgrade the rest of the cluster, drain the old cluster of work, and then upgrade it in to the new one

13:31 <gchristensen> care and effort will be required, but that type of upgrade doesn'thappen often

13:32 <LnL> ah right, there's nothing persistent in there

13:33 <gchristensen> yea

13:33 <LnL> hmm no, what about the logs?

13:33 <LnL> forgot how that works

13:33 orivej has joined #nixos-borg

13:33 <gchristensen> we could queue jobs in the new cluster with nobody reading from it until the old cluster is totally done, then switch over

13:34 <gchristensen> so logs will be collected and new stuff just won't be started on until the old stuff is done. not ideal, but workable

13:34 <gchristensen> I think there are several options

13:34 <gchristensen> anyway, gotta go

13:35 <gchristensen> back in a few hours

15:51 orivej has quit [Ping timeout: 256 seconds]

17:13 orivej has joined #nixos-borg

17:58 orivej has quit [Ping timeout: 248 seconds]

18:32 jtojnar has quit [Read error: Connection reset by peer]

18:34 jtojnar has joined #nixos-borg

20:06 <gchristensen> hmmm 4:00

20:07 <gchristensen> 16:00*

20:08 <gchristensen> I'm not sure there is an ideal time to take down ofborg for an upgrad

20:14 <LnL> there's a graph that should indicate if stuff slows down during certain hours

20:21 <gchristensen> LnL: is there a way to take this query, avg(ofborg_queue_evaluator_waiting + ofborg_queue_evaluator_in_progress), and make it return the data for 24hrs ago?

20:21 <gchristensen> so I can overlay the last 7 days of data

20:21 <LnL> yeah

20:41 <LnL> question

20:41 <gchristensen> answer

20:41 <LnL> how's the current migration different from a major upgrade?

20:42 <gchristensen> I guess it is the same thing :D

20:42 <MichaelRaskin> Everything is the same if you look long enough

20:44 <LnL> so crazy thought, do the migration thing if you don't want downtime

20:54 <LnL> 1. new cluster + new webhook, 2. change dns/proxy, 3. wait, 4. whatever?

20:55 orivej has joined #nixos-borg

20:55 <LnL> I'm probably missing something :p

20:59 <gchristensen> yeah that is probably the way to go:P

21:00 <gchristensen> sort of "ripping the band-aid" off on seeing if terraform is setting up the rabbitmq settings properly :)

21:19 gchristensen has quit [Ping timeout: 268 seconds]

21:23 gchristensen has joined #nixos-borg

21:25 <gchristensen> woo

21:25 <LnL> is things a "long" deploy?

21:25 <gchristensen> the pages?

21:25 <gchristensen> what things?

21:25 <LnL> the alert

21:26 <gchristensen> ewr1 seemed to have a long network blip

21:26 <LnL> or where you testing something

21:26 <gchristensen> whatever took out ofborg took out my network connection too (I just lost my irc connection and came back)

21:26 <gchristensen> the page should be clearing now

21:27 <gchristensen> jacob.packet [5:26 PM]

21:27 <gchristensen> yup, looks like we did a bit of fat-fingering on our, and then quickly corrected. sorry for the excitement! please let me know if any issue persists

21:28 <gchristensen> LnL: I'm assuming you're learning the downside to push over right now

21:29 <LnL> oh, well guess the alerts/metrics work well :D

21:29 <gchristensen> yay :)

21:30 <LnL> if this happens too frequently we can change the timing window

21:31 <gchristensen> LnL: are you still receiving FIRING notices?

21:31 <LnL> no

21:31 <gchristensen> why am I :/

21:32 <LnL> did you acknowledge it?

21:33 <LnL> pushover seems to snooze/repeat emergency alerts until you do

21:33 <gchristensen> oooh cool

21:33 <LnL> yeah, noticed that last time

21:34 <LnL> probably isn't smart enough to understand that the resolved event tho :/

21:48 <gchristensen> LnL: I think your idea is totally the right one and anything between now and doing it is just nerves :P

22:39 <gchristensen> ah, right, the reason I was thinking about a nice cutoverthis time is it means making the acme-dns-01 code really nice up front, and I was thinking I'd get the upgrade out of the way first.

22:39 <gchristensen> but that is fine, and probably the right way to go about it.

22:40 <gchristensen> was thinking about a not nice cutover this time*

23:59 timokau has quit [Quit: WeeChat 2.2]