<gchristensen>
I might have gotten it! time to destroy the server and start over.
<gchristensen>
boot #2 was almost perfect... few more patches, and let's see about boot #3 -happening ow.
<gchristensen>
yaaas
<andi->
:+1:
orivej has quit [Ping timeout: 256 seconds]
timokau has quit [Quit: WeeChat 2.2]
timokau has joined #nixos-borg
orivej has joined #nixos-borg
<LnL>
"doesn't work and would be bad to deploy" :'D
<gchristensen>
:D
<gchristensen>
it'd be cool if I could make the nginx enableACME opt enable the dns01 challenge codeinstead of simp_le
orivej has quit [Ping timeout: 272 seconds]
orivej has joined #nixos-borg
<gchristensen>
hrm
<gchristensen>
I think it would be best to just bite the bullet and take some down-time for a deploy
<LnL>
for the rabbitmq deployment?
<gchristensen>
otherwise it becomes a bit of a mess of handling partial states and this isn't so critical that 20min of down-time will make many people mad.
<gchristensen>
for upgrading to 18.03
orivej has quit [Ping timeout: 256 seconds]
<timokau[m]>
Yes I think half an hour downtime is perfectly acceptable
<gchristensen>
from there, it'll be easier to add a second and third rabbitmq node to the cluster, and can then do rolling deploys via buildkite
<LnL>
yeah, that's fine
<gchristensen>
LnL: fwiw if you want to deploy anything, go ahead -- I haven't done anything to make buildkite dangerous, like done unpushed deploys to nodes
<gchristensen>
when it is time to do this I'll make sure you know when and when it is done
<LnL>
unless there's some trivial way to make the webhook put stuff in a local redis queue and flush it afterwards
<LnL>
but that's probably not worth the effort
<gchristensen>
probably not, once we have 3 rmq nodes it'll automatically find an up one
<gchristensen>
it is just this first one which is painful
<LnL>
what about upgrading a cluster, they don't need to eg. run the same erlang version?
<gchristensen>
"Please note that a full cluster stop is required for feature version upgrades." going from 3.6 -> 3.7 will require that
<gchristensen>
in cases like that we can split off one node on the old version and upgrade the rest of the cluster, drain the old cluster of work, and then upgrade it in to the new one
<gchristensen>
care and effort will be required, but that type of upgrade doesn'thappen often
<LnL>
ah right, there's nothing persistent in there
<gchristensen>
yea
<LnL>
hmm no, what about the logs?
<LnL>
forgot how that works
orivej has joined #nixos-borg
<gchristensen>
we could queue jobs in the new cluster with nobody reading from it until the old cluster is totally done, then switch over
<gchristensen>
so logs will be collected and new stuff just won't be started on until the old stuff is done. not ideal, but workable
<gchristensen>
I think there are several options
<gchristensen>
anyway, gotta go
<gchristensen>
back in a few hours
orivej has quit [Ping timeout: 256 seconds]
orivej has joined #nixos-borg
orivej has quit [Ping timeout: 248 seconds]
jtojnar has quit [Read error: Connection reset by peer]
jtojnar has joined #nixos-borg
<gchristensen>
hmmm 4:00
<gchristensen>
16:00*
<gchristensen>
I'm not sure there is an ideal time to take down ofborg for an upgrad
<LnL>
there's a graph that should indicate if stuff slows down during certain hours
<gchristensen>
LnL: is there a way to take this query, avg(ofborg_queue_evaluator_waiting + ofborg_queue_evaluator_in_progress), and make it return the data for 24hrs ago?
<gchristensen>
so I can overlay the last 7 days of data
<LnL>
yeah
<LnL>
question
<gchristensen>
answer
<LnL>
how's the current migration different from a major upgrade?
<gchristensen>
I guess it is the same thing :D
<MichaelRaskin>
Everything is the same if you look long enough
<LnL>
so crazy thought, do the migration thing if you don't want downtime
<LnL>
1. new cluster + new webhook, 2. change dns/proxy, 3. wait, 4. whatever?
orivej has joined #nixos-borg
<LnL>
I'm probably missing something :p
<gchristensen>
yeah that is probably the way to go:P
<gchristensen>
sort of "ripping the band-aid" off on seeing if terraform is setting up the rabbitmq settings properly :)
gchristensen has quit [Ping timeout: 268 seconds]
gchristensen has joined #nixos-borg
<gchristensen>
woo
<LnL>
is things a "long" deploy?
<gchristensen>
the pages?
<gchristensen>
what things?
<LnL>
the alert
<gchristensen>
ewr1 seemed to have a long network blip
<LnL>
or where you testing something
<gchristensen>
whatever took out ofborg took out my network connection too (I just lost my irc connection and came back)
<gchristensen>
the page should be clearing now
<gchristensen>
jacob.packet [5:26 PM]
<gchristensen>
yup, looks like we did a bit of fat-fingering on our, and then quickly corrected. sorry for the excitement! please let me know if any issue persists
<gchristensen>
LnL: I'm assuming you're learning the downside to push over right now
<LnL>
oh, well guess the alerts/metrics work well :D
<gchristensen>
yay :)
<LnL>
if this happens too frequently we can change the timing window
<gchristensen>
LnL: are you still receiving FIRING notices?
<LnL>
no
<gchristensen>
why am I :/
<LnL>
did you acknowledge it?
<LnL>
pushover seems to snooze/repeat emergency alerts until you do
<gchristensen>
oooh cool
<LnL>
yeah, noticed that last time
<LnL>
probably isn't smart enough to understand that the resolved event tho :/
<gchristensen>
LnL: I think your idea is totally the right one and anything between now and doing it is just nerves :P
<gchristensen>
ah, right, the reason I was thinking about a nice cutoverthis time is it means making the acme-dns-01 code really nice up front, and I was thinking I'd get the upgrade out of the way first.
<gchristensen>
but that is fine, and probably the right way to go about it.
<gchristensen>
was thinking about a not nice cutover this time*