gchristensen changed the topic of #nixos-borg to: https://www.patreon.com/ofborg https://monitoring.nix.ci/dashboard/db/ofborg?refresh=10s&orgId=1&from=now-1h&to=now "I get to skip reviewing the PHP code and just wait until it is rewritten in something sane, like POSIX shell. || https://logs.nix.samueldr.com/nixos-borg
orivej has quit [Ping timeout: 264 seconds]
orivej_ has joined #nixos-borg
orivej_ has quit [Quit: No Ping reply in 180 seconds.]
orivej has joined #nixos-borg
orivej has quit [Quit: No Ping reply in 180 seconds.]
orivej has joined #nixos-borg
orivej has quit [Ping timeout: 256 seconds]
orivej has joined #nixos-borg
orivej has quit [Ping timeout: 264 seconds]
orivej has joined #nixos-borg
orivej has quit [Quit: No Ping reply in 180 seconds.]
orivej has joined #nixos-borg
orivej has quit [Quit: No Ping reply in 180 seconds.]
orivej has joined #nixos-borg
orivej has quit [Read error: Connection reset by peer]
orivej_ has joined #nixos-borg
orivej_ has quit [Ping timeout: 264 seconds]
orivej has joined #nixos-borg
orivej has quit [Ping timeout: 265 seconds]
orivej has joined #nixos-borg
orivej has quit [Ping timeout: 256 seconds]
orivej_ has joined #nixos-borg
orivej_ has quit [Quit: No Ping reply in 180 seconds.]
orivej has joined #nixos-borg
orivej has quit [Ping timeout: 256 seconds]
orivej_ has joined #nixos-borg
orivej_ has quit [Ping timeout: 240 seconds]
hmpffff has joined #nixos-borg
hmpffff has quit [Client Quit]
hmpffff has joined #nixos-borg
hmpffff has quit [Quit: nchrrrr…]
hmpffff has joined #nixos-borg
hmpffff has quit [Quit: nchrrrr…]
hmpffff has joined #nixos-borg
hmpffff has quit [Client Quit]
hmpffff has joined #nixos-borg
hmpffff has quit [Quit: nchrrrr…]
hmpffff has joined #nixos-borg
orivej has joined #nixos-borg
orivej has quit [Ping timeout: 240 seconds]
orivej has joined #nixos-borg
hmpffff has quit [Quit: nchrrrr…]
orivej has quit [Ping timeout: 256 seconds]
orivej has joined #nixos-borg
hmpffff has joined #nixos-borg
orivej_ has joined #nixos-borg
orivej has quit [Ping timeout: 272 seconds]
hmpffff_ has joined #nixos-borg
orivej_ has quit [Ping timeout: 246 seconds]
hmpffff has quit [Ping timeout: 260 seconds]
orivej has joined #nixos-borg
hmpffff_ has quit [Quit: nchrrrr…]
orivej has quit [Ping timeout: 246 seconds]
orivej has joined #nixos-borg
hmpffff has joined #nixos-borg
orivej has quit [Quit: No Ping reply in 180 seconds.]
orivej_ has joined #nixos-borg
orivej_ has quit [Ping timeout: 256 seconds]
orivej has joined #nixos-borg
orivej has quit [Ping timeout: 246 seconds]
orivej has joined #nixos-borg
hmpffff has quit [Quit: nchrrrr…]
orivej has quit [Quit: No Ping reply in 180 seconds.]
orivej has joined #nixos-borg
orivej has quit [Quit: No Ping reply in 180 seconds.]
orivej has joined #nixos-borg
orivej has quit [Read error: Connection reset by peer]
orivej_ has joined #nixos-borg
orivej_ has quit [Quit: No Ping reply in 180 seconds.]
orivej has joined #nixos-borg
hmpffff has joined #nixos-borg
orivej has quit [Quit: No Ping reply in 180 seconds.]
orivej has joined #nixos-borg
<LnL> oh, there was a connection timeout which recovered
<gchristensen> nice!
<cole-h> So, close_wait was fixed? :o
orivej has quit [Ping timeout: 246 seconds]
<LnL> not going to say that, but I'm not sure I've seen that error before
<cole-h> omg
<cole-h> I'm seeing backtraces for hubcaps now, instead of "no info" or whatever
<cole-h> :OOO
orivej has joined #nixos-borg
<LnL> oh?
<cole-h> https://monitoring.nix.ci/explore?orgId=1&left=%5B%22now-2d%22,%22now%22,%22Loki%22,%7B%22expr%22:%22%7Bunit%3D%5C%22ofborg-evaluator.service%5C%22,level%3D%5C%22ERROR%5C%22%7D%22%7D,%7B%22mode%22:%22Logs%22%7D,%7B%22ui%22:%5Btrue,true,true,%22none%22%5D%7D%5D
<cole-h> Maybe it was carnix's fault :o
<LnL> kind of
<LnL> I ran a full cargo update, which didn't work before because carnix barfed on one of the issues I ran into with the packet exporter
<LnL> oh that was a github timeout, not rabbitmq
evanjs has quit [Ping timeout: 256 seconds]
<cole-h> I kinda wanna get a TV or some huge monitor so I can have every borg service's logs all on one monitor lol
orivej has quit [Quit: No Ping reply in 180 seconds.]
orivej has joined #nixos-borg
<LnL> we'll see when the aarch builder updates, seems to pretty consistently run into the problem
<cole-h> True. I forgot aarch hasn't been redeployed yet
<LnL> btw should we keep the 6 nodes?
<gchristensen> do feel free to push the button
<LnL> feels kind of a waste to me
<cole-h> LnL: lol, I was just about to ask that
<gchristensen> up to y'all :)
<cole-h> I mean, it's donated, right? So it's not like we're losing money...
<gchristensen> good to be kind, though
<LnL> I mean we know we can upscale now easily
<cole-h> Don't really know how packet works, but would scaling back down free up 3 machines for others?
<gchristensen> it would
<cole-h> If so, I'd say it's a good idea, then
<gchristensen> they're 3, real, physical machines that cost like $1.7/h
<cole-h> I didn't know if they were baremetal or VMs or what
<cole-h> So yeah, might as well scale back down
<gchristensen> that said since we're on the spot market, they can shut us off any time
<gchristensen> and also our bids, while we don't pay, do raise the price in the spot market
<LnL> heh
<cole-h> I wonder if we could use `builtins.getEnv "OFBORG_PACKET_HOSTS"` or something and scale based on that at deploy time, instead of needing to commit every time?
<gchristensen> sure
<cole-h> So if we ever need more hosts for one deploy, just set an env in buildkite
<gchristensen> nice idea
<LnL> that's a good idea
<cole-h> :D
<cole-h> I feel lucid today, boys
<gchristensen> lol
<cole-h> Kinda wish we had toInt as a builtin
<gchristensen> > builtins.fromJSON "1"
<{^_^}> 1
<gchristensen> lol
<cole-h> Wasn't there some issue with that recently?
<cole-h> It was in nixFlakes or nixUnstable or something, and that didn't work (maybe)
<cole-h> Oh, guess not. Works fine for me.
<cole-h> Then, I'll do that :P
<infinisil> cole-h: lib.toInt encapsulates the fromJSON thing
<cole-h> Yeah, I saw.
<infinisil> Ah
<infinisil> I missed the "as a builtin" part
<cole-h> I was only worried because I thought I had taken part in a discussion where builtins.fromJSON wasn't working as expected for toInt
<cole-h> But maybe that was fixed
<cole-h> Another idea I had was adding some sort of rollback env var so we could say `OFBORG_IS_BROKEN=1` and have that actually rollback to the previous deploy or something
<cole-h> Though this would require enabling rollback in our nixops network and I don't know how well that would work
<gchristensen> don't get fancy
<cole-h> :P
<LnL> also seems a bit strange, should probaby be a separate thing
<cole-h> Well, I don't know where the buildkite config is, so that's the best I could come up with
<LnL> :)
<cole-h> :P
orivej has quit [Ping timeout: 258 seconds]
orivej has joined #nixos-borg
<cole-h> Nice, this seems to work well. Y'all fine with unspecified value being 3 hosts?
<cole-h> (well-ish)
<LnL> probably not all that important tho, just don't leave large cargo updates laying around without deploying like this guy
<cole-h> Hahaha
<LnL> yeah 3 seems good
<cole-h> omg I almost just had a heart attack. As I was committing, sway froze and I though my whole system went down (meaning I would have needed to restart my rsync probably from the beginning)
<{^_^}> ofborg/infrastructure#30 (by cole-h, 8 seconds ago, open): variable amount of hosts without committing
<LnL> why clippy, why
evanjs has joined #nixos-borg
<cole-h> :D
orivej has quit [Ping timeout: 246 seconds]
orivej has joined #nixos-borg
hmpffff has quit [Quit: nchrrrr…]
hmpffff has joined #nixos-borg
<LnL> oh when did ofborg become a rust repo again?
<cole-h> When you dropped crates-io.nix or whatever
<cole-h> ~2k lines of Nix lol
<infinisil> Btw you can tell github to ignore certain files for the language thing, here's an example: https://github.com/Infinisil/all-hies/blob/master/.gitattributes
<LnL> oh we can fix it?
<LnL> *.rs linguist-generated=true
<cole-h> lol
<{^_^}> [ofborg] @LnL7 opened pull request #511 → move lib-tests to a build → https://git.io/JfoXO
<LnL> cole-h: should we try to downscale or merge that another time?
<cole-h> I'm good with downscaling now.
<cole-h> I wanna try downscaling to three but using the env var to make sure it works
<cole-h> Anything else we wanna get in?
<LnL> still have to add the infra changes for the metrics so that's not totally ready yet
<LnL> but maybe #488?
<{^_^}> https://github.com/NixOS/ofborg/pull/488 (by LnL7, 1 week ago, open): don't request maintainer reviews if many files changed
<cole-h> Should t hat be a success status, or can we set a neutral one?
<LnL> eval was still successful, we just don't ping anybody
<cole-h> Oh, got it.
<{^_^}> [ofborg] @cole-h merged pull request #488 → don't request maintainer reviews if many files changed → https://git.io/JfRPZ
<{^_^}> [ofborg] @cole-h pushed commit from @LnL7 to released « don't request maintainer reviews if many files changed »: https://git.io/JfoXx
<cole-h> btw, you decided on what to do with the packet exporter?
<LnL> will probably put it somewhere else
<cole-h> Alright. Anything else that should go in? I'm ready to send it out.
<LnL> go ahead
<LnL> hmm, don't think that did anything
<cole-h> ?
<cole-h> I only see packet-spot-eval-{1,2,3}
<LnL> resource ‘packet-spot-eval-4’ is obsolete
<cole-h> 4 5 6 are gone from the actual deploy
<cole-h> Yeah, the hetzner builder is also there x)
<cole-h> Because it hasn't been deleted from the nixops state file
<cole-h> Well, I only did the dry-run just now
<LnL> that's only configuration I think
<LnL> dry-run creates hosts
<cole-h> I only saw {1,2,3} in the deploy log, so packet should reclaim {4,5,6} afterwards, right?
<LnL> have a feeling it won't
<gchristensen> does the deploy use -k ?
<cole-h> Don't think so
<gchristensen> dry-run won't destroy, but if the actual deploy uses -k it will
<cole-h> nixops deploy --check --allow-recreate
<gchristensen> add a -k? :P
<cole-h> lol
<LnL> could we blacklist hosts?
<gchristensen> what for?
<cole-h> gchristensen: add --kill-obsolete to actual deploy, dry-activate, or both?
<LnL> -k 'not core'
<gchristensen> ah
<LnL> or is that not a thing because it's setup differently?
<gchristensen> you can --exclude but it won't deploy to it
<cole-h> Then maybe `nixops deploy --check --allow-recreate --kill-obsolete --exclude core-0` followed by what we already have, or something?
<gchristensen> sure
<LnL> ah, that should work
<cole-h> Or maybe it should go `--kill-obsolete --exclude core-0` and then `--include core-0` so we don't deploy twice to the evaluators we keep
<cole-h> idk if that really matters though
<LnL> I'm probably being overly paranoid
<cole-h> Nah
<cole-h> Better safe than sorry
<gchristensen> +1
<gchristensen> btw core-0 is our web1
<cole-h> lol
<LnL> yeah, I have experience with "testing backups"
<cole-h> Me too
* cole-h looks at 18% transfer with ~100 hours left
<LnL> I... accidentally the entire cluster :p
<cole-h> looool
<gchristensen> lol
<gchristensen> classic
<gchristensen> one time LnL I was migrating a tooonnnn of data to Ceph
<cole-h> Now do that a few more times until it doesn't hurt :D
<gchristensen> and kernel panicked all 200 servers in the company
<gchristensen> including all the fileservers
<gchristensen> which did not 100% recover
<cole-h> Huh, I guess github changed IPs or something recently
<cole-h> Warning: Permanently added the RSA host key for IP address '192.30.255.112' to the list of known hosts.
<{^_^}> ofborg/infrastructure#31 (by cole-h, 8 seconds ago, open): kill obsolete machines
<cole-h> I did `--kill-obsolete --exclude core-0` and `--include core-0`
<LnL> gchristensen: we bootstrapped our testing system using the tool itself which is nice, until you recreate a deployment with the wrong settings :p
<gchristensen> hehehe
<cole-h> gchristensen: Would appreciate if you could give a quick once-over of those 2 lines I changed ;^)
<gchristensen> 2 lines are sometimes the scariest
<cole-h> Hopefully these ones aren't :P
<gchristensen> maybe add a comment as to the gymnastics
<cole-h> Hmm, maybe core-0 should be deployed first, followed by the evaluators...
<cole-h> (ack on commentary)
<LnL> doesn't this also give us a workaround if the hosts disappear?
<cole-h> How so?
<cole-h> (Not disputing, just not immediately obvious to me)
<LnL> OFBORG_PACKET_HOSTS=0 + regular deploy
<cole-h> Oh, I see
<cole-h> (btw switching order -- I feel like we should deploy core-0 first, and then evaluators; just gut feeling)
<LnL> don't think it really matters
<cole-h> Probably not
<cole-h> But it makes me feel all warm and fuzzy inside :^)
<cole-h> Here we go. Keep your hands and feet inside the vehicle at all times.
<cole-h> Ahhh, -k asks for confirmation
<cole-h> Assuming `--confirm` will fix that
<cole-h> LnL: Wanna add `--confirm` to `--kill-obsolete`? :^)
<cole-h> Else, I can send a PR for that too hehe
<cole-h> (Or we can revert this and have it be a manual thing...)
<LnL> should ok, given the exclude
orivej_ has joined #nixos-borg
orivej has quit [Ping timeout: 272 seconds]
<{^_^}> ofborg/infrastructure#32 (by cole-h, 10 seconds ago, open): confirm kill-obsolete
<cole-h> Round 3 :D
<cole-h> Nice
<cole-h> eval-{4,5,6} gone from the exporter, {1,2,3} remain
<cole-h> Thanks guys.
<LnL> hmm, "Assuming it's been destroyed already"
<LnL> and curl 147.75.77.57:9100/metrics
<cole-h> Hmmm, yeah, 6 evaluators on the dash too
<cole-h> :(
<LnL> they should be gone there tho, prometheus stops checking them
<cole-h> I don't like that "should" word
<LnL> nevermind https://monitoring.nix.ci/explore?orgId=1&left=%5B%22now-1h%22,%22now%22,%22prometheus%22,%7B%22expr%22:%22node_boot_time_seconds%7Binstance%3D~%5C%22packet.*%5C%22%7D%22%7D,%7B%22mode%22:%22Metrics%22%7D,%7B%22ui%22:%5Btrue,true,true,%22none%22%5D%7D%5D
orivej_ has quit [Ping timeout: 246 seconds]
<cole-h> Huh
orivej has joined #nixos-borg
<cole-h> Does packet have to "reclaim" those machines themselves then? Or do they stay up until someone needs it from the spot market?
<LnL> they where removed from the configuration so prometheus isn't looking at them anymore
<LnL> but I think they where not destroyed
<cole-h> Seems like it, especially seeing that that `curl` worked
<cole-h> And the fact there are still 6 evaluators/builders lol
orivej has quit [Ping timeout: 246 seconds]
orivej has joined #nixos-borg
hmpffff has quit [Quit: nchrrrr…]