ChanServ changed the topic of #nixus to: Nixus is an experimental deployment tool for NixOS systems - https://github.com/Infinisil/nixus - https://logs.nix.samueldr.com/nixus/
<bqv> oh
<bqv> that was my stickler
<bqv> i cant have my custom specialArgs
<bqv> so my entire config is invalid, for now
<infinisil> bqv: Does it need to be specialArgs or would _module.args work too?
<bqv> i'm not entirely sure how the latter works, but presumably?
<infinisil> Because _module.args should be preferred whenever possible
<infinisil> Since it can be assigned in modules themselves
<infinisil> (so it doesn't need to be treated specially)
<bqv> ah
<bqv> hm
<bqv> oh
<bqv> it did a thing
<bqv> deploying my test to a spare machine...
<bqv> well snap
<bqv> this works
<bqv> i mean, i don't see how it could fail at this stage
<infinisil> :o
<bqv> this did stuff and required no changes
<bqv> obviously will require changes to make actual decent use of nixus
<bqv> but it's further than i got last time
<bqv> only thing is, i have to build with --impure
<bqv> because something uses builtins.currentSystem
<bqv> the thing is in nixpkgs, but i'm not sure what part of nixus calls that so i don't know how to replace it
<infinisil> bqv: Got logs for that?
<bqv> infinisil: http://ix.io/2tQw
<infinisil> bqv: Sounds like you need to specify `nixpkgs.localSystem = <the system>` in configuration
<bqv> infinisil: no dice
<infinisil> bqv: Same error?
<bqv> yep
<infinisil> bqv: *exactly* the same?
<infinisil> Oh I think I see the problem
<bqv> not a single character differs
<infinisil> Alright lemme just quickly break compatibility for everybody that uses nixus
<bqv> :D
<infinisil> And change the main call to `import nixus { deploySystem = <the system>; } { ... }`
<bqv> 9| pkgsModule = nixpkgs: { lib, config, ... }: {
<bqv> 10| config.nixpkgs.system = lib.mkDefault builtins.currentSystem;
<bqv> | ^
<bqv> 11| # Not using nixpkgs.pkgs because that would apply the overlays again
<bqv> attribute 'currentSystem' missing
<bqv> ...is that me or you?
<bqv> if i try and set nixpkgs.system or config.nixpkgs.system, i get followup errors
<bqv> set it in nodes.*.configuration, that is
<bqv> i don't know if that's even where it's cominf from
<infinisil> Oh yeah that's me
<infinisil> Hold on
<infinisil> Well I'm not testing it with pure eval
<infinisil> :P
<bqv> yeah fair :D
<infinisil> That impurity should be gone now
<infinisil> (now all systems are assumed to be deploySystem or currentSystem unless otherwise specified)
<bqv> infinisil: it works! and it just full on deployed that machine
<infinisil> :D
<bqv> wait
<bqv> lmao
<bqv> "this shouldn't occur" occured
<bqv> is it because i ssh'd as not root
<bqv> can i make it sudo
<infinisil> logs?
<infinisil> It should sudo on the target host
<bqv> [phi] Triggering system switcher...
<bqv> [phi] Warning: Permanently added '10.0.0.4' (ED25519) to the list of known hosts.
<bqv> [phi] Trying to confirm success...
<bqv> [phi] Warning: Permanently added '10.0.0.4' (ED25519) to the list of known hosts.
<bqv> [phi] This shouldn't occur!
<bqv> [phi] Finished
<infinisil> Hm, not sure where that error is from
<bqv> oh
<bqv> i think the machine just rebooted
<bqv> what the
<bqv> yeah it's not even responding to ssh now, i hope it boots ok, it's headless...
<infinisil> Oh lol
<bqv> hm
<bqv> it booted ok, but it's not the new system
<bqv> how can i debug what just happeend?
<infinisil> Nixus reboots automatically if it couldn't rollback successfully after a failure
<bqv> oh
<infinisil> (and before the new system is activated)
<infinisil> bqv: Logs are in /var/lib/system-switcher
<infinisil> (on the target machine)
<bqv> oh
<bqv> oh that's fine, it's my stuff not yours
<bqv> my systems soft-fail activation all the time :p
<bqv> ..hmm
<infinisil> bqv: systemd units?
<bqv> yeah
<infinisil> There's ignoreFailingSystemdUnits if you don't care about fixing the underlying problem :P
<bqv> yes!
<bqv> where's that?
<infinisil> `defaults.ignoreFailingSystemdUnits = true`
<infinisil> Or `nodes.<node>.ignoreFailingSystemdUnits`
<bqv> alright, take 2
<bqv> lol, got a "this shouldn't occur" again
<bqv> benign?
<infinisil> Never seen that myself
<infinisil> Well, not in that context at least lol
<bqv> well, it worked
<bqv> so yeah, benign
<bqv> wtf
<bqv> it just rebooted again
<bqv> oh
<bqv> there was another activationscript that failed
<infinisil> The logging isn't great I know
<infinisil> Should really be streamed to the terminal you deploy from
<bqv> yeah
<bqv> hmm
<infinisil> bqv: What's failing?
<bqv> one of my custom ones
<bqv> made it unfailable now
<infinisil> bqv: Does it always fail or just with nixus?
<bqv> always, probably
<bqv> failed again... waiting for reboot
<infinisil> Reboot should only occur if the switch to the new system failed (and no success confirmation in time), plus the old system fails to activate too (e.g. failing activation script)
<infinisil> bqv: So my guess is that there's still something that makes the new system activation fail
<bqv> yeah
<bqv> infinisil: is there a time limit?
<infinisil> Yea
<infinisil> Might have to increase it for slow machines
<bqv> that'll be it
<bqv> gotcha
<drakonis> ah the new nix flakes cli is much better
<drakonis> far less janky to work with
<bqv> yup
<bqv> infinisil: why is success < switch :p
<bqv> that makes no sense to me
<bqv> it'll rollback before switch times out
<infinisil> bqv: Oh those are not very related
<infinisil> switchTimeout is the timeout for the `$system/bin/switch-to-configuration` command
<infinisil> Which does all the activation script stuff
<infinisil> successTimeout is after that's done, and the target machine is expecting the success confirmation
<infinisil> Which should be pretty fast
<bqv> oh, ok, so i can just up the former
<infinisil> (because it's just an SSH into the machine)
<infinisil> bqv: switchTimeout if your activation scripts are timing out yeah
<infinisil> Maybe the defaults should be higher
<bqv> i've set it to 120
<bqv> ...fail?!
<infinisil> I also had to increase the successTimeout for a harddisk machine, because the shell prompt was so slow to load lol
<bqv> this is weird though, it failed in way less that 20s
<bqv> maybe i'll just bump it to see
<infinisil> The logs aren't telling you the problem?
<bqv> maybe it did this time, i'll find out once it finishes booting
<bqv> last time it was the timeout, but i thought setting switchTimeout to 2 minutes would fix that
<bqv> maybe it is just oldMachineProblems
<bqv> yeah, success not in time
<bqv> but hang on
<bqv> success includes switch
<bqv> because ssh won't be started until near the end of switch
<bqv> (in this case, at least)
<bqv> i will try with 2 mins on both
<bqv> it failed instantly again
<bqv> infinisil: this is a bug, i think
<infinisil> bqv: Logs?
<bqv> the activation was successful (ignoring systemd)
<bqv> oh, actually maybe it didn't fail
<bqv> i saw the "this shouldn't occur" and assumed that meant problems
<bqv> but it's not rebooted this time
<bqv> but yeah, the nixus script on the deployer finished way before the activation on the target
<bqv> and had [phi] This shouldn't occur!
<bqv> i feel like this'll be some expectation that sshd is running at all times, but in this activation it wasn't, for a period?
<infinisil> I don't think it relies on sshd running all the time
<bqv> oh nevermind, it rebooted
<bqv> just way later this time
<bqv> what is going on..
<infinisil> No idea
<infinisil> But I wanna blame bash
<bqv> doodoodoo, let's break some nix rules...
<bqv> $status is unknown
<infinisil> bqv: context?
<infinisil> Oh
<infinisil> Lol
<infinisil> I searched for "this shouldn't occur" in nixus
<infinisil> But didn't pass -i
<bqv> hah
<bqv> is it possible ssh is taking longer than 5?
<bqv> not entirely sure how that's meant to work
<infinisil> Hm that could be it
<infinisil> bqv: I'd try increasing that timeout too then
<infinisil> Wait but since $status is unknown, this means that the command finished
<infinisil> So that can't be it
<bqv> oh, ok
<infinisil> bqv: This is the part that's run there: https://github.com/Infinisil/nixus/blob/master/scripts/switch#L66-L76
<infinisil> The `cat "system-$id/status"` outputs unknown
<infinisil> Wait does it
<infinisil> Yea
<bqv> brl.
<bqv> nope
<bqv> all of the statuses for every id are failure
<infinisil> Huh, what does this line even do: [ ! -p "system-$id/confirm" ]
<bqv> i'm not fluent in bash :p
<bqv> wait what
<bqv> lol
<bqv> i don't have to be fluent to see that's ..weird
<infinisil> Oh I guess it just checks that that file doesn't exist
<infinisil> And exits if it does
<infinisil> I think
<infinisil> But why would I code that
<infinisil> infinisil: Comment your code please
<bqv> oh
<bqv> does status get written to multiple times?
<bqv> cause if so i guess it could have been unknown at some point during activation
<infinisil> Yeah
<infinisil> It does
<infinisil> It's unknown until the activation finished (then success or failure depending on the result)
<bqv> so the issue is, $active is breaking out of the loop, so the issue probably is the timeout
<bqv> nah, instafail again...
<infinisil> Lemme do some small changes
<bqv> shouldn't that be while [ "$active" != 0 ] && [ ! "$status" -eq "unknown" ]; do
<bqv> or whatever that should be
<infinisil> That might be it
<infinisil> Although not entirely
<bqv> and uh, prevstatus/prevactive aren't even used
<infinisil> Heh
<infinisil> There's definitely a logic error somewhere
<infinisil> But I'm not sure how I managed to avoid this all this time
<bqv> :D
<infinisil> And what's special about your config that triggers it
<infinisil> bqv: Is that failure reproducible?
<infinisil> Like, all the time?
<infinisil> Because from looking at the code I feel like it should be racey and not trigger all the time
<bqv> it's a very slow pc
<bqv> but yes, it's failed every time
<bqv> this looks very wrong to me
<bqv> why this -> active=$?
<bqv> isn't that checking the return code
<infinisil> It is
<bqv> which isn't even relevant
<bqv> because timeout and because ssh
<infinisil> Should all propagate
<bqv> oh is that what batchmode is
<infinisil> Even without that
<infinisil> Ohh
<infinisil> timeout doesn't preserve the exit status when it timed out
<infinisil> `man timeout`
<infinisil> --preserve-status
<bqv> (i just tried `ssh 10.0.0.1 false || echo false`, it output nothing?)
<bqv> also ha
<bqv> i don't think that's it either though
<infinisil> That outputs false for me
<bqv> the timeout was 30, it failed in way less than that
<bqv> oh, it doesn't preserve it on nottimeouteither
<bqv> no..
<bqv> [phi] Trying to confirm success...
<bqv> [phi] Tue 11 Aug 03:46:22 BST 2020
<bqv> [phi] Warning: Permanently added '10.0.0.4' (ED25519) to the list of known hosts.
<bqv> [phi] Tue 11 Aug 03:46:24 BST 2020
<bqv> [phi] $status=unknown
<bqv> [phi] This shouldn't occur!
<bqv> [phi] Finished
<bqv> this is with --preserve-status, and with a timeout of 30;
<infinisil> bqv: And that happens in less than 30 seconds?
<bqv> look at the datestamps
<bqv> they're 2 seconds apart
<bqv> the first date is right after trying to confirm success
<bqv> the last date is just before case $status
<bqv> that entire block fails in 2 seconds
<infinisil> Hm
<infinisil> But like, the active script needs to exit with 0 for this to occur
<infinisil> And it needs to output "unknown"
<infinisil> Ugh this is a mess, I should rewrite this in haskell already
<bqv> on multiple machines i have it that "ssh $machine false" is a success
<bqv> the manpage agrees with you, but my experience doesn't
<infinisil> Well mainly the `set -x` I guess
<infinisil> bqv: `ssh localhost false; echo $?`
<infinisil> What about that
<bqv> success
<infinisil> The hell
<bqv> all i can think of is that it's due to my shell
<infinisil> What's special about it?
<bqv> well it's not bash
<infinisil> I guess that would explain the problem
<infinisil> what is it then
<bqv> xonsh
<infinisil> Oh
<infinisil> Yeah I think that could explain it
<bqv> can you make ssh not use the shell?
<infinisil> I wanted to do this anyways eventually
<infinisil> Let's see how easy it is
<infinisil> Actually I'm not sure if that's possible
<bqv> exec
<bqv> that fixes it
<infinisil> exec where
<bqv> ssh localhost exec false
<bqv> that's a failure, for me
<bqv> so... if i stick an exec before the switch script
<infinisil> Hmm
<bqv> seems way more promising at least
<bqv> it's doing stuff
<bqv> yeah, you need to add an exec, cause it's harmless for those who use bash but fixes it if they don't :)
<bqv> i have a successful system switch, finally
<infinisil> Damn
<infinisil> Alternatively I could not use the exit code to transmit state
<infinisil> But at this point, an exec hack on top doesn't matter much
<bqv> seems reasonable to do that, because connection refused is also an "active" state that you slurp up with that
<infinisil> It's not
<infinisil> Well
<infinisil> In your shell it is
<bqv> rd
<bqv> oh
<bqv> i dunno, at any rate exec makes it work so it'd be cool if you could pop that on the pure branch too :p
<infinisil> Yeah why not
<infinisil> pure is merged into master btw
<bqv> oh
<infinisil> :)
<bqv> awesome, will move back
<infinisil> bqv: Pushed the exec change to master
<bqv> <3
<infinisil> bqv: Also just pushed: ability to add lib overlays: https://github.com/Infinisil/nixus/commit/69462cdd201a938a2f3e72840bbd7d3c96a795a1
<bqv> ooh
<bqv> neat
<infinisil> bqv: Thanks for the error reporting and debugging!
<bqv> np!
<bqv> infinisil: oh, please bump this timeout, or at least make it retry? https://github.com/Infinisil/nixus/blob/69462cdd201a938a2f3e72840bbd7d3c96a795a1/modules/deploy.nix#L174
<bqv> (not a major issue, just, i've had to rerun the deploy script a few times because slow to start disks)
<infinisil> bqv: PR? I need to sleep now :)
<bqv> can do
<infinisil> Eh I'll just increment it myself real quick
<infinisil> bqv: How many seconds is enough for your system?
<bqv> i'd been setting 30, but even 15 would probably be constant success
<infinisil> bqv: pushed to master
<bqv> ty!
<infinisil> Won't hurt to have this a bit higher anyways
<infinisil> Np :)
<bqv> next step, start using it for all my hosts, but i'll do that tomorrow
<infinisil> Nice
<bqv> infinisil: I get a few errors from nix-copy-closure, I think. Invalid operations on the daemon
<bqv> Might be fixed by using nix copy instead
<bqv> I might need to make a PR
<infinisil> bqv: The problem might be that the current systems nix-copy-closure is used, which might not be the same Nix version as the target host
<infinisil> But I'm not sure if any other version can be used because nix-copy-closure needs to have access to the deploy hosts nix store
<bqv> infinisil: yeah, the alternative is to check the nix version of the deployer/target, thats what I was gonna try
<bqv> nix copy is backwards compatible
<bqv> So as long as it's present it can be used
<bqv> oh, nevermind, nix copy works on old nix too
<bqv> infinisil: http://ix.io/2tUs this is the kind of stuff i'm seeing. using `nix copy` instead doesn't help.
<bqv> just getting random "invalid operation"s
<infinisil> Never seen this before
<infinisil> bqv: What are you currently using to deploy your machines? How does that copy the closure?
<bqv> i'm currently using my nixos-rebuild wrapper, git and/or nix copy
<bqv> nix{-copy-closure, copy} is indeed tempramental, but this seems ...persistent
malook has joined #nixus
eyJhb has joined #nixus
<eyJhb> Why is there so many nix channels?!
<eyJhb> :o
<eyJhb> I am in 8 atm... :| But end goals, if any?
<infinisil> eyJhb: I guess one goal is to be the best deployment tool there can be :P
<eyJhb> As long as it does not play the pokemon theme song
<eyJhb> But I am thinking, do we have any good tools atm. to do huge 1000+ deployments?
<infinisil> Like, one goal is that you can be sure that you never lose access to a remote machine
<infinisil> With automatic rollback and stuff
<infinisil> eyJhb: I haven't thought about huge deployments, but if possible I'd love to make nixus support that too
<infinisil> And actually some ideas I have would play well into that
malook has quit [Quit: malook]
<infinisil> eyJhb: Oh another goal I have in mind is to have many Nixus modules that change config on multiple machines
<infinisil> For things that require changes on more than one machine
<infinisil> The recent ssh.nix module is the first example of this: https://github.com/Infinisil/nixus/blob/master/modules/ssh.nix
<eyJhb> Does none of the others have automatic rollback?
<eyJhb> Yeah, I saw that but I am unsure what it does vs. just normally configuring stuff?
<infinisil> I can't remember any others having rollback, might not have looked close enough though
<infinisil> eyJhb: It essentially allows you to say "I need to be able to `ssh user@host`" and the module sets your authorized key on the target host and known_hosts on the local one
<infinisil> If you just did the authorized key bit (which works in normal NixOS), you still have to configure known_hosts so you don't get an SSH warning when you connect at first
<infinisil> eyJhb: This is just the start of it though. I want to make a backup module (for znapzend) next, which uses the SSH module to make sure that the backups work (they need SSH access to the backup machine)
<infinisil> Or a VPN module that allows you to do `vpn.<id> = { server = "host1"; clients = [ "host2" "host3" ]; }`
<infinisil> And it takes care of configuring everything on all involved hosts
<eyJhb> Seems advanced. but basically modules that work at a higher level
<infinisil> Yea
<infinisil> Also, I want to have inter-host dependencies
<infinisil> So you can e.g. wait for another host to start its DNS server first before doing another thing on a different host
<eyJhb> Sounds nice! But in general, it seems like switching from X to Y to Z, involves some basic changes and the config stays the same pretty much
<eyJhb> Not Nixus related directly
<eyJhb> Also, is it free time or work?
<infinisil> Not sure what you mean
<infinisil> I guess small config changes only change a small part in the end
<eyJhb> For switching between nixus, morph, etc. the base config for each host is the same I assume
<infinisil> eyJhb: Free time (there was part that was sponsored by Niteo though, but that's no more unless they plan to actually use this)
<eyJhb> What are they currently using, if not Nixus?
<infinisil> eyJhb: Ah yeah, mostly. Switching from Nixus -> something else isn't possible with the multi-host modules though
<infinisil> (pretty sure Nixus is the only tool that allows multi-host modules for now)
<infinisil> eyJhb: Nothing yet, just slowly testing the waters
<bqv> I'm somewhat using it
<bqv> I just have a nix bug
<bqv> So I can't deploy remote hosts successfully consistently
<infinisil> bqv: Oh it's a Nix bug?
<bqv> infinisil: I can reproduce it with plain nix-copy-closure
<bqv> So it must be
<bqv> See #nixos
<infinisil> Ah but nix copy works?
<bqv> No that fails too
<bqv> Nothing seems to want to copy, at the moment
<infinisil> Oof
<infinisil> Well at least it's not Nixus problem then :P
<bqv> Yeah, heh
<bqv> So with that sorted I'll see no reason not to use nixus
<infinisil> Well, other than it being experimental and stuff lol
<infinisil> There's also some annoying things like Ctrl-C not working and logs not being nice
<infinisil> Nothing permanent of course, just bash and distributed system problems
<bqv> pfft, light work
drakonis has quit [Quit: ZNC 1.8.1 - https://znc.in]
drakonis has joined #nixus
<drakonis> 00p00
<drakonis> oops
<drakonis> cleaning my phone