00:01
<
bqv >
that was my stickler
00:01
<
bqv >
i cant have my custom specialArgs
00:01
<
bqv >
so my entire config is invalid, for now
00:05
<
infinisil >
bqv: Does it need to be specialArgs or would _module.args work too?
00:05
<
bqv >
i'm not entirely sure how the latter works, but presumably?
00:07
<
infinisil >
Because _module.args should be preferred whenever possible
00:07
<
infinisil >
Since it can be assigned in modules themselves
00:07
<
infinisil >
(so it doesn't need to be treated specially)
00:45
<
bqv >
it did a thing
00:47
<
bqv >
deploying my test to a spare machine...
00:50
<
bqv >
i mean, i don't see how it could fail at this stage
00:54
<
bqv >
this did stuff and required no changes
00:55
<
bqv >
obviously will require changes to make actual decent use of nixus
00:55
<
bqv >
but it's further than i got last time
00:55
<
bqv >
only thing is, i have to build with --impure
00:55
<
bqv >
because something uses builtins.currentSystem
00:55
<
bqv >
the thing is in nixpkgs, but i'm not sure what part of nixus calls that so i don't know how to replace it
00:56
<
infinisil >
bqv: Got logs for that?
00:59
<
infinisil >
bqv: Sounds like you need to specify `nixpkgs.localSystem = <the system>` in configuration
01:00
<
bqv >
infinisil: no dice
01:00
<
infinisil >
bqv: Same error?
01:01
<
infinisil >
bqv:
*exactly* the same?
01:01
<
infinisil >
Oh I think I see the problem
01:02
<
bqv >
not a single character differs
01:03
<
infinisil >
Alright lemme just quickly break compatibility for everybody that uses nixus
01:09
<
infinisil >
And change the main call to `import nixus { deploySystem = <the system>; } { ... }`
01:12
<
bqv >
9| pkgsModule = nixpkgs: { lib, config, ... }: {
01:12
<
bqv >
10| config.nixpkgs.system = lib.mkDefault builtins.currentSystem;
01:12
<
bqv >
11| # Not using nixpkgs.pkgs because that would apply the overlays again
01:12
<
bqv >
attribute 'currentSystem' missing
01:12
<
bqv >
...is that me or you?
01:14
<
bqv >
if i try and set nixpkgs.system or config.nixpkgs.system, i get followup errors
01:14
<
bqv >
set it in nodes.*.configuration, that is
01:14
<
bqv >
i don't know if that's even where it's cominf from
01:15
<
infinisil >
Oh yeah that's me
01:15
<
infinisil >
Hold on
01:15
<
infinisil >
Well I'm not testing it with pure eval
01:22
<
infinisil >
That impurity should be gone now
01:23
<
infinisil >
(now all systems are assumed to be deploySystem or currentSystem unless otherwise specified)
01:24
<
bqv >
infinisil: it works! and it just full on deployed that machine
01:25
<
bqv >
"this shouldn't occur" occured
01:25
<
bqv >
is it because i ssh'd as not root
01:25
<
bqv >
can i make it sudo
01:25
<
infinisil >
It should sudo on the target host
01:25
<
bqv >
[phi] Triggering system switcher...
01:25
<
bqv >
[phi] Warning: Permanently added '10.0.0.4' (ED25519) to the list of known hosts.
01:25
<
bqv >
[phi] Trying to confirm success...
01:25
<
bqv >
[phi] Warning: Permanently added '10.0.0.4' (ED25519) to the list of known hosts.
01:26
<
bqv >
[phi] This shouldn't occur!
01:26
<
bqv >
[phi] Finished
01:26
<
infinisil >
Hm, not sure where that error is from
01:26
<
bqv >
i think the machine just rebooted
01:27
<
bqv >
yeah it's not even responding to ssh now, i hope it boots ok, it's headless...
01:28
<
bqv >
it booted ok, but it's not the new system
01:28
<
bqv >
how can i debug what just happeend?
01:28
<
infinisil >
Nixus reboots automatically if it couldn't rollback successfully after a failure
01:28
<
infinisil >
(and before the new system is activated)
01:29
<
infinisil >
bqv: Logs are in /var/lib/system-switcher
01:29
<
infinisil >
(on the target machine)
01:30
<
bqv >
oh that's fine, it's my stuff not yours
01:30
<
bqv >
my systems soft-fail activation all the time :p
01:31
<
infinisil >
bqv: systemd units?
01:31
<
infinisil >
There's ignoreFailingSystemdUnits if you don't care about fixing the underlying problem :P
01:31
<
bqv >
where's that?
01:31
<
infinisil >
`defaults.ignoreFailingSystemdUnits = true`
01:32
<
infinisil >
Or `nodes.<node>.ignoreFailingSystemdUnits`
01:32
<
bqv >
alright, take 2
01:33
<
bqv >
lol, got a "this shouldn't occur" again
01:33
<
infinisil >
Never seen that myself
01:33
<
infinisil >
Well, not in that context at least lol
01:34
<
bqv >
well, it worked
01:34
<
bqv >
so yeah, benign
01:35
<
bqv >
it just rebooted again
01:35
<
bqv >
there was another activationscript that failed
01:35
<
infinisil >
The logging isn't great I know
01:35
<
infinisil >
Should really be streamed to the terminal you deploy from
01:40
<
infinisil >
bqv: What's failing?
01:41
<
bqv >
one of my custom ones
01:41
<
bqv >
made it unfailable now
01:41
<
infinisil >
bqv: Does it always fail or just with nixus?
01:41
<
bqv >
always, probably
01:43
<
bqv >
failed again... waiting for reboot
01:44
<
infinisil >
Reboot should only occur if the switch to the new system failed (and no success confirmation in time), plus the old system fails to activate too (e.g. failing activation script)
01:44
<
infinisil >
bqv: So my guess is that there's still something that makes the new system activation fail
01:46
<
bqv >
infinisil: is there a time limit?
01:47
<
infinisil >
Might have to increase it for slow machines
01:47
<
bqv >
that'll be it
01:49
<
drakonis >
ah the new nix flakes cli is much better
01:50
<
drakonis >
far less janky to work with
01:51
<
bqv >
infinisil: why is success < switch :p
01:51
<
bqv >
that makes no sense to me
01:51
<
bqv >
it'll rollback before switch times out
01:52
<
infinisil >
bqv: Oh those are not very related
01:52
<
infinisil >
switchTimeout is the timeout for the `$system/bin/switch-to-configuration` command
01:52
<
infinisil >
Which does all the activation script stuff
01:52
<
infinisil >
successTimeout is after that's done, and the target machine is expecting the success confirmation
01:53
<
infinisil >
Which should be pretty fast
01:53
<
bqv >
oh, ok, so i can just up the former
01:53
<
infinisil >
(because it's just an SSH into the machine)
01:53
<
infinisil >
bqv: switchTimeout if your activation scripts are timing out yeah
01:54
<
infinisil >
Maybe the defaults should be higher
01:54
<
bqv >
i've set it to 120
01:54
<
infinisil >
I also had to increase the successTimeout for a harddisk machine, because the shell prompt was so slow to load lol
01:55
<
bqv >
this is weird though, it failed in way less that 20s
01:55
<
bqv >
maybe i'll just bump it to see
01:57
<
infinisil >
The logs aren't telling you the problem?
01:57
<
bqv >
maybe it did this time, i'll find out once it finishes booting
01:57
<
bqv >
last time it was the timeout, but i thought setting switchTimeout to 2 minutes would fix that
01:59
<
bqv >
maybe it is just oldMachineProblems
02:00
<
bqv >
yeah, success not in time
02:00
<
bqv >
success includes switch
02:00
<
bqv >
because ssh won't be started until near the end of switch
02:01
<
bqv >
(in this case, at least)
02:01
<
bqv >
i will try with 2 mins on both
02:02
<
bqv >
it failed instantly again
02:02
<
bqv >
infinisil: this is a bug, i think
02:03
<
infinisil >
bqv: Logs?
02:03
<
bqv >
the activation was successful (ignoring systemd)
02:04
<
bqv >
oh, actually maybe it didn't fail
02:04
<
bqv >
i saw the "this shouldn't occur" and assumed that meant problems
02:04
<
bqv >
but it's not rebooted this time
02:05
<
bqv >
but yeah, the nixus script on the deployer finished way before the activation on the target
02:05
<
bqv >
and had [phi] This shouldn't occur!
02:06
<
bqv >
i feel like this'll be some expectation that sshd is running at all times, but in this activation it wasn't, for a period?
02:07
<
infinisil >
I don't think it relies on sshd running all the time
02:07
<
bqv >
oh nevermind, it rebooted
02:07
<
bqv >
just way later this time
02:07
<
bqv >
what is going on..
02:08
<
infinisil >
No idea
02:08
<
infinisil >
But I wanna blame bash
02:13
<
bqv >
doodoodoo, let's break some nix rules...
02:13
<
bqv >
$status is unknown
02:15
<
infinisil >
bqv: context?
02:18
<
infinisil >
I searched for "this shouldn't occur" in nixus
02:19
<
infinisil >
But didn't pass -i
02:20
<
bqv >
is it possible ssh is taking longer than 5?
02:20
<
bqv >
not entirely sure how that's meant to work
02:21
<
infinisil >
Hm that could be it
02:21
<
infinisil >
bqv: I'd try increasing that timeout too then
02:21
<
infinisil >
Wait but since $status is unknown, this means that the command finished
02:22
<
infinisil >
So that can't be it
02:23
<
infinisil >
The `cat "system-$id/status"` outputs unknown
02:23
<
infinisil >
Wait does it
02:24
<
bqv >
all of the statuses for every id are failure
02:24
<
infinisil >
Huh, what does this line even do: [ ! -p "system-$id/confirm" ]
02:24
<
bqv >
i'm not fluent in bash :p
02:25
<
bqv >
i don't have to be fluent to see that's ..weird
02:25
<
infinisil >
Oh I guess it just checks that that file doesn't exist
02:26
<
infinisil >
And exits if it does
02:26
<
infinisil >
I think
02:26
<
infinisil >
But why would I code that
02:26
<
infinisil >
infinisil: Comment your code please
02:29
<
bqv >
does status get written to multiple times?
02:29
<
bqv >
cause if so i guess it could have been unknown at some point during activation
02:30
<
infinisil >
It does
02:30
<
infinisil >
It's unknown until the activation finished (then success or failure depending on the result)
02:31
<
bqv >
so the issue is, $active is breaking out of the loop, so the issue probably is the timeout
02:32
<
bqv >
nah, instafail again...
02:33
<
infinisil >
Lemme do some small changes
02:34
<
bqv >
shouldn't that be while [ "$active" != 0 ] && [ ! "$status" -eq "unknown" ]; do
02:34
<
bqv >
or whatever that should be
02:35
<
infinisil >
That might be it
02:35
<
infinisil >
Although not entirely
02:35
<
bqv >
and uh, prevstatus/prevactive aren't even used
02:36
<
infinisil >
There's definitely a logic error somewhere
02:36
<
infinisil >
But I'm not sure how I managed to avoid this all this time
02:36
<
infinisil >
And what's special about your config that triggers it
02:38
<
infinisil >
bqv: Is that failure reproducible?
02:38
<
infinisil >
Like, all the time?
02:38
<
infinisil >
Because from looking at the code I feel like it should be racey and not trigger all the time
02:39
<
bqv >
it's a very slow pc
02:39
<
bqv >
but yes, it's failed every time
02:40
<
bqv >
this looks very wrong to me
02:40
<
bqv >
why this -> active=$?
02:40
<
bqv >
isn't that checking the return code
02:41
<
bqv >
which isn't even relevant
02:41
<
bqv >
because timeout and because ssh
02:41
<
infinisil >
Should all propagate
02:41
<
bqv >
oh is that what batchmode is
02:41
<
infinisil >
Even without that
02:42
<
infinisil >
timeout doesn't preserve the exit status when it timed out
02:42
<
infinisil >
`man timeout`
02:42
<
infinisil >
--preserve-status
02:42
<
bqv >
(i just tried `ssh 10.0.0.1 false || echo false`, it output nothing?)
02:43
<
bqv >
i don't think that's it either though
02:43
<
infinisil >
That outputs false for me
02:43
<
bqv >
the timeout was 30, it failed in way less than that
02:44
<
bqv >
oh, it doesn't preserve it on nottimeouteither
02:47
<
bqv >
[phi] Trying to confirm success...
02:47
<
bqv >
[phi] Tue 11 Aug 03:46:22 BST 2020
02:47
<
bqv >
[phi] Warning: Permanently added '10.0.0.4' (ED25519) to the list of known hosts.
02:47
<
bqv >
[phi] Tue 11 Aug 03:46:24 BST 2020
02:47
<
bqv >
[phi] $status=unknown
02:47
<
bqv >
[phi] This shouldn't occur!
02:47
<
bqv >
[phi] Finished
02:47
<
bqv >
this is with --preserve-status, and with a timeout of 30;
02:48
<
infinisil >
bqv: And that happens in less than 30 seconds?
02:48
<
bqv >
look at the datestamps
02:48
<
bqv >
they're 2 seconds apart
02:48
<
bqv >
the first date is right after trying to confirm success
02:48
<
bqv >
the last date is just before case $status
02:49
<
bqv >
that entire block fails in 2 seconds
02:51
<
infinisil >
But like, the active script needs to exit with 0 for this to occur
02:51
<
infinisil >
And it needs to output "unknown"
02:53
<
infinisil >
Ugh this is a mess, I should rewrite this in haskell already
02:54
<
bqv >
on multiple machines i have it that "ssh $machine false" is a success
02:55
<
bqv >
the manpage agrees with you, but my experience doesn't
02:55
<
infinisil >
Well mainly the `set -x` I guess
02:56
<
infinisil >
bqv: `ssh localhost false; echo $?`
02:56
<
infinisil >
What about that
02:56
<
infinisil >
The hell
02:57
<
bqv >
all i can think of is that it's due to my shell
02:57
<
infinisil >
What's special about it?
02:57
<
bqv >
well it's not bash
02:57
<
infinisil >
I guess that would explain the problem
02:57
<
infinisil >
what is it then
02:57
<
infinisil >
Yeah I think that could explain it
02:58
<
bqv >
can you make ssh not use the shell?
02:58
<
infinisil >
I wanted to do this anyways eventually
02:58
<
infinisil >
Let's see how easy it is
02:58
<
infinisil >
Actually I'm not sure if that's possible
02:59
<
bqv >
that fixes it
02:59
<
infinisil >
exec where
02:59
<
bqv >
ssh localhost exec false
02:59
<
bqv >
that's a failure, for me
03:00
<
bqv >
so... if i stick an exec before the switch script
03:00
<
bqv >
seems way more promising at least
03:00
<
bqv >
it's doing stuff
03:01
<
bqv >
yeah, you need to add an exec, cause it's harmless for those who use bash but fixes it if they don't :)
03:01
<
bqv >
i have a successful system switch, finally
03:02
<
infinisil >
Alternatively I could not use the exit code to transmit state
03:02
<
infinisil >
But at this point, an exec hack on top doesn't matter much
03:03
<
bqv >
seems reasonable to do that, because connection refused is also an "active" state that you slurp up with that
03:03
<
infinisil >
It's not
03:03
<
infinisil >
In your shell it is
03:04
<
bqv >
i dunno, at any rate exec makes it work so it'd be cool if you could pop that on the pure branch too :p
03:04
<
infinisil >
Yeah why not
03:04
<
infinisil >
pure is merged into master btw
03:06
<
bqv >
awesome, will move back
03:08
<
infinisil >
bqv: Pushed the exec change to master
03:18
<
infinisil >
bqv: Thanks for the error reporting and debugging!
03:29
<
bqv >
(not a major issue, just, i've had to rerun the deploy script a few times because slow to start disks)
03:34
<
infinisil >
bqv: PR? I need to sleep now :)
03:35
<
infinisil >
Eh I'll just increment it myself real quick
03:35
<
infinisil >
bqv: How many seconds is enough for your system?
03:36
<
bqv >
i'd been setting 30, but even 15 would probably be constant success
03:38
<
infinisil >
bqv: pushed to master
03:38
<
infinisil >
Won't hurt to have this a bit higher anyways
03:43
<
bqv >
next step, start using it for all my hosts, but i'll do that tomorrow
06:29
<
bqv >
infinisil: I get a few errors from nix-copy-closure, I think. Invalid operations on the daemon
06:29
<
bqv >
Might be fixed by using nix copy instead
13:19
<
bqv >
I might need to make a PR
13:43
<
infinisil >
bqv: The problem might be that the current systems nix-copy-closure is used, which might not be the same Nix version as the target host
13:43
<
infinisil >
But I'm not sure if any other version can be used because nix-copy-closure needs to have access to the deploy hosts nix store
14:32
<
bqv >
infinisil: yeah, the alternative is to check the nix version of the deployer/target, thats what I was gonna try
14:33
<
bqv >
nix copy is backwards compatible
14:33
<
bqv >
So as long as it's present it can be used
15:53
<
bqv >
oh, nevermind, nix copy works on old nix too
16:52
<
bqv >
infinisil:
http://ix.io/2tUs this is the kind of stuff i'm seeing. using `nix copy` instead doesn't help.
16:52
<
bqv >
just getting random "invalid operation"s
17:05
<
infinisil >
Never seen this before
17:06
<
infinisil >
bqv: What are you currently using to deploy your machines? How does that copy the closure?
17:07
<
bqv >
i'm currently using my nixos-rebuild wrapper, git and/or nix copy
17:07
<
bqv >
nix{-copy-closure, copy} is indeed tempramental, but this seems ...persistent
19:44
malook has joined #nixus
19:44
eyJhb has joined #nixus
19:44
<
eyJhb >
Why is there so many nix channels?!
19:44
<
eyJhb >
I am in 8 atm... :| But end goals, if any?
19:45
<
infinisil >
eyJhb: I guess one goal is to be the best deployment tool there can be :P
19:45
<
eyJhb >
As long as it does not play the pokemon theme song
19:46
<
eyJhb >
But I am thinking, do we have any good tools atm. to do huge 1000+ deployments?
19:46
<
infinisil >
Like, one goal is that you can be sure that you never lose access to a remote machine
19:46
<
infinisil >
With automatic rollback and stuff
19:46
<
infinisil >
eyJhb: I haven't thought about huge deployments, but if possible I'd love to make nixus support that too
19:47
<
infinisil >
And actually some ideas I have would play well into that
19:48
malook has quit [Quit: malook]
19:49
<
infinisil >
eyJhb: Oh another goal I have in mind is to have many Nixus modules that change config on multiple machines
19:49
<
infinisil >
For things that require changes on more than one machine
19:49
<
eyJhb >
Does none of the others have automatic rollback?
19:50
<
eyJhb >
Yeah, I saw that but I am unsure what it does vs. just normally configuring stuff?
19:50
<
infinisil >
I can't remember any others having rollback, might not have looked close enough though
19:51
<
infinisil >
eyJhb: It essentially allows you to say "I need to be able to `ssh user@host`" and the module sets your authorized key on the target host and known_hosts on the local one
19:51
<
infinisil >
If you just did the authorized key bit (which works in normal NixOS), you still have to configure known_hosts so you don't get an SSH warning when you connect at first
19:52
<
infinisil >
eyJhb: This is just the start of it though. I want to make a backup module (for znapzend) next, which uses the SSH module to make sure that the backups work (they need SSH access to the backup machine)
19:53
<
infinisil >
Or a VPN module that allows you to do `vpn.<id> = { server = "host1"; clients = [ "host2" "host3" ]; }`
19:54
<
infinisil >
And it takes care of configuring everything on all involved hosts
19:54
<
eyJhb >
Seems advanced. but basically modules that work at a higher level
19:54
<
infinisil >
Also, I want to have inter-host dependencies
19:55
<
infinisil >
So you can e.g. wait for another host to start its DNS server first before doing another thing on a different host
19:55
<
eyJhb >
Sounds nice! But in general, it seems like switching from X to Y to Z, involves some basic changes and the config stays the same pretty much
19:56
<
eyJhb >
Not Nixus related directly
19:56
<
eyJhb >
Also, is it free time or work?
19:56
<
infinisil >
Not sure what you mean
19:56
<
infinisil >
I guess small config changes only change a small part in the end
19:57
<
eyJhb >
For switching between nixus, morph, etc. the base config for each host is the same I assume
19:57
<
infinisil >
eyJhb: Free time (there was part that was sponsored by Niteo though, but that's no more unless they plan to actually use this)
19:58
<
eyJhb >
What are they currently using, if not Nixus?
19:58
<
infinisil >
eyJhb: Ah yeah, mostly. Switching from Nixus -> something else isn't possible with the multi-host modules though
19:58
<
infinisil >
(pretty sure Nixus is the only tool that allows multi-host modules for now)
19:59
<
infinisil >
eyJhb: Nothing yet, just slowly testing the waters
20:03
<
bqv >
I'm somewhat using it
20:03
<
bqv >
I just have a nix bug
20:04
<
bqv >
So I can't deploy remote hosts successfully consistently
20:04
<
infinisil >
bqv: Oh it's a Nix bug?
20:05
<
bqv >
infinisil: I can reproduce it with plain nix-copy-closure
20:05
<
bqv >
So it must be
20:05
<
infinisil >
Ah but nix copy works?
20:05
<
bqv >
No that fails too
20:06
<
bqv >
Nothing seems to want to copy, at the moment
20:07
<
infinisil >
Well at least it's not Nixus problem then :P
20:08
<
bqv >
So with that sorted I'll see no reason not to use nixus
20:09
<
infinisil >
Well, other than it being experimental and stuff lol
20:09
<
infinisil >
There's also some annoying things like Ctrl-C not working and logs not being nice
20:10
<
infinisil >
Nothing permanent of course, just bash and distributed system problems
20:10
<
bqv >
pfft, light work
21:29
drakonis has joined #nixus
22:23
<
drakonis >
cleaning my phone