samueldr changed the topic of #nixops to: NixOps related talk | logs:
lordcirth_ has joined #nixops
<adisbladis> I made a draft PR for nixops2Unstable (with the plugin infra)
<{^_^}> #83548 (by adisbladis, 36 minutes ago, open): nixops2Unstable: init at 1.8.0
bhipple has quit [Ping timeout: 256 seconds]
bhipple has joined #nixops
tg has quit [Quit: Leaving]
allgreed has quit [Quit: Ping timeout (120 seconds)]
allgreed has joined #nixops
jbgi_ has joined #nixops
dhess has joined #nixops
<dhess> Hi, just updated my NixOps to the latest commit (cca0b74). Now I'm getting this on a deploy:
<dhess> bouncer.....> could not connect to ‘root@bouncer’, retrying in 1 seconds...
<dhess> for every machine in my network.
<dhess> I can ssh into them just fine without NixOps.
<gchristensen> ruh roh :)
<gchristensen> thanks for testing, can you try and bisect a bit?
<dhess> Probably
<gchristensen> are you using any plugins?
<dhess> I don't think so.
<dhess> It's somewhere after e1cc7ab9972762fc1c6381b65db75a6dce7685f4 I believe
<gchristensen> looking ...
<gchristensen> oh boy, that is a lot of changes from there to now
<dhess> yeah give me a sec, I'm trying to figure out how to override NixOps in my wrapper script
<gchristensen> thank you :D
<dhess> I should be able to bisect this reasonably easily
<dhess> whoa
<dhess> many many ssh processes now running
<dhess> ok they died :)
<dhess> overlays to the rescue
<dhess> this is so much easier in Nixpkgs than anything else.
<gchristensen> oh?
<dhess> oh it's not fixed yet, I just mean it's a lot easier to test this in nixpkgs than anything else.
<dhess> :)
<gchristensen> ahh yeah
<gchristensen> :D
<dhess> e1cc7ab9972762fc1c6381b65db75a6dce7685f4 is good
<dhess> I'll start bisecting
<gchristensen> _phew_
<gchristensen> thanks!
jbgi_ has quit [Ping timeout: 264 seconds]
<dhess> ok I got a different error actually
<dhess> during a bisect
<gchristensen> oh?
<dhess> on f2a5b743f4de55fbd884430cd6dfe152f38f90f0
<dhess> but I'm gonna keep going for now
<gchristensen> oh man
<gchristensen> dhess: you need to skip past all the `migrate: ` commits
<gchristensen> I probably should have squished those
<dhess> oh ugh
<dhess> this is the *one* time when I think squashed commits are ok :)
<gchristensen> yeah...
<gchristensen> it made reviewing much easier, but yeah.
<dhess> I'm gonna predict this is the one
<dhess> it's the one I suspected when I looked at the commits
<gchristensen> whichun?
<dhess> I was wrong :)
<dhess> (not telling to protect the names of the innocent)
<dhess> cca0b749542d255056b4c13029a4855e455af862
<dhess> that's the one
<gchristensen> huh
<gchristensen> ...really
<gchristensen> but ...
<dhess> I think it's the stdout stuff, probably not actually ssh
<dhess> error: Multiple exceptions (11):
<dhess> * bouncer: process.stdin was None
<gchristensen> one sec, gotta watch a machine boot
<dhess> that was at the end. I thought it was incidental. Turns out it was probably the real G
<dhess> /stdout/stdin/
<gchristensen> oh no
<dhess> btw, I hate None. Or null. Or nil. Or whatever it's called in each language. Probably the #1 all-time cause of bugs.
<gchristensen> +1111111
<gchristensen> this commit is supposed to be fixing that, of course :P
<dhess> lol
<gchristensen> dhess: I don't see the origin of the problem here ... are you able to send a PR?
<adisbladis> Huh
<dhess> gchristensen: you mean figure out the fix?
<gchristensen> I'm just not sure what changed here to break it
<dhess> well as stdin was never previously checked, from what I can see there, perhaps stdin == None is not an error condition
<gchristensen> ooh
<adisbladis> Probably will fix it ?
<dhess> yeah I would guess that stdin is None if you're not sending any input
<adisbladis> dhess: Looking at `passed_stdin` if I find this very confusing.
<adisbladis> Looking at `passed_stdin` I find this very confusing.
<adisbladis> The type signature would indicate it's never valid to be None.
<dhess> I've never used Mypy so I don't know how robust its type system is.
<adisbladis> dhess: Even looking at the assignments it doesn't make sense that it would be None.
<dhess> are you looking at the devnull case?
<adisbladis> Yeah
<dhess> I assume devnull is None, then.
<adisbladis> devnull = open(os.devnull, "r+")
<dhess> does opening os.devnull in python return None?
<adisbladis> Nope
<gchristensen> I'd expect stdin to be None if we don't explicitly tell popen we're going to use it
<dhess> ok but when python tries to read from /dev/null the stdin value it will return is surely None
<gchristensen> I wonder what typeshed has to say about the type sig
<adisbladis> gchristensen: Optional[IO[str]]
<gchristensen> m
<gchristensen> hm
<adisbladis> But that's true for all stdin/stdout/stderr
<dhess> process.stdin is supposed to be the Text value of stdin, I assume?
<dhess> not the fd?
<adisbladis> To the documentation mobile!
<dhess> I'm pretty sure it's the string. You're assigning its value to a string, if I'm reading the Mypy annotation correctly
<dhess> so that will be None, because it's the result of reading from /dev/null
<adisbladis> dhess: No, it's supposed to be a file descriptor.
<dhess> ok
<adisbladis> This is very confusing.
<dhess> Popen.stdin
<dhess> If the stdin argument was PIPE, this attribute is a writeable stream object as returned by open(). If the encoding or errors arguments were specified or the universal_newlines argument was True, the stream is a text stream, otherwise it is a byte stream. If the stdin argument was not PIPE, this attribute is None.
<dhess> there you go
<dhess> it wasn't PIPE
<adisbladis> Ok, then I think the corect fix is already what I proposed.
<dhess> Isn't it the case that process.stdin can only be None there if Popen is broken?
<gchristensen> I don't think so...
<adisbladis> It seems that it can be None when what's passed is an fd (which can be fully consumed/exhausted)
<adisbladis> That makes sense
<adisbladis> But in the case of subprocess.PIPE (-1) it returns an fd for you to write into
<adisbladis> So the fix I pastebined should do the trick as we only check for None when we actually want to use that FD
<adisbladis> This _may_ also do the trick:
<adisbladis> Though I'm not sure
<adisbladis> dhess: Seems to have done the trick :)
<gchristensen> nice
<adisbladis> I wonder if we still need that deadlock check?
<gchristensen> I don't think so, I think stringio just fixes it
<gchristensen> I answered a question on SO about this once ... let me find it
<adisbladis> gchristensen: That was my gut feeling too
<gchristensen> Apparently a cStringIO.StringIO object doesn't quack close enough to a file duck to suit subprocess.Popen. How do I work around this?
<dhess> Is that committed? If so I'll test it
<adisbladis> gchristensen: hrm indeed
<gchristensen> adisbladis: do you think it is feasible to have a test for this?
<gchristensen> not trying to add work, but if it is reasonably easy, might be good :)
<adisbladis> gchristensen: I've added a test that would have caught this :)
<gchristensen> cool!
<gchristensen> thanks :D
<gchristensen> dhess: can you test master?
<gchristensen> I meant to ask that before I merged it, but here we are
<dhess> Is it possible that NixOps has gotten a lot (I mean a lot) slower with all of these Mypy commits?
<dhess> It's taking about 2 or 3 minutes just to get to the "building all machine configurations" step
<dhess> whereas it used to take, maybe 30 seconds?
<gchristensen> no, not possible
<adisbladis> dhess: Mypy is not checking things at runtime.
<gchristensen> python completely ignores types at runtime
<gchristensen> yeah
<adisbladis> (for better or worse)
<gchristensen> (both?)
<dhess> ok
<adisbladis> gchristensen: Heh, I was debating whether I should say `or` or `and` ;)
<dhess> so no change in how NixOps evaluates the deployment expression or anything like that?
<gchristensen> no, it is possible Nix had to write many many drv's this time?
<adisbladis> Any time spent in nixops is completely dwarfed by the nix eval (usually)
<dhess> Nope. I made no changes to my configs.
<dhess> This is just a NixOps update
<gchristensen> hrm
<gchristensen> dhess: if you run it again?
<adisbladis> gchristensen: Is it possible for our tests to upload artifacts anywhere?
<gchristensen> hrm I have no idea
<gchristensen> I would like to say yes :) what kind of artifacts?
<gchristensen> they can post github comments, maybe they can post comments with files
<adisbladis> gchristensen: Detailed coverage report
<gchristensen> that would be great
<gchristensen> dhess: thanks for the report
allgreed has left #nixops ["The Lounge -"]
abathur has quit [Ping timeout: 264 seconds]
<dhess> gchristensen: thank you for the incredibly fast fix!
<dhess> I'm wondering if it's the YubiKey/gpg-agent, though.
<gchristensen> my thinking is we can merge often and revert often too :P
<dhess> My network stack was actually still working. I was able to file the issue while I was waiting for NixOps to time out (which it never did -- I had to reboot). But something was locking up `ps`, Activity Monitor, and other processes.
<gchristensen> wow.
<dhess> gchristensen: do you know why NixOps would occasionally seem like it's using my personal SSH keys, rather than the ones it generates?
<dhess> It definitely prompts me for the PIN that unlocks my YubiKey, after gpg-agent drops its cache.
<gchristensen> nixops doesn't say ONLY use the key it knows about
<gchristensen> and then ssh just does ... whatever ... order
<dhess> ahh I thought it did. Well, that would almost certainly explain it.
<dhess> The YubiKey probably locks up dealing with all the requests.
<gchristensen> I wish it did, I have enough SSH keys that it breaks SSH for nixops
<dhess> all kinds of other weird shit also happened. My desktop images switched back to the default, for example.
<dhess> but only on some of my (virtual) desktops.
<gchristensen> wow what
<dhess> yeah this was a very strange issue. After a reboot, my Mac wanted to file a report that it had rebooted due to a kernel crash. Took the machine about 3-4 minutes to reboot.
<dhess> etc.
jbgi_ has joined #nixops
<dhess> anyway it seems like a super useful feature, but I suspect that anybody using a hardware device to manage their SSH private keys is going to run into this issue.
<dhess> that'll teach me not to upgrade my NixOps :)
<dhess> I only did it to be ready for your S3 state stuff.
<gchristensen> no way I'm glad you did
<gchristensen> I hope you continue to ... :)
<dhess> will do, of course, NixOps is great.
<dhess> it sounds like stateless deployments are also coming?
<gchristensen> yeap :) just need to iron out some kinks, lke not generate an SSH key oneach deploy
<dhess> that'll be amazing. At least half of my deployments could run that way.
<gchristensen> nice
abathur has joined #nixops
jbgi_ has quit [Ping timeout: 256 seconds]
<dhess> gchristensen: anyway, pretty sure that explains why I was seeing that extremely slow performance, even when I was just testing with 1 or 2 hosts at a time.
<gchristensen> I believe that :)
<dhess> gchristensen: did the context manager stuff make it possible for multiple SSHes to happen in parallel, even just one per host at a time?
<dhess> because I'm *pretty* sure that NixOps is still slower than it used to be for me during a deploy, even after you reverted that parallel key copy commit.
<gchristensen> no, that didn't change any of that -- it has always been able to SSH to many at once
<gchristensen> open_deployment should be renamed to break plugins more thoroughly
jbgi_ has joined #nixops
johnny101 has quit [Quit: Konversation terminated!]
bhipple has quit [Ping timeout: 264 seconds]
bhipple has joined #nixops
jbgi_ has quit [Ping timeout: 264 seconds]