sphalerite changed the topic of #nixos-dev to: NixOS Development (#nixos for questions) | NixOS 19.03 released! https://discourse.nixos.org/t/nixos-19-03-release/2652 | https://hydra.nixos.org/jobset/nixos/trunk-combined https://channels.nix.gsc.io/graph.html https://r13y.com | 19.03 RMs: samueldr,sphalerite | https://logs.nix.samueldr.com/nixos-dev
<clever> andi-: nixops will also do send-keys for you, after deploy, start, and reboot
<clever> the main use for manual send-keys is when nixops isnt aware of a remote reboot and loss of key material
<gchristensen> andi-: I do
<andi-> gchristensen, clever: mostly intersted in your experienc ewith it.. I am looking at it in the context of the systemd bump and dropping a patch that causes a bit of pain for the generic systemd services. I build a libvirt test environment and whenever I want to unlock the disk it takes >30s because systemd_pam runs into a udev timeout. This happens to me on 19.03 and unstable..
<andi-> Is that also the case on your machines?
<andi-> s/udev/dbus/
<gchristensen> ouch
<gchristensen> I can't think of anything like that happening, which makes me think it doesn't happen
<andi-> is basically my configuration
<andi-> that issues also gives some context of the impact of our nixops-enabling systemd patch.. Many related PRs and issues exist.
<gchristensen> what is the hack supporting in nixops? I've not used send-keys in such a way that it had much of anything to do with systemd I think
<gchristensen> early disk decrypting I guess
<andi-> not really early but decrypting during boot (after SSHd has started)
<clever> andi-: i'm not doing anything with systemd/nixops, and no luks, but my nixops is hanging at deploy
<clever> [root@nas:~]# systemctl list-jobs
<clever> 1893709 sys-subsystem-net-devices-enp3s0.device start running
<clever> this appears to be the cause
<andi-> that is probably a completly different issue
<clever> May 27 03:18:45 nas systemd[1]: sys-subsystem-net-devices-enp3s0.device: Job sys-subsystem-net-devices-enp3s0.device/start failed with result 'timeout'.
<clever> yeah
<gchristensen> andi-: ah ... I don't do that
<gchristensen> interesting.
<andi-> the note was removed but the defect is still there :/
<gchristensen> oooOOooo
<andi-> It disappears if you do a `nixos-rebuild switch` but not if you boot the machine.
<gchristensen> ack.
<andi-> Now I am a point - after a few weeks of dealing with things like these - that I would go for breaking `nixops send-keys` and requiring the `_netdev` option to be set. It fixes issues for everyone (racy ones) and only requires a few additional lines for a few people.
<gchristensen> okay I'm not against that
<gchristensen> what is the breakage users will see?
<andi-> that is the hard part... I had a bit of a heated debate about it with flokli an hour ago.. It breaks local-fs.target if _netdev isn't set for those mount points. I am not for that breakage but he convinced me that at some point we might have to do it.. There si no way we can traverse all the layers of indirections to automagically add the option :/
<gchristensen> right
<andi-> We could probably add safeguards to nixops to warn if no FS with the option was found etc.. Still not perfect.
<gchristensen> that would warn almost every user, right?
<andi-> probably
<gchristensen> oof
<andi-> every user that uses autoluks that is
<gchristensen> is there any way we can start warning users now?
<gchristensen> warning in such a way that we know 100% they're using this feature
<samueldr> and backport said runtime warning to at least current stable
<andi-> I'd combine the nixpkgs change with a change to nixops that fails (unless opted out?) when we see autoLuks being used without _netdev. Also a big fat warning in the release notes of 19.09... In general I'd like to find a better way to handle this. I'll spent most of the night/tomorrow reading through systemd docs one more time..
<gchristensen> I think having an extremely well targeted alert would be a great start
<andi-> A way to mark any .mount unit as "_netdev" by only knowing the .device unit is something we need.
<andi-> Yes, definitly.
<gchristensen> though: what happens if they miss it?
<gchristensen> what is the risk
<andi-> Rollback, open an issue, pop up on IRC (screaming at us) and we point them to the alert we tried to get out?
<flokli> Well, rolling back needs to be done manually. They might not be able to ssh to machines, as sshd is waiting for local-fa.target
<andi-> it does fail quickly (a few minutes)
<andi-> but then it is stuck in a resuce shell
<flokli> So no ssh
<flokli> I also thought about having some deployment option that needs to be set, and cancelling deployment otherwise. But UX is pretty bad
<gchristensen> that would be very ugly on AWS where there is no console
<andi-> is what you end up in
<flokli> The longer-run solution would be to move the autoluks part to crypttab and add somewhat transitive propagation of the _netdev attribute from the crypt device to mointpoints natively into systemd... But that's a very long stretch
<andi-> (year+)
<gchristensen> I feel very uncomfortable with the possibility of losing data
<gchristensen> hmm I guess it wouldn't be lost, but the recoery would be extremely annoying
<flokli> Well, you manually need to boot an older generation, fix your nixops config, then redeploy
<gchristensen> right, but that isn't really possible on AWS / other hosts without console access
<andi-> (once you figured what went wrong)
<flokli> So maybe enforcing some new deployment. option to be set, to kinda ensure the nixops user read the chanhelog and updated his fstab config?
<flokli> It's ugly, but better than the possible breakage?
<andi-> how does nixpkgs figure out if nixops is being used? Checkinf for `deployment` configuration(s)?
<Shados> gchristensen: There's literally always a recovery approach possible. If you've got no console, typically you do still have the ability to boot from an ISO, in which case you can just edit the grub menu from there to rollback.
<Shados> But it would be a pain for someone, somewhere, no doubt.
<gchristensen> on AWS I believe the root device would need to be snapshotted, mounted on another machine, edited, snapshotted, and booted from -- destroying the original
<flokli> The thing is, detecting whether all mointpoints have that option set or not is (almost) impossible to 'inspect' from nix, as it can be arbitrarily nested
<andi-> you can check if any have it set.. any other assumptions ar eeasily false (iscsi multipath, raid1, …)
<andi-> making an `autoLuks.foo.mountPoint = "/foo";` mandatory would workaround the issue
<andi-> the user must then ensure not to screw it up or provide the wrong ones.. Not very nice but not the worst.
<gchristensen> this sounds promising
<andi-> I am thinking about workloads where it is not a filesystem that is being used ontop of that encrypted device... database logs (oracle did raw disks), ceph, … Would they be affected as well? Probably not.
<gchristensen> I care about these less, as the diagnostics and recovery is much simpler
<gchristensen> (I mean, I care about them plenty -- but am less concerned)
<andi-> At least not at first.. Maybe we figure something out later. Keeping systems bootable is the most important bit.
<gchristensen> +1
<andi-> Guess I'll be working on another NixOps PR then.. introducing the _netdev option when the mointPoint is set and not explictily set to some "I know what I am doing" value..
<Shados> andi-: flokli: What is it that prevents adding or checking _netdev on the mount points from nixops?
<andi-> Shados: For the trivial cases (filesystem directly ontop of the luks device) we could probably do that. For the more interesting cases (lvm on luks, raid on luks, …) we can not follow the indirections that might be resolvable during runtime.
<andi-> e.g. you might be creating a raid1 device from multiple encrypted volumes. Knowing which part/disk/… uuid the device might have combined with another device isn't really something we usually know.
<Shados> I see what you mean
<andi-> Generally speaking it wouldn't work for any nixos-generated config since those use uuids and not device paths.
Synthetica has quit [Quit: Connection closed for inactivity]
disasm| has quit [Quit: WeeChat 2.0]
drakonis has joined #nixos-dev
disasm has joined #nixos-dev
<Shados> Is there a standard way of getting a Nix list of arbitrary strings (e.g. including ones containing spaces) into a bash array? I am guessing not, given the etc-builder doesn't do this, but confirmation would be nice
<Shados> And on a related note: I can't find the list of characters that are illegal in store paths in the Nix manual; is it in there?
ryantm has joined #nixos-dev
<srhb> Shados: https://nixos.org/~eelco/pubs/phd-thesis.pdf Search for abcdf
<srhb> (Would be nice to stick in the manual I suppose)
<Shados> srhb: Had to go a few pages down for the name portion of the store path, but thanks. For anyone interested: alphanumerics and `+-._?=` are valid.
drakonis has quit [Quit: WeeChat 2.4]
teto has quit [Quit: WeeChat 2.4]
orivej has quit [Ping timeout: 245 seconds]
pie_ has quit [Ping timeout: 258 seconds]
Synthetica has joined #nixos-dev
pie_ has joined #nixos-dev
pie___ has joined #nixos-dev
pie_ has quit [Ping timeout: 248 seconds]
orivej has joined #nixos-dev
<arianvp> I was talking to the sifive guys 2 years ago at fosdem and if we ask nicely rhey might be able to provide you a machine for NixOS support
<arianvp> They seemed open to the idea
<arianvp> They're interested in getting more distros running on their platform
<genesis> noticed.
Cale has quit [Ping timeout: 264 seconds]
Cale has joined #nixos-dev
init_6 has joined #nixos-dev
copumpkin has joined #nixos-dev
orivej has quit [Ping timeout: 245 seconds]
init_6 has quit []
Enzime has quit [Ping timeout: 245 seconds]
Jackneill has quit [Read error: Connection reset by peer]
Jackneill has joined #nixos-dev
orivej has joined #nixos-dev
pie___ has quit [Ping timeout: 248 seconds]
<infinisil> ,stuck
<infinisil> nixpkgs-unstable seems to be stuck
<andi-> succeeded ~50min ago: https://hydra.nixos.org/build/94117178
<samueldr> still needs to be fully built
<samueldr> looks like there's exactly one job in queue
<infinisil> andi-: it also succeeded 5 days ago and 2 days ago: https://hydra.nixos.org/job/nixos/trunk-combined/tested/all
<infinisil> Yet no channel update. Or does the "tested" job mean something else?
<samueldr> the channel upgrade script uses `latest-finished` (Latest successful build from a finished evaluation) to update
<samueldr> this can be found in the Links tab of the job https://hydra.nixos.org/job/nixos/trunk-combined/tested#tabs-links
<samueldr> here it looks like that something fishy is going on
sir_guy_carleton has joined #nixos-dev
<samueldr> oh right, nixpkgs* it's not using tested
<samueldr> infinisil: it's that job that matters for nixpkgs channels https://hydra.nixos.org/job/nixpkgs/trunk/unstable#tabs-constituents
<infinisil> Ah
<infinisil> So probably the metrics job was the cause
<infinisil> But no idea why, i can't get to the logs of the last failed metrics build
<samueldr> >> builder for '/nix/store/b7n7axj5yqqnxsx152lv32030b8hx6ab-nixpkgs-metrics.drv' failed with exit code 137
<samueldr> >> builder for '/nix/store/423lmbdhzzc430p4ks7zgn7q47dqjfx5-nixpkgs-metrics.drv' failed with exit code 137
<samueldr> OOM
<infinisil> Ahh
<infinisil> The usual then..
<samueldr> :/ it succeeded on the same machine it failed, so uh
<infinisil> Why does it need so much memory anyways?
<gchristensen> it evaluates all of nixpkgs
<infinisil> And Nix can't throw out part of the result because it might be needed later?
<gchristensen> nixpkgs is a highly connected graph
<gchristensen> and .override() and .overrideAttrs means keeping copies of iirc the entire pkgs set around all the time
<infinisil> Okay but why not evaluate one attribute after the next?
<infinisil> Or 10 at a time
<gchristensen> I don't understand the question
orivej has quit [Ping timeout: 258 seconds]
JosW has joined #nixos-dev
<infinisil> gchristensen: I assumed "evaluating all of nixpkgs" meant evaluating all attributes, is that not it?
<gchristensen> sure, nix does evaluate one at a time
<gchristensen> so one thought is to have a thunk and its evaluated state be interchangable, so if memory needs freeing it can "unevaluate" that thunk
<infinisil> Hmm yeah
<infinisil> gchristensen: How much memory do the build machines failing with OOM have?
<gchristensen> which ones are failing with OOM?
<infinisil> The metrics one in this case
<infinisil> E.g. ^^
<infinisil> I see ~8GB at the bottom, so probably that's it
<samueldr> gchristensen: t2a
<samueldr> IIRC it's been targetting t2a specially
<gchristensen> hmm I think that is vcunat's
<gchristensen> that is annoying. it does target that machine specifically, so the metrics are more reliable
<gchristensen> I'll mail vcunat
<infinisil> So, after some looking around it seems that this is the mapping from channel to hydra job:
<gchristensen> one sec
<infinisil> Ohh
<infinisil> I was wrong on the 19.03 one apparently
<infinisil> That seems like a rather chaotic mapping
drakonis has joined #nixos-dev
<gchristensen> to me, it follows a well defined pattern
<gchristensen> (with nixpkgs-unstable being an outlier))
<infinisil> Everything follows a pattern if you make enough exceptions :)
<gchristensen> fair
<gchristensen> does one outlier make something chaotic?
<infinisil> But nixos-unstable is also out of line, why trunk-*combined*?
<infinisil> Why not just trunk like nixpkgs-unstable?
<gchristensen> ah hehe
<gchristensen> Back In Days Of Yore There Was Nixpkgs and NixOS And It Was Good
<gchristensen> and thus a nixpkgs/trunk and a nixos/trunk
<gchristensen> and then, it was decided to combine Nixpkgs and NixOS in to one, and thus: trunk-combined was produced
<infinisil> I see
<gchristensen> yeah, some cleaning up would be good
<infinisil> And it would make sense to do s/trunk/master, which corresponds with the release channels using the branch names too
<gchristensen> right
<gchristensen> however, I don't think hydra supports that sort of renaming, and so it hasn't been recreated so we don't lose the history
<gchristensen> not sure -- it would definitely be nice to do
<infinisil> Yeah, maybe it would work with some redirects?
<infinisil> Ah but history will be lost yeah
<gchristensen> some manual wizardry in hydra's postgres db could possibly do it, but it would be a transaction from hell
drakonis_ has joined #nixos-dev
drakonis_ has quit [Client Quit]
<infinisil> Can somebody give an opinion in #32005? @peti insists on closing it, even though it's still very much a problem
<samueldr> infinisil: the mapping is the last column http://howoldis.herokuapp.com/
<{^_^}> https://github.com/NixOS/nixpkgs/issues/32005 (by Infinisil, 1 year ago, closed): buildStackProject: fails downloading resolver
<infinisil> (samueldr: For 1 hour nobody talked, yet we decided to at the same second :))
<samueldr> catching up on the ol' backlog
<niksnut> gchristensen: yes, the only reason that trunk hasn't been renamed is that it takes postgres hours to do so
<niksnut> other than that, hydra supports renaming jobsets and will automatically redirect from the old to the new name
<gchristensen> aye :D
<gchristensen> ahh nice
<samueldr> does it need to be renamed? a new jobset could replace it leaving the old one untouched?
<niksnut> sure, but that breaks all history (like performance metrics)
<samueldr> tight, guessed there was something like that I didn't think of
<samueldr> right*
<niksnut> of course we could just shut down the queue runner / evaluator over the weekend to do the rename
<gchristensen> true
<samueldr> are there other operations that would require it to be shut down so they could be done at the same time?
<niksnut> not that I know
<gchristensen> depends if moving the DB off `chef` server is on the table ;)
<niksnut> could be
<matthewbauer> Can you add a library to an ELF's RUNPATH without adding the whole directory it's in?
<matthewbauer> I want to add /usr/lib/libGL.so without adding all of /usr/lib/
JosW has quit [Quit: KVIrc 4.2.0 Equilibrium http://www.kvirc.net/]
<gchristensen> I wonder if each wireguard peer should have its own systemd unit
<ivan> what for?
<andi-> What is the use case yu have for multiple peers on a single wireguard interface? The experience I have with them is that if something is off it seems like the node is reachable (interface is up) but data will just be lost. I usually do one interface per peer..
<gchristensen> that is interesting, andi-, I have a `wg0` with several peers -- one for each machine in my "personal" network
<clever> matthewbauer: i think you can put absolute paths into DT_NEEDED
<clever> matthewbauer: and then it will just skip RUNPATH/RPATH entirely
pie_ has joined #nixos-dev
<ekleog> did someone already try to use python's sys.meta_path to do static linking for python? I'm thinking it might make things much better for our python story, if we can get rid of propagatedBuildInputs / python.withPackages
<matthewbauer> clever: thanks!
<ekleog> (downside being it works only with python3.4+, so we won't be able to fully switch to it until we get rid of python2… which might just be “never” :( )
<ekleog> oh wait there's a python2 version too, even though the python3 doc states it came in with python3.4
<ekleog> -> am I missing something obvious before I add that to my “to try someday” idea list?
<samueldr> wouldn't it make sense to have two different infra, a "legacy" one for python2, and the python3 living one?
<samueldr> I mean, if it has clear advantages
<ekleog> once python2 no longer is our default python we will likely be able to, though ideally we wouldn't need that
<gchristensen> do we have a way to escape systemd unit names? I did a proof of concept of making each wireguard peer its own service: https://github.com/grahamc/nixpkgs/commit/4470e9cc08d8e3534341eda7304fb3ebc997b0e5
drakonis has quit [Ping timeout: 252 seconds]
drakonis has joined #nixos-dev
<gchristensen> oh awesome -- I was grepping lib/ not nixos/lib
<Profpatsch> ah yeah, good old utils module
<infinisil> Should just move that to lib probably
pie_ has quit [Read error: Connection reset by peer]
pie_ has joined #nixos-dev
orivej has joined #nixos-dev