#nixos-dev on 2019-05-28

2019-04-11 14:17 sphalerite changed the topic of #nixos-dev to: NixOS Development (#nixos for questions) | NixOS 19.03 released! https://discourse.nixos.org/t/nixos-19-03-release/2652 | https://hydra.nixos.org/jobset/nixos/trunk-combined https://channels.nix.gsc.io/graph.html https://r13y.com | 19.03 RMs: samueldr,sphalerite | https://logs.nix.samueldr.com/nixos-dev

00:03 <clever> andi-: nixops will also do send-keys for you, after deploy, start, and reboot

00:04 <clever> the main use for manual send-keys is when nixops isnt aware of a remote reboot and loss of key material

00:04 <gchristensen> andi-: I do

00:07 <andi-> gchristensen, clever: mostly intersted in your experienc ewith it.. I am looking at it in the context of the systemd bump and dropping a patch that causes a bit of pain for the generic systemd services. I build a libvirt test environment and whenever I want to unlock the disk it takes >30s because systemd_pam runs into a udev timeout. This happens to me on 19.03 and unstable..

00:08 <andi-> Is that also the case on your machines?

00:08 <andi-> s/udev/dbus/

00:09 <gchristensen> ouch

00:10 <gchristensen> I can't think of anything like that happening, which makes me think it doesn't happen

00:10 <andi-> https://github.com/NixOS/nixpkgs/issues/47550#issuecomment-496321103

00:10 <andi-> is basically my configuration

00:11 <andi-> that issues also gives some context of the impact of our nixops-enabling systemd patch.. Many related PRs and issues exist.

00:12 <gchristensen> what is the hack supporting in nixops? I've not used send-keys in such a way that it had much of anything to do with systemd I think

00:13 <gchristensen> early disk decrypting I guess

00:13 <andi-> gchristensen: https://github.com/NixOS/systemd/commit/9212052ac3fd2b3c0966453f546bb47f2dbdc1f5

00:14 <andi-> not really early but decrypting during boot (after SSHd has started)

00:14 <clever> andi-: i'm not doing anything with systemd/nixops, and no luks, but my nixops is hanging at deploy

00:14 <clever> [root@nas:~]# systemctl list-jobs

00:14 <clever> 1893709 sys-subsystem-net-devices-enp3s0.device start running

00:14 <clever> this appears to be the cause

00:15 <andi-> that is probably a completly different issue

00:15 <clever> May 27 03:18:45 nas systemd[1]: sys-subsystem-net-devices-enp3s0.device: Job sys-subsystem-net-devices-enp3s0.device/start failed with result 'timeout'.

00:15 <clever> yeah

00:16 <gchristensen> andi-: ah ... I don't do that

00:16 <andi-> gchristensen: read it together with https://github.com/NixOS/nixpkgs/commit/43404d9acf210d940ba5bf015b57b34c8a4a9fb9

00:17 <gchristensen> interesting.

00:17 <andi-> the note was removed but the defect is still there :/

00:17 <gchristensen> oooOOooo

00:18 <andi-> It disappears if you do a `nixos-rebuild switch` but not if you boot the machine.

00:19 <gchristensen> ack.

00:20 <andi-> Now I am a point - after a few weeks of dealing with things like these - that I would go for breaking `nixops send-keys` and requiring the `_netdev` option to be set. It fixes issues for everyone (racy ones) and only requires a few additional lines for a few people.

00:21 <gchristensen> okay I'm not against that

00:21 <gchristensen> what is the breakage users will see?

00:22 <andi-> that is the hard part... I had a bit of a heated debate about it with flokli an hour ago.. It breaks local-fs.target if _netdev isn't set for those mount points. I am not for that breakage but he convinced me that at some point we might have to do it.. There si no way we can traverse all the layers of indirections to automagically add the option :/

00:23 <gchristensen> right

00:24 <andi-> We could probably add safeguards to nixops to warn if no FS with the option was found etc.. Still not perfect.

00:26 <gchristensen> that would warn almost every user, right?

00:27 <andi-> probably

00:29 <gchristensen> oof

00:29 <andi-> every user that uses autoluks that is

00:32 <gchristensen> is there any way we can start warning users now?

00:32 <gchristensen> warning in such a way that we know 100% they're using this feature

00:33 <samueldr> and backport said runtime warning to at least current stable

00:34 <andi-> I'd combine the nixpkgs change with a change to nixops that fails (unless opted out?) when we see autoLuks being used without _netdev. Also a big fat warning in the release notes of 19.09... In general I'd like to find a better way to handle this. I'll spent most of the night/tomorrow reading through systemd docs one more time..

00:34 <gchristensen> I think having an extremely well targeted alert would be a great start

00:35 <andi-> A way to mark any .mount unit as "_netdev" by only knowing the .device unit is something we need.

00:35 <andi-> Yes, definitly.

00:36 <gchristensen> though: what happens if they miss it?

00:36 <gchristensen> what is the risk

00:38 <andi-> Rollback, open an issue, pop up on IRC (screaming at us) and we point them to the alert we tried to get out?

00:39 <flokli> Well, rolling back needs to be done manually. They might not be able to ssh to machines, as sshd is waiting for local-fa.target

00:39 <andi-> it does fail quickly (a few minutes)

00:39 <andi-> but then it is stuck in a resuce shell

00:39 <flokli> So no ssh

00:40 <flokli> I also thought about having some deployment option that needs to be set, and cancelling deployment otherwise. But UX is pretty bad

00:40 <gchristensen> that would be very ugly on AWS where there is no console

00:41 <andi-> https://i.imgur.com/A4yVaeD.png

00:41 <andi-> is what you end up in

00:42 <flokli> The longer-run solution would be to move the autoluks part to crypttab and add somewhat transitive propagation of the _netdev attribute from the crypt device to mointpoints natively into systemd... But that's a very long stretch

00:42 <andi-> (year+)

00:42 <gchristensen> I feel very uncomfortable with the possibility of losing data

00:42 <gchristensen> hmm I guess it wouldn't be lost, but the recoery would be extremely annoying

00:43 <flokli> Well, you manually need to boot an older generation, fix your nixops config, then redeploy

00:44 <gchristensen> right, but that isn't really possible on AWS / other hosts without console access

00:44 <andi-> (once you figured what went wrong)

00:45 <flokli> So maybe enforcing some new deployment. option to be set, to kinda ensure the nixops user read the chanhelog and updated his fstab config?

00:45 <flokli> It's ugly, but better than the possible breakage?

00:45 <andi-> how does nixpkgs figure out if nixops is being used? Checkinf for `deployment` configuration(s)?

00:46 <Shados> gchristensen: There's literally always a recovery approach possible. If you've got no console, typically you do still have the ability to boot from an ISO, in which case you can just edit the grub menu from there to rollback.

00:46 <Shados> But it would be a pain for someone, somewhere, no doubt.

00:48 <gchristensen> on AWS I believe the root device would need to be snapshotted, mounted on another machine, edited, snapshotted, and booted from -- destroying the original

00:49 <flokli> The thing is, detecting whether all mointpoints have that option set or not is (almost) impossible to 'inspect' from nix, as it can be arbitrarily nested

00:50 <andi-> you can check if any have it set.. any other assumptions ar eeasily false (iscsi multipath, raid1, …)

00:50 <andi-> making an `autoLuks.foo.mountPoint = "/foo";` mandatory would workaround the issue

00:51 <andi-> the user must then ensure not to screw it up or provide the wrong ones.. Not very nice but not the worst.

00:51 <gchristensen> this sounds promising

00:53 <andi-> I am thinking about workloads where it is not a filesystem that is being used ontop of that encrypted device... database logs (oracle did raw disks), ceph, … Would they be affected as well? Probably not.

00:54 <gchristensen> I care about these less, as the diagnostics and recovery is much simpler

00:54 <gchristensen> (I mean, I care about them plenty -- but am less concerned)

00:55 <andi-> At least not at first.. Maybe we figure something out later. Keeping systems bootable is the most important bit.

00:55 <gchristensen> +1

00:56 <andi-> Guess I'll be working on another NixOps PR then.. introducing the _netdev option when the mointPoint is set and not explictily set to some "I know what I am doing" value..

01:11 <Shados> andi-: flokli: What is it that prevents adding or checking _netdev on the mount points from nixops?

01:14 <andi-> Shados: For the trivial cases (filesystem directly ontop of the luks device) we could probably do that. For the more interesting cases (lvm on luks, raid on luks, …) we can not follow the indirections that might be resolvable during runtime.

01:15 <andi-> e.g. you might be creating a raid1 device from multiple encrypted volumes. Knowing which part/disk/… uuid the device might have combined with another device isn't really something we usually know.

01:18 <Shados> I see what you mean

01:18 <andi-> Generally speaking it wouldn't work for any nixos-generated config since those use uuids and not device paths.

01:23 Synthetica has quit [Quit: Connection closed for inactivity]

02:04 disasm| has quit [Quit: WeeChat 2.0]

02:06 drakonis has joined #nixos-dev

02:06 disasm has joined #nixos-dev

04:22 <Shados> Is there a standard way of getting a Nix list of arbitrary strings (e.g. including ones containing spaces) into a bash array? I am guessing not, given the etc-builder doesn't do this, but confirmation would be nice

04:25 <Shados> And on a related note: I can't find the list of characters that are illegal in store paths in the Nix manual; is it in there?

05:08 ryantm has joined #nixos-dev

05:33 <srhb> Shados: https://nixos.org/~eelco/pubs/phd-thesis.pdf Search for abcdf

05:33 <srhb> (Would be nice to stick in the manual I suppose)

05:44 <Shados> srhb: Had to go a few pages down for the name portion of the store path, but thanks. For anyone interested: alphanumerics and `+-._?=` are valid.

06:14 drakonis has quit [Quit: WeeChat 2.4]

06:25 teto has quit [Quit: WeeChat 2.4]

07:22 orivej has quit [Ping timeout: 245 seconds]

07:57 pie_ has quit [Ping timeout: 258 seconds]

08:56 Synthetica has joined #nixos-dev

09:09 pie_ has joined #nixos-dev

09:11 pie___ has joined #nixos-dev

09:14 pie_ has quit [Ping timeout: 248 seconds]

09:58 orivej has joined #nixos-dev

10:01 <genesis> https://hackaday.com/2019/05/24/new-part-day-a-64-bit-risc-v-cpu-in-raspberry-pi-hat-form/

10:04 <arianvp> I was talking to the sifive guys 2 years ago at fosdem and if we ask nicely rhey might be able to provide you a machine for NixOS support

10:04 <arianvp> They seemed open to the idea

10:05 <arianvp> They're interested in getting more distros running on their platform

10:10 <genesis> noticed.

11:16 Cale has quit [Ping timeout: 264 seconds]

11:29 Cale has joined #nixos-dev

11:37 init_6 has joined #nixos-dev

11:47 copumpkin has joined #nixos-dev

12:37 orivej has quit [Ping timeout: 245 seconds]

13:54 init_6 has quit []

14:06 Enzime has quit [Ping timeout: 245 seconds]

14:50 Jackneill has quit [Read error: Connection reset by peer]

14:50 Jackneill has joined #nixos-dev

15:10 orivej has joined #nixos-dev

16:06 pie___ has quit [Ping timeout: 248 seconds]

16:17 <infinisil> ,stuck

16:17 <{^_^}> https://howoldis.herokuapp.com/

16:17 <infinisil> nixpkgs-unstable seems to be stuck

16:28 <andi-> succeeded ~50min ago: https://hydra.nixos.org/build/94117178

16:28 <samueldr> still needs to be fully built

16:29 <samueldr> looks like there's exactly one job in queue

16:36 <infinisil> andi-: it also succeeded 5 days ago and 2 days ago: https://hydra.nixos.org/job/nixos/trunk-combined/tested/all

16:37 <infinisil> Yet no channel update. Or does the "tested" job mean something else?

16:40 <samueldr> the channel upgrade script uses `latest-finished` (Latest successful build from a finished evaluation) to update

16:40 <samueldr> this can be found in the Links tab of the job https://hydra.nixos.org/job/nixos/trunk-combined/tested#tabs-links

16:40 <samueldr> here it looks like that something fishy is going on

16:40 sir_guy_carleton has joined #nixos-dev

16:42 <samueldr> oh right, nixpkgs* it's not using tested

16:44 <samueldr> infinisil: it's that job that matters for nixpkgs channels https://hydra.nixos.org/job/nixpkgs/trunk/unstable#tabs-constituents

16:46 <samueldr> 7 queued jobs, https://hydra.nixos.org/eval/1522035#tabs-unfinished

16:59 <infinisil> Ah

17:00 <infinisil> So probably the metrics job was the cause

17:00 <infinisil> But no idea why, i can't get to the logs of the last failed metrics build

17:02 <samueldr> >> builder for '/nix/store/b7n7axj5yqqnxsx152lv32030b8hx6ab-nixpkgs-metrics.drv' failed with exit code 137

17:02 <samueldr> >> builder for '/nix/store/423lmbdhzzc430p4ks7zgn7q47dqjfx5-nixpkgs-metrics.drv' failed with exit code 137

17:02 <samueldr> OOM

17:03 <infinisil> Ahh

17:04 <infinisil> The usual then..

17:04 <samueldr> :/ it succeeded on the same machine it failed, so uh

17:04 <infinisil> Why does it need so much memory anyways?

17:05 <gchristensen> it evaluates all of nixpkgs

17:13 <infinisil> And Nix can't throw out part of the result because it might be needed later?

17:13 <gchristensen> nixpkgs is a highly connected graph

17:14 <gchristensen> and .override() and .overrideAttrs means keeping copies of iirc the entire pkgs set around all the time

17:15 <infinisil> Okay but why not evaluate one attribute after the next?

17:15 <infinisil> Or 10 at a time

17:17 <gchristensen> I don't understand the question

17:19 orivej has quit [Ping timeout: 258 seconds]

17:20 JosW has joined #nixos-dev

17:28 <infinisil> gchristensen: I assumed "evaluating all of nixpkgs" meant evaluating all attributes, is that not it?

17:29 <gchristensen> sure, nix does evaluate one at a time

17:29 <gchristensen> so one thought is to have a thunk and its evaluated state be interchangable, so if memory needs freeing it can "unevaluate" that thunk

17:34 <infinisil> Hmm yeah

17:36 <infinisil> gchristensen: How much memory do the build machines failing with OOM have?

17:37 <gchristensen> which ones are failing with OOM?

17:37 <infinisil> The metrics one in this case

17:37 <infinisil> https://hydra.nixos.org/build/94116399

17:37 <infinisil> E.g. ^^

17:37 <infinisil> I see ~8GB at the bottom, so probably that's it

17:37 <samueldr> gchristensen: t2a

17:38 <samueldr> IIRC it's been targetting t2a specially

17:38 <gchristensen> hmm I think that is vcunat's

17:38 <gchristensen> that is annoying. it does target that machine specifically, so the metrics are more reliable

17:39 <gchristensen> I'll mail vcunat

17:48 <infinisil> So, after some looking around it seems that this is the mapping from channel to hydra job:

17:48 <infinisil> nixpkgs-unstable -> https://hydra.nixos.org/job/nixpkgs/trunk/unstable

17:48 <infinisil> nixos-unstable -> https://hydra.nixos.org/job/nixos/trunk-combined/tested

17:48 <infinisil> nixos-19.03 -> https://hydra.nixos.org/job/nixos/release-19.03/nixos.channel

17:48 <gchristensen> one sec

17:48 <gchristensen> infinisil: https://github.com/NixOS/nixos-org-configurations/blob/master/modules/hydra-mirror.nix#L4

17:49 <infinisil> Ohh

17:49 <infinisil> I was wrong on the 19.03 one apparently

17:50 <infinisil> nixos-19.03 -> https://hydra.nixos.org/job/nixos/release-19.03/tested

17:51 <infinisil> That seems like a rather chaotic mapping

17:51 drakonis has joined #nixos-dev

17:51 <gchristensen> to me, it follows a well defined pattern

17:52 <gchristensen> (with nixpkgs-unstable being an outlier))

17:52 <infinisil> Everything follows a pattern if you make enough exceptions :)

17:53 <gchristensen> fair

17:53 <gchristensen> does one outlier make something chaotic?

17:53 <infinisil> But nixos-unstable is also out of line, why trunk-*combined*?

17:54 <infinisil> Why not just trunk like nixpkgs-unstable?

17:54 <gchristensen> ah hehe

17:54 <gchristensen> Back In Days Of Yore There Was Nixpkgs and NixOS And It Was Good

17:54 <gchristensen> and thus a nixpkgs/trunk and a nixos/trunk

17:55 <gchristensen> and then, it was decided to combine Nixpkgs and NixOS in to one, and thus: trunk-combined was produced

17:55 <infinisil> I see

17:55 <gchristensen> yeah, some cleaning up would be good

17:55 <infinisil> And it would make sense to do s/trunk/master, which corresponds with the release channels using the branch names too

17:55 <gchristensen> right

17:56 <gchristensen> however, I don't think hydra supports that sort of renaming, and so it hasn't been recreated so we don't lose the history

17:56 <gchristensen> not sure -- it would definitely be nice to do

17:56 <infinisil> Yeah, maybe it would work with some redirects?

17:58 <infinisil> Ah but history will be lost yeah

17:58 <gchristensen> some manual wizardry in hydra's postgres db could possibly do it, but it would be a transaction from hell

18:02 drakonis_ has joined #nixos-dev

18:05 drakonis_ has quit [Client Quit]

19:00 <infinisil> Can somebody give an opinion in #32005? @peti insists on closing it, even though it's still very much a problem

19:00 <samueldr> infinisil: the mapping is the last column http://howoldis.herokuapp.com/

19:00 <{^_^}> https://github.com/NixOS/nixpkgs/issues/32005 (by Infinisil, 1 year ago, closed): buildStackProject: fails downloading resolver

19:01 <infinisil> (samueldr: For 1 hour nobody talked, yet we decided to at the same second :))

19:01 <samueldr> catching up on the ol' backlog

19:16 <niksnut> gchristensen: yes, the only reason that trunk hasn't been renamed is that it takes postgres hours to do so

19:16 <niksnut> other than that, hydra supports renaming jobsets and will automatically redirect from the old to the new name

19:16 <gchristensen> aye :D

19:16 <gchristensen> ahh nice

19:16 <samueldr> does it need to be renamed? a new jobset could replace it leaving the old one untouched?

19:18 <niksnut> sure, but that breaks all history (like performance metrics)

19:19 <samueldr> tight, guessed there was something like that I didn't think of

19:19 <samueldr> right*

19:19 <niksnut> of course we could just shut down the queue runner / evaluator over the weekend to do the rename

19:19 <gchristensen> true

19:19 <samueldr> are there other operations that would require it to be shut down so they could be done at the same time?

19:20 <niksnut> not that I know

19:21 <gchristensen> depends if moving the DB off `chef` server is on the table ;)

19:25 <niksnut> could be

19:52 <matthewbauer> Can you add a library to an ELF's RUNPATH without adding the whole directory it's in?

19:52 <matthewbauer> I want to add /usr/lib/libGL.so without adding all of /usr/lib/

20:04 JosW has quit [Quit: KVIrc 4.2.0 Equilibrium http://www.kvirc.net/]

20:07 <gchristensen> I wonder if each wireguard peer should have its own systemd unit

20:15 <ivan> what for?

20:34 <andi-> What is the use case yu have for multiple peers on a single wireguard interface? The experience I have with them is that if something is off it seems like the node is reachable (interface is up) but data will just be lost. I usually do one interface per peer..

20:50 <gchristensen> that is interesting, andi-, I have a `wg0` with several peers -- one for each machine in my "personal" network

20:53 <clever> matthewbauer: i think you can put absolute paths into DT_NEEDED

20:53 <clever> matthewbauer: and then it will just skip RUNPATH/RPATH entirely

21:11 pie_ has joined #nixos-dev

21:23 <ekleog> did someone already try to use python's sys.meta_path to do static linking for python? I'm thinking it might make things much better for our python story, if we can get rid of propagatedBuildInputs / python.withPackages

21:26 <matthewbauer> clever: thanks!

21:26 <ekleog> (downside being it works only with python3.4+, so we won't be able to fully switch to it until we get rid of python2… which might just be “never” :( )

21:28 <ekleog> oh wait there's a python2 version too, even though the python3 doc states it came in with python3.4

21:28 <ekleog> -> am I missing something obvious before I add that to my “to try someday” idea list?

21:31 <samueldr> wouldn't it make sense to have two different infra, a "legacy" one for python2, and the python3 living one?

21:31 <samueldr> I mean, if it has clear advantages

21:33 <ekleog> once python2 no longer is our default python we will likely be able to, though ideally we wouldn't need that

21:35 <gchristensen> do we have a way to escape systemd unit names? I did a proof of concept of making each wireguard peer its own service: https://github.com/grahamc/nixpkgs/commit/4470e9cc08d8e3534341eda7304fb3ebc997b0e5

21:44 drakonis has quit [Ping timeout: 252 seconds]

22:02 drakonis has joined #nixos-dev

22:04 <infinisil> gchristensen: https://github.com/NixOS/nixpkgs/blob/d2d8b5e59d52318cfd6b56a70126ecbf14ab1c69/nixos/lib/utils.nix#L13-L17

22:04 <gchristensen> oh awesome -- I was grepping lib/ not nixos/lib

22:06 <Profpatsch> ah yeah, good old utils module

22:08 <infinisil> Should just move that to lib probably

22:27 pie_ has quit [Read error: Connection reset by peer]

22:27 pie_ has joined #nixos-dev

22:54 orivej has joined #nixos-dev