gchristensen changed the topic of #nixos-dev to: NixOS Development (#nixos for questions) | https://hydra.nixos.org/jobset/nixos/trunk-combined https://channels.nix.gsc.io/graph.html | 18.09 release managers: vcunat and samueldr | https://logs.nix.samueldr.com/nixos-dev
sir_guy_carleton has joined #nixos-dev
drakonis has quit [Read error: Connection reset by peer]
eadwu has joined #nixos-dev
drakonis has joined #nixos-dev
sir_guy_carleton has quit [Quit: WeeChat 2.2]
orivej has joined #nixos-dev
drakonis has quit [Read error: Connection reset by peer]
<worldofpeace> Is the borg ok? I haven't seen build results for anything I've issued today.
<gchristensen> hmm
<worldofpeace> I think others might have noticed this too since I've spotted someone editing their comment to see if it would do the thing
<gchristensen> did that help?
<worldofpeace> Well it's been 3 hours since then
<worldofpeace> *its
<gchristensen> did you just get replies?
<worldofpeace> looking around to see
<worldofpeace> Yes
<gchristensen> good
<worldofpeace> What kind of mystery was this :D
<gchristensen> there was a hiccup in some of the pieces, and a few of the components stopped receiving messages -- but the work was being done
<worldofpeace> praise to you again gchristensen :)
<gchristensen> well... it'd be better if I fixed that bug than just restarted the pieces :P but thanks
copumpkin has joined #nixos-dev
Lingjian has joined #nixos-dev
eadwu has quit [Ping timeout: 252 seconds]
eadwu has joined #nixos-dev
sir_guy_carleton has joined #nixos-dev
<jtojnar> this is funny
<{^_^}> #53520 (by rvolosatovs, 1 day ago, open): kitty: 0.13.1 -> 0.13.2
init_6 has joined #nixos-dev
<jtojnar> borg is now succeeding in negative time
Lingjian has quit [Ping timeout: 252 seconds]
orivej has quit [Ping timeout: 258 seconds]
orivej has joined #nixos-dev
<worldofpeace> I'm convinced the borg is a time traveller :P
<gchristensen> it is true
<gchristensen> the borg's engines tear holes in space-time.
lassulus_ has joined #nixos-dev
lassulus has quit [Ping timeout: 246 seconds]
lassulus_ is now known as lassulus
jbarthelmes has joined #nixos-dev
<dtz> someone working on readline 8/bash 5? :D
lopsided98 has quit [Quit: Disconnected]
eadwu has quit [Ping timeout: 268 seconds]
lopsided98 has joined #nixos-dev
copumpkin has quit [Quit: My MacBook Pro has gone to sleep. ZZZzzz…]
copumpkin has joined #nixos-dev
jtojnar has quit [Ping timeout: 272 seconds]
orivej has quit [Ping timeout: 258 seconds]
init_6 has quit [Ping timeout: 246 seconds]
orivej has joined #nixos-dev
sir_guy_carleton has quit [Quit: WeeChat 2.2]
pie_ has joined #nixos-dev
phreedom has quit [Ping timeout: 256 seconds]
sorear has quit [Read error: Connection reset by peer]
worldofpeace has quit [Ping timeout: 246 seconds]
orivej has quit [Ping timeout: 245 seconds]
orivej has joined #nixos-dev
jtojnar has joined #nixos-dev
init_6 has joined #nixos-dev
phreedom has joined #nixos-dev
eadwu has joined #nixos-dev
eadwu has quit [Ping timeout: 268 seconds]
sorear has joined #nixos-dev
__Sander__ has joined #nixos-dev
init_6 has quit [Ping timeout: 258 seconds]
init_6 has joined #nixos-dev
init_6 has quit [Ping timeout: 272 seconds]
<infinisil> Ohh, yeah actually, we should get rid of this static uid/gid mapping convention ASAP: https://github.com/NixOS/nixpkgs/blob/2c9de98/nixos/modules/misc/ids.nix#L340-L342
<gchristensen> why 339?
<gchristensen> 399*
<infinisil> Probably there starts some reserved range, *goes to look*
<infinisil> Ahh
<infinisil> So I guess in theory we could probably still use 500-999 for static ids..
<infinisil> Oh um
<infinisil> "The majority of modern Unix-like systems (e.g., Solaris-2.0 in 1990, Linux 2.4 in 2001) have switched to 32-bit UIDs, allowing 4,294,967,296 (232) unique IDs."
<infinisil> I didn't know that
hedning has left #nixos-dev [#nixos-dev]
<fadenb> Things I did not expect: Reading Solaris 2 and modern in the same sentence ;)
worldofpeace has joined #nixos-dev
orivej has quit [Read error: Connection reset by peer]
orivej has joined #nixos-dev
__Sander__ has quit [Quit: Konversation terminated!]
fpletz has joined #nixos-dev
pie_ has quit [Remote host closed the connection]
pie_ has joined #nixos-dev
emily has joined #nixos-dev
drakonis has joined #nixos-dev
jbarthelmes has quit [Ping timeout: 258 seconds]
<globin> most of those aren't necessary anyway
<globin> only services with a lot of data should be in there
<globin> and even that is not really necessary since having the user-group perl script "persisting" uids/gids
<samueldr> maybe what's needed is a canonical way in nixos to share that database between machines?
<samueldr> (though now we're left with another issue, what if you have machines A and B and B and C that needs to share the same IDs, but don't really want A and C to be mixed :))
<samueldr> so, not really ideal
<infinisil> globin: Exactly, that's what I've been thinking too
<infinisil> I'll probably soon make a PR to get rid of the static mapping to discuss this further
<infinisil> So far nobody has brought up a convincing reason to keep the static mapping now that /var/lib/nixos/{g,u}id-map exists
<ekleog> the static mapping is here to handle properly backup/restore, is it not?
<ekleog> so that it's possible to backup on a machine and restore on another without having to fixup all uid/gids
<gchristensen> if you restore the /var/lib/nixos it'll restore properly
<infinisil> ekleog: As long as you backup /var/lib/nixos along with the rest of /var/lib/* it won't be a problem
<ekleog> ie. while DynamicUser is an appropriate replacement for the static mapping, I'm not convinced the perl script is
<ekleog> it will after a rebuild only, right?
<ekleog> meaning that the restore must be done before the first rebuild
<ekleog> which can be problematic when deploying machines directly with their configuration
<infinisil> Yeah
<infinisil> But you should always first restore data, the the config
<ekleog> well, that means first boot with a null-config, then restore data, then switch to a full-config
<ekleog> which means having to actually write a null-config
<ekleog> (with enough stuff to be able to do the restore stage but not too much to not have everything break down when the u/gids will change during the restore)
<infinisil> Hmm, that's how I'm always doing my restores
<infinisil> ekleog: It would be weird if the services were running without the correct data
<infinisil> Some are stateful and need the correct setup
<infinisil> e.g. acme
<infinisil> I think it doesn't make much sense to run the services without the restored data
<ekleog> until now (so only one restore) I've survived by just deploying the same configuration, then `systemctl stop`'ing all services, then restoring and rebooting
<infinisil> And you'd probably run into a couple other problems anyways. At least I did once when I restored the config before the data (can't remember the problem though)
<infinisil> But I mean, even if you restore the data after a config change, the worst thing that'll happen is that services can't read their data initially. After the config change, everything will be fine
<ekleog> hmm, I'm maybe over-cautious, just somehow feel bad about that /etc/users-generating perl script :)
* ekleog actually hands out explicit IDs to his users
<infinisil> I mean, it's worked well for a while now. I don't like perl either, but don't fix what ain't broken :P
<ekleog> it's not really a perl issue, more an impurity issue
<infinisil> Hmm, there's really no nice way to make this pure afaik
<ekleog> yup, that's why I'm trying not to use it by having all users have explicit IDs
<ekleog> it's by-design impure so I feel like it can only go wrong someplace, but now I'm unwillingly moving goal posts so feel free to ignore, it's much more of a feeling than actual data
<infinisil> Oh, crazy idea
<gchristensen> I also don't like dynamic IDs, ekleog.
<gchristensen> DynamicUsers is a different thing, but I feel for NixOS, my very strong preference is for uids to be statically assigned
<infinisil> Oh, no crazy idea won't work. I thought of cramming all possible usernames into the 2^32 range via some deterministic mapping. But username can have 32 chars afaik, which won't fit into that range by far
<gchristensen> lol
<ekleog> infinisil: yup, at some point I was considering writing a hash function in nix for exactly this purpose, for VM's IP addresses :D
<ekleog> but then hash collision resolution is the issue
<infinisil> Yeah and you can't just fix it like with hash tables..
<ekleog> one solution would be to just refuse to evaluate in case of collision
<ekleog> and hope usual stuff won't collide
<ekleog> but that's not really backwards-compatible :D
<infinisil> How about this: Since most usernames only have a couple chars, cram all those into a deterministic mapping (limited by how many letters we can fit into 2^32)
<infinisil> Like, start with 1 chars, then 2, ...
<infinisil> I think this might even work
<ekleog> one lowercase letter is 4.7 bits -> 2^32 would make ~6 chars
<infinisil> Hmm yeah a bit low probably
<ekleog> I'd have more hope for the hash function than for the deterministic mapping, as it'd handle better collisions… like, you'd need 2^16 usernames to have 1/2 of collision
<infinisil> Hmm yeah that's a good point
<ekleog> but then when it actually collides in a configuration that currently works someone's going to hate us
<infinisil> Does all software work with ids >65356 ids btw?
<infinisil> s/ids//
<ekleog> most likely not, indeed
<gchristensen> probably not
<ekleog> hmm and cramming into 65536 ids would make collisions have 1/2 chance at only 256 usernames, so a bit low
<infinisil> Also we can't use most of that range
<infinisil> I think
<ekleog> actually maybe cramming it deterministically in 65536 (or whatever we can use), and bailing out on collision by asking the user to assign an explicit ID could do it? I kind of wonder how bad the churn would be
<ekleog> (and the considered hash function would be the hash function with the overrides provided by the user with .uid = 1234;)
<infinisil> It's very bad UX
<infinisil> And there's bound to be certain users having usernames that collide, especially as the userbase grows
<infinisil> Oh, I guess you mean only for services?
<ekleog> depends on how frequent these collisions are… like, on my system I've got 51 users
<ekleog> (single-user laptop system, though)
<cransom> time to create the worlds largest ldap server.
<ekleog> actually upon more thought… all this explodes in the air with ldap users and the like
<ekleog> huh inb4'd
<ekleog> but the perl script explodes in the air too, I guess
<infinisil> Yeah, I think a static mapping just doesn't work, and it's very ugly to have it in nixpkgs anyways imo
<gchristensen> let's put ldap on the blockchain
<ekleog> gchristensen: namecoin
<ekleog> (yes, it does exist)
<gchristensen> yeah but not as a lightweight directory access protocol!
<ekleog> legacy doesn't really help us by limiting to 2^16 users…
<ekleog> one solution would be to drop legacy support and run everything in a DynamicUser-like
<gchristensen> not everything is compatible with DynamicUser
<ekleog> I kind-of wonder
<ekleog> (part of the incompatibilities was due to nscd caching failing which should be fixed now, maybe there are others?)
<ekleog> but it'd be painfully slow to search through the whole filesystem to fixup permissions anyway, so it couldn't really replace all uid/gids
<ekleog> only for services
<infinisil> Okay so all static ids doesn't work, and all dynamic ids doesn't work either, so we need something inbetween, and that's what /var/lib/nixos/{g,u}id-map does nicely
<ekleog> :'(
<gchristensen> something something arguing for state something nixos
<infinisil> One slight actual problem with this mapping is that when you restore from backups, but forget to do it for /var/lib/nixos, all services may break
<infinisil> This is fixable by having all services check whether the permissions are correct for their directories and fix them if not
<infinisil> Which might take a while for big directories, but since this happens very rarely, it should be fine
<gchristensen> my preference would be everything use DynamicUser, and those that can't, have statically assigned IDs
<infinisil> Yeah, DynamicUser always if possible, I'm encouraging it in my reviews
<infinisil> Wait, how does dynamic user even work with stateful directories? Does it chmod the dir upon start?
<gchristensen> systemd magic
<infinisil> Oh, some mount namespace magic maybe
<infinisil> A problem with the "every service checks for permissions at the start and chmods if they're wrong" is that we can't just easily change every service to do this
<fpletz> I'm all for DynamicUser but are people still complaining about compatibility with state directories mounted over nfs?
<fpletz> infinisil: yeah, or user namespace magic :)
<infinisil> fpletz: Doesn't nfs support some id translation thing?
<fpletz> infinisil: I don't use nfs so I don't really know, but if uids are dynamic a static translation is problaby bound to fail
<fpletz> regarding state directories in systemd service with dynamic users, here are some details: http://0pointer.net/blog/dynamic-users-with-systemd.html
<fpletz> more than in the manpages :)
<fpletz> nfs4++ :)
<gchristensen> "As pretty much the whole OS directory tree is read-only" funny, this is always true for every service of mine...
<infinisil> Yeah so I guess the nfs complaints can be ignored with this
* infinisil reads the blog now
<gchristensen> (also, I should migrate my things over to using dynamicuser... making custom users is so systemdv234!)
<clever> gchristensen: what if the config file needs to be owned by a user?
<clever> or state
<clever> state, the enemy of pure languages!
<infinisil> fpletz: Ah yeah that's a very nice blogpost, explained it perfectly
<infinisil> So DynamicUser is very neat and I'll advocate it even more now
<infinisil> One slight disadvantage is that users can't set their own state directories, but I researched this a bit, and found out that nobody needs that anyways for most services
<infinisil> (I grepped through like 15 different nixos configs for dataDir assignments, only found a couple that set it for postgresql, mysql and syncthing -> concluded it's only needed for services that explicitly are about handling data)
<clever> infinisil: my VPN has to track private keys and whitelisted public keys
<clever> and thats all managed outside the store, so it needs a state dir
<clever> it defaults to $HOME/.toxvpn/
<infinisil> There's no real reason this can't be in /var/lib/toxvpn though is there
<clever> [root@amd-nixos:~]# ls -ltrha /var/lib/toxvpn/.toxvpn/
<clever> infinisil: exactly what i set $HOME to for the toxvpn user
<infinisil> Apparently the nixos module also uses /var/lib/toxvpn
pie_ has quit [Remote host closed the connection]
pie_ has joined #nixos-dev
<infinisil> clever: Wait, are you not using the nixos module?
<clever> i am using the nixos module
<clever> that module auto-creates a user when the service is enabled
<infinisil> Ahh right, so it uses $HOME/.toxvpn, and `toxvpn.home = "/var/lib/toxvpn";` i see
<clever> exactly
<infinisil> Crazy idea: enable DynamicUser by default :P
<gchristensen> that is certainly an idea
<ekleog> if you're looking for crazy ideas around systemd, I was around enabling Restart=yes by default
<gchristensen> :|
<infinisil> why that/
<infinisil> ?*
<ekleog> got hurt way too many times by systemd not auto-restarting some service that transiently failed
<gchristensen> y'all wild
<infinisil> ekleog: By yes you mean on-failure? I don't think yes is an allowed value
<ekleog> like, opensmtpd failing at startup randomly, and basically most if not all services are expected to be running, not stopped
<gchristensen> a better (imo) route is to audit your system for failed services and fix their bugs
<infinisil> Yeah, it's not too bad of an idea tbh, almost all services set Restart already anyways
<infinisil> Oh
<infinisil> Yeah, agreed with gchristensen actually
<ekleog> infinisil: I was more thinking `always` but `on-failure` would be less aggressive
<ekleog> gchristensen: transient failures are transient ._.
<ekleog> like, I think there's some dependency missing in nixos' opensmtpd service
<gchristensen> if we wanted to restart in a loop, we might as well use runit
<ekleog> tbh I much prefer runit to systemd, for exactly this reason
<ekleog> but the opensmtpd failure only happens on system boot and not reliably
<clever> ekleog: does runit even support stop?
<gchristensen> because your services are underspecified?
<ekleog> clever: yup, there's `sv stop`
<gchristensen> that came off snarkier than I intended
<clever> ekleog: ah
<ekleog> gchristensen: they most certainly are, but I have no idea in which way
<gchristensen> but, if a service isn't coming up properly, it is probably because it has a missing dependency
<infinisil> Yeah, hard to figure out the real problem though
<ekleog> if there was a way to purify it, I'd be happy to
<gchristensen> run things with large amounts of logging and see whats up
<gchristensen> or, boot to an emergency shell and then start the service, and see what it complains about
<ekleog> it's like choosing coq vs. erlang, coq would be making everything perfect, erlang is “well that'll fail, just restart it and it'll work well enough”
<gchristensen> another technique is examine other distro's systemd services
<ekleog> gchristensen: I've got another issue with timing on a machine whose disks are really slow to appear, and I couldn't possibly be fast enough by hand to reproduce
<gchristensen> aye, but isn't that one pretty easy to diagnose and fix?
<ekleog> (and even the emergency shell won't block the kernel from accepting new disks)
apaul1729 has joined #nixos-dev
<infinisil> gchristensen: Lots of services already define their own systemd files upstream. Maybe we should have a systemd2nixos
<ekleog> this one should be quite easy hopefully, just wanted to point out another issue where the emergency shell wouldn't help
<gchristensen> infinisil: nixos can already base its config off of a package's contained systemd file
<infinisil> Yeah, but it's all dynamic then
<worldofpeace> infinisil: https://github.com/NixOS/nixpkgs/issues/50105#issuecomment-452435126 Contributing to nixpkgs can feel mysterious, all the unpublicised inner complexity
<infinisil> Sure can, how does this relate to the comment though?
<worldofpeace> This what at least tell people what's going on, decreasing the mystery slightly.
<infinisil> Ah yeah
<worldofpeace> But I'm also getting at is that we need to be better at exposing our complexities
<worldofpeace> Our rules of engagement are revealed through interaction. That can be uninviting.
orivej_ has joined #nixos-dev
orivej has quit [Read error: Connection reset by peer]
eadwu has joined #nixos-dev
worldofpeace has quit [Quit: worldofpeace]
worldofpeace has joined #nixos-dev
apaul1729 has quit [Remote host closed the connection]
JosW has joined #nixos-dev
<timokau[m]> The uefiUsb test is currently timing out. How long is that timeout? Where is it configured?
<timokau[m]> Should we just double it? I feel like "exponential backoff" is a good solution whenever we get a false-negative from a timeout
<samueldr> this might be a balancing act, detect broken stuff earlier or not, and past failing tests aren't accounted for (and probably shouldn't)
<samueldr> things that would be nice to know: was the host overloaded?
* gchristensen looks at his nascent scheduler code
* samueldr was thinking about it
<samueldr> (hoping it would help guarantee *some* load?)
<timokau[m]> The test took 500s locally for me, a lot of which was spent waiting for the vm to come up
JosW has quit [Quit: KVIrc 4.2.0 Equilibrium http://www.kvirc.net/]
<timokau[m]> I feel like timeouts should be chosen very generous to avoid having to think about load too much
<timokau[m]> Hydra tests are not the place for benchmarks
<samueldr> maybe 5 minutes was a generous value at one point?
<samueldr> see also 95486ca306fc72d23ed4568b91afe35b6af9fab1 9bc10e12916979c5c620be5b521b9218a0077cba
<timokau[m]> Thats a good point, I know basically nothing about how the vm tests work but bringing up a simple headless vm probably shouldn't take that long
<samueldr> part of #49777
<{^_^}> https://github.com/NixOS/nixpkgs/pull/49777 (by srhb, 9 weeks ago, merged): Revert "NixOS tests: Wait for shell for 10x longer (50m)"
<samueldr> (learning along the way btw)
<samueldr> >> This suggests to me that the increased timeout as suspected doesn't actually do anything, because the failure will occur regardless of how long we wait. If that turns out wrong, we can always re-revert, but I think the root cause should be sought elsewhere within the VM test framework.
<gchristensen> tests used to reliably finish on my machine in <1min and no longer do
<gchristensen> (/!\ wildly making stuff up alert /!\) I wonder if spectre stuff has to do with it.
<timokau[m]> Looking at the tests matrix I feel like we need to change something fundamentally, that looks like success or failure is basically random
<samueldr> gchristensen: disable mitigations and microcode and test?
<samueldr> (on your machine)
<timokau[m]> I'd be very surprised if that would slow things down for more than 50%
<samueldr> timokau[m]: yes, *something* is up, I hope people better than I at this stuff will look at it and figure it out :)
<gchristensen> same
<samueldr> looks like it's `pti=off` to disable the kernel bits (unless there were other added)
<timokau[m]> We should be able to `git bisect` it, shouldn't we?
<timokau[m]> I'm trying to run that particular test on 17.03 right now
<samueldr> bisecting unreliable tests?
<timokau[m]> bisecting their runtime
<samueldr> or the apparent increase in time it takes?
<samueldr> ah
<timokau[m]> If it takes 500s locally, its not very surprising that a 300s timeout of some component gets triggered occasionally
<samueldr> though, if it's host-related (e.g. a kernel with spectre mitigations and its /dev/kvm) bisecting will also require the host to be touched
<timokau[m]> So either it shouldn't take 500s locally or the timeout is too strict
<timokau[m]> True
<samueldr> though the timeout AFAIUI is not at the beginning of it all, but at a point it's known to be already starting
<timokau[m]> Or we write a vm test for vm tests :D
<samueldr> timokau[m]: I think we'd need cycle-accurate tests :)
<timokau[m]> Whats that?
<samueldr> borrowing an emulation term, where (oversimplified) "the whole machine" is emulated; here I'm meaning where the wall clock is *not* synchronized with the host's clock, so e.g. even if it takes twice as long to test the tests on my machine, it wouldn't fail earlier in the execution
<samueldr> (half joking here, but I think to test a testing infra it'd be required)
<timokau[m]> Ah right, that makes sense
<gchristensen> also remember metrics about the builders are public
<samueldr> seems it might exist
<gchristensen> and if you have ideas on how to tune them, I'm all ears
<samueldr> but possibly only academically?
<timokau[m]> The uefi-usb test took 118s on 17.03 (500 on master)
<gchristensen> all the best stuff is only available academically
<timokau[m]> Will re-test on master
<ekleog> samueldr: might be doable with minor tweaks to valgrind too, but would be much slower than regular qemu
<samueldr> here I'm budgetting infinite time in tradeoff to perfect reproducibility of the testing infra :)
<timokau[m]> Well its really infinite CPU time vs infinite developer time
<timokau[m]> I can reproduce the 5x increase in runtime of `tests.boot.uefiUsb.x86_64-linux` (from 100 to 500s) between 17.03 and current master
<timokau[m]> Time to sleep now though, if somebody else wants to do the actual bisect that would be amazing :)
<gchristensen> I can provide a big system for whoever wants to do that
sir_guy_carleton has joined #nixos-dev
sir_guy_carleton has quit [Ping timeout: 258 seconds]
sir_guy_carleton has joined #nixos-dev
<srhb> samueldr: Frankly my hypothesis, in light of the result of the revert, seems wrong.
<srhb> samueldr: We did see fewer failures with the increased timeout. Not all of them went away, but I'm fairly sure some did.
<samueldr> hm!
<srhb> I think there's just more than one cause here.
<samueldr> I'm waiting on a thing, probably done in ~1h where I run the test three times for all releases since 17.03 (release commit) and nixos-unstable to see if there's a progression or a sharp increase between one release and the other; then I'll be doing the same between the tags, somehow
<samueldr> I'm sure there's more than one cause :/
<srhb> Tweaking load on machines also has an impact etc. It's frankly a bit confusing, and really, really difficult to replicate on a smaller scale. Like, on my machine I have to load it massively and set the timeout to a few seconds.
<samueldr> and having a cycle-accurate emulation, even if sloooooooooow would help :(
<srhb> fwiw I intend to return to the issue when I'm out of my NixOS-hiatus :P
<samueldr> I'm thinking of maybe adding more data; e.g. dump the time it took where we set the alarm, and other stats in a file for each builds, into a build product; then we could at least use existing builds on hydra to gather data
<srhb> Yes, that would be very helpful.
<samueldr> I was surprised to see I had to parse the logs to get the amount of time it took to build a test
eadwu has quit [Ping timeout: 264 seconds]