#nixos-dev on 2019-01-08

2018-08-16 20:49 gchristensen changed the topic of #nixos-dev to: NixOS Development (#nixos for questions) | https://hydra.nixos.org/jobset/nixos/trunk-combined https://channels.nix.gsc.io/graph.html | 18.09 release managers: vcunat and samueldr | https://logs.nix.samueldr.com/nixos-dev

00:11 sir_guy_carleton has joined #nixos-dev

00:26 drakonis has quit [Read error: Connection reset by peer]

00:44 eadwu has joined #nixos-dev

01:19 drakonis has joined #nixos-dev

01:28 sir_guy_carleton has quit [Quit: WeeChat 2.2]

01:36 orivej has joined #nixos-dev

01:43 drakonis has quit [Read error: Connection reset by peer]

02:12 <worldofpeace> Is the borg ok? I haven't seen build results for anything I've issued today.

02:12 <gchristensen> hmm

02:14 <worldofpeace> I think others might have noticed this too since I've spotted someone editing their comment to see if it would do the thing

02:14 <gchristensen> did that help?

02:15 <worldofpeace> Well it's been 3 hours since then

02:15 <worldofpeace> *its

02:16 <gchristensen> did you just get replies?

02:18 <worldofpeace> looking around to see

02:19 <worldofpeace> Yes

02:19 <gchristensen> good

02:19 <worldofpeace> What kind of mystery was this :D

02:20 <gchristensen> there was a hiccup in some of the pieces, and a few of the components stopped receiving messages -- but the work was being done

02:24 <worldofpeace> praise to you again gchristensen :)

02:25 <gchristensen> well... it'd be better if I fixed that bug than just restarted the pieces :P but thanks

02:54 copumpkin has joined #nixos-dev

02:58 Lingjian has joined #nixos-dev

03:01 eadwu has quit [Ping timeout: 252 seconds]

03:04 eadwu has joined #nixos-dev

03:05 sir_guy_carleton has joined #nixos-dev

03:05 <jtojnar> https://i.imgur.com/LFcAZ6r.png

03:06 <jtojnar> this is funny

03:06 <jtojnar> https://github.com/NixOS/nixpkgs/pull/53520

03:06 <{^_^}> #53520 (by rvolosatovs, 1 day ago, open): kitty: 0.13.1 -> 0.13.2

03:06 init_6 has joined #nixos-dev

03:07 <jtojnar> borg is now succeeding in negative time

03:07 Lingjian has quit [Ping timeout: 252 seconds]

03:08 orivej has quit [Ping timeout: 258 seconds]

03:08 orivej has joined #nixos-dev

03:10 <worldofpeace> I'm convinced the borg is a time traveller :P

03:15 <gchristensen> it is true

03:15 <gchristensen> the borg's engines tear holes in space-time.

03:51 lassulus_ has joined #nixos-dev

03:53 lassulus has quit [Ping timeout: 246 seconds]

03:53 lassulus_ is now known as lassulus

03:59 jbarthelmes has joined #nixos-dev

04:03 <dtz> someone working on readline 8/bash 5? :D

04:10 lopsided98 has quit [Quit: Disconnected]

04:18 eadwu has quit [Ping timeout: 268 seconds]

04:19 lopsided98 has joined #nixos-dev

04:35 copumpkin has quit [Quit: My MacBook Pro has gone to sleep. ZZZzzz…]

05:05 copumpkin has joined #nixos-dev

05:14 jtojnar has quit [Ping timeout: 272 seconds]

05:26 orivej has quit [Ping timeout: 258 seconds]

05:44 init_6 has quit [Ping timeout: 246 seconds]

06:02 orivej has joined #nixos-dev

06:10 sir_guy_carleton has quit [Quit: WeeChat 2.2]

06:21 pie_ has joined #nixos-dev

08:04 phreedom has quit [Ping timeout: 256 seconds]

08:05 sorear has quit [Read error: Connection reset by peer]

08:18 worldofpeace has quit [Ping timeout: 246 seconds]

08:29 orivej has quit [Ping timeout: 245 seconds]

08:33 orivej has joined #nixos-dev

08:51 jtojnar has joined #nixos-dev

09:14 init_6 has joined #nixos-dev

09:58 phreedom has joined #nixos-dev

11:19 eadwu has joined #nixos-dev

11:40 eadwu has quit [Ping timeout: 268 seconds]

12:05 sorear has joined #nixos-dev

12:16 __Sander__ has joined #nixos-dev

13:29 init_6 has quit [Ping timeout: 258 seconds]

13:31 init_6 has joined #nixos-dev

13:50 init_6 has quit [Ping timeout: 272 seconds]

15:30 <infinisil> Ohh, yeah actually, we should get rid of this static uid/gid mapping convention ASAP: https://github.com/NixOS/nixpkgs/blob/2c9de98/nixos/modules/misc/ids.nix#L340-L342

15:31 <gchristensen> why 339?

15:31 <gchristensen> 399*

15:31 <infinisil> Probably there starts some reserved range, *goes to look*

15:32 <gchristensen> https://github.com/NixOS/nixpkgs/blob/2c9de98dbaddc25520296ad22c55eae961cb21a3/nixos/modules/programs/shadow.nix#L9

15:33 <infinisil> Ahh

15:34 <infinisil> So I guess in theory we could probably still use 500-999 for static ids..

15:38 <infinisil> Oh um

15:38 <infinisil> "The majority of modern Unix-like systems (e.g., Solaris-2.0 in 1990, Linux 2.4 in 2001) have switched to 32-bit UIDs, allowing 4,294,967,296 (232) unique IDs."

15:39 <infinisil> I didn't know that

15:55 hedning has left #nixos-dev [#nixos-dev]

16:06 <fadenb> Things I did not expect: Reading Solaris 2 and modern in the same sentence ;)

16:30 worldofpeace has joined #nixos-dev

16:41 orivej has quit [Read error: Connection reset by peer]

16:42 orivej has joined #nixos-dev

16:57 __Sander__ has quit [Quit: Konversation terminated!]

16:59 fpletz has joined #nixos-dev

17:19 pie_ has quit [Remote host closed the connection]

17:19 pie_ has joined #nixos-dev

17:27 emily has joined #nixos-dev

17:38 drakonis has joined #nixos-dev

17:49 jbarthelmes has quit [Ping timeout: 258 seconds]

17:59 <globin> most of those aren't necessary anyway

18:00 <globin> only services with a lot of data should be in there

18:01 <globin> and even that is not really necessary since having the user-group perl script "persisting" uids/gids

18:07 <samueldr> maybe what's needed is a canonical way in nixos to share that database between machines?

18:07 <samueldr> (though now we're left with another issue, what if you have machines A and B and B and C that needs to share the same IDs, but don't really want A and C to be mixed :))

18:07 <samueldr> so, not really ideal

18:28 <infinisil> globin: Exactly, that's what I've been thinking too

18:29 <infinisil> I'll probably soon make a PR to get rid of the static mapping to discuss this further

18:29 <infinisil> So far nobody has brought up a convincing reason to keep the static mapping now that /var/lib/nixos/{g,u}id-map exists

18:37 <ekleog> the static mapping is here to handle properly backup/restore, is it not?

18:37 <ekleog> so that it's possible to backup on a machine and restore on another without having to fixup all uid/gids

18:38 <gchristensen> if you restore the /var/lib/nixos it'll restore properly

18:38 <infinisil> ekleog: As long as you backup /var/lib/nixos along with the rest of /var/lib/* it won't be a problem

18:39 <ekleog> ie. while DynamicUser is an appropriate replacement for the static mapping, I'm not convinced the perl script is

18:39 <ekleog> it will after a rebuild only, right?

18:39 <ekleog> meaning that the restore must be done before the first rebuild

18:40 <ekleog> which can be problematic when deploying machines directly with their configuration

18:40 <infinisil> Yeah

18:40 <infinisil> But you should always first restore data, the the config

18:41 <ekleog> well, that means first boot with a null-config, then restore data, then switch to a full-config

18:41 <ekleog> which means having to actually write a null-config

18:42 <ekleog> (with enough stuff to be able to do the restore stage but not too much to not have everything break down when the u/gids will change during the restore)

18:42 <infinisil> Hmm, that's how I'm always doing my restores

18:43 <infinisil> ekleog: It would be weird if the services were running without the correct data

18:43 <infinisil> Some are stateful and need the correct setup

18:43 <infinisil> e.g. acme

18:43 <infinisil> I think it doesn't make much sense to run the services without the restored data

18:44 <ekleog> until now (so only one restore) I've survived by just deploying the same configuration, then `systemctl stop`'ing all services, then restoring and rebooting

18:44 <infinisil> And you'd probably run into a couple other problems anyways. At least I did once when I restored the config before the data (can't remember the problem though)

18:45 <infinisil> But I mean, even if you restore the data after a config change, the worst thing that'll happen is that services can't read their data initially. After the config change, everything will be fine

18:47 <ekleog> hmm, I'm maybe over-cautious, just somehow feel bad about that /etc/users-generating perl script :)

18:47 * ekleog actually hands out explicit IDs to his users

18:47 <infinisil> I mean, it's worked well for a while now. I don't like perl either, but don't fix what ain't broken :P

18:48 <ekleog> it's not really a perl issue, more an impurity issue

18:48 <infinisil> Hmm, there's really no nice way to make this pure afaik

18:49 <ekleog> yup, that's why I'm trying not to use it by having all users have explicit IDs

18:49 <ekleog> it's by-design impure so I feel like it can only go wrong someplace, but now I'm unwillingly moving goal posts so feel free to ignore, it's much more of a feeling than actual data

18:50 <infinisil> Oh, crazy idea

18:50 <gchristensen> I also don't like dynamic IDs, ekleog.

18:51 <gchristensen> DynamicUsers is a different thing, but I feel for NixOS, my very strong preference is for uids to be statically assigned

18:51 <infinisil> Oh, no crazy idea won't work. I thought of cramming all possible usernames into the 2^32 range via some deterministic mapping. But username can have 32 chars afaik, which won't fit into that range by far

18:51 <gchristensen> lol

18:52 <ekleog> infinisil: yup, at some point I was considering writing a hash function in nix for exactly this purpose, for VM's IP addresses :D

18:52 <ekleog> but then hash collision resolution is the issue

18:52 <infinisil> Yeah and you can't just fix it like with hash tables..

18:52 <ekleog> one solution would be to just refuse to evaluate in case of collision

18:52 <ekleog> and hope usual stuff won't collide

18:53 <ekleog> but that's not really backwards-compatible :D

18:53 <infinisil> How about this: Since most usernames only have a couple chars, cram all those into a deterministic mapping (limited by how many letters we can fit into 2^32)

18:53 <infinisil> Like, start with 1 chars, then 2, ...

18:53 <infinisil> I think this might even work

18:54 <ekleog> one lowercase letter is 4.7 bits -> 2^32 would make ~6 chars

18:54 <infinisil> Hmm yeah a bit low probably

18:55 <ekleog> I'd have more hope for the hash function than for the deterministic mapping, as it'd handle better collisions… like, you'd need 2^16 usernames to have 1/2 of collision

18:55 <infinisil> Hmm yeah that's a good point

18:56 <ekleog> but then when it actually collides in a configuration that currently works someone's going to hate us

18:56 <infinisil> Does all software work with ids >65356 ids btw?

18:56 <infinisil> s/ids//

18:56 <ekleog> most likely not, indeed

18:56 <gchristensen> probably not

18:56 <ekleog> hmm and cramming into 65536 ids would make collisions have 1/2 chance at only 256 usernames, so a bit low

18:57 <infinisil> Also we can't use most of that range

18:57 <infinisil> I think

18:59 <ekleog> actually maybe cramming it deterministically in 65536 (or whatever we can use), and bailing out on collision by asking the user to assign an explicit ID could do it? I kind of wonder how bad the churn would be

18:59 <ekleog> (and the considered hash function would be the hash function with the overrides provided by the user with .uid = 1234;)

19:00 <infinisil> It's very bad UX

19:01 <infinisil> And there's bound to be certain users having usernames that collide, especially as the userbase grows

19:01 <infinisil> Oh, I guess you mean only for services?

19:01 <ekleog> depends on how frequent these collisions are… like, on my system I've got 51 users

19:01 <ekleog> (single-user laptop system, though)

19:02 <cransom> time to create the worlds largest ldap server.

19:02 <ekleog> actually upon more thought… all this explodes in the air with ldap users and the like

19:02 <ekleog> huh inb4'd

19:02 <ekleog> but the perl script explodes in the air too, I guess

19:02 <infinisil> Yeah, I think a static mapping just doesn't work, and it's very ugly to have it in nixpkgs anyways imo

19:02 <gchristensen> let's put ldap on the blockchain

19:02 <ekleog> gchristensen: namecoin

19:03 <ekleog> (yes, it does exist)

19:03 <gchristensen> yeah but not as a lightweight directory access protocol!

19:04 <ekleog> legacy doesn't really help us by limiting to 2^16 users…

19:04 <ekleog> one solution would be to drop legacy support and run everything in a DynamicUser-like

19:04 <gchristensen> not everything is compatible with DynamicUser

19:04 <ekleog> I kind-of wonder

19:05 <ekleog> (part of the incompatibilities was due to nscd caching failing which should be fixed now, maybe there are others?)

19:05 <ekleog> but it'd be painfully slow to search through the whole filesystem to fixup permissions anyway, so it couldn't really replace all uid/gids

19:06 <ekleog> only for services

19:06 <infinisil> Okay so all static ids doesn't work, and all dynamic ids doesn't work either, so we need something inbetween, and that's what /var/lib/nixos/{g,u}id-map does nicely

19:07 <ekleog> :'(

19:08 <gchristensen> something something arguing for state something nixos

19:09 <infinisil> One slight actual problem with this mapping is that when you restore from backups, but forget to do it for /var/lib/nixos, all services may break

19:09 <infinisil> This is fixable by having all services check whether the permissions are correct for their directories and fix them if not

19:10 <infinisil> Which might take a while for big directories, but since this happens very rarely, it should be fine

19:10 <gchristensen> my preference would be everything use DynamicUser, and those that can't, have statically assigned IDs

19:10 <infinisil> Yeah, DynamicUser always if possible, I'm encouraging it in my reviews

19:11 <infinisil> Wait, how does dynamic user even work with stateful directories? Does it chmod the dir upon start?

19:11 <gchristensen> systemd magic

19:11 <infinisil> Oh, some mount namespace magic maybe

19:14 <infinisil> A problem with the "every service checks for permissions at the start and chmods if they're wrong" is that we can't just easily change every service to do this

19:14 <fpletz> I'm all for DynamicUser but are people still complaining about compatibility with state directories mounted over nfs?

19:15 <fpletz> infinisil: yeah, or user namespace magic :)

19:16 <infinisil> fpletz: Doesn't nfs support some id translation thing?

19:18 <fpletz> infinisil: I don't use nfs so I don't really know, but if uids are dynamic a static translation is problaby bound to fail

19:18 <fpletz> regarding state directories in systemd service with dynamic users, here are some details: http://0pointer.net/blog/dynamic-users-with-systemd.html

19:18 <fpletz> more than in the manpages :)

19:19 <infinisil> fpletz: :O https://serverfault.com/a/632315/416725

19:20 <fpletz> nfs4++ :)

19:21 <gchristensen> "As pretty much the whole OS directory tree is read-only" funny, this is always true for every service of mine...

19:21 <infinisil> Yeah so I guess the nfs complaints can be ignored with this

19:21 * infinisil reads the blog now

19:23 <gchristensen> (also, I should migrate my things over to using dynamicuser... making custom users is so systemdv234!)

19:25 <clever> gchristensen: what if the config file needs to be owned by a user?

19:26 <clever> or state

19:26 <clever> state, the enemy of pure languages!

19:29 <infinisil> fpletz: Ah yeah that's a very nice blogpost, explained it perfectly

19:30 <infinisil> So DynamicUser is very neat and I'll advocate it even more now

19:31 <infinisil> One slight disadvantage is that users can't set their own state directories, but I researched this a bit, and found out that nobody needs that anyways for most services

19:31 <infinisil> https://github.com/infinisil/nixos-services#datadir

19:35 <infinisil> (I grepped through like 15 different nixos configs for dataDir assignments, only found a couple that set it for postgresql, mysql and syncthing -> concluded it's only needed for services that explicitly are about handling data)

19:36 <clever> infinisil: my VPN has to track private keys and whitelisted public keys

19:37 <clever> and thats all managed outside the store, so it needs a state dir

19:37 <clever> it defaults to $HOME/.toxvpn/

19:38 <infinisil> There's no real reason this can't be in /var/lib/toxvpn though is there

19:38 <clever> [root@amd-nixos:~]# ls -ltrha /var/lib/toxvpn/.toxvpn/

19:38 <clever> infinisil: exactly what i set $HOME to for the toxvpn user

19:40 <infinisil> Apparently the nixos module also uses /var/lib/toxvpn

19:40 pie_ has quit [Remote host closed the connection]

19:40 pie_ has joined #nixos-dev

19:40 <infinisil> clever: Wait, are you not using the nixos module?

19:40 <clever> i am using the nixos module

19:40 <clever> that module auto-creates a user when the service is enabled

19:41 <infinisil> Ahh right, so it uses $HOME/.toxvpn, and `toxvpn.home = "/var/lib/toxvpn";` i see

19:41 <clever> exactly

19:43 <infinisil> Crazy idea: enable DynamicUser by default :P

19:44 <gchristensen> that is certainly an idea

19:44 <ekleog> if you're looking for crazy ideas around systemd, I was around enabling Restart=yes by default

19:44 <gchristensen> :|

19:44 <infinisil> why that/

19:44 <infinisil> ?*

19:44 <ekleog> got hurt way too many times by systemd not auto-restarting some service that transiently failed

19:44 <gchristensen> y'all wild

19:45 <infinisil> ekleog: By yes you mean on-failure? I don't think yes is an allowed value

19:45 <ekleog> like, opensmtpd failing at startup randomly, and basically most if not all services are expected to be running, not stopped

19:46 <gchristensen> a better (imo) route is to audit your system for failed services and fix their bugs

19:46 <infinisil> Yeah, it's not too bad of an idea tbh, almost all services set Restart already anyways

19:46 <infinisil> Oh

19:46 <infinisil> Yeah, agreed with gchristensen actually

19:46 <ekleog> infinisil: I was more thinking `always` but `on-failure` would be less aggressive

19:46 <ekleog> gchristensen: transient failures are transient ._.

19:47 <ekleog> like, I think there's some dependency missing in nixos' opensmtpd service

19:47 <gchristensen> if we wanted to restart in a loop, we might as well use runit

19:47 <ekleog> tbh I much prefer runit to systemd, for exactly this reason

19:47 <ekleog> but the opensmtpd failure only happens on system boot and not reliably

19:47 <clever> ekleog: does runit even support stop?

19:47 <gchristensen> because your services are underspecified?

19:48 <ekleog> clever: yup, there's `sv stop`

19:48 <gchristensen> that came off snarkier than I intended

19:48 <clever> ekleog: ah

19:48 <ekleog> gchristensen: they most certainly are, but I have no idea in which way

19:48 <gchristensen> but, if a service isn't coming up properly, it is probably because it has a missing dependency

19:48 <infinisil> Yeah, hard to figure out the real problem though

19:48 <ekleog> if there was a way to purify it, I'd be happy to

19:48 <gchristensen> run things with large amounts of logging and see whats up

19:48 <gchristensen> or, boot to an emergency shell and then start the service, and see what it complains about

19:49 <ekleog> it's like choosing coq vs. erlang, coq would be making everything perfect, erlang is “well that'll fail, just restart it and it'll work well enough”

19:49 <gchristensen> another technique is examine other distro's systemd services

19:49 <ekleog> gchristensen: I've got another issue with timing on a machine whose disks are really slow to appear, and I couldn't possibly be fast enough by hand to reproduce

19:50 <gchristensen> aye, but isn't that one pretty easy to diagnose and fix?

19:50 <ekleog> (and even the emergency shell won't block the kernel from accepting new disks)

19:50 apaul1729 has joined #nixos-dev

19:50 <infinisil> gchristensen: Lots of services already define their own systemd files upstream. Maybe we should have a systemd2nixos

19:50 <ekleog> this one should be quite easy hopefully, just wanted to point out another issue where the emergency shell wouldn't help

19:50 <gchristensen> infinisil: nixos can already base its config off of a package's contained systemd file

19:51 <infinisil> Yeah, but it's all dynamic then

20:29 <worldofpeace> infinisil: https://github.com/NixOS/nixpkgs/issues/50105#issuecomment-452435126 Contributing to nixpkgs can feel mysterious, all the unpublicised inner complexity

20:33 <infinisil> Sure can, how does this relate to the comment though?

20:35 <worldofpeace> This what at least tell people what's going on, decreasing the mystery slightly.

20:39 <infinisil> Ah yeah

20:41 <worldofpeace> But I'm also getting at is that we need to be better at exposing our complexities

20:41 <worldofpeace> Our rules of engagement are revealed through interaction. That can be uninviting.

20:55 orivej_ has joined #nixos-dev

20:56 orivej has quit [Read error: Connection reset by peer]

21:00 eadwu has joined #nixos-dev

21:08 worldofpeace has quit [Quit: worldofpeace]

21:13 worldofpeace has joined #nixos-dev

21:52 apaul1729 has quit [Remote host closed the connection]

21:58 JosW has joined #nixos-dev

22:03 <timokau[m]> The uefiUsb test is currently timing out. How long is that timeout? Where is it configured?

22:03 <timokau[m]> https://hydra.nixos.org/build/86749911

22:06 <samueldr> 300s, https://github.com/NixOS/nixpkgs/blob/cd8c1a40536886114c417d5cab357e059ac41505/nixos/lib/test-driver/Machine.pm#L252-L253

22:07 <timokau[m]> Should we just double it? I feel like "exponential backoff" is a good solution whenever we get a false-negative from a timeout

22:08 <samueldr> this might be a balancing act, detect broken stuff earlier or not, and past failing tests aren't accounted for (and probably shouldn't)

22:08 <samueldr> things that would be nice to know: was the host overloaded?

22:08 * gchristensen looks at his nascent scheduler code

22:09 * samueldr was thinking about it

22:10 <samueldr> (hoping it would help guarantee *some* load?)

22:10 <timokau[m]> The test took 500s locally for me, a lot of which was spent waiting for the vm to come up

22:10 JosW has quit [Quit: KVIrc 4.2.0 Equilibrium http://www.kvirc.net/]

22:11 <timokau[m]> I feel like timeouts should be chosen very generous to avoid having to think about load too much

22:11 <timokau[m]> Hydra tests are not the place for benchmarks

22:11 <samueldr> maybe 5 minutes was a generous value at one point?

22:13 <samueldr> see also 95486ca306fc72d23ed4568b91afe35b6af9fab1 9bc10e12916979c5c620be5b521b9218a0077cba

22:13 <samueldr> https://github.com/NixOS/nixpkgs/commit/95486ca306fc72d23ed4568b91afe35b6af9fab1 https://github.com/NixOS/nixpkgs/commit/9bc10e12916979c5c620be5b521b9218a0077cba

22:13 <timokau[m]> Thats a good point, I know basically nothing about how the vm tests work but bringing up a simple headless vm probably shouldn't take that long

22:13 <samueldr> part of #49777

22:13 <{^_^}> https://github.com/NixOS/nixpkgs/pull/49777 (by srhb, 9 weeks ago, merged): Revert "NixOS tests: Wait for shell for 10x longer (50m)"

22:14 <samueldr> (learning along the way btw)

22:14 <samueldr> >> This suggests to me that the increased timeout as suspected doesn't actually do anything, because the failure will occur regardless of how long we wait. If that turns out wrong, we can always re-revert, but I think the root cause should be sought elsewhere within the VM test framework.

22:14 <gchristensen> tests used to reliably finish on my machine in <1min and no longer do

22:15 <gchristensen> (/!\ wildly making stuff up alert /!\) I wonder if spectre stuff has to do with it.

22:15 <timokau[m]> Looking at the tests matrix I feel like we need to change something fundamentally, that looks like success or failure is basically random

22:15 <samueldr> gchristensen: disable mitigations and microcode and test?

22:15 <samueldr> (on your machine)

22:16 <timokau[m]> I'd be very surprised if that would slow things down for more than 50%

22:16 <samueldr> timokau[m]: yes, *something* is up, I hope people better than I at this stuff will look at it and figure it out :)

22:17 <gchristensen> same

22:17 <samueldr> looks like it's `pti=off` to disable the kernel bits (unless there were other added)

22:17 <timokau[m]> We should be able to `git bisect` it, shouldn't we?

22:18 <timokau[m]> I'm trying to run that particular test on 17.03 right now

22:18 <samueldr> bisecting unreliable tests?

22:18 <timokau[m]> bisecting their runtime

22:18 <samueldr> or the apparent increase in time it takes?

22:18 <samueldr> ah

22:18 <timokau[m]> If it takes 500s locally, its not very surprising that a 300s timeout of some component gets triggered occasionally

22:18 <samueldr> though, if it's host-related (e.g. a kernel with spectre mitigations and its /dev/kvm) bisecting will also require the host to be touched

22:19 <timokau[m]> So either it shouldn't take 500s locally or the timeout is too strict

22:19 <timokau[m]> True

22:19 <samueldr> though the timeout AFAIUI is not at the beginning of it all, but at a point it's known to be already starting

22:19 <timokau[m]> Or we write a vm test for vm tests :D

22:19 <samueldr> timokau[m]: I think we'd need cycle-accurate tests :)

22:20 <timokau[m]> Whats that?

22:21 <samueldr> borrowing an emulation term, where (oversimplified) "the whole machine" is emulated; here I'm meaning where the wall clock is *not* synchronized with the host's clock, so e.g. even if it takes twice as long to test the tests on my machine, it wouldn't fail earlier in the execution

22:22 <samueldr> (half joking here, but I think to test a testing infra it'd be required)

22:24 <timokau[m]> Ah right, that makes sense

22:24 <samueldr> https://ieeexplore.ieee.org/document/5475901

22:24 <gchristensen> also remember metrics about the builders are public

22:24 <samueldr> seems it might exist

22:24 <gchristensen> and if you have ideas on how to tune them, I'm all ears

22:25 <samueldr> but possibly only academically?

22:25 <timokau[m]> The uefi-usb test took 118s on 17.03 (500 on master)

22:25 <gchristensen> all the best stuff is only available academically

22:25 <timokau[m]> Will re-test on master

22:29 <ekleog> samueldr: might be doable with minor tweaks to valgrind too, but would be much slower than regular qemu

22:30 <samueldr> here I'm budgetting infinite time in tradeoff to perfect reproducibility of the testing infra :)

22:30 <timokau[m]> Well its really infinite CPU time vs infinite developer time

22:36 <timokau[m]> I can reproduce the 5x increase in runtime of `tests.boot.uefiUsb.x86_64-linux` (from 100 to 500s) between 17.03 and current master

22:37 <timokau[m]> Time to sleep now though, if somebody else wants to do the actual bisect that would be amazing :)

22:39 <gchristensen> I can provide a big system for whoever wants to do that

22:58 sir_guy_carleton has joined #nixos-dev

23:13 sir_guy_carleton has quit [Ping timeout: 258 seconds]

23:17 sir_guy_carleton has joined #nixos-dev

23:44 <srhb> samueldr: Frankly my hypothesis, in light of the result of the revert, seems wrong.

23:44 <srhb> samueldr: We did see fewer failures with the increased timeout. Not all of them went away, but I'm fairly sure some did.

23:45 <samueldr> hm!

23:45 <srhb> I think there's just more than one cause here.

23:46 <samueldr> I'm waiting on a thing, probably done in ~1h where I run the test three times for all releases since 17.03 (release commit) and nixos-unstable to see if there's a progression or a sharp increase between one release and the other; then I'll be doing the same between the tags, somehow

23:46 <samueldr> I'm sure there's more than one cause :/

23:46 <srhb> Tweaking load on machines also has an impact etc. It's frankly a bit confusing, and really, really difficult to replicate on a smaller scale. Like, on my machine I have to load it massively and set the timeout to a few seconds.

23:46 <samueldr> and having a cycle-accurate emulation, even if sloooooooooow would help :(

23:47 <srhb> fwiw I intend to return to the issue when I'm out of my NixOS-hiatus :P

23:47 <samueldr> I'm thinking of maybe adding more data; e.g. dump the time it took where we set the alarm, and other stats in a file for each builds, into a build product; then we could at least use existing builds on hydra to gather data

23:47 <srhb> Yes, that would be very helpful.

23:48 <samueldr> I was surprised to see I had to parse the logs to get the amount of time it took to build a test

23:48 eadwu has quit [Ping timeout: 264 seconds]