gchristensen changed the topic of #nixos-dev to: NixOS Development (#nixos for questions) | https://hydra.nixos.org/jobset/nixos/trunk-combined https://channels.nix.gsc.io/graph.html | 18.09 release managers: vcunat and samueldr | https://logs.nix.samueldr.com/nixos-dev
kalbasit[m] is now known as yl[m]
joachifm has quit [Remote host closed the connection]
lopsided98 has quit [Ping timeout: 252 seconds]
lopsided98 has joined #nixos-dev
drakonis_ has joined #nixos-dev
drakonis has quit [Ping timeout: 264 seconds]
drakonis has joined #nixos-dev
drakonis_ has quit [Ping timeout: 264 seconds]
Lisanna has joined #nixos-dev
lassulus_ has joined #nixos-dev
lassulus has quit [Ping timeout: 244 seconds]
lassulus_ is now known as lassulus
sir_guy_carleton has joined #nixos-dev
lassulus has quit [Ping timeout: 268 seconds]
orivej has quit [Ping timeout: 244 seconds]
jtojnar has joined #nixos-dev
sir_guy_carleton has quit [Quit: WeeChat 2.2]
phreedom has quit [Remote host closed the connection]
phreedom has joined #nixos-dev
lassulus has joined #nixos-dev
__Sander__ has joined #nixos-dev
florianjacob has quit [Remote host closed the connection]
roberth has quit [Read error: Connection reset by peer]
thefloweringash has quit [Write error: Connection reset by peer]
timokau[m]1 has quit [Remote host closed the connection]
sphalerit has quit [Remote host closed the connection]
schmittlauch[m] has quit [Read error: Connection reset by peer]
Ericson2314 has quit [Remote host closed the connection]
yegortimoshenko has quit [Remote host closed the connection]
florianjacob has joined #nixos-dev
Ericson2314 has joined #nixos-dev
thefloweringash has joined #nixos-dev
roberth has joined #nixos-dev
schmittlauch[m] has joined #nixos-dev
yegortimoshenko has joined #nixos-dev
sphalerit has joined #nixos-dev
<gchristensen> ok so https://hydra.nixos.org/build/83240579#tabs-buildsteps timed out finally after 10hrs, this is with the host running with 4 cores per build. how about increasing buildCores to 12 and reducing jobs per machine? https://status.nixos.org/grafana/d/hkRCcV0mk/instance-metrics?orgId=1&from=1540894613238&to=1540979304316&var-instance=packet-t2a-1&var-instance=packet-t2a-2&var-machine=All
<gchristensen> cc srhb andi- for guidance
<andi-> sgtm
<andi-> That depends on the decrease of jobs. If the overall "core" overbooking will be the same or not
<gchristensen> another option is setting one to big-parallel and the other to not
<gchristensen> big-parallel running fewer jobs with higher cores
<andi-> that means just big-parallel jobs and nothing else?
<gchristensen> we have two options there. if we set big-parallel in the "supported features" list, the machine will build big-parallel jobs and non-big-parallel jobs. if we set big-parallel in the "mandatory features" list, the machine will only build big-parallel jbos
<andi-> ok, so the mandatory case is were I fear we might "waste" idle time... But since we can not enforce that we priortize big-parallel vs "regular" jobs it might be worth a shot.
<srhb> gchristensen: I forget what the verdict was on timeout increasing. There was some default being enforced on (some but not all?) builders despite setting meta.timeout and friends
<srhb> Fixing that is the least wasteful option (slow builds are fine as long as they eventually complete...)
<srhb> That said, we might not want to optimize for that right now given how utterly fragile our jobsets are these days...
<srhb> In which case more cores less jobs is the safer bet
<andi-> srhb: https://github.com/NixOS/hydra/issues/591 is the issue for that. See my comment with a potential fix.. Just haven't managed to reproduce it with reasonable timeouts locally :/
<{^_^}> hydra#591 (by cleverca22, 8 weeks ago, open): meta.timeout does not always work
<srhb> andi-: Right, thanks!
<srhb> gchristensen: I don't think there's a reason to set big-parallel in mandatory AND increase cores/decrease jobs
<gchristensen> right. so let's just start with cores/jobs
<srhb> So right now my vote is more cores/less jobs -- at least until timeout is a safer bet and we've stabilized somewhat
<srhb> Right :)
<gchristensen> how does 12cores/9jobs sound?
<srhb> Out of how many cores?
<gchristensen> 96
aminechikhaoui has quit [Ping timeout: 264 seconds]
<srhb> So that's a target load of 1.1 per core?
<gchristensen> yeah I guess so
<srhb> Seems OK.
<srhb> How much memory does that thing have anyway?
<gchristensen> enough that I've never had to check
<srhb> OK :)
<srhb> It does peak occasionally, according to those graphs
<srhb> But not regularly, so I guess as long as the cores*jobs product is less than it is today, we're fine.
<sphalerite> I think it was 128 or 256GB
<gchristensen> 125
<gchristensen> ok, build-jobs changed from 48 to 9
<srhb> gchristensen: Will be very interesting to see! :)
aminechikhaoui has joined #nixos-dev
<gchristensen> indeed, we'll have to watch the queue carefully
<srhb> Yeah.
<srhb> gchristensen: The recent deployment marker on those graphs is this change, right?
<andi-> if you hove over the triangle you should see a message
<gchristensen> yeah, the pink ones are the actual change. the blue one is a manuallycreated one showing what I did.
<srhb> This is really sweet. :)
<srhb> Finally, a sensible use of activationScripts! :D
<gchristensen> oh the blue one appears per-graph... not exactly what I wanted, but close enough I guess.
<srhb> Yep.
<srhb> It would be cool to enrich the automatic ones with a link to the system nix-diff. :D
<gchristensen> motherofgod
<globin> gchristensen: mind that profiles/system changes on dry-activate
<gchristensen> huh
<gchristensen> euoeuoeuoeunthoeunthoeunthoenuthoeunhnh
<gchristensen> oops lol
<Profpatsch> yeah!
alp has quit [Remote host closed the connection]
<gchristensen> nice, globin
<Profpatsch> long live Regus wifi
<andi-> that seems like a good password to me ;)
<gchristensen> that was me saying "is the wifi up?" and then pressing RET~. to kill the session, but having it come back at the last moment to send my message. :)
<andi-> not a fan of mosh?
<gchristensen> I had a few frustrating moments of dealing with firewall shenanigans and mosh and then went back to ssh :)
<andi-> fair enough
<gchristensen> and then I get to a bad wifi connection where it is high latency and wish I had fought that fight sooner.
<Synthetica> I did `sudo hydra-create-user synthetica --full-name "Patrick Hilhorst" --email-address "patrick@hilhorst.be" --password hahayeahwouldntyoulikethat--role admin --help`, but I can't log in through the web interface, what am I doing wrong?
<andi-> you are missing a space between the pw and --role? ;)
<gchristensen> and also probably don't want to pass --help
<Synthetica> Oh, copy-pasted wrong command, did it without help before (and with the space, that's just for irc :P)
alp has joined #nixos-dev
<gchristensen> globin: where do you make the directory?
<Synthetica> Could it be a permissions thing, that I shouldn't use sudo?
<srhb> Synthetica: sudo -u hydra
<srhb> (Usually)
<srhb> But if you got no error (which is puzzling) I guess it sounds right
<Synthetica> Hmm, with sudo -u hydra it gives an error
<srhb> You're not using the NixOS module?
<Synthetica> Yeah, I am
<srhb> ... odd
<Synthetica> With sudo -u hydra I get an error
<Synthetica> Maybe because I also did hydra-init-db as root?
<srhb> Yes, that sounds wrong as well.
<srhb> The module should take care of that. Nuke the db and start over.
<globin> gchristensen: think it's the runtime directory in the module or similar, will check in ~1 hour when I'm back at my laptop
<arianvp> I want to write a regression test that involves changing the nixos config and asserting that the right systemd units are restarted
<gchristensen> globin: please do :) let me know and I'll switch over
<arianvp> Is there an easy way to do that with the nixos testing infra?
<arianvp> Can you change the config of a VM inside the VM itself?
<gchristensen> arianvp: sure, look at the installer.nix test
<arianvp> Oh and could you add me to ofborg? :D
<{^_^}> ofborg#256 (by arianvp, 1 week ago, open): Add me to extra known users
<arianvp> Thanks!
<gchristensen> (still need to deploy)
<gchristensen> ok done arianvp
<andi-> could someone abort https://hydra.nixos.org/build/83304979 ? It is just wasting time :/
phreedom has quit [Quit: No Ping reply in 180 seconds.]
phreedom has joined #nixos-dev
<andi-> thanks :)
<srhb> andi-: panic? Fun
init_6 has joined #nixos-dev
<gchristensen> tail ~/.weechat/logs/irc.freenode.#nixos-dev.weechatlog | grep "someone abort https://hydra.nixos.org/build/" | sed -e 's#^.*https://hydra.nixos.org/build/##' -e 's#[^0-9]*$##' | xargs -I{} echo curl https://hydra.nixos.org/build/{}/cancel
<gchristensen> minus an echo, plus some creds ...
<andi-> you are really evolving towards a cyborg :P
<gchristensen> I don't actually do that, but it is something I've dreamed of :P
<srhb> Everything but chromium is green on the latest trunk-combined tested eval, too... So close.
<andi-> I would like to know if working on more granular permissions for hydra is something that we would want.. I opened that one PR that allows restarting of a single job with the "restart-jobs" role but not feedback yet.. I'd also be up to introduce more / different roles if we think that is a good thing..
<andi-> srhb: yeah, lets hope the scheduler puts it on a node that doesn't honor 10h timeouts :D
<srhb> indeed.
<srhb> I know it builds. I have it locally. So 70.x is not broken, just slow as molasses.
<andi-> e.g. packet-epyc-1 seems to just work fine
<srhb> Fingers crossed.
<andi-> but that is an older eval.. better then none thought
<srhb> 15h is too long though. That job really needs to hit a node with more cores.
<srhb> It took like 4-5 hours on my laptop...
<gchristensen> requiredSystemFeatures = [ "bountiful-cores" ];
<srhb> Yes
<srhb> I've suggested this before, it could help a lot. :)
<srhb> It needn't even be a big builder, just a very low-jobs one.
<srhb> I think that was essentially the original intention behind big-parallel, but we've misused it.
<andi-> could we somehow slice a build machine into being "two" builders just different settings and one having more CPU prefernce then the other?
<andi-> (technically we can, not sure if it makes much sense)
<srhb> andi-: vms?
<gchristensen> I'm not sure we have? only ~5 jobs have big-parallel, srhb
<srhb> gchristensen: It's more about which builders get it, and what maxJobs they have :)
<srhb> The jobs don't matter much.
<andi-> srhb: without that overhead of VMs, I am thinking like a preemtive scheduling that prefers the "bountiful-cores" builder
<gchristensen> I think the reason that has happened is due to bein resource constrained and feeling not great about depriving the rest of the builds with those jobs.
<srhb> Yep, I understand the reasoning.
<srhb> I too like being environmentally and economically friendly. :-P
<gchristensen> I see no reason why one machine can't be in the hydra machines file twice, one with big-parallel and one joband many cores, and once without and few cores
<srhb> andi-: Your restart thing got merged. :o
<andi-> srhb: \o/ must check whats up with my github notifications.. I get mails but they don't show up on the UI
<aanderse> i'm poking around hydra and trying to find out if a package is broken under master... i haven't spent much time looking at hydra, so i'm kinda lost atm (can't seem to find)
<andi-> gchristensen: things like rustc that just eats what it can is not optimal for such setups without further restricting things
<aanderse> the package name is speed_dreams
<gchristensen> andi-: I think nix-daemon restricts CPUs?
<gchristensen> does it not?
<srhb> aanderse: If it's in one of the sets that actually receives evaluations, the easiest way to go is nixpkgs->trunk->search latest eval
<andi-> gchristensen: I think not on my machines..
<andi-> might be able to..
<Synthetica> srhb: gchristensen: Fixed it, the sollution was passing -i to sudo
<andi-> aanderse: last build around 2013, was it removed?
<srhb> No, it's still there.
<srhb> Anyway, this should probably be in #nixos :-)
<aanderse> andi-: ah that makes sense... the build is failing on my machine
<andi-> ahh, hydraPlatforms = [];
<andi-> it just isn't being scheduled on hydra
<aanderse> i'm doing a version bump on a package and noticed speed_dreams depends on it so thought that a good thing to test that it still builds
<srhb> (fwiw #nixos is also about package development while this channel is more about Nix infrastructure and wide-ranging changes to nixpkgs/NixOS)
<aanderse> srhb: understood, thanks :)
<Synthetica> (btw, services.hydra.enable = true; should really be mentioned in https://nixos.org/hydra/manual/)
<globin> gchristensen: that should be all the code, I'll move that to our open source code, too or maybe even nixpkgs
Lisanna has quit [Remote host closed the connection]
<gchristensen> nice, thanks globin
<gchristensen> andi-: your hydra patch is deployed :)
<andi-> \o/
orivej has joined #nixos-dev
genesis has quit [Remote host closed the connection]
genesis has joined #nixos-dev
init_6 has quit []
aanderse has quit [Ping timeout: 252 seconds]
<arianvp> gchristensen: bot doesnt seem to like me yet
<{^_^}> #48771 (by arianvp, 1 week ago, open): nixos/containers: Introduce several tweaks to systemd-nspawn from upstream systemd
<gchristensen> arianvp: it is building: https://logs.nix.ci/?key=nixos/nixpkgs.48771
<gchristensen> right now (it is silly) ofborg posts logs after things are done.
<arianvp> aaah
<globin> flokli: I think I have a working version of the gitlab test, have to clean it up and will ping you in the PR
<flokli> globin: cool, thanks :-)
<flokli> lmk if you need some help in polishing. I added some notes on what I saw broken to https://github.com/NixOS/nixpkgs/pull/43844#issuecomment-424902207
ekleog has quit [Quit: back soon]
ekleog has joined #nixos-dev
aminechikhaoui has quit [Ping timeout: 252 seconds]
aminechikhaoui has joined #nixos-dev
<niksnut> chromium takes nearly 18 hours to build :S
<niksnut> I feel strongly inclined to remove it from the nixpkgs channel
<samueldr> imho this would be a net loss for the user experience if end-users would then have to compile it :/
<fpletz> maybe we could get the jumbo builds to work https://chromium.googlesource.com/chromium/src/+/lkgr/docs/jumbo.md
<Synthetica> Can't we package a binary, and have a chromium-from-source for people that really want it?
<niksnut> I guess people can use google-chrome instead?
<fpletz> globin tried that a few weeks ago but just enabling it didn't work
<andi-> Or can we maybe builds less bundled libraries? It seems to redistribute a lot of stuff.. I know thats not how they want it to be done but firefox also bundles a bit and we still use system libs
<gchristensen> fpletz: with jumbo builds turned on I was able to build chromium in 20 minutes on the epyc machine
<Synthetica> Oh damn
<samueldr> gchristensen: did you have a number for non-jumbo?
<globin> gchristensen: yes, but that broke linking for us.. :/
<globin> gchristensen: did that succeed for you?
<fpletz> gchristensen: wow, did you need to add changes to the chromium expression?
<gchristensen> globin: the build finished at least, but I didn't try it.
<{^_^}> #43658 (by Ekleog, 15 weeks ago, closed): chromium: do jumbo builds
<globin> hmm, ok that didn't work for me
<gchristensen> samueldr: before was 2hrs, heh
<gchristensen> it seems overall our builds are straining on overburdened build nodes.
<gchristensen> notably, big-parallel build nodes.
<gchristensen> builds*
<gchristensen> niksnut: does Nix restrict the cores a build job can access based on build-cores?
<aminechikhaoui> I think build-cores is mainly passed to make -j
<niksnut> no
<gchristensen> ah
<gchristensen> maybe the thing to do is dedicate a node to big-parallel builds, set cores=0 and max-jobs=1 for a few days and see what happens. I feel we don't have enough data about how reliability and timing changes based on how loaded machines are. am I wrong?
<gchristensen> (yes yes environment, but 18hrs of chromium maybe could be just 2hrs of chromium and be more efficient)
<Profpatsch> niksnut: aszlig mentioned that they put in half of Chrome-OS by now.
<Profpatsch> Less of a browser, more of an operating system …
<Profpatsch> Though I still prefer it over FF tbh
<globin> gchristensen: is it possible for ofborg to run tests in nixpkgs that aren't included in release.nix, might be an option for gitlab, which probably uses too much memory for us to want it run on hydra.. trying with 4GB after OOM with 2GB
<gchristensen> right now, no
<gchristensen> though I wonder if the builders are capable of such large builds, since I don't manage them all I don't know
<globin> fwiw mine is
<gchristensen> https://github.com/NixOS/hydra/blob/adf59a395993d5ed1d7a31108f7666195f789c99/src/hydra-queue-runner/hydra-queue-runner.cc#L579 anyone know the difference between total step time and total step build time?
<shlevy> I've caught the NixCon cold so obviously I'm sitting here thinking about a new configuration scheme: just like object capabilities replace imperative global namespaces and ACLs with combined imperative designation and authority, we could have "declarative capabilities" to replace the module system's declarative global namespace (and lack of ALCs) with combined declarative designation and authority
<shlevy> A component that *creates* a capability expresses it through a function argument, whereas a component that *uses* a capability expresses it through setting attributes. These can be composed in arbitrary ways, e.g. if services A, B, and C each need a database, service A and B can be "passed" a capability to one postgres instance and C can be passed a capability to another, without having to rewrite A, B, C, or the postgres functionality
<shlevy> And it's completely in the control of the user which components have the right to modify which aspects of the overall configuration
<niksnut> I've been thinking about it in terms of Nix configuration modules: https://gist.github.com/edolstra/29ce9d8ea399b703a7023073b0dbc00d#nixos-services
<niksnut> this could already be accomplished using NixOS submodules btw
<simpson> shlevy: Yessssss.
<shlevy> simpson: That's what you get for sharing a bunch of interesting ocap links just before I have nothing to do but read :P
<simpson> shlevy, niksnut: FWIW, in ocap languages this is a fundamentally-common pattern, to have a *maker* function which produces a parameterized object.
<shlevy> simpson: can you point to an example?
<simpson> shlevy: This section of our docs is pretty readable if you can read Python: https://monte.readthedocs.io/en/latest/rosetta-python.html#objects
<shlevy> Ah I see, thanks
<simpson> In Monte, modules have an outermost layer of exported values, which are "DeepFrozen" (transitively immutable), and those are usually makers for objects which will be live and mutable at runtime. This gives us our weird statically-linked dynamically-typed ability.
<Profpatsch> Isn’t that just Ocaml modules?
<simpson> Their module system is a big inspiration, yeah. Racket's too.
drakonis1 has joined #nixos-dev
Mic92 has quit [Quit: WeeChat 2.2]
Mic92 has joined #nixos-dev
__Sander__ has quit [Quit: Konversation terminated!]
Lisanna has joined #nixos-dev
phreedom has quit [Ping timeout: 256 seconds]
Taneb is now known as GHOSTLY_SPOOK
GHOSTLY_SPOOK is now known as Taneb
<shlevy> simpson: Implementing unforgeable references on top of a language with no encapsulation is fun :D
orivej has quit [Ping timeout: 268 seconds]
<simpson> No kidding.
drakonis1 has quit [Quit: WeeChat 2.2]
sir_guy_carleton has joined #nixos-dev
orivej has joined #nixos-dev
drakonis_ has joined #nixos-dev
drakonis has quit [Ping timeout: 268 seconds]
phreedom has joined #nixos-dev
drakonis has joined #nixos-dev
drakonis_ has quit [Ping timeout: 264 seconds]
drakonis has quit [Read error: Connection reset by peer]
drakonis has joined #nixos-dev
drakonis_ has joined #nixos-dev
drakonis has quit [Ping timeout: 250 seconds]
drakonis has joined #nixos-dev
pie__ has quit [Ping timeout: 272 seconds]
zarel has joined #nixos-dev
<andi-> Yet another test that has been running in circles for a while: https://hydra.nixos.org/build/83319589
<clever> machine# [24316.242222] rcu_sched kthread starved for 23899897 jiffies! g295260 c295259 f0x2 RCU_GP_WAIT_FQS(3) ->state=0x0 ->cpu=0
<clever> andi-: and its epyc-1 again
<andi-> haven't paid attention if it is the same machine
<clever> epyc has been failing a lot lately, but it might also be chance and getting more jobs then others
<clever> thats the guest kernel, not the host kernel thats starving
<andi-> I know
<clever> the vm may be jammed internally
<andi-> but I'd expect that to happen when the host is saturated
<andi-> [ 277.206535] BUG: Bad page state in process xsltproc pfn:12b48
<andi-> thats the first thing that looks odd
<clever> yeah, thats usually bad
<andi-> let see what the build form earlier today said..
<clever> /home/clever/apps/linux/mm/page_alloc.c: pr_alert("BUG: Bad page state in process %s pfn:%05lx\n",
<clever> machine# [ 277.211914] page dumped because: non-NULL mapping
<clever> andi-: something went terribly wrong while freeing pages in the kernel
<andi-> Yeah, I would hope it is a hardware issue because otherwise we are up for a nice hunt
<clever> andi-: i would almost say its bad ram, but its a vm, so bad host ram? ....
<andi-> on the other hand it could be a bug in the kernel for that CPU model..
<andi-> while it is running the latest 4.14 kernel (in the VM) that might not be true for the host and it might still be missing something
<andi-> anyway have to get a bit of sleep now.. maybe it was just bad luck ;-)
zarel has quit [Quit: Leaving]
<Mic92> globin: when is the first rfc meeting?