#nixos-infra on 2021-04-16

2020-03-23 19:08 samueldr changed the topic of #nixos-infra to: NixOS infrastructure | logs: https://logs.nix.samueldr.com/nixos-infra/

06:15 cole-h has quit [Ping timeout: 240 seconds]

11:19 supersandro2000 has quit [Remote host closed the connection]

11:19 supersandro2000 has joined #nixos-infra

12:01 adisbladis has quit [Quit: ZNC 1.8.2 - https://znc.in]

13:11 <lukegb> Hmm, I'm not sure LLVM 12 builds properly on wendy - or at least, one of the exegesis benchmark tests doesn't pass. Looking at the previous successful builds, they all happened on a Packet builder

13:50 <hexa-> I think tests do not run on wendy anymore

13:50 <hexa-> https://github.com/NixOS/nixos-org-configurations/issues/146#issuecomment-817792595

14:17 cole-h has joined #nixos-infra

14:31 <lukegb> hexa-: those are nixos tests; I meant the LLVM12 package tests

17:54 <samueldr> are the LLVM2 package tests ran on the same machine that built the package?

17:55 <samueldr> depending on the exact failure, maybe LLVM12 is configured with "too new" CPU features in optimizations?

18:29 <hexa-> mac{4,5,6,8}-guest are idle with over 20k x86_64-darwin jobs queud

18:34 <gchristensen> unfortunately the queue runner prioritizes jobsets, not resource utilization

19:23 <lukegb> samueldr: yeah, these are the checkPhase tests

19:24 <lukegb> I think because llvm is marked big-parallel and wendy now only builds big-parallel things (?) that the package will get stuck

19:24 <samueldr> we'd need to know why those tests fail on wendy, and fix the cause

19:25 * lukegb nods

19:25 <lukegb> I saw someone complaining about it on IRC a while back

19:25 <lukegb> but I can't reproduce it on any of the machines I've got lying around

19:25 <samueldr> we have a somewhat big blind spot with cpu features

19:26 <lukegb> what CPU does wendy have?

19:26 <lukegb> (fwiw: I think this has been broken for a while: https://logs.nix.samueldr.com/nixos/2020-11-15#;roconnor)

19:27 <samueldr> plausible

20:53 cole-h has quit [Ping timeout: 265 seconds]

22:18 <lukegb> Bah, I can't get this build to fail, even on an Opteron of a similar vintage to what I assume is in wendy :(

22:20 <gchristensen> wendy: https://gist.github.com/grahamc/c55614bf0a2fae00fdc622177217dbc5

22:38 <lukegb> Ah, I have slightly newer Opterons in this machine, the 6276

22:40 <lukegb> Hmm, which means I have AVX/SSE4.1/SSE4.2/SSSE3. Bah.

22:44 <gchristensen> I mean, I dunno, I think samueldr has a good point w.r.t. sloppy cpu features

22:44 <samueldr> lukegb: and wendy doesn't?

22:44 <gchristensen> it'd be a shame to disable wendy but I have to wonder what sort of weird performance things are coming out of this

22:45 <samueldr> gchristensen: performance things?

22:45 <lukegb> samueldr: According to this random website, apparently not

22:45 <samueldr> lukegb: good, wanted to be sure I understood what you said right

22:45 <samueldr> gchristensen: it's less about performance, but more about producing working outputs for architectures we claim to support

22:46 <gchristensen> like not detecting sse4, randomly building something performance sensitive without sse4, and then hitting users who have mysteriously good then bad performing code paths

22:46 <samueldr> things like maths software is fine if it doesn't support those earlier CPUs, but should be tagged as such in some way imo

22:47 <samueldr> right, but my opinion is that performance sensitive things like that should specifiy in some way that they need cpu features

22:47 <samueldr> (and configure the package to be built that way, without detection, but that's another issue)

22:48 <lukegb> My personal 2c is it's probably fine to pick *a* baseline as long as we're clear on what it is and how to change it? Like, do we want to support everything k8+ for x86-64, ad infinitum?

22:48 <samueldr> right now that's the baseline "x86_64"

22:48 <samueldr> another option would be to specify a higher baseline

22:48 <samueldr> but still wouldn't solve gchristensen's concern about detection

22:49 <samueldr> and we probably still have the same issue in ARM land

22:49 <samueldr> and for even newer CPU features

22:49 <samueldr> (in x86_64)

22:49 <lukegb> It would be nice to be able to e.g. tag things like Chrome as SSE3+-only

22:50 <samueldr> it would be helpful if we could scan binaries

22:51 <samueldr> though even that wouldn't be entirely helpful, as it could do feature detection at runtime

22:51 <samueldr> which is fine

22:51 <lukegb> I mean in the ideal case we'd have e.g. something like intel cc's runtime detection

22:51 <lukegb> but without the vendor bias :p

22:52 <lukegb> this has drifted a bit from the infra discussion of "llvm is bork and probably will be for a while"

22:52 <gchristensen> :)

22:52 <lukegb> one way of "fixing" it would be to untag llvm as big-parallel, which means it might end up getting scheduled on a packet worker

22:52 <lukegb> downside is... it's kinda big-parallel

22:53 <gchristensen> or retag wendy

22:53 <gchristensen> this is a bandaid of course

22:53 <lukegb> or disable the failing LLVM test... but I'd kinda like to understand why it's failing

22:53 <samueldr> or, as I said, understand why the test fails

22:53 * lukegb nods

22:53 <samueldr> is it LLVM producing binaries that may be using more advanced features?

22:53 <lukegb> well, that's the interesting thing

22:53 <samueldr> or is the test run testing opted-in advanced features?

22:54 <samueldr> if it's the latter, then it can't run on a CPU without the features, but is fine otherwise

22:54 <lukegb> it's testing a CMOV instruction that dates back to Pentium Pro(!)

22:54 <samueldr> could it be related to how some features get encoded?

22:54 <samueldr> I don't know much about that

22:54 <lukegb> it seems to be this test: https://github.com/llvm/llvm-project/blob/main/llvm/test/tools/llvm-exegesis/X86/uops-CMOV16rm-noreg.s

22:54 <samueldr> if it's the former, a feature being output by the compiler without being asked for it, then the compiler may produce other binaries that won't run on wendy-class hardware

22:55 <samueldr> which is probably undesirable entirely

22:55 <lukegb> which basically says, from what I understand, generate a microop analysis of CMOV16rm, and then check that it can be parsed again

22:55 <samueldr> I see the source, it doesn't help me :)

22:56 <samueldr> unlikely, but maybe we're exercising llvm in a way the authors did not intend

22:56 <samueldr> has someone got in touch with them to have their opinion?

22:56 <lukegb> I'd kinda like to get some of the intermediate build outputs before we do that

22:56 <lukegb> but I need some time with someone with access to the machine that it's broken on to make that happen, I think

22:57 <samueldr> exotic hardware is hard

22:58 <gchristensen> lukegb: I can help with that

22:58 <lukegb> gchristensen: ooh, that'd be good

22:58 <gchristensen> I'll need to be "dumb" hands though

22:59 <gchristensen> let me know when you're ready

23:01 <lukegb> gchristensen: https://gist.github.com/lukegb/d6bf0479ca9cdece46b34279a6089b69 is my current state dump

23:02 <gchristensen> lukegb: so, ready now?

23:02 <lukegb> a copy of the CMOV16rm-uops.yaml, and the stdout/stderr of the two llvm-exegesis invocations should be enough to at least vaguely work out what's going on

23:02 <lukegb> yeah

23:06 <gchristensen> load needs to go down a bit before we can do it

23:06 <lukegb> ah, fair

23:07 <lukegb> I'll be around for a bit, if you are - if it's a bad time we can reschedule, don't want to put pressure on you or anything 😅

23:07 <gchristensen> ssh ro-FfKLwDY43sSv2fUFSXFKDLS5n@lon1.tmate.io

23:08 <lukegb> oh joys

23:10 <gchristensen> whoops.

23:11 <lukegb> hah

23:11 <gchristensen> maybe this second time the CPU will branch predict all the delta resolution

23:16 <gchristensen> lukegb: will nix-shell -I nixpkgs=$(pwd) -p llvmPackages_12.llvm work?

23:16 <lukegb> should do

23:16 <lukegb> better than nix run but my fingers always type that instead :3

23:17 <lukegb> that looks good at least

23:17 <gchristensen> there you go

23:17 <lukegb> ah, segfault

23:18 <lukegb> can you grab the coredump

23:18 <gchristensen> w/ coredumpctl yea?

23:19 <lukegb> I guess

23:19 <lukegb> yeah

23:20 <lukegb> (unless it just dumped the core into the local dir, but... probably coredumpctl)

23:22 <sterni> what is going on here

23:22 <lukegb> llvm-exegesis debugging

23:22 <lukegb> it, err, segfaults. but only on wendy. and apparently only under this one test

23:23 <sterni> still better than debugging llvm 7 libunwind cross compilation :'(

23:23 <sterni> or worse who knows

23:23 <sterni> actually worse

23:25 <gchristensen> lukegb: DM'd you a wormhole command to get it

23:26 <lukegb> gchristensen: can you grab the .yaml file as well?

23:27 <gchristensen> which?

23:28 <lukegb> err

23:28 <lukegb> CMOV16rm-uops.yaml

23:29 <gchristensen> wormhole receive 4-borderline-sailboat

23:30 <gchristensen> lukegb:

23:31 <lukegb> cheers! thanks, I think that should be enough to at least work out what's up

23:35 <gchristensen> it might be cool to automatically upload coredumps from builds

23:35 <lukegb> gchristensen++

23:36 <gchristensen> :) thanks for digging in

23:38 <gchristensen> [root@wendy:~/scratch/nixpkgs]# coredumpctl list | wc -l

23:38 <gchristensen> 10185

23:38 <lukegb> hahaha

23:42 supersandro2000 has quit [Killed (rothfuss.freenode.net (Nickname regained by services))]

23:42 supersandro2000 has joined #nixos-infra