samueldr changed the topic of #nixos-infra to: NixOS infrastructure | logs: https://logs.nix.samueldr.com/nixos-infra/
cole-h has quit [Ping timeout: 240 seconds]
supersandro2000 has quit [Remote host closed the connection]
supersandro2000 has joined #nixos-infra
adisbladis has quit [Quit: ZNC 1.8.2 - https://znc.in]
<lukegb> Hmm, I'm not sure LLVM 12 builds properly on wendy - or at least, one of the exegesis benchmark tests doesn't pass. Looking at the previous successful builds, they all happened on a Packet builder
<hexa-> I think tests do not run on wendy anymore
cole-h has joined #nixos-infra
<lukegb> hexa-: those are nixos tests; I meant the LLVM12 package tests
<samueldr> are the LLVM2 package tests ran on the same machine that built the package?
<samueldr> depending on the exact failure, maybe LLVM12 is configured with "too new" CPU features in optimizations?
<hexa-> mac{4,5,6,8}-guest are idle with over 20k x86_64-darwin jobs queud
<gchristensen> unfortunately the queue runner prioritizes jobsets, not resource utilization
<lukegb> samueldr: yeah, these are the checkPhase tests
<lukegb> I think because llvm is marked big-parallel and wendy now only builds big-parallel things (?) that the package will get stuck
<samueldr> we'd need to know why those tests fail on wendy, and fix the cause
* lukegb nods
<lukegb> I saw someone complaining about it on IRC a while back
<lukegb> but I can't reproduce it on any of the machines I've got lying around
<samueldr> we have a somewhat big blind spot with cpu features
<lukegb> what CPU does wendy have?
<lukegb> (fwiw: I think this has been broken for a while: https://logs.nix.samueldr.com/nixos/2020-11-15#;roconnor)
<samueldr> plausible
cole-h has quit [Ping timeout: 265 seconds]
<lukegb> Bah, I can't get this build to fail, even on an Opteron of a similar vintage to what I assume is in wendy :(
<lukegb> Ah, I have slightly newer Opterons in this machine, the 6276
<lukegb> Hmm, which means I have AVX/SSE4.1/SSE4.2/SSSE3. Bah.
<gchristensen> I mean, I dunno, I think samueldr has a good point w.r.t. sloppy cpu features
<samueldr> lukegb: and wendy doesn't?
<gchristensen> it'd be a shame to disable wendy but I have to wonder what sort of weird performance things are coming out of this
<samueldr> gchristensen: performance things?
<lukegb> samueldr: According to this random website, apparently not
<samueldr> lukegb: good, wanted to be sure I understood what you said right
<samueldr> gchristensen: it's less about performance, but more about producing working outputs for architectures we claim to support
<gchristensen> like not detecting sse4, randomly building something performance sensitive without sse4, and then hitting users who have mysteriously good then bad performing code paths
<samueldr> things like maths software is fine if it doesn't support those earlier CPUs, but should be tagged as such in some way imo
<samueldr> right, but my opinion is that performance sensitive things like that should specifiy in some way that they need cpu features
<samueldr> (and configure the package to be built that way, without detection, but that's another issue)
<lukegb> My personal 2c is it's probably fine to pick *a* baseline as long as we're clear on what it is and how to change it? Like, do we want to support everything k8+ for x86-64, ad infinitum?
<samueldr> right now that's the baseline "x86_64"
<samueldr> another option would be to specify a higher baseline
<samueldr> but still wouldn't solve gchristensen's concern about detection
<samueldr> and we probably still have the same issue in ARM land
<samueldr> and for even newer CPU features
<samueldr> (in x86_64)
<lukegb> It would be nice to be able to e.g. tag things like Chrome as SSE3+-only
<samueldr> it would be helpful if we could scan binaries
<samueldr> though even that wouldn't be entirely helpful, as it could do feature detection at runtime
<samueldr> which is fine
<lukegb> I mean in the ideal case we'd have e.g. something like intel cc's runtime detection
<lukegb> but without the vendor bias :p
<lukegb> this has drifted a bit from the infra discussion of "llvm is bork and probably will be for a while"
<gchristensen> :)
<lukegb> one way of "fixing" it would be to untag llvm as big-parallel, which means it might end up getting scheduled on a packet worker
<lukegb> downside is... it's kinda big-parallel
<gchristensen> or retag wendy
<gchristensen> this is a bandaid of course
<lukegb> or disable the failing LLVM test... but I'd kinda like to understand why it's failing
<samueldr> or, as I said, understand why the test fails
* lukegb nods
<samueldr> is it LLVM producing binaries that may be using more advanced features?
<lukegb> well, that's the interesting thing
<samueldr> or is the test run testing opted-in advanced features?
<samueldr> if it's the latter, then it can't run on a CPU without the features, but is fine otherwise
<lukegb> it's testing a CMOV instruction that dates back to Pentium Pro(!)
<samueldr> could it be related to how some features get encoded?
<samueldr> I don't know much about that
<samueldr> if it's the former, a feature being output by the compiler without being asked for it, then the compiler may produce other binaries that won't run on wendy-class hardware
<samueldr> which is probably undesirable entirely
<lukegb> which basically says, from what I understand, generate a microop analysis of CMOV16rm, and then check that it can be parsed again
<samueldr> I see the source, it doesn't help me :)
<samueldr> unlikely, but maybe we're exercising llvm in a way the authors did not intend
<samueldr> has someone got in touch with them to have their opinion?
<lukegb> I'd kinda like to get some of the intermediate build outputs before we do that
<lukegb> but I need some time with someone with access to the machine that it's broken on to make that happen, I think
<samueldr> exotic hardware is hard
<gchristensen> lukegb: I can help with that
<lukegb> gchristensen: ooh, that'd be good
<gchristensen> I'll need to be "dumb" hands though
<gchristensen> let me know when you're ready
<lukegb> gchristensen: https://gist.github.com/lukegb/d6bf0479ca9cdece46b34279a6089b69 is my current state dump
<gchristensen> lukegb: so, ready now?
<lukegb> a copy of the CMOV16rm-uops.yaml, and the stdout/stderr of the two llvm-exegesis invocations should be enough to at least vaguely work out what's going on
<lukegb> yeah
<gchristensen> load needs to go down a bit before we can do it
<lukegb> ah, fair
<lukegb> I'll be around for a bit, if you are - if it's a bad time we can reschedule, don't want to put pressure on you or anything 😅
<gchristensen> ssh ro-FfKLwDY43sSv2fUFSXFKDLS5n@lon1.tmate.io
<lukegb> oh joys
<gchristensen> whoops.
<lukegb> hah
<gchristensen> maybe this second time the CPU will branch predict all the delta resolution
<gchristensen> lukegb: will nix-shell -I nixpkgs=$(pwd) -p llvmPackages_12.llvm work?
<lukegb> should do
<lukegb> better than nix run but my fingers always type that instead :3
<lukegb> that looks good at least
<gchristensen> there you go
<lukegb> ah, segfault
<lukegb> can you grab the coredump
<gchristensen> w/ coredumpctl yea?
<lukegb> I guess
<lukegb> yeah
<lukegb> (unless it just dumped the core into the local dir, but... probably coredumpctl)
<sterni> what is going on here
<lukegb> llvm-exegesis debugging
<lukegb> it, err, segfaults. but only on wendy. and apparently only under this one test
<sterni> still better than debugging llvm 7 libunwind cross compilation :'(
<sterni> or worse who knows
<sterni> actually worse
<gchristensen> lukegb: DM'd you a wormhole command to get it
<lukegb> gchristensen: can you grab the .yaml file as well?
<gchristensen> which?
<lukegb> err
<lukegb> CMOV16rm-uops.yaml
<gchristensen> wormhole receive 4-borderline-sailboat
<gchristensen> lukegb:
<lukegb> cheers! thanks, I think that should be enough to at least work out what's up
<gchristensen> it might be cool to automatically upload coredumps from builds
<lukegb> gchristensen++
<gchristensen> :) thanks for digging in
<gchristensen> [root@wendy:~/scratch/nixpkgs]# coredumpctl list | wc -l
<gchristensen> 10185
<lukegb> hahaha
supersandro2000 has quit [Killed (rothfuss.freenode.net (Nickname regained by services))]
supersandro2000 has joined #nixos-infra