<infinisil>
With --max-jobs and --cores I feel like the optimum is often something like jobs * cores = 2 * actual cores
<infinisil>
Gut feeling only
<elvishjerricco>
That's basically the intuition I've been using for a couple years
<elvishjerricco>
loadavg never seems to punish me for it :P
<infinisil>
Hmm would be cool if we had info on how much parallelism builds used
<infinisil>
Like a probability distribution of core usage for a package build
<elvishjerricco>
On top of wishing GHC actually handled `-j` well, I wish it had something like make's `-l`
<gchristensen>
I think that only makes sense if you have HT?
<gchristensen>
if you don't have HT, then the context switches can be costly
<infinisil>
So you can go "90% of the time this packages uses 1 core only)+
<infinisil>
elvishjerricco: What's those do?
drakonis has joined #nixos-chat
<infinisil>
gchristensen: Intuition is: many builds won't be parallelizable often
<elvishjerricco>
gchristensen: I meant "threads" when infinisil said "actual cores", not physical cpu cores, sorry. Reason being that Haskell dependency graphs don't tend to be hugely parallel, so you can overprovision cores to builds pretty safely
<elvishjerricco>
infinisil: `-l` cause `make` to watch the system loadavg. It won't start a new job until the loadavg is below the specified number
<gchristensen>
right
<infinisil>
elvishjerricco: :o
<infinisil>
That's pretty neat
<gchristensen>
but putting 2 processes on 1 hardware thread can lead to contention which can slow it down
<elvishjerricco>
So you can do `make -j <n> -l <n>` on many jobs when you have <n> cores, and they'll balance cores between each other reasonably well basically by happenstance
<gchristensen>
I wish GHC didn't have a linear slowdown with more threads :|
<gchristensen>
I wish GHC didn't have a linear slowdown with more hardware threads :|
<elvishjerricco>
gchristensen: well the point is that max-jobs * cores can usually be set to more than the number of hardware threads you have because Haskell build graphs rarely saturate max-jobs * cores, and it's useful to let jobs and cores-used-per-job fluctuate into higher numbers during a build.
<elvishjerricco>
gchristensen: And it's just GHC itself that has linear slowdown with hardware threads, right? Not all GHC-compiled programs?
<gchristensen>
right, but GHC is the only Haskell program "I use" :)
<elvishjerricco>
Fair enough :P
<gchristensen>
(in other words, i'm annoyed I can't give ghc 96 slow cores and get a reasonably fast build)
<elvishjerricco>
Not that most module graphs or package graphs would take advantage of that anywya
<gchristensen>
yeah
<elvishjerricco>
that's one thing I envy about C
<infinisil>
Oh, is that in C fast because of the .h files?
<infinisil>
Like, during compilation files only need to look at the .h files to infer the api of dependent files
<infinisil>
Although, if it wants to do inter-file optimizations that won't work
<elvishjerricco>
infinisil: Yea, all C files can be compiled in parallel no matter their real dependency graph.
<infinisil>
Ah so it is that
<infinisil>
Do you need some flag to activate this?
<elvishjerricco>
interfile optimizations basically aren't a thing in C, unless you use LTO (link-time-optimization), which stores a different kind of object file that can optimize during linkage
<infinisil>
Or is this enabled by default in gcc?
<elvishjerricco>
infinisil: gcc doesn't do any of the work. It's all in `make`
<infinisil>
Ohh right
<elvishjerricco>
each C file is compiled by its own gcc process, and all such processes are independent
<infinisil>
Hm yeah that seems pretty nice
<infinisil>
elvishjerricco: Although, couldn't Haskell kind of do the same thing? Assuming all top-level definitions have a type given, it should be able to extract an equivalent ".h" from it
<elvishjerricco>
infinisil: In theory it's something you could potentially get away with in a Haskell-like language (separate types into different files), but not a dependently typed language
<elvishjerricco>
^ same train of thought :P
<infinisil>
:)
<infinisil>
Oh yeah I see how dep types can mess that up
<infinisil>
And I guess GHC already has some dependent type stuff
<infinisil>
type families and co.
<elvishjerricco>
infinisil: I *think* you could put type families into "header" files and it'd be fine in current haskell. They're completely unoptimized, so they're basically eval'd per-usage anyway
<gchristensen>
I would imagine with all the degrees that have come out of ghc, the problem is more complex than dependent types
<elvishjerricco>
gchristensen: yea, TemplateHaskell for example
<gchristensen>
hy eah
<infinisil>
Hm, as a Haskeller I kind of feel like I'm cornered into Haskell. I feel like I won't be satisfied with any other language anymore
<gchristensen>
many languages have beautiful ideas and concepts
<gchristensen>
it may be tough to find them, though
<infinisil>
I mean also for realistic and productive coding
<infinisil>
I'm really interested in Idris for example, but I won't find any useful libraries for it
<elvishjerricco>
infinisil: I feel the same way. Which is sad because I have sooo many issues with Haskell :P
<infinisil>
(maybe I should write them then I guess)
<infinisil>
elvishjerricco: Yeah, having some issues too
<infinisil>
elvishjerricco: What are some of yours?
<elvishjerricco>
1) The type system is a eldritch horror attempting to masquerade as dependent types. 2) The compiler produces fast programs but is horrible in almost every other way. 3) The build systems are painful to work with. 4) TemplateHaskell is a terrible thing. 5) There's tons of libraries.... for like 5 things. People love to reinvent the same 5 wheels, so we're missing a lot and work gets wasted.
<elvishjerricco>
There's plenty of language level things that aren't coming to the top of my head though
<elvishjerricco>
records... oh man records
<gchristensen>
how about CPP
<elvishjerricco>
oh god the problems caused by CPP...
<gchristensen>
:D
<elvishjerricco>
Oh the parallel GC has terrible defaults for like 90% of cases but no one notices until they use lots of cores.
<infinisil>
-gy or something, I think
* infinisil
looks it up
<elvishjerricco>
Language level things: 1) ScopedTypeVariables isn't the default because...?? 2) Compiler magic behind `$` is stupid. 3) Reasoning about let-floating and full-laziness is incredibly annoying. 4) Prelude is kinda bad but not bad enough to be worth using alternate ones.
<infinisil>
-qg
<elvishjerricco>
infinisil: Which one is that again?
<infinisil>
Um, no idea, but the one that doesn't make multiple cores run gc
<infinisil>
Well I guess I have an idea then
<elvishjerricco>
Ah, yea I think that just disables parallel GC altogether and forces a serial one
<infinisil>
elvishjerricco: What's let-floating?
<elvishjerricco>
Which is *often* an improvement over the defaults, but not as good as if you actually chose the right params
<elvishjerricco>
infinisil: It's similar to full-laziness. `let`-bindings can get floated into higher scopes
* infinisil
has no idea how this affects anything
<infinisil>
I guess it can save some thunk allocations?
<elvishjerricco>
infinisil: It's more of a problem with full-laziness, which will take terms that don't refer to a lambda's or function's arguments and push them into a higher scope so they're only allocated once for the whole function, not once on each invocation.
<elvishjerricco>
Sounds like an optimization until you realize it forces that value to be shared, which can blow up your GC
<clever>
elvishjerricco: ive noticed that problem with nix too
<infinisil>
Ohh
<elvishjerricco>
clever: I didn't think Nix did any such thing... or hardly any optimizations really :P
<clever>
elvishjerricco: basically, there was something in the form of: map (x: let foo = 5*5; in x + foo)
<infinisil>
elvishjerricco: Tbh my understanding of multithreading and GC and the combination of them is rather poor, I wish I knew how it all ties together
<clever>
elvishjerricco: and it re-computed 5*5 for every pass thru the list
<clever>
elvishjerricco: i had to manually float the let into: let foo = 5*5; in map (x: x + foo)
<clever>
elvishjerricco: and that gave a measurable difference in function calls
<elvishjerricco>
clever: Yea, I forget if that one would be let-floating or full-laziness, but that optimization would eliminate that. At the cost of keeping `foo = 25` allocated in memory much longer.
<elvishjerricco>
matters more for big data structures that get GC'd along the function's execution
<clever>
elvishjerricco: the problem is less about keeping in ram, and more about nix not knowing it doesnt refer to the local scope
<clever>
elvishjerricco: when you do `let foo = expr;` nix will initially assign foo to be a thunk containing the expr, and if you force it, it becomes a concrete type
<elvishjerricco>
infinisil: GHC's GC has a few neat tricks up its sleeve but it's mostly pretty normal. Its multithreading is pretty sweet though
<clever>
elvishjerricco: but when you do `x: let foo = expr in ...`, the type is a function, which cant contain the result of forcing thunks
<clever>
elvishjerricco: so, every single time you call the function, it creates a new thunk within foo, that has to re-compute expr
<clever>
elvishjerricco: even if expr never depends on x!!
<elvishjerricco>
clever: Yea, it's good for that case to factor `foo` out (and GHC would do so).
<clever>
elvishjerricco: `let foo = expr; in x: ...` would save the result into foo, then the garbage collector will take care of the rest
<elvishjerricco>
But for instance, if you have a lazy linked list that spans millions, or even infinite elements, you REALLY don't want that getting shared between function invocations
<clever>
elvishjerricco: yeah..., maybe a warning would be better, from some optimization scanner
<elvishjerricco>
There was a bug with the `forever` function in `base` a while ago where this was guaranteed to cause a memory leak
<infinisil>
elvishjerricco: My only experience with GHC's GC with multithreading has been pretty bad: Told it to use 8 cores, but only like 600% core usage in htop, no idea what happened
<clever>
maybe the trace stuff angermann has been playing with lately, could identify those slow spots, then use hnix to confirm if it can be floated
<elvishjerricco>
infinisil: You need to be careful with the GC when multithreading. There's lots of advice online about it, and I always forget what should be done :P
<elvishjerricco>
But for instance the spark pool is the coolest multithreading technique I've seen in any language
<infinisil>
I know at least what a spark is, didn't know there was a pool of them though!
<clever>
elvishjerricco: --trace-function-calls
<infinisil>
I guess free cores just take a spark from the pool if there are any
<clever>
my understanding, is that the ghc rts has a list of mutex's, called capabilities
<elvishjerricco>
infinisil: Yea when you spark a thunk, it gets placed in a pool. Then mutator threads come along and start eating away at the pool on their own accord.
<clever>
so if you run it with `+RTS -N3 -RTS`, it creates 3 capabilities (aka, 3 mutexes), which gives you the capability to run up to 3 threads at once
<clever>
any time a thread wants to run haskell code, it must grab one of those mutexes
<clever>
and also, the os threads can internally context switch between many haskell threads
<elvishjerricco>
clever: Well, kinda
<elvishjerricco>
Haskell threads are cooperative
<clever>
i think when you ffi things, it will release the mutex, and then block on that ffi call
<elvishjerricco>
They're shuffled around between capabilities at the RTS's discretion.
<clever>
and another os thread can take over the capability
<elvishjerricco>
And the RTS manages waiting for a thread to yield, then switches to the next without any kind of mutexing
<clever>
and yeah, the haskell threads can roam between os threads
<elvishjerricco>
But there is such a thing with the FFI, yea
<clever>
so a busy os thread just pushes its jobs off to another
<clever>
i still havent figured out the black-box that is "proper" io in haskell
<clever>
it somehow registers for a thing to wait until a fd is readable, without blocking in ffi
<clever>
and there is one central thread that uses select/poll/epoll?
<elvishjerricco>
clever: Yea you're talking about the IO manager
<elvishjerricco>
It was replaced several years ago
<elvishjerricco>
I don't know much about it
<clever>
last i heard, its part of why named pipe stuff on windows still sucks as much as it does
<infinisil>
elvishjerricco: Oh and btw, one major grip with Haskell I have (well not haskell directly), is the split of the library ecosystem: There's so many different non-compatible libraries. Some use string, some text, some use this effect system, some another, some lens some not
<clever>
the io manager doesnt support it, so all traffic has to go thru ffi's
<elvishjerricco>
But it's basically a "userspace" thing (as-in, non RTS thing) that manages stuff, and base chooses to use it
<clever>
infinisil: oh, i hate how paths are handled
<clever>
some libraries want String, some want Text, some want ByteString, and there are 2 competing libraries for dealing with paths
<elvishjerricco>
infinisil: Oh yea, that's a huge annoyance
<infinisil>
I personally have only seen paths be Strings
<elvishjerricco>
Though to be fair
<elvishjerricco>
ByteString is deservedly its own thing
<clever>
and turtle cant even make up its mind, the file functions return one type, but the execute functions take a different type
<elvishjerricco>
The only thing I'd consider a competitor to ByteString would be like `UVector Word8` or something
<clever>
so you cant even do something like "execute (dir + binary) [ -o file ]"
<infinisil>
elvishjerricco: Yeah, they do have different use cases (well String not really, should really be [Char])
<elvishjerricco>
Because ByteString is about bytes, not text
<clever>
you have to convert between types several times
<elvishjerricco>
clever: Turtle is awful
<elvishjerricco>
It's got some great ideas but the UX is just horrendous
<clever>
ive also ran into bugs recently that </> would solve
<clever>
but it often needs incompatible types
<clever>
which involves refactoring half the code :P
<infinisil>
Oh and another major gripe: Documentation of libraries is generally pretty bad, error messages similarly
<clever>
can we just decide on a single standard type for a path?
<elvishjerricco>
Haskell is the reason I've become quick to read sources :P Docs are bad
<clever>
infinisil: often i skip the docs, and then click the source link in google :P
<elvishjerricco>
Not just in Haskell; even when they're good in other languages, they're still often wrong
<clever>
infinisil: i can sometimes understand the source better then the docs
<infinisil>
A replacement for lens, with actually good error messages, none of that abstract nonsense
<clever>
something i want, is for the template haskell monad, to be based on the identity monad, not IO
<elvishjerricco>
infinisil: Any idea what the drawbacks of it are?
<clever>
then you can proove at the type level, that the TH isnt doing any IO
<infinisil>
elvishjerricco:
<elvishjerricco>
clever: There's a lot more wrong with TH than that :P
<clever>
which means you can run the TH on the "wrong" platform
<clever>
which enables TH in cross-compile
<elvishjerricco>
clever: Not quite
<clever>
that will cover 80% of the use-cases
<elvishjerricco>
There's all sorts of constants and CPP directives that TH has access to that are super annoying
<elvishjerricco>
but yea, 80%
<infinisil>
elvishjerricco: You can't define optics without the library. With lens it's just a `(s -> a) -> (s -> a -> s)` or whatever
<clever>
elvishjerricco: makeLenses'
<clever>
thats about the only thing i use TH for
<elvishjerricco>
Same :P
<clever>
it just takes a type in, and spits out a bunch of functions?
<clever>
it doesnt care what arch it runs on
<elvishjerricco>
infinisil: What about perf? Does it optimize out of existence like lens usually does?
<clever>
the code it generates could even be reused between many platforms
<clever>
elvishjerricco: oh!, something else ive wanted, a bit of a `gcc -E` mode, can i run the TH, and have ghc emit a non-th'd version of the entire file?
<infinisil>
elvishjerricco: No idea, but I hope so, and I'd expect it
<elvishjerricco>
clever: reflex-platform has a GHC patch for exactly that
<clever>
elvishjerricco: so i could manually run that th, and then just commit the generated code, and compile it in a non-th manner
<elvishjerricco>
They apply it to the native build of every package and splice the output onto the cross build
<elvishjerricco>
not sure how it's applied or how it works
<elvishjerricco>
so you may have to do some digging
<clever>
elvishjerricco: i'll have a look at it tomorrow, its getting late here, oh but one last thing
<clever>
ive been working on porting some open-source firmware to the rpi4, and i recently had a crazy idea involving haskell
<clever>
the firmware runs on a cpu without an MMU, so heap fragmentation leads to problems allocating large objects
<clever>
so, the firmware heap can be defragged, moving objects around randomly
<clever>
you must lock an object before you know where it is, and unlock it when your done
<clever>
and that feels like something the ghc copy collector is already doing....
<clever>
elvishjerricco: am i insane for wanting to run haskell bare on a cpu that lacks an MMU?
<elvishjerricco>
Oh that's a cool idea
<clever>
elvishjerricco: note, that this will also be involved in reading and writing memory mapped IO
<elvishjerricco>
Yea copying GCs are probably the only GC's well suited to a CPU that has to defrag like that
<clever>
elvishjerricco: where simply reading a given address in memory has side-effects :P
<elvishjerricco>
but that also sounds super hard to do right :P
<clever>
i think a copy-collector will waste half the ram?
<elvishjerricco>
how do you do memory mapped IO without an MMU?
<clever>
enless it can copy in smaller chunks?
<clever>
all IO devices are in the main address space
<elvishjerricco>
clever: I forget if GHC chunks up the copies or not...
<elvishjerricco>
ah
<clever>
so if i want to blink the status led, i just write to the correct byte, that is within ~64 bytes of physical address 0x7e20_0000
<clever>
and if i want to check for a button being pushed, i read a byte in that area
<clever>
another factor that may be an issue, is that the address space only has room for 1gig of ram
<clever>
and for most models of the rpi, you must share that ram with the linux kernel (which is on an entirely different cpu)
<elvishjerricco>
oh that sounds like a problem
<elvishjerricco>
Most models?
<clever>
any model with 1gig or less ram
<clever>
the rpi3/4 can basically run 4 entirely seperate instruction sets, at once
<clever>
arm32, arm64, vc4, and qpu
<elvishjerricco>
wut
<clever>
arm32/arm64 are ran on the quad-core ARM, which can do the usuall switching between 32bit and 64bit mode (they use different instruction sets)
<clever>
vc4 is a dual-core cpu without a MMU, thats where start.elf and the firmware runs
<clever>
qpu is the shaders, for compute and 3d graphics, its basically a 192 core cpu
<clever>
qpu has extremely limited branching capabilities, it might not even be turing complete
<clever>
(baring self-modifiying code)
<clever>
the rpi 1/2 is limited to arm32/vc4/qpu, and the rpi1 only has a single arm32 core
<clever>
elvishjerricco: is it less wut-inducing now that its been explained? lol
<elvishjerricco>
clever: Slightly :P
<clever>
when you turn the pi on, the vc4 runs a boot rom, and the dram/arm are offline
<clever>
the boot rom loads bootcode.bin
<clever>
bootcode.bin enables the dram, and loads start.elf (also a vc4 binary)
<clever>
start.elf then loads kernel.img, and enables the arm cpu
<clever>
then start.elf and kernel.img each live in their own cpu cluster (quad-core arm + dual-core vc4)
<clever>
and they share the ram, and talk to eachother over a mailbox
<elvishjerricco>
clever: So then you're talking about having Haskell in the equivalent of start.elf?
<clever>
yep
<elvishjerricco>
that sounds very hard :P
<clever>
i also already have a bootcode.bin that can load linux on its own
<elvishjerricco>
GHC likes to use a lot of OS features
<clever>
so the idea is that linux will load the haskell firmware later on, after linux boots
<clever>
more like a typical gpu firmware
<clever>
elvishjerricco: such as?
<elvishjerricco>
IIRC it starts using select/poll right away
<clever>
i learned the internals of halvm, before i knew haskell, lol
<clever>
halvm runs haskell code under xen, without a kernel
<clever>
elvishjerricco: its basically 2 large components, a new "os" for the rts, that provides the "native" wrapper around select/poll, and a custom haskell library, that ffi's into native code, to talk to the hypervisor
<clever>
elvishjerricco: this code is ran by some assembly stubs (which prepare things enough that c can run), and it will convert the kernel command line into an argv array
<clever>
elvishjerricco: it will also auto-generate some +RTS flags, based on how much ram the guest has
<clever>
it then runs the main() entry-point in the RTS
<clever>
which thinks you just executed the code on a normal os, with a normal command-line
<elvishjerricco>
clever: Yea in hindsight it's not as crazy as it sounds
<clever>
elvishjerricco: the RTS will then call osGetMBlocks and osFreeMBlocks to manage ram
<elvishjerricco>
clever: but something tells me GHC doesn't have a vc4 backend yet :P
<clever>
elvishjerricco: but ghc has an llvm backend, and ive heard rumors that llvm has a vc4 backend
<elvishjerricco>
clever: So the kernel and the firmware have to be aware of each others' memory allocations, right?
<clever>
for the official firmware, the "gpu" just takes a chunk at the top of the address space, like 256mb
<clever>
and then the linux kernel gets everything below that
<clever>
but my idea, is that the "gpu firmware" wont start until after linux has fully booted
<clever>
so, linux itself could just allocate a 256mb dma-capable block from its own address space, and then tell the gpu, "go nuts, its yours"
<elvishjerricco>
sure
<clever>
another complication, is that the vc4 can only access 1gig of the address space
<clever>
the rpi4 can go up to 4gig, and that creates a weird lowmem/highmem thing
<clever>
for the lower 1gig of the physical address space, the old start.elf api tells you how much ram linux gets, and how much is used by the firmware
<clever>
but the firmware cant see the 3gig of highmem, so thats not even reported, and is 100% in arm's control
<clever>
so you get 768mb of linux ram, 256mb of gpu ram, then 3gig more of linux ram
<clever>
and now you have to deal with a giant hole in your address space :P
<clever>
and anything you want to DMA to the gpu, must be in the loer 768mb block
<clever>
and its getting late here, i should get off to bed!
<colemickens>
so if I am doing `git am`, it fails, I fix what fails, how do I get it to apply the rest of the patch? It seems like it didn't apply any of the hunks and I dont know how to get it to?
drakonis has quit [Quit: WeeChat 2.6]
endformationage has quit [Ping timeout: 265 seconds]
lopsided98 has quit [Ping timeout: 265 seconds]
lopsided98 has joined #nixos-chat
* colemickens
gave up and applied it by hand and then `git format-patch`d the new patch :S
noneucat has quit [Quit: WeeChat 2.6]
Jackneill has joined #nixos-chat
__Sander__ has joined #nixos-chat
psyanticy has joined #nixos-chat
__monty__ has joined #nixos-chat
tilpner has quit [Read error: Connection reset by peer]