<worldofpeace>
great, I've been helping him out. good to see another commiter ❤️
pie_ has quit [Quit: pie_]
<gchristensen>
=) yeah!
ajs124 has quit [Quit: Gateway shutdown]
das_j has quit [Remote host closed the connection]
das_j has joined #nixos-dev
andi- has quit [Remote host closed the connection]
andi- has joined #nixos-dev
drakonis has quit [Quit: WeeChat 2.4]
cjpbirkbeck has joined #nixos-dev
Jackneill has joined #nixos-dev
__monty__ has joined #nixos-dev
orivej has joined #nixos-dev
cjpbirkbeck has quit [Quit: Quitting now.]
johanot has joined #nixos-dev
drakonis_ has quit [Ping timeout: 245 seconds]
drakonis_ has joined #nixos-dev
ajs124 has joined #nixos-dev
cransom has quit [Quit: WeeChat 2.4]
justanotheruser has quit [Ping timeout: 245 seconds]
justanotheruser has joined #nixos-dev
<adisbladis>
Re cache slowness/ipv6 issues I have friends in asia telling me that they often see 200k speeds even on gigabit internet connection
<adisbladis>
After debugging a bit it seems like fastly is _always_ super slow when having to fetch from origin and subsequent fetches can max out even a gigabit connection
<adisbladis>
curl'ing a known old cache-object takes several seconds before the transfer is even started (presumably while fastly does an origin fetch)
<arianvp>
I also vaguely remember that 404's are very slow on fastly
<arianvp>
(e.g. when your build checks the cache first if something already exists)
<arianvp>
(but 404's are very slow on most CDNs. which is kinda sucky for the Nix usecase)
<adisbladis>
arianvp: 900ms for a 404 from that connection
<adisbladis>
And from my current UK connection it`s ~100ms
<adisbladis>
Oh wow, 1200ms
<adisbladis>
But avg is around 900
<adisbladis>
Requesting a narinfo: 0m0.981s
<adisbladis>
After cache: 0m0.027s
<adisbladis>
I can also observe that fastly has a very short-lived 404 cache
orivej has quit [Ping timeout: 245 seconds]
<Profpatsch>
I’d wager they don’t optimize for that
<Profpatsch>
Also: There’s always a lot more that *isn’t* there, so not restricting how much/long you cache 404s is a serious DDoS vector for such a cache
<Profpatsch>
An attacker could request lots of random items and blow the cache.
<adisbladis>
Profpatsch: I don't think it's a problem that their 404-cache is very short-lived. That bit is fine.
<adisbladis>
It was for you guys information in case anyone is also testing :)
<adisbladis>
So that cache is taken into account
<Profpatsch>
Hm, so if you want to find out whether something is a cache miss from a layered cache you have to try all layers first, right?
<adisbladis>
Makes sense yeah
<Profpatsch>
At least if you do it the naive way
<Profpatsch>
So the requests probably go all around the world in their network
<Profpatsch>
There’s probably a good way to design that (bloom filters come to mund), but fastly isn’t doing it it looks like
<Profpatsch>
s/mund/mind
<adisbladis>
Profpatsch: How would you construct that bloom filter?
<Profpatsch>
wild guess, I always forget which way around those go and I suck at boolean logic
<adisbladis>
Hmm
<adisbladis>
Interesting, hitting the s3 bucket directly consistently takes ~900-1000ms
<adisbladis>
So I'm guessing this is what happens: Fastly is always requesting the origin directly from the edge?
<adisbladis>
I'd like us to consider options to fastly
<arianvp>
adisbladis: I also remember someone ( gchristensen ? ) evaluating a whole bunch of options
<arianvp>
esp the 404 performance stuff . and fastly came out on top ? but this is all very vague recallings of old memory
<adisbladis>
I'd love to see that data
<__monty__>
Kludge suggestion, have a cronjob that hits a significant number of things you might want from the cache?
orivej has joined #nixos-dev
<adisbladis>
__monty__: You'd have to know what you want first ;)
<__monty__>
adisbladis: Just record things you hit, potentially cull the list in LRU fashion.
<adisbladis>
Holy ****, an object that I hit just a second before took 3 seconds to fetch
<__monty__>
Note that I didn't call it a "solution" : >
orivej has quit [Ping timeout: 268 seconds]
orivej has joined #nixos-dev
psyanticy has joined #nixos-dev
orivej has quit [Ping timeout: 272 seconds]
<michaelpj>
does Nix race cache lookups if you have multiple binary caches? That would at least prevent something I think I've noticed where having multiple binary caches slows you down a bunch since you have to wait for all of them to 404 in sequence
<michaelpj>
doesn't look like it, looking at `SubstitutionGoal::tryNext`
<Profpatsch>
yeah, it goes through them in order.
* tilpner
thinks sequential lookup is important to keep
<tilpner>
Optional (or default, but allowing to switch back) parallel lookup would be fine
<{^_^}>
nix#3019 (by michaelpj, 14 seconds ago, open): Race substitutors against each other
orivej has joined #nixos-dev
orivej has quit [Ping timeout: 246 seconds]
orivej has joined #nixos-dev
__monty__ has quit [Quit: leaving]
<gchristensen>
we use fastly because they donate it to us
<gchristensen>
thoughtpolice and i are going to meet to improve its config
<thoughtpolice>
arianvp: I'm pretty sure we can just cache 404s, which is the real solution. Then whenever Hydra does uploads, it can just purge the 404 cache entries and be on its way.
<gchristensen>
I think we do already cache some 404s
<thoughtpolice>
It's not like a global 30 minute purge "cache updated, wait for CDN to catch up now" thing. You can just purge a ton all the time and it's nearly instant.
<thoughtpolice>
If we got bold enough we could even do things like cache the public APIs/endpoints Hydra exposed
<thoughtpolice>
Searches, for example, which are just crazy slow
<gchristensen>
let's make things a bit better before getting bold :P
<thoughtpolice>
Also I don't know the timeline for how long our TTLs max out at but I'm pretty sure you can stuff things in there for quite a while. Like just set the 404 cache TTL to like 1 month on every 404
<thoughtpolice>
And get rid of it when you upload
<gchristensen>
interesting. we never expunged cache entries on cloudfront
<gchristensen>
I wonder if that caused problems we didn't know about
orivej has quit [Ping timeout: 246 seconds]
<thoughtpolice>
adisbladis: Yes, right now all systems in the CDN fetch from S3 directly over the "global internet". If you are in Europe, you hit a DC, and it hits us-east-1. If I am in the US, it hits a different DC and goes to us-east-1 as well. This is not good since it exposes S3 latency and also means that individual nodes all cache entires in a "disjoint" way
<thoughtpolice>
In other words, if you do a fetch and populate the cache, and I do a fetch, I don't hit your DC, so I get a miss, and will hit the origin again, wasting time. That's bad but you can fix that.
<thoughtpolice>
Instead you use a two layer approach: all edges hit one of the DCs, and that DC alone is responsible for talking to S3, sort of like a reverse proxy. This means A) all requests stay in the network backbone, but B) it increases hit ratio too, because now all requests are funneled through one set of nodes, which will populate the cache nicely.
<thoughtpolice>
I imagine this would dramatically reduce the amount of direct origin fetches people see.
<thoughtpolice>
"Dramatically". It's literally impossible to know right now because we have no insight into any of the cache infrastructure. Ideally we would have things like p99 metrics via logs, but alas, that's a big TODO...
<thoughtpolice>
(And actually, gchristensen may have enabled shielding a while back. But the default TTLs are really not very long, IIRC, so that also means we wouldn't get much benefit. narinfos for example should basically get cached effectively infinitely, for example)
FRidh has quit [Ping timeout: 246 seconds]
drakonis_ has quit [Ping timeout: 244 seconds]
<thoughtpolice>
It actually does not look like shielding is enabled
<gchristensen>
it is not
<gchristensen>
if I remember correctly, when we talked about shielding I was in a weird TZ and falling asleep tired
drakonis_ has joined #nixos-dev
<thoughtpolice>
*nod*
<thoughtpolice>
adisbladis: Also, if you're going to check the cache for hit/miss numbers, timing etc, any information becomes much more valuable when you include the right headers and could relay them
<thoughtpolice>
There's some more information in the comments
jtojnar has quit [Ping timeout: 245 seconds]
<thoughtpolice>
Profpatsch: Also, requests do not go around the world, the basic nuts and bolts are the exact same as any CDN -- you hit a CDN node, it requests from the origin. There aren't any other multiple levels to it by default as far as a user is concerned (though you can change that if you wish, as I noted above). Not sure what "layering" you mean.
jtojnar has joined #nixos-dev
<Profpatsch>
thoughtpolice: Doesn’t mean the CDN doesn’t have layered caches inside, right?
<thoughtpolice>
I don't know what you mean by "layered" I'm afraid. Do you mean several nodes in the same datacenter all work together? That there are multiple caches on the same server or something?
<thoughtpolice>
In the sense that there are multiple servers working together inside of one "Point of Presence", yes that is true. The request you made is close to what a real request looks like
<thoughtpolice>
Here's what a real request looks like from cache.nixos.org:
<thoughtpolice>
That means there were two nodes involved in the request. One of them "fetched" from the origin (F) while the other "delivered" to the user (D)
<thoughtpolice>
Both of them came from the same datacenter or "POP", which is Dallas, AKA "DFW"
<thoughtpolice>
In this case, "fetched" is a bit of a misnomer. There was actually a cache hit, in that example.
cransom has joined #nixos-dev
<thoughtpolice>
(Well, it's not really a misnomer it's just... not normally what you think of "fetch", which has a specific meaning. In Varnish, "fetch" normally means "I talked to the origin", but here it just means "I had to grab the object from literally anywhere", including other nodes in the cache.)
<thoughtpolice>
So yes, there is what you'd think of as "clustering" going on. Several nodes can participate in a request, but it's not like, every HTTP request requires talking to N other nodes in the datacenter every time. It's more just like a fast lookup -- there are [0...N] nodes in the DC, the object was on node 1 but you talked to node 0, by chance.
<thoughtpolice>
Provided that you get a good cache hit ratio -- and when I say "good", keep in mind 90% is "just okay" -- then you will not see this. Instead every node in the DC will get its own cache populated, so it won't even have to talk to any others.
pie_ has joined #nixos-dev
<thoughtpolice>
Profpatsch: I hope that explains a little bit, at least.
johanot has quit [Quit: WeeChat 2.4]
johanot has joined #nixos-dev
orivej has joined #nixos-dev
<Profpatsch>
thoughtpolice: There is no communication between datacenters?
<cransom>
for a cdn? no
<cransom>
or at least for fastly specifically, no.
<thoughtpolice>
Profpatsch: When you say "communication", what do you mean? Like every DC talks to every other DC on a request? No, no CDNs work like that, the latency would simply be atrocious. With Fastly, you *can* do a limited form of "datacenter layering", which in our case for cache.nixos.org would be beneficial, but that's probably different than what you're thinking.
<thoughtpolice>
(Which is the configuration I mentioned earlier -- you choose one special datacenter to handle ALL requests, and every other datacenter will, instead of going to the origin, go to that DC instead. Only that DC will ever talk to the origin, if it cannot satisfy the request. This is beneficial for a number of reasons in many setups)
<thoughtpolice>
But fundamentally all CDNs work on the idea you contact the closest DC instead of the origin, and it either gives you an object from the cache, or goes to the origin as a last resort. Those are the fundamentals and they're the same basically everywhere.
<cransom>
typically, the largest advantage is that it only requests something of the origin once. so if you have an app that is very slow, it's better to sheild/tier requests so that fastly channels all requests to their one caching location, and then the rest of the fastly caches ask that location for the asset.
<Profpatsch>
So is the problem with Asia that fastly just doesn’t have any DCs there?
<thoughtpolice>
Yes, exactly. This improves the hit ratio because it means the "shield" will satisfy many more requests. And those requests go through the intra-dc networks, not over the globally routed internet.
<thoughtpolice>
Profpatsch: It's probably a number of things, including the fact we don't have a great hit ratio -- but there aren't many DCs as compared to other regions, yes (and most of them aren't directly in China, but e.g. Tokyo and Hong Kong, depending on where you mean by "asia"
orivej has quit [Ping timeout: 272 seconds]
<thoughtpolice>
In general things like the Great Firewall etc shouldn't be an issue. The real issue is that if you have bad hit ratios, you simply have to talk to the origin a lot. And that's going to suck going across, say, the Pacific Ocean, no matter what you're doing. But in this case you not only pay that latency but also the latency of the intermediate connection as well.
<thoughtpolice>
At least, that's my working theory. It's hard to tell without more information and without logs from Asia users. Driving up the hit rate and being way more aggressive about caching is the first step, however.
johanot has quit [Quit: WeeChat 2.4]
justanotheruser has quit [Ping timeout: 245 seconds]
justanotheruser has joined #nixos-dev
justanotheruser has quit [Client Quit]
pie_ has quit [Ping timeout: 252 seconds]
pie_ has joined #nixos-dev
drakonis has joined #nixos-dev
jtojnar has quit [Read error: Connection reset by peer]
jtojnar_ has joined #nixos-dev
jtojnar_ is now known as jtojnar
<gchristensen>
so thoughtpolice and I just spent like 2hrs (!!! thank you!)
orivej has joined #nixos-dev
<gchristensen>
looking through Fastly/VCL config settings. we ended up turning on some options to improve cache performance, which should mostly help people outside of the US by making requests go on Fastly's private network until they get right next door to S3.
<thoughtpolice>
you can just give it any object from cache.nixos.org as the parameter and it'll dump the relevant info
<thoughtpolice>
We probably won't see a huge amount of changes immediately, we'll have to watch. (It might look much better here in 12 hours once the EU is back @ work, we'll see)
<gchristensen>
thoughtpolice: mind putting that up on a nixos.wiki page? then we'll add it to {^_^}'s reply about the cache
<thoughtpolice>
Yeah I can push it up there a little later today
<samueldr>
thoughtpolice++
<{^_^}>
thoughtpolice's karma got increased to 5
<gchristensen>
thoughtpolice++
<{^_^}>
thoughtpolice's karma got increased to 6
<gchristensen>
thoughtpolice: also, including a link to fastly-debug.com (or ... whatever that is ... :| :) ) and instructions to send them to me would be good ... but I can add that lastp art.
<gchristensen>
some more details:
<gchristensen>
we turned on "POP Shielding" which directs all requests through a central DC. so requests to Europe first travel on Fastly's private fiber to their Ashburn DC, then hits their cache infra in Ashburn, and then exits that cache infra to s3 which is like next door
<gchristensen>
we turned on "POP Shielding" which directs all requests through a central DC. so requests to Europe first travel on Fastly's private fiber to their Ashburn DC, then hits their cache infra in Ashburn, and then exits that cache infra to s3 which is like next door
<gchristensen>
we turned on "POP Shielding" which directs all requests through a central DC. so requests to Europe first travel on Fastly's private fiber to their Ashburn DC, then hits their cache infra in Ashburn, and then exits that cache infra to s3 which is like next door
<gchristensen>
oops
<srhb>
Protip: Don't oops after accidental spams :D
<srhb>
(I know...)
<simpson>
It's fine, only one message got through and the other two were killed by the load balancer~
<gchristensen>
this means that requests only hit s3 if ashburn is missing the content.
<gchristensen>
we also don't hav e to deal with general internet congestion
<gchristensen>
another thing is we found 404s are probably not being cached as we expect, which is not huge but also not great. we weren't sure how to fix it yet, but the issue is the synthetic reply in the VCL seems to ignore the cacheable property
<adisbladis>
gchristensen: Just some quick tests from a server in HK shows that 404 latency has dropped significantly
<gchristensen>
can you give me some info on that?
<gchristensen>
specific #s
<adisbladis>
From consistenly ~900-1200ms to <300ms
<gchristensen>
thoughtpolice: ^^^
<andi->
so there is no intermediate caching (anymore?) between say London and Ashburn? Meaning all replies (positive and 404s) will be like ~180ms?
<gchristensen>
London has its own cache infra. if London cache misses, it goes to Ashburn. if Ashburn cache misses, it goes to s3
<andi->
ok
<gchristensen>
404s use a synthetic response of "404", which seems to side-step the instruction to cache the 404
<thoughtpolice>
:D
<andi->
so the 1st cache receives a 404 from ashburn and treats it different from "normal" responses?
<andi->
or is that just the cache in ashburn that doesn't work then
<gchristensen>
the hot button problem of IPv6 is unsolved. the current understanding, which matches what we've read, is it has to do with how the network is implemented at a layer we can't control, which leaves us very limited options -- none of which are very good. for example, make an ipv4.cache.nixos.org and an ipv6.cachen..nixos.org and cache.nixos.org points to both
<gchristensen>
the cache "chaining "has nothing to do with 404s
<thoughtpolice>
andi-: In Fastly, "Shields" are global. That means if Ashburn is the shield, ALL requests from ALL other DCs will always go through there. London, Tokyo, etc etc. No difference where it comes from.
<andi->
thoughtpolice: okay that makes more sense. Not exactly what gchristensen wrote a few lines up :)
<adisbladis>
It still very noticable when the cache is not hot though
<thoughtpolice>
So Ashburn is always "in the middle" of every request, should it get there. It is also possible a request will NOT go to Ashburn. Why? Because say you hit London, and it already has the object. No need to go there!
<gchristensen>
as it stands, 404s never cache, though, so they always go local-pop -> ashburn -> s3
<thoughtpolice>
adisbladis: Yes, hopefully shielding will help more with that over time, though only so much, it will have a limit. But as Ashburn gets populated, it will serve more and more objects directly without ever talking to S3. I'm betting the final S3 hop is the longest time
<thoughtpolice>
(Assuming the object is NOT a 404 -- but an actual cold object that simply hasn't been touched)
<adisbladis>
thoughtpolice: Is there a way I could request objects directly from ashburn?
<thoughtpolice>
I don't think so, I'm afraid.
<adisbladis>
Ok. I can request the object from europe and then try the same in asia
<andi->
thoughtpolice: getting back to the v6 problem: What is the current theory of issue? Graham said we can't control it so it is something usually configured wrong by ISPs/CPEs/…?
<gchristensen>
it is the bug in those technicolor modems
<thoughtpolice>
Basically, that cheap shitware routers are very popular, and they do not implement IPv6 correctly. On the Fastly side, I cannot give specifics but it is... Probably not trivial to fix in any immediate timeframe.
<thoughtpolice>
I'm thinking we'll probably just have to move cache.nixos.org to IPv4 only, in the mean time, and possibly have a separate v6.cache.nixos.org that does dual stack routing.
webster23_ has joined #nixos-dev
<thoughtpolice>
This will also help determine exactly where the problems are; IPv6, maybe Nix bugs, maybe user issues, etc. Right now there's sort of a lot going on so pinning it down beyond "crappy routers" is hard, but it's the strongest lead at the moment.
<andi->
So what if I tell you that I (used to) hit the problem and there is no modem nor medium change between me and fastly?
<adisbladis>
Great work thoughtpolice and gchristensen \o/ This has already improved things noticably.
<gchristensen>
adisbladis: <3 so glad
<adisbladis>
thoughtpolice++
<{^_^}>
thoughtpolice's karma got increased to 7
<adisbladis>
gchristensen++
<{^_^}>
gchristensen's karma got increased to 134
<gchristensen>
then maybe your ipv6 problem is not "the ipv6 problem" and is indeed something else
<gchristensen>
I think there are numerous problems here playing together, and difficult to tease apart
<andi->
maybe, maybe it is a different one maybe it is the same :/
<adisbladis>
I've asked some of my friends in asia to keep an eye on things over the next couple of days
<gchristensen>
interestingly, I noticed that github.com and other public customers of fastly don't enable ipv6
<gchristensen>
andi-: not knowing exactly the problem makes it pretty tough to debug :(
<thoughtpolice>
adisbladis: Good to hear! I figured it would probably help quite a lot.
orivej has quit [Ping timeout: 245 seconds]
<andi->
I can start logging all the connection headers to fastly and then check once I observe it..
<thoughtpolice>
And it should make S3 happier (or, less irritable) as well
<thoughtpolice>
gchristensen: I think I kind of know what's going on in the 404 thing.
<gchristensen>
oh cool
drakonis has quit [Quit: WeeChat 2.4]
<thoughtpolice>
Basically I think what's happening is, we are not caching the 404s. We always hit the backend and then map the 403 to a 404 and give a synthetic. What we want is, to cache the result itself, *then* based on what the response *was* possibly return a synthetic
<thoughtpolice>
So in other words it's something in, like, deliver we have to handle, I think
<gchristensen>
oh!
<thoughtpolice>
So like you look at the object, it's a hit, so you're going to deliver. Then you look closer, and say, "oh, the cached object was a 403, so instead return a 404 synthetic"
<thoughtpolice>
So it's still cached, you just have to recognize the 403 case specifically
<thoughtpolice>
I think
<thoughtpolice>
Slightly weird but makes sense
<adisbladis>
Hmm, debian seems to enable ipv6 for their fastly mirror
drakonis has joined #nixos-dev
<andi->
right, I vagely remember that Debian had similar issues a few years ago when they switched
<tilpner>
Even Hound text search does better than their indexing
<gchristensen>
hounds' search is incredible
<gchristensen>
(this is why I setup hound)
orivej has joined #nixos-dev
<thoughtpolice>
I don't like ragging on particular things like that too much but, yeah. GitHub search... could be improved
<thoughtpolice>
My suggestion personally is just use a gajillion of them microsoft dollars to replace their search with a global version of livegrep.com
<adisbladis>
gchristensen: What's your workflow around hound?
<adisbladis>
I'd love it if it could auto-discover local checkouts
<cransom>
is it powered by bing yet?
<gchristensen>
I have search.nix.gsc.io and I "just use it". I don't use hound locally
<gchristensen>
there is editor integration for it, but I don't use that either
<adisbladis>
gchristensen: I wrote a nix expression that generates the config :)