-
Notifications
You must be signed in to change notification settings - Fork 467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
persist: introduce a very small in-mem blob cache #19614
Conversation
b55ac18
to
c38d8e7
Compare
// it's also what's in s3 (if not, then there's a horrible bug somewhere | ||
// else). | ||
if let Some(cached_value) = self.cache.get(key) { | ||
self.metrics.blob_cache_mem.hits_blobs.inc(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the intent to calculate hit rate based on the delta between this vs existing blob metrics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, tho we'll have to do hits / (hits + fetches)
because I put this "around" the MetricsBlob wrapper so it wouldn't skew our latency histograms
blob: Arc<dyn Blob + Send + Sync>, | ||
) -> Arc<dyn Blob + Send + Sync> { | ||
let cache = Cache::<String, SegmentedBytes>::builder() | ||
.max_capacity(u64::cast_from(cfg.blob_cache_mem_limit_bytes)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thoughts on making this configurable? the limit could be artificially large, and we could dynamically set a multiplier in the weigher fn to play around with it (e.g. max capacity of "1GiB" but set the multiplier to 1GiB/1MiB so it's effectively a 1MiB cache by default)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typed this in the review notes, but I'd rather not :D. I'd prefer we do something like #19532 before going too much further down the cache road. even by the time we start getting to O(part size), I'd think it'd probably be time to introduce a secondary disk cache layer and the tradeoffs involved just complicate things sooo much. was hoping to keep this PR framed as a really quick win
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was more thinking to have the leeway to change it to like, 4MiB if that turns out to make a big difference, but I get that it adds complexity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I'm not sure there's an easy way to model the hit rate with moka's LFU/LRU policies beyond just measuring it empirically)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think my point is that without something like #19532, we won't know if it makes a big difference without just trying things in LaunchDarkly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, I think I'm not sure why we need to simulate it to that level of fidelity before adding a knob. it seems like the most useful data would be running this on a staging / prod env and just trying a few different sizes and seeing how the numbers look. is the concern for when the sizes get large enough to affect the amount of memory left for the rest of the process?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay yeah, I think you're right here. at this point, I'm way over the hours I budgeted for spending on this, so any updates will have to wait until there's a pause in pushdown
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
paul and I decided offline to do the compromise solution of a knob that only applies at startup. I dislike the footgun very much but it's a lot simpler to implement. we can always circle back later if we find ourselves needed to adjust this frequently
u32::MAX | ||
}) | ||
}) | ||
.build(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could be interesting to add a listener so we can track how many removals come from explicit delete
calls vs size-based evictions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great idea! will do
const_labels: {"cache" => "mem"}, | ||
)), | ||
hits_blobs: registry.register(metric!( | ||
name: "mz_persist_blob_cache_hits_blobs", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: elsewhere it's referred to as BlobMemCache
vs the metric name blob_cache
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup! did that on purpose to set up a hypothetical disk cache. this has a label of "cache" -> "mem"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh yeah, I totally saw that and forgot it when reading it over a second time
|
||
// This could maybe use moka's async cache to unify any concurrent | ||
// fetches for the same key? That's not particularly expected in | ||
// persist's workload, so punt for now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a little curious about this... I could imagine if multiple persist_source
are waiting on the same blob, they could all cache miss because their timings might be so closely aligned. hm... any metrics that would give us insight into how often that might happen? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i can't think of an easy way to measure this without just literally solving the problem. would prefer to punt this to followup work as well
c38d8e7
to
191954e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I would like to make the size configurable, but I'm happy to add that in myself later -- right now I'm more curious to see how it affects our S3 numbers in staging/prod next week!
A one-time (skunkworks) experiment showed that showed an environment running our demo "auction" source + mv got 90%+ cache hits with a 1 MiB cache. This doesn't scale up to prod data sizes and doesn't help with multi-process replicas, but the memory usage seems unobjectionable enough to have it for the cases that it does help. Possibly, a decent chunk of why this is true is pubsub. With the low pubsub latencies, we might write some blob to s3, then within milliseconds notify everyone in-process interested in that blob, waking them up and fetching it. This means even a very small cache is useful because things stay in it just long enough for them to get fetched by everyone that immediately needs them. 1 MiB is enough to fit things like state rollups, remap shard writes, and likely many MVs (probably less so for sources, but atm those still happen in another cluster). Touches #19225
bb24bb4
to
877daf1
Compare
TFTR! |
(3) it might crash ;) (not sure yet if it actually comes from here, but timing and segfault seem suspicious) |
Originally introduced in MaterializeInc#19614 but reverted in MaterializeInc#19945 because we were seeing segfaults in the lru crate this was using. I've replaced it with a new simple implementation of an lru cache. This is particularly interesting to revisit now because we might soon be moving to a world in which each machine has attached disk and this is a useful stepping stone to a disk-based cache that persists across process restarts (and thus helps rehydration). The original motivation is as follows. A one-time (skunkworks) experiment showed that showed an environment running our demo "auction" source + mv got 90%+ cache hits with a 1 MiB cache. This doesn't scale up to prod data sizes and doesn't help with multi-process replicas, but the memory usage seems unobjectionable enough to have it for the cases that it does help. Possibly, a decent chunk of why this is true is pubsub. With the low pubsub latencies, we might write some blob to s3, then within milliseconds notify everyone in-process interested in that blob, waking them up and fetching it. This means even a very small cache is useful because things stay in it just long enough for them to get fetched by everyone that immediately needs them. 1 MiB is enough to fit things like state rollups, remap shard writes, and likely many MVs (probably less so for sources, but atm those still happen in another cluster).
Originally introduced in MaterializeInc#19614 but reverted in MaterializeInc#19945 because we were seeing segfaults in the lru crate this was using. I've replaced it with a new simple implementation of an lru cache. This is particularly interesting to revisit now because we might soon be moving to a world in which each machine has attached disk and this is a useful stepping stone to a disk-based cache that persists across process restarts (and thus helps rehydration). The original motivation is as follows. A one-time (skunkworks) experiment showed that showed an environment running our demo "auction" source + mv got 90%+ cache hits with a 1 MiB cache. This doesn't scale up to prod data sizes and doesn't help with multi-process replicas, but the memory usage seems unobjectionable enough to have it for the cases that it does help. Possibly, a decent chunk of why this is true is pubsub. With the low pubsub latencies, we might write some blob to s3, then within milliseconds notify everyone in-process interested in that blob, waking them up and fetching it. This means even a very small cache is useful because things stay in it just long enough for them to get fetched by everyone that immediately needs them. 1 MiB is enough to fit things like state rollups, remap shard writes, and likely many MVs (probably less so for sources, but atm those still happen in another cluster).
Originally introduced in MaterializeInc#19614 but reverted in MaterializeInc#19945 because we were seeing segfaults in the lru crate this was using. I've replaced it with a new simple implementation of an lru cache. This is particularly interesting to revisit now because we might soon be moving to a world in which each machine has attached disk and this is a useful stepping stone to a disk-based cache that persists across process restarts (and thus helps rehydration). The original motivation is as follows. A one-time (skunkworks) experiment showed that showed an environment running our demo "auction" source + mv got 90%+ cache hits with a 1 MiB cache. This doesn't scale up to prod data sizes and doesn't help with multi-process replicas, but the memory usage seems unobjectionable enough to have it for the cases that it does help. Possibly, a decent chunk of why this is true is pubsub. With the low pubsub latencies, we might write some blob to s3, then within milliseconds notify everyone in-process interested in that blob, waking them up and fetching it. This means even a very small cache is useful because things stay in it just long enough for them to get fetched by everyone that immediately needs them. 1 MiB is enough to fit things like state rollups, remap shard writes, and likely many MVs (probably less so for sources, but atm those still happen in another cluster).
Originally introduced in MaterializeInc#19614 but reverted in MaterializeInc#19945 because we were seeing segfaults in the lru crate this was using. I've replaced it with a new simple implementation of an lru cache. This is particularly interesting to revisit now because we might soon be moving to a world in which each machine has attached disk and this is a useful stepping stone to a disk-based cache that persists across process restarts (and thus helps rehydration). The original motivation is as follows. A one-time (skunkworks) experiment showed that showed an environment running our demo "auction" source + mv got 90%+ cache hits with a 1 MiB cache. This doesn't scale up to prod data sizes and doesn't help with multi-process replicas, but the memory usage seems unobjectionable enough to have it for the cases that it does help. Possibly, a decent chunk of why this is true is pubsub. With the low pubsub latencies, we might write some blob to s3, then within milliseconds notify everyone in-process interested in that blob, waking them up and fetching it. This means even a very small cache is useful because things stay in it just long enough for them to get fetched by everyone that immediately needs them. 1 MiB is enough to fit things like state rollups, remap shard writes, and likely many MVs (probably less so for sources, but atm those still happen in another cluster).
A one-time (skunkworks) experiment showed that showed an environment running our demo "auction" source + mv got 90%+ cache hits with a 1 MiB cache. This doesn't scale up to prod data sizes and doesn't help with multi-process replicas, but the memory usage seems unobjectionable enough to have it for the cases that it does help.
Possibly, a decent chunk of why this is true is pubsub. With the low pubsub latencies, we might write some blob to s3, then within milliseconds notify everyone in-process interested in that blob, waking them up and fetching it. This means even a very small cache is useful because things stay in it just long enough for them to get fetched by everyone that immediately needs them. 1 MiB is enough to fit things like state rollups, remap shard writes, and likely many MVs (probably less so for sources, but atm those still happen in another cluster).
Touches MaterializeInc/database-issues#5704
Motivation
Tips for reviewer
It's possible that we want to gate this behind a feature flag, or otherwise hook the param up to LaunchDarkly... but I just can't talk myself into it being worth it for a quick win. My concerns here would be if it's (1) incorrect or (2) somehow slower in some unexpected case. CI should do a very good job of figuring out any issues with (1) and given s3 latencies I have a very hard time imagining (2) being an issue.
Checklist
$T ⇔ Proto$T
mapping (possibly in a backwards-incompatible way), then it is tagged with aT-proto
label.