persist: introduce a very small in-mem blob cache #19614

danhhz · 2023-05-31T18:32:22Z

A one-time (skunkworks) experiment showed that showed an environment running our demo "auction" source + mv got 90%+ cache hits with a 1 MiB cache. This doesn't scale up to prod data sizes and doesn't help with multi-process replicas, but the memory usage seems unobjectionable enough to have it for the cases that it does help.

Possibly, a decent chunk of why this is true is pubsub. With the low pubsub latencies, we might write some blob to s3, then within milliseconds notify everyone in-process interested in that blob, waking them up and fetching it. This means even a very small cache is useful because things stay in it just long enough for them to get fetched by everyone that immediately needs them. 1 MiB is enough to fit things like state rollups, remap shard writes, and likely many MVs (probably less so for sources, but atm those still happen in another cluster).

Touches MaterializeInc/database-issues#5704

Motivation

This PR fixes a recognized bug.

Tips for reviewer

It's possible that we want to gate this behind a feature flag, or otherwise hook the param up to LaunchDarkly... but I just can't talk myself into it being worth it for a quick win. My concerns here would be if it's (1) incorrect or (2) somehow slower in some unexpected case. CI should do a very good job of figuring out any issues with (1) and given s3 latencies I have a very hard time imagining (2) being an issue.

Checklist

This PR has adequate test coverage / QA involvement has been duly considered.
This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
If this PR will require changes to cloud orchestration, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
This PR includes the following user-facing behavior changes:

pH14 · 2023-05-31T18:44:49Z

src/persist-client/src/internal/cache.rs

+        // it's also what's in s3 (if not, then there's a horrible bug somewhere
+        // else).
+        if let Some(cached_value) = self.cache.get(key) {
+            self.metrics.blob_cache_mem.hits_blobs.inc();


is the intent to calculate hit rate based on the delta between this vs existing blob metrics?

yeah, tho we'll have to do hits / (hits + fetches) because I put this "around" the MetricsBlob wrapper so it wouldn't skew our latency histograms

pH14 · 2023-05-31T18:47:10Z

src/persist-client/src/internal/cache.rs

+        blob: Arc<dyn Blob + Send + Sync>,
+    ) -> Arc<dyn Blob + Send + Sync> {
+        let cache = Cache::<String, SegmentedBytes>::builder()
+            .max_capacity(u64::cast_from(cfg.blob_cache_mem_limit_bytes))


thoughts on making this configurable? the limit could be artificially large, and we could dynamically set a multiplier in the weigher fn to play around with it (e.g. max capacity of "1GiB" but set the multiplier to 1GiB/1MiB so it's effectively a 1MiB cache by default)

typed this in the review notes, but I'd rather not :D. I'd prefer we do something like #19532 before going too much further down the cache road. even by the time we start getting to O(part size), I'd think it'd probably be time to introduce a secondary disk cache layer and the tradeoffs involved just complicate things sooo much. was hoping to keep this PR framed as a really quick win

I was more thinking to have the leeway to change it to like, 4MiB if that turns out to make a big difference, but I get that it adds complexity

(I'm not sure there's an easy way to model the hit rate with moka's LFU/LRU policies beyond just measuring it empirically)

I think my point is that without something like #19532, we won't know if it makes a big difference without just trying things in LaunchDarkly

hm, I think I'm not sure why we need to simulate it to that level of fidelity before adding a knob. it seems like the most useful data would be running this on a staging / prod env and just trying a few different sizes and seeing how the numbers look. is the concern for when the sizes get large enough to affect the amount of memory left for the rest of the process?

okay yeah, I think you're right here. at this point, I'm way over the hours I budgeted for spending on this, so any updates will have to wait until there's a pause in pushdown

paul and I decided offline to do the compromise solution of a knob that only applies at startup. I dislike the footgun very much but it's a lot simpler to implement. we can always circle back later if we find ourselves needed to adjust this frequently

pH14 · 2023-05-31T18:50:48Z

src/persist-client/src/internal/cache.rs

+                    u32::MAX
+                })
+            })
+            .build();


could be interesting to add a listener so we can track how many removals come from explicit delete calls vs size-based evictions

great idea! will do

pH14 · 2023-05-31T18:54:18Z

src/persist-client/src/internal/metrics.rs

+                const_labels: {"cache" => "mem"},
+            )),
+            hits_blobs: registry.register(metric!(
+                name: "mz_persist_blob_cache_hits_blobs",


nit: elsewhere it's referred to as BlobMemCache vs the metric name blob_cache

yup! did that on purpose to set up a hypothetical disk cache. this has a label of "cache" -> "mem"

oh yeah, I totally saw that and forgot it when reading it over a second time

pH14 · 2023-05-31T18:56:23Z

src/persist-client/src/internal/cache.rs

+
+        // This could maybe use moka's async cache to unify any concurrent
+        // fetches for the same key? That's not particularly expected in
+        // persist's workload, so punt for now.


I am a little curious about this... I could imagine if multiple persist_source are waiting on the same blob, they could all cache miss because their timings might be so closely aligned. hm... any metrics that would give us insight into how often that might happen? 🤔

i can't think of an easy way to measure this without just literally solving the problem. would prefer to punt this to followup work as well

pH14

LGTM. I would like to make the size configurable, but I'm happy to add that in myself later -- right now I'm more curious to see how it affects our S3 numbers in staging/prod next week!

A one-time (skunkworks) experiment showed that showed an environment running our demo "auction" source + mv got 90%+ cache hits with a 1 MiB cache. This doesn't scale up to prod data sizes and doesn't help with multi-process replicas, but the memory usage seems unobjectionable enough to have it for the cases that it does help. Possibly, a decent chunk of why this is true is pubsub. With the low pubsub latencies, we might write some blob to s3, then within milliseconds notify everyone in-process interested in that blob, waking them up and fetching it. This means even a very small cache is useful because things stay in it just long enough for them to get fetched by everyone that immediately needs them. 1 MiB is enough to fit things like state rollups, remap shard writes, and likely many MVs (probably less so for sources, but atm those still happen in another cluster). Touches #19225

danhhz · 2023-06-01T21:04:35Z

TFTR!

def- · 2023-06-14T16:54:04Z

My concerns here would be if it's (1) incorrect or (2) somehow slower in some unexpected case. CI should do a very good job of figuring out any issues with (1) and given s3 latencies I have a very hard time imagining (2) being an issue.

(3) it might crash ;) (not sure yet if it actually comes from here, but timing and segfault seem suspicious)

Originally introduced in MaterializeInc#19614 but reverted in MaterializeInc#19945 because we were seeing segfaults in the lru crate this was using. I've replaced it with a new simple implementation of an lru cache. This is particularly interesting to revisit now because we might soon be moving to a world in which each machine has attached disk and this is a useful stepping stone to a disk-based cache that persists across process restarts (and thus helps rehydration). The original motivation is as follows. A one-time (skunkworks) experiment showed that showed an environment running our demo "auction" source + mv got 90%+ cache hits with a 1 MiB cache. This doesn't scale up to prod data sizes and doesn't help with multi-process replicas, but the memory usage seems unobjectionable enough to have it for the cases that it does help. Possibly, a decent chunk of why this is true is pubsub. With the low pubsub latencies, we might write some blob to s3, then within milliseconds notify everyone in-process interested in that blob, waking them up and fetching it. This means even a very small cache is useful because things stay in it just long enough for them to get fetched by everyone that immediately needs them. 1 MiB is enough to fit things like state rollups, remap shard writes, and likely many MVs (probably less so for sources, but atm those still happen in another cluster).

danhhz requested a review from a team as a code owner May 31, 2023 18:32

danhhz force-pushed the persist_blob_cache_mem branch 2 times, most recently from b55ac18 to c38d8e7 Compare May 31, 2023 18:54

danhhz requested a review from benesch as a code owner May 31, 2023 18:54

pH14 reviewed May 31, 2023

View reviewed changes

danhhz force-pushed the persist_blob_cache_mem branch from c38d8e7 to 191954e Compare May 31, 2023 19:00

pH14 approved these changes Jun 1, 2023

View reviewed changes

danhhz force-pushed the persist_blob_cache_mem branch from bb24bb4 to 877daf1 Compare June 1, 2023 21:04

danhhz requested a review from a team as a code owner June 1, 2023 21:04

danhhz enabled auto-merge June 1, 2023 21:04

danhhz merged commit 1bcd1c4 into MaterializeInc:main Jun 1, 2023

danhhz deleted the persist_blob_cache_mem branch June 1, 2023 21:42

ParkMyCar mentioned this pull request Jun 14, 2023

Data race found by miri test moka-rs/moka#279

Open

tatsuya6502 mentioned this pull request Jun 19, 2023

Memory corruption observed when using Moka v0.9.6 moka-rs/moka#281

Closed

danhhz mentioned this pull request Jan 5, 2024

persist: reintroduce in-mem blob cache #24208

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

persist: introduce a very small in-mem blob cache #19614

persist: introduce a very small in-mem blob cache #19614

danhhz commented May 31, 2023

pH14 May 31, 2023

danhhz May 31, 2023

pH14 May 31, 2023

danhhz May 31, 2023

pH14 May 31, 2023

pH14 May 31, 2023

danhhz May 31, 2023

pH14 May 31, 2023

danhhz Jun 1, 2023

danhhz Jun 1, 2023

pH14 May 31, 2023

danhhz May 31, 2023

pH14 May 31, 2023

danhhz May 31, 2023

pH14 May 31, 2023

pH14 May 31, 2023

danhhz May 31, 2023

pH14 left a comment

danhhz commented Jun 1, 2023

def- commented Jun 14, 2023

persist: introduce a very small in-mem blob cache #19614

persist: introduce a very small in-mem blob cache #19614

Conversation

danhhz commented May 31, 2023

Motivation

Tips for reviewer

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pH14 left a comment

Choose a reason for hiding this comment

danhhz commented Jun 1, 2023

def- commented Jun 14, 2023