persist: reintroduce in-mem blob cache #24208

danhhz · 2024-01-03T21:51:34Z

Originally introduced in #19614 but reverted in #19945 because we were
seeing segfaults in the lru crate this was using. I've replaced it with
a new simple implementation of an lru cache.

This is particularly interesting to revisit now because we might soon be
moving to a world in which each machine has attached disk and this is a
useful stepping stone to a disk-based cache that persists across process
restarts (and thus helps rehydration). The original motivation is as
follows.

A one-time (skunkworks) experiment showed that showed an environment
running our demo "auction" source + mv got 90%+ cache hits with a 1 MiB
cache. This doesn't scale up to prod data sizes and doesn't help with
multi-process replicas, but the memory usage seems unobjectionable
enough to have it for the cases that it does help.

Possibly, a decent chunk of why this is true is pubsub. With the low
pubsub latencies, we might write some blob to s3, then within
milliseconds notify everyone in-process interested in that blob, waking
them up and fetching it. This means even a very small cache is useful
because things stay in it just long enough for them to get fetched by
everyone that immediately needs them. 1 MiB is enough to fit things like
state rollups, remap shard writes, and likely many MVs (probably less so
for sources, but atm those still happen in another cluster).

Motivation

This PR adds a known-desirable feature.

Tips for reviewer

The first commit is just a revert of the revert, so feel free to skim it. The second commit contains the new lru implementation and the bits that changed in hooking it up.

Checklist

This PR has adequate test coverage / QA involvement has been duly considered.
This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
This PR includes the following user-facing behavior changes:

bkirwi

I didn't make it through the custom Lru impl today; coming back to this tomorrow!

bkirwi · 2024-01-08T23:05:57Z

src/persist-client/src/internal/cache.rs

+        // any races or cache invalidations here. If the value is in the cache,
+        // it's also what's in s3 (if not, then there's a horrible bug somewhere


Suggested change

// any races or cache invalidations here. If the value is in the cache,

// it's also what's in s3 (if not, then there's a horrible bug somewhere

// any races or cache invalidations here. If the value is in the cache,

// any value in S3 is guaranteed to match (if not, then there's a horrible bug somewhere

(Because the value might be deleted from S3, but that's fine too.)

bkirwi · 2024-01-09T00:15:07Z

src/persist-client/src/internal/cache.rs

+        total_weight: usize,
+
+        nodes: List<LruNode<K, V>>,
+        by_key: HashMap<K, ListNodeId>,


Haven't gone through the Lru code in detail yet, so maybe I'll figure this out, but I'm tempted to ask whether we could get away with something like:

nodes: HashMap<K, (V, Weight, Time)>, by_time: BTreeSet<(Time, K)>,

Since BTreeMap supports popping the smallest entry and also efficient removal.

Not sure I follow! Is there maybe a typo or something in there? In particular, I don't really understand why the BTreeMap has a Weight key. If there's a way to avoid writing my own linked list, I'm quite interested to hear about it

Is there maybe a typo or something in there?

Sure was - fixed! Sorry about that - shouldn't have rushed it.

BTreeSet is there because we want to both add and remove arbitrary entries (as cache entries come and go) and to grab the minimum in an efficient way. Maybe a good 1:1 topic!

This worked beautifully, thanks for the idea! I am very excited to not maintain an original doubly linked list implementation 😅

The one tweak I made was to replace the BTreeSet<(Time, K)> with a BTreeMap<Time, K>. This was 1) to avoid a couple unfortunate borrow checker issues (all workable but messy) and 2) to avoid an Ord requirement on K (also no biggie, but felt wrong conceptually). I don't really see any downside besides Times needing to be globally unique, but they were anyway.

I went back and forth on whether the map should internally be ordered by increasing or decreasing time and ended up swapping the order compared to the list impl. I think I do prefer this one, but I verified that it's easy to slap a std::cmp::Reverse in there if you feel strongly the other way.

bkirwi · 2024-01-09T00:18:11Z

src/persist-client/src/internal/cache.rs

+            // some read handles).
+            let mut cache = self.cache.lock().expect("lock poisoned");
+            cache.insert(key.to_owned(), blob.clone(), blob.len());
+            self.resize_and_update_size_metrics(&mut cache);


IIUC this will end up evicting from the underlying map twice: once for the insert and once on the resize. Seems harmless... but it may be cleaner to not resize during map updates and do all the evicting on resize, since it's called for every update anyways.

IMO it's important for insert to hold all the invariants at the time it returns. And then resize is a conceptually separate operation, I think it's an implementation detail of BlobMemCache that it happens to call it each time the map is updated

This reverts commit c799456.

danhhz

RFAL! Rebased and force pushed to resolve the merge skew, but pushed all changes as an append-only commit.

Gonna kick off a round of nightlies while you take a second review pass.

danhhz · 2024-01-09T20:58:30Z

src/persist-client/src/internal/cache.rs

+        // any races or cache invalidations here. If the value is in the cache,
+        // it's also what's in s3 (if not, then there's a horrible bug somewhere


danhhz · 2024-01-09T21:12:10Z

src/persist-client/src/internal/cache.rs

+        total_weight: usize,
+
+        nodes: List<LruNode<K, V>>,
+        by_key: HashMap<K, ListNodeId>,


This worked beautifully, thanks for the idea! I am very excited to not maintain an original doubly linked list implementation 😅

The one tweak I made was to replace the BTreeSet<(Time, K)> with a BTreeMap<Time, K>. This was 1) to avoid a couple unfortunate borrow checker issues (all workable but messy) and 2) to avoid an Ord requirement on K (also no biggie, but felt wrong conceptually). I don't really see any downside besides Times needing to be globally unique, but they were anyway.

I went back and forth on whether the map should internally be ordered by increasing or decreasing time and ended up swapping the order compared to the list impl. I think I do prefer this one, but I verified that it's easy to slap a std::cmp::Reverse in there if you feel strongly the other way.

bkirwi

Looks good - thanks for the followup!

bkirwi · 2024-01-09T21:30:37Z

src/persist-client/src/internal/cache.rs

-        weight: usize,
+        next_time: Time,
+        entries: HashMap<K, (V, Weight, Time)>,
+        by_time: BTreeMap<Time, K>,


Yeah, BTreeMap definitely works out better here!

Originally introduced in MaterializeInc#19614 but reverted in MaterializeInc#19945 because we were seeing segfaults in the lru crate this was using. I've replaced it with a new simple implementation of an lru cache. This is particularly interesting to revisit now because we might soon be moving to a world in which each machine has attached disk and this is a useful stepping stone to a disk-based cache that persists across process restarts (and thus helps rehydration). The original motivation is as follows. A one-time (skunkworks) experiment showed that showed an environment running our demo "auction" source + mv got 90%+ cache hits with a 1 MiB cache. This doesn't scale up to prod data sizes and doesn't help with multi-process replicas, but the memory usage seems unobjectionable enough to have it for the cases that it does help. Possibly, a decent chunk of why this is true is pubsub. With the low pubsub latencies, we might write some blob to s3, then within milliseconds notify everyone in-process interested in that blob, waking them up and fetching it. This means even a very small cache is useful because things stay in it just long enough for them to get fetched by everyone that immediately needs them. 1 MiB is enough to fit things like state rollups, remap shard writes, and likely many MVs (probably less so for sources, but atm those still happen in another cluster).

danhhz · 2024-01-10T15:43:31Z

TFTR!

danhhz force-pushed the persist_blob_cache branch from 1a64c74 to 3120c62 Compare January 5, 2024 21:24

danhhz changed the title ~~WIP in mem blob cache~~ persist: reintroduce in-mem blob cache Jan 5, 2024

danhhz force-pushed the persist_blob_cache branch from 3120c62 to fa64c99 Compare January 5, 2024 21:35

danhhz requested a review from bkirwi January 5, 2024 21:35

danhhz marked this pull request as ready for review January 5, 2024 21:35

danhhz requested a review from a team as a code owner January 5, 2024 21:35

bkirwi reviewed Jan 9, 2024

View reviewed changes

Revert "persist: remove in-mem blob cache impl"

37231ab

This reverts commit c799456.

danhhz commented Jan 9, 2024

View reviewed changes

danhhz force-pushed the persist_blob_cache branch from fa64c99 to 05e0e29 Compare January 9, 2024 21:13

bkirwi approved these changes Jan 9, 2024

View reviewed changes

danhhz force-pushed the persist_blob_cache branch from 05e0e29 to 5938def Compare January 10, 2024 15:43

danhhz enabled auto-merge January 10, 2024 15:43

danhhz merged commit 38bee28 into MaterializeInc:main Jan 10, 2024
63 checks passed

danhhz deleted the persist_blob_cache branch January 10, 2024 16:14

benesch mentioned this pull request Feb 3, 2024

persist: reintroduce blob in-mem cache #20349

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

persist: reintroduce in-mem blob cache #24208

persist: reintroduce in-mem blob cache #24208

danhhz commented Jan 3, 2024 •

edited

Loading

bkirwi left a comment

bkirwi Jan 8, 2024

bkirwi Jan 9, 2024

danhhz Jan 9, 2024

bkirwi Jan 9, 2024 •

edited

Loading

danhhz Jan 9, 2024

bkirwi Jan 9, 2024

danhhz Jan 9, 2024

bkirwi Jan 9, 2024

danhhz Jan 9, 2024

danhhz left a comment

danhhz Jan 9, 2024

danhhz Jan 9, 2024

bkirwi left a comment

bkirwi Jan 9, 2024

danhhz commented Jan 10, 2024

		// any races or cache invalidations here. If the value is in the cache,
		// it's also what's in s3 (if not, then there's a horrible bug somewhere

persist: reintroduce in-mem blob cache #24208

persist: reintroduce in-mem blob cache #24208

Conversation

danhhz commented Jan 3, 2024 • edited Loading

Motivation

Tips for reviewer

Checklist

bkirwi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bkirwi Jan 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danhhz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bkirwi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danhhz commented Jan 10, 2024

danhhz commented Jan 3, 2024 •

edited

Loading

bkirwi Jan 9, 2024 •

edited

Loading