Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

persist: reintroduce in-mem blob cache #24208

Merged
merged 2 commits into from
Jan 10, 2024

Conversation

danhhz
Copy link
Contributor

@danhhz danhhz commented Jan 3, 2024

Originally introduced in #19614 but reverted in #19945 because we were
seeing segfaults in the lru crate this was using. I've replaced it with
a new simple implementation of an lru cache.

This is particularly interesting to revisit now because we might soon be
moving to a world in which each machine has attached disk and this is a
useful stepping stone to a disk-based cache that persists across process
restarts (and thus helps rehydration). The original motivation is as
follows.

A one-time (skunkworks) experiment showed that showed an environment
running our demo "auction" source + mv got 90%+ cache hits with a 1 MiB
cache. This doesn't scale up to prod data sizes and doesn't help with
multi-process replicas, but the memory usage seems unobjectionable
enough to have it for the cases that it does help.

Possibly, a decent chunk of why this is true is pubsub. With the low
pubsub latencies, we might write some blob to s3, then within
milliseconds notify everyone in-process interested in that blob, waking
them up and fetching it. This means even a very small cache is useful
because things stay in it just long enough for them to get fetched by
everyone that immediately needs them. 1 MiB is enough to fit things like
state rollups, remap shard writes, and likely many MVs (probably less so
for sources, but atm those still happen in another cluster).

Motivation

  • This PR adds a known-desirable feature.

Tips for reviewer

The first commit is just a revert of the revert, so feel free to skim it. The second commit contains the new lru implementation and the bits that changed in hooking it up.

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered.
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • This PR includes the following user-facing behavior changes:

@danhhz danhhz force-pushed the persist_blob_cache branch from 1a64c74 to 3120c62 Compare January 5, 2024 21:24
@danhhz danhhz changed the title WIP in mem blob cache persist: reintroduce in-mem blob cache Jan 5, 2024
@danhhz danhhz force-pushed the persist_blob_cache branch from 3120c62 to fa64c99 Compare January 5, 2024 21:35
@danhhz danhhz requested a review from bkirwi January 5, 2024 21:35
@danhhz danhhz marked this pull request as ready for review January 5, 2024 21:35
@danhhz danhhz requested a review from a team as a code owner January 5, 2024 21:35
Copy link
Contributor

@bkirwi bkirwi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't make it through the custom Lru impl today; coming back to this tomorrow!

Comment on lines 88 to 89
// any races or cache invalidations here. If the value is in the cache,
// it's also what's in s3 (if not, then there's a horrible bug somewhere
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// any races or cache invalidations here. If the value is in the cache,
// it's also what's in s3 (if not, then there's a horrible bug somewhere
// any races or cache invalidations here. If the value is in the cache,
// any value in S3 is guaranteed to match (if not, then there's a horrible bug somewhere

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Because the value might be deleted from S3, but that's fine too.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

total_weight: usize,

nodes: List<LruNode<K, V>>,
by_key: HashMap<K, ListNodeId>,
Copy link
Contributor

@bkirwi bkirwi Jan 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't gone through the Lru code in detail yet, so maybe I'll figure this out, but I'm tempted to ask whether we could get away with something like:

nodes: HashMap<K, (V, Weight, Time)>,
by_time: BTreeSet<(Time, K)>,

Since BTreeMap supports popping the smallest entry and also efficient removal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I follow! Is there maybe a typo or something in there? In particular, I don't really understand why the BTreeMap has a Weight key. If there's a way to avoid writing my own linked list, I'm quite interested to hear about it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there maybe a typo or something in there?

Sure was - fixed! Sorry about that - shouldn't have rushed it.

BTreeSet is there because we want to both add and remove arbitrary entries (as cache entries come and go) and to grab the minimum in an efficient way. Maybe a good 1:1 topic!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This worked beautifully, thanks for the idea! I am very excited to not maintain an original doubly linked list implementation 😅

The one tweak I made was to replace the BTreeSet<(Time, K)> with a BTreeMap<Time, K>. This was 1) to avoid a couple unfortunate borrow checker issues (all workable but messy) and 2) to avoid an Ord requirement on K (also no biggie, but felt wrong conceptually). I don't really see any downside besides Times needing to be globally unique, but they were anyway.

I went back and forth on whether the map should internally be ordered by increasing or decreasing time and ended up swapping the order compared to the list impl. I think I do prefer this one, but I verified that it's easy to slap a std::cmp::Reverse in there if you feel strongly the other way.

// some read handles).
let mut cache = self.cache.lock().expect("lock poisoned");
cache.insert(key.to_owned(), blob.clone(), blob.len());
self.resize_and_update_size_metrics(&mut cache);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this will end up evicting from the underlying map twice: once for the insert and once on the resize. Seems harmless... but it may be cleaner to not resize during map updates and do all the evicting on resize, since it's called for every update anyways.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it's important for insert to hold all the invariants at the time it returns. And then resize is a conceptually separate operation, I think it's an implementation detail of BlobMemCache that it happens to call it each time the map is updated

Copy link
Contributor Author

@danhhz danhhz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RFAL! Rebased and force pushed to resolve the merge skew, but pushed all changes as an append-only commit.

Gonna kick off a round of nightlies while you take a second review pass.

Comment on lines 88 to 89
// any races or cache invalidations here. If the value is in the cache,
// it's also what's in s3 (if not, then there's a horrible bug somewhere
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

total_weight: usize,

nodes: List<LruNode<K, V>>,
by_key: HashMap<K, ListNodeId>,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This worked beautifully, thanks for the idea! I am very excited to not maintain an original doubly linked list implementation 😅

The one tweak I made was to replace the BTreeSet<(Time, K)> with a BTreeMap<Time, K>. This was 1) to avoid a couple unfortunate borrow checker issues (all workable but messy) and 2) to avoid an Ord requirement on K (also no biggie, but felt wrong conceptually). I don't really see any downside besides Times needing to be globally unique, but they were anyway.

I went back and forth on whether the map should internally be ordered by increasing or decreasing time and ended up swapping the order compared to the list impl. I think I do prefer this one, but I verified that it's easy to slap a std::cmp::Reverse in there if you feel strongly the other way.

@danhhz danhhz force-pushed the persist_blob_cache branch from fa64c99 to 05e0e29 Compare January 9, 2024 21:13
Copy link
Contributor

@bkirwi bkirwi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - thanks for the followup!

weight: usize,
next_time: Time,
entries: HashMap<K, (V, Weight, Time)>,
by_time: BTreeMap<Time, K>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, BTreeMap definitely works out better here!

Originally introduced in MaterializeInc#19614 but reverted in MaterializeInc#19945 because we were
seeing segfaults in the lru crate this was using. I've replaced it with
a new simple implementation of an lru cache.

This is particularly interesting to revisit now because we might soon be
moving to a world in which each machine has attached disk and this is a
useful stepping stone to a disk-based cache that persists across process
restarts (and thus helps rehydration). The original motivation is as
follows.

A one-time (skunkworks) experiment showed that showed an environment
running our demo "auction" source + mv got 90%+ cache hits with a 1 MiB
cache. This doesn't scale up to prod data sizes and doesn't help with
multi-process replicas, but the memory usage seems unobjectionable
enough to have it for the cases that it does help.

Possibly, a decent chunk of why this is true is pubsub. With the low
pubsub latencies, we might write some blob to s3, then within
milliseconds notify everyone in-process interested in that blob, waking
them up and fetching it. This means even a very small cache is useful
because things stay in it just long enough for them to get fetched by
everyone that immediately needs them. 1 MiB is enough to fit things like
state rollups, remap shard writes, and likely many MVs (probably less so
for sources, but atm those still happen in another cluster).
@danhhz danhhz force-pushed the persist_blob_cache branch from 05e0e29 to 5938def Compare January 10, 2024 15:43
@danhhz
Copy link
Contributor Author

danhhz commented Jan 10, 2024

TFTR!

@danhhz danhhz enabled auto-merge January 10, 2024 15:43
@danhhz danhhz merged commit 38bee28 into MaterializeInc:main Jan 10, 2024
63 checks passed
@danhhz danhhz deleted the persist_blob_cache branch January 10, 2024 16:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants