-
Notifications
You must be signed in to change notification settings - Fork 467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
persist: reintroduce in-mem blob cache #24208
Conversation
1a64c74
to
3120c62
Compare
3120c62
to
fa64c99
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't make it through the custom Lru
impl today; coming back to this tomorrow!
// any races or cache invalidations here. If the value is in the cache, | ||
// it's also what's in s3 (if not, then there's a horrible bug somewhere |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// any races or cache invalidations here. If the value is in the cache, | |
// it's also what's in s3 (if not, then there's a horrible bug somewhere | |
// any races or cache invalidations here. If the value is in the cache, | |
// any value in S3 is guaranteed to match (if not, then there's a horrible bug somewhere |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Because the value might be deleted from S3, but that's fine too.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
total_weight: usize, | ||
|
||
nodes: List<LruNode<K, V>>, | ||
by_key: HashMap<K, ListNodeId>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't gone through the Lru
code in detail yet, so maybe I'll figure this out, but I'm tempted to ask whether we could get away with something like:
nodes: HashMap<K, (V, Weight, Time)>,
by_time: BTreeSet<(Time, K)>,
Since BTreeMap
supports popping the smallest entry and also efficient removal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I follow! Is there maybe a typo or something in there? In particular, I don't really understand why the BTreeMap has a Weight
key. If there's a way to avoid writing my own linked list, I'm quite interested to hear about it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there maybe a typo or something in there?
Sure was - fixed! Sorry about that - shouldn't have rushed it.
BTreeSet
is there because we want to both add and remove arbitrary entries (as cache entries come and go) and to grab the minimum in an efficient way. Maybe a good 1:1 topic!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This worked beautifully, thanks for the idea! I am very excited to not maintain an original doubly linked list implementation 😅
The one tweak I made was to replace the BTreeSet<(Time, K)>
with a BTreeMap<Time, K>
. This was 1) to avoid a couple unfortunate borrow checker issues (all workable but messy) and 2) to avoid an Ord
requirement on K
(also no biggie, but felt wrong conceptually). I don't really see any downside besides Time
s needing to be globally unique, but they were anyway.
I went back and forth on whether the map should internally be ordered by increasing or decreasing time and ended up swapping the order compared to the list impl. I think I do prefer this one, but I verified that it's easy to slap a std::cmp::Reverse
in there if you feel strongly the other way.
// some read handles). | ||
let mut cache = self.cache.lock().expect("lock poisoned"); | ||
cache.insert(key.to_owned(), blob.clone(), blob.len()); | ||
self.resize_and_update_size_metrics(&mut cache); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC this will end up evicting from the underlying map twice: once for the insert and once on the resize. Seems harmless... but it may be cleaner to not resize during map updates and do all the evicting on resize, since it's called for every update anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO it's important for insert
to hold all the invariants at the time it returns. And then resize is a conceptually separate operation, I think it's an implementation detail of BlobMemCache
that it happens to call it each time the map is updated
This reverts commit c799456.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RFAL! Rebased and force pushed to resolve the merge skew, but pushed all changes as an append-only commit.
Gonna kick off a round of nightlies while you take a second review pass.
// any races or cache invalidations here. If the value is in the cache, | ||
// it's also what's in s3 (if not, then there's a horrible bug somewhere |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
total_weight: usize, | ||
|
||
nodes: List<LruNode<K, V>>, | ||
by_key: HashMap<K, ListNodeId>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This worked beautifully, thanks for the idea! I am very excited to not maintain an original doubly linked list implementation 😅
The one tweak I made was to replace the BTreeSet<(Time, K)>
with a BTreeMap<Time, K>
. This was 1) to avoid a couple unfortunate borrow checker issues (all workable but messy) and 2) to avoid an Ord
requirement on K
(also no biggie, but felt wrong conceptually). I don't really see any downside besides Time
s needing to be globally unique, but they were anyway.
I went back and forth on whether the map should internally be ordered by increasing or decreasing time and ended up swapping the order compared to the list impl. I think I do prefer this one, but I verified that it's easy to slap a std::cmp::Reverse
in there if you feel strongly the other way.
fa64c99
to
05e0e29
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good - thanks for the followup!
weight: usize, | ||
next_time: Time, | ||
entries: HashMap<K, (V, Weight, Time)>, | ||
by_time: BTreeMap<Time, K>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, BTreeMap
definitely works out better here!
Originally introduced in MaterializeInc#19614 but reverted in MaterializeInc#19945 because we were seeing segfaults in the lru crate this was using. I've replaced it with a new simple implementation of an lru cache. This is particularly interesting to revisit now because we might soon be moving to a world in which each machine has attached disk and this is a useful stepping stone to a disk-based cache that persists across process restarts (and thus helps rehydration). The original motivation is as follows. A one-time (skunkworks) experiment showed that showed an environment running our demo "auction" source + mv got 90%+ cache hits with a 1 MiB cache. This doesn't scale up to prod data sizes and doesn't help with multi-process replicas, but the memory usage seems unobjectionable enough to have it for the cases that it does help. Possibly, a decent chunk of why this is true is pubsub. With the low pubsub latencies, we might write some blob to s3, then within milliseconds notify everyone in-process interested in that blob, waking them up and fetching it. This means even a very small cache is useful because things stay in it just long enough for them to get fetched by everyone that immediately needs them. 1 MiB is enough to fit things like state rollups, remap shard writes, and likely many MVs (probably less so for sources, but atm those still happen in another cluster).
05e0e29
to
5938def
Compare
TFTR! |
Originally introduced in #19614 but reverted in #19945 because we were
seeing segfaults in the lru crate this was using. I've replaced it with
a new simple implementation of an lru cache.
This is particularly interesting to revisit now because we might soon be
moving to a world in which each machine has attached disk and this is a
useful stepping stone to a disk-based cache that persists across process
restarts (and thus helps rehydration). The original motivation is as
follows.
A one-time (skunkworks) experiment showed that showed an environment
running our demo "auction" source + mv got 90%+ cache hits with a 1 MiB
cache. This doesn't scale up to prod data sizes and doesn't help with
multi-process replicas, but the memory usage seems unobjectionable
enough to have it for the cases that it does help.
Possibly, a decent chunk of why this is true is pubsub. With the low
pubsub latencies, we might write some blob to s3, then within
milliseconds notify everyone in-process interested in that blob, waking
them up and fetching it. This means even a very small cache is useful
because things stay in it just long enough for them to get fetched by
everyone that immediately needs them. 1 MiB is enough to fit things like
state rollups, remap shard writes, and likely many MVs (probably less so
for sources, but atm those still happen in another cluster).
Motivation
Tips for reviewer
The first commit is just a revert of the revert, so feel free to skim it. The second commit contains the new lru implementation and the bits that changed in hooking it up.
Checklist
$T ⇔ Proto$T
mapping (possibly in a backwards-incompatible way), then it is tagged with aT-proto
label.