Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

persist: introduce a very small in-mem blob cache #19614

Merged
merged 1 commit into from
Jun 1, 2023

Conversation

danhhz
Copy link
Contributor

@danhhz danhhz commented May 31, 2023

A one-time (skunkworks) experiment showed that showed an environment running our demo "auction" source + mv got 90%+ cache hits with a 1 MiB cache. This doesn't scale up to prod data sizes and doesn't help with multi-process replicas, but the memory usage seems unobjectionable enough to have it for the cases that it does help.

Possibly, a decent chunk of why this is true is pubsub. With the low pubsub latencies, we might write some blob to s3, then within milliseconds notify everyone in-process interested in that blob, waking them up and fetching it. This means even a very small cache is useful because things stay in it just long enough for them to get fetched by everyone that immediately needs them. 1 MiB is enough to fit things like state rollups, remap shard writes, and likely many MVs (probably less so for sources, but atm those still happen in another cluster).

Touches MaterializeInc/database-issues#5704

Motivation

  • This PR fixes a recognized bug.

Tips for reviewer

It's possible that we want to gate this behind a feature flag, or otherwise hook the param up to LaunchDarkly... but I just can't talk myself into it being worth it for a quick win. My concerns here would be if it's (1) incorrect or (2) somehow slower in some unexpected case. CI should do a very good job of figuring out any issues with (1) and given s3 latencies I have a very hard time imagining (2) being an issue.

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered.
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • This PR includes the following user-facing behavior changes:

@danhhz danhhz requested a review from a team as a code owner May 31, 2023 18:32
@danhhz danhhz force-pushed the persist_blob_cache_mem branch 2 times, most recently from b55ac18 to c38d8e7 Compare May 31, 2023 18:54
@danhhz danhhz requested a review from benesch as a code owner May 31, 2023 18:54
// it's also what's in s3 (if not, then there's a horrible bug somewhere
// else).
if let Some(cached_value) = self.cache.get(key) {
self.metrics.blob_cache_mem.hits_blobs.inc();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the intent to calculate hit rate based on the delta between this vs existing blob metrics?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, tho we'll have to do hits / (hits + fetches) because I put this "around" the MetricsBlob wrapper so it wouldn't skew our latency histograms

blob: Arc<dyn Blob + Send + Sync>,
) -> Arc<dyn Blob + Send + Sync> {
let cache = Cache::<String, SegmentedBytes>::builder()
.max_capacity(u64::cast_from(cfg.blob_cache_mem_limit_bytes))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thoughts on making this configurable? the limit could be artificially large, and we could dynamically set a multiplier in the weigher fn to play around with it (e.g. max capacity of "1GiB" but set the multiplier to 1GiB/1MiB so it's effectively a 1MiB cache by default)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typed this in the review notes, but I'd rather not :D. I'd prefer we do something like #19532 before going too much further down the cache road. even by the time we start getting to O(part size), I'd think it'd probably be time to introduce a secondary disk cache layer and the tradeoffs involved just complicate things sooo much. was hoping to keep this PR framed as a really quick win

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was more thinking to have the leeway to change it to like, 4MiB if that turns out to make a big difference, but I get that it adds complexity

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I'm not sure there's an easy way to model the hit rate with moka's LFU/LRU policies beyond just measuring it empirically)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my point is that without something like #19532, we won't know if it makes a big difference without just trying things in LaunchDarkly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, I think I'm not sure why we need to simulate it to that level of fidelity before adding a knob. it seems like the most useful data would be running this on a staging / prod env and just trying a few different sizes and seeing how the numbers look. is the concern for when the sizes get large enough to affect the amount of memory left for the rest of the process?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay yeah, I think you're right here. at this point, I'm way over the hours I budgeted for spending on this, so any updates will have to wait until there's a pause in pushdown

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

paul and I decided offline to do the compromise solution of a knob that only applies at startup. I dislike the footgun very much but it's a lot simpler to implement. we can always circle back later if we find ourselves needed to adjust this frequently

u32::MAX
})
})
.build();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be interesting to add a listener so we can track how many removals come from explicit delete calls vs size-based evictions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great idea! will do

const_labels: {"cache" => "mem"},
)),
hits_blobs: registry.register(metric!(
name: "mz_persist_blob_cache_hits_blobs",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: elsewhere it's referred to as BlobMemCache vs the metric name blob_cache

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup! did that on purpose to set up a hypothetical disk cache. this has a label of "cache" -> "mem"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yeah, I totally saw that and forgot it when reading it over a second time


// This could maybe use moka's async cache to unify any concurrent
// fetches for the same key? That's not particularly expected in
// persist's workload, so punt for now.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little curious about this... I could imagine if multiple persist_source are waiting on the same blob, they could all cache miss because their timings might be so closely aligned. hm... any metrics that would give us insight into how often that might happen? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i can't think of an easy way to measure this without just literally solving the problem. would prefer to punt this to followup work as well

@danhhz danhhz force-pushed the persist_blob_cache_mem branch from c38d8e7 to 191954e Compare May 31, 2023 19:00
Copy link
Contributor

@pH14 pH14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I would like to make the size configurable, but I'm happy to add that in myself later -- right now I'm more curious to see how it affects our S3 numbers in staging/prod next week!

A one-time (skunkworks) experiment showed that showed an environment
running our demo "auction" source + mv got 90%+ cache hits with a 1 MiB
cache. This doesn't scale up to prod data sizes and doesn't help with
multi-process replicas, but the memory usage seems unobjectionable
enough to have it for the cases that it does help.

Possibly, a decent chunk of why this is true is pubsub. With the low
pubsub latencies, we might write some blob to s3, then within
milliseconds notify everyone in-process interested in that blob, waking
them up and fetching it. This means even a very small cache is useful
because things stay in it just long enough for them to get fetched by
everyone that immediately needs them. 1 MiB is enough to fit things like
state rollups, remap shard writes, and likely many MVs (probably less so
for sources, but atm those still happen in another cluster).

Touches #19225
@danhhz danhhz force-pushed the persist_blob_cache_mem branch from bb24bb4 to 877daf1 Compare June 1, 2023 21:04
@danhhz danhhz requested a review from a team as a code owner June 1, 2023 21:04
@danhhz
Copy link
Contributor Author

danhhz commented Jun 1, 2023

TFTR!

@danhhz danhhz enabled auto-merge June 1, 2023 21:04
@danhhz danhhz merged commit 1bcd1c4 into MaterializeInc:main Jun 1, 2023
@danhhz danhhz deleted the persist_blob_cache_mem branch June 1, 2023 21:42
@def-
Copy link
Contributor

def- commented Jun 14, 2023

My concerns here would be if it's (1) incorrect or (2) somehow slower in some unexpected case. CI should do a very good job of figuring out any issues with (1) and given s3 latencies I have a very hard time imagining (2) being an issue.

(3) it might crash ;) (not sure yet if it actually comes from here, but timing and segfault seem suspicious)

danhhz added a commit to danhhz/materialize that referenced this pull request Jan 5, 2024
Originally introduced in MaterializeInc#19614 but reverted in MaterializeInc#19945 because we were
seeing segfaults in the lru crate this was using. I've replaced it with
a new simple implementation of an lru cache.

This is particularly interesting to revisit now because we might soon be
moving to a world in which each machine has attached disk and this is a
useful stepping stone to a disk-based cache that persists across process
restarts (and thus helps rehydration). The original motivation is as
follows.

A one-time (skunkworks) experiment showed that showed an environment
running our demo "auction" source + mv got 90%+ cache hits with a 1 MiB
cache. This doesn't scale up to prod data sizes and doesn't help with
multi-process replicas, but the memory usage seems unobjectionable
enough to have it for the cases that it does help.

Possibly, a decent chunk of why this is true is pubsub. With the low
pubsub latencies, we might write some blob to s3, then within
milliseconds notify everyone in-process interested in that blob, waking
them up and fetching it. This means even a very small cache is useful
because things stay in it just long enough for them to get fetched by
everyone that immediately needs them. 1 MiB is enough to fit things like
state rollups, remap shard writes, and likely many MVs (probably less so
for sources, but atm those still happen in another cluster).
danhhz added a commit to danhhz/materialize that referenced this pull request Jan 5, 2024
Originally introduced in MaterializeInc#19614 but reverted in MaterializeInc#19945 because we were
seeing segfaults in the lru crate this was using. I've replaced it with
a new simple implementation of an lru cache.

This is particularly interesting to revisit now because we might soon be
moving to a world in which each machine has attached disk and this is a
useful stepping stone to a disk-based cache that persists across process
restarts (and thus helps rehydration). The original motivation is as
follows.

A one-time (skunkworks) experiment showed that showed an environment
running our demo "auction" source + mv got 90%+ cache hits with a 1 MiB
cache. This doesn't scale up to prod data sizes and doesn't help with
multi-process replicas, but the memory usage seems unobjectionable
enough to have it for the cases that it does help.

Possibly, a decent chunk of why this is true is pubsub. With the low
pubsub latencies, we might write some blob to s3, then within
milliseconds notify everyone in-process interested in that blob, waking
them up and fetching it. This means even a very small cache is useful
because things stay in it just long enough for them to get fetched by
everyone that immediately needs them. 1 MiB is enough to fit things like
state rollups, remap shard writes, and likely many MVs (probably less so
for sources, but atm those still happen in another cluster).
danhhz added a commit to danhhz/materialize that referenced this pull request Jan 9, 2024
Originally introduced in MaterializeInc#19614 but reverted in MaterializeInc#19945 because we were
seeing segfaults in the lru crate this was using. I've replaced it with
a new simple implementation of an lru cache.

This is particularly interesting to revisit now because we might soon be
moving to a world in which each machine has attached disk and this is a
useful stepping stone to a disk-based cache that persists across process
restarts (and thus helps rehydration). The original motivation is as
follows.

A one-time (skunkworks) experiment showed that showed an environment
running our demo "auction" source + mv got 90%+ cache hits with a 1 MiB
cache. This doesn't scale up to prod data sizes and doesn't help with
multi-process replicas, but the memory usage seems unobjectionable
enough to have it for the cases that it does help.

Possibly, a decent chunk of why this is true is pubsub. With the low
pubsub latencies, we might write some blob to s3, then within
milliseconds notify everyone in-process interested in that blob, waking
them up and fetching it. This means even a very small cache is useful
because things stay in it just long enough for them to get fetched by
everyone that immediately needs them. 1 MiB is enough to fit things like
state rollups, remap shard writes, and likely many MVs (probably less so
for sources, but atm those still happen in another cluster).
danhhz added a commit to danhhz/materialize that referenced this pull request Jan 10, 2024
Originally introduced in MaterializeInc#19614 but reverted in MaterializeInc#19945 because we were
seeing segfaults in the lru crate this was using. I've replaced it with
a new simple implementation of an lru cache.

This is particularly interesting to revisit now because we might soon be
moving to a world in which each machine has attached disk and this is a
useful stepping stone to a disk-based cache that persists across process
restarts (and thus helps rehydration). The original motivation is as
follows.

A one-time (skunkworks) experiment showed that showed an environment
running our demo "auction" source + mv got 90%+ cache hits with a 1 MiB
cache. This doesn't scale up to prod data sizes and doesn't help with
multi-process replicas, but the memory usage seems unobjectionable
enough to have it for the cases that it does help.

Possibly, a decent chunk of why this is true is pubsub. With the low
pubsub latencies, we might write some blob to s3, then within
milliseconds notify everyone in-process interested in that blob, waking
them up and fetching it. This means even a very small cache is useful
because things stay in it just long enough for them to get fetched by
everyone that immediately needs them. 1 MiB is enough to fit things like
state rollups, remap shard writes, and likely many MVs (probably less so
for sources, but atm those still happen in another cluster).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants