-
Notifications
You must be signed in to change notification settings - Fork 464
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cache: investigate mutex contention with TPCC #1997
Comments
In a recent customer investigation (https://github.com/cockroachlabs/support/issues/2066) I was looking at what the runnable goroutines on the system were doing to try and determine why requests were taking longer than expected. There are a large number (445) of runnable goroutines on the system. Looking more into what they were doing, about half of them were waiting in the pebble layer. Note that this does not include goroutines that are in the Instead, these were busy attempting to get a lock, not waiting for the OS signal. goroutine_dump.2023-02-12T06_50_56.716.double_since_last_dump.000002714.txt Looking at what all the runnable threads are doing:
Looking at just where the runnable mutexes came from:
|
During compactions, avoid populating the block cache with input files' blocks. These files will soon be removed from the LSM, making it less likely any iterator will need to read these blocks. While Pebble uses a scan-resistant block cache algorithm (ClockPRO), the act of inserting the blocks into the cache increases contention on the block cache mutexes (cockroachdb#1997). This contention has been observed to significantly contribute to tail latencies, both for reads and for writes during memtable reservation. Additionally, although these blocks may be soon replaced with more useful blocks due to ClockPRO's scan resistance, they may be freed by a different thread inducing excessive TLB shootdowns (cockroachdb#2693). A compaction only requires a relatively small working set of buffers during its scan across input sstables. In this commit, we introduce a per-compaction BufferPool that is used to allocate buffers during cache misses. Buffers are reused throughout the compaction and only freed to the memory allocator when they're too small or the compaction is finished. This reduces pressure on the memory allocator and the block cache.
When opening a sstable for the first time, we read two 'meta' blocks: one describes the layout of sstable, encoding block handles for the index block, properties block, etc. The other contains table properties. These blocks are decoded, and the necessary state is copied onto the heap, and then blocks are released. In typical configurations with sufficiently large table caches, these blocks are never read again or only much later after the table has been evicted from the table cache. This commit uses a BufferPool to hold these temporary blocks, and refrains from adding these blocks to the block cache. This reduces contention on the block cache mutexes (cockroachdb#1997).
During compactions, avoid populating the block cache with input files' blocks. These files will soon be removed from the LSM, making it less likely any iterator will need to read these blocks. While Pebble uses a scan-resistant block cache algorithm (ClockPRO), the act of inserting the blocks into the cache increases contention on the block cache mutexes (cockroachdb#1997). This contention has been observed to significantly contribute to tail latencies, both for reads and for writes during memtable reservation. Additionally, although these blocks may be soon replaced with more useful blocks due to ClockPRO's scan resistance, they may be freed by a different thread inducing excessive TLB shootdowns (cockroachdb#2693). A compaction only requires a relatively small working set of buffers during its scan across input sstables. In this commit, we introduce a per-compaction BufferPool that is used to allocate buffers during cache misses. Buffers are reused throughout the compaction and only freed to the memory allocator when they're too small or the compaction is finished. This reduces pressure on the memory allocator and the block cache.
When opening a sstable for the first time, we read two 'meta' blocks: one describes the layout of sstable, encoding block handles for the index block, properties block, etc. The other contains table properties. These blocks are decoded, and the necessary state is copied onto the heap, and then blocks are released. In typical configurations with sufficiently large table caches, these blocks are never read again or only much later after the table has been evicted from the table cache. This commit uses a BufferPool to hold these temporary blocks, and refrains from adding these blocks to the block cache. This reduces contention on the block cache mutexes (cockroachdb#1997).
During compactions, avoid populating the block cache with input files' blocks. These files will soon be removed from the LSM, making it less likely any iterator will need to read these blocks. While Pebble uses a scan-resistant block cache algorithm (ClockPRO), the act of inserting the blocks into the cache increases contention on the block cache mutexes (#1997). This contention has been observed to significantly contribute to tail latencies, both for reads and for writes during memtable reservation. Additionally, although these blocks may be soon replaced with more useful blocks due to ClockPRO's scan resistance, they may be freed by a different thread inducing excessive TLB shootdowns (#2693). A compaction only requires a relatively small working set of buffers during its scan across input sstables. In this commit, we introduce a per-compaction BufferPool that is used to allocate buffers during cache misses. Buffers are reused throughout the compaction and only freed to the memory allocator when they're too small or the compaction is finished. This reduces pressure on the memory allocator and the block cache.
When opening a sstable for the first time, we read two 'meta' blocks: one describes the layout of sstable, encoding block handles for the index block, properties block, etc. The other contains table properties. These blocks are decoded, and the necessary state is copied onto the heap, and then blocks are released. In typical configurations with sufficiently large table caches, these blocks are never read again or only much later after the table has been evicted from the table cache. This commit uses a BufferPool to hold these temporary blocks, and refrains from adding these blocks to the block cache. This reduces contention on the block cache mutexes (#1997).
We've observed significant block cache mutex contention originating from EvictFile. When evicting a file with a significant count of blocks currently held within the block cache, EvictFile may hold the block cache shard mutex for a long duration, increasing latency for requests that must access the same shard. This commit periodically drops the mutex to provide these other requests with an opportunity to make progress. Informs cockroachdb#1997.
We've observed significant block cache mutex contention originating from EvictFile. When evicting a file with a significant count of blocks currently held within the block cache, EvictFile may hold the block cache shard mutex for a long duration, increasing latency for requests that must access the same shard. This commit periodically drops the mutex to provide these other requests with an opportunity to make progress. Informs cockroachdb#1997.
We've observed significant block cache mutex contention originating from EvictFile. When evicting a file with a significant count of blocks currently held within the block cache, EvictFile may hold the block cache shard mutex for a long duration, increasing latency for requests that must access the same shard. This commit mitigates this contention with two approaches. EvictFile now periodically drops the mutex to provide these other requests with an opportunity to make progress. Additionally, rather than free()-ing manually managed memory back to the memory allocator while holding the mutex, the free is deferred until the shard mutex has been dropped. Informs cockroachdb#1997.
We've observed block cache mutex contention originating from EvictFile. When evicting a file with a significant count of blocks currently held within the block cache, EvictFile may hold the block cache shard mutex for an extended duration, increasing latency for requests that must access the same shard. This commit mitigates this contention with two approaches. EvictFile now periodically drops the mutex to provide these other requests with an opportunity to make progress. Additionally, rather than free()-ing manually managed memory back to the memory allocator while holding the mutex, the free is deferred until the shard mutex has been dropped. Informs cockroachdb#1997.
We've observed large allocations like the 64MB memtable allocation take 10ms+. This can contribute to batch commit tail latencies by slowing down WAL/memtable rotation during which the entire commit pipeline is stalled. This commit adapts the memtable lifecycle to keep the most recent obsolete memtable around for use as the next memtable. This reduces the commit latency hiccup during a memtable rotation, and it also reduces block cache mutex contention (cockroachdb#1997) by reducing the number of times we must reserve memory from the block cache. ``` goos: linux goarch: amd64 pkg: github.com/cockroachdb/pebble cpu: Intel(R) Xeon(R) CPU @ 2.30GHz │ old.txt │ new.txt │ │ sec/op │ sec/op vs base │ RotateMemtables-24 120.7µ ± 2% 102.8µ ± 4% -14.85% (p=0.000 n=25) │ old.txt │ new.txt │ │ B/op │ B/op vs base │ RotateMemtables-24 124.3Ki ± 0% 124.0Ki ± 0% -0.27% (p=0.000 n=25) │ old.txt │ new.txt │ │ allocs/op │ allocs/op vs base │ RotateMemtables-24 114.0 ± 0% 111.0 ± 0% -2.63% (p=0.000 n=25) ``` Informs cockroachdb#2646.
We've observed block cache mutex contention originating from EvictFile. When evicting a file with a significant count of blocks currently held within the block cache, EvictFile may hold the block cache shard mutex for an extended duration, increasing latency for requests that must access the same shard. This commit mitigates this contention with two approaches. EvictFile now periodically drops the mutex to provide these other requests with an opportunity to make progress. Additionally, rather than free()-ing manually managed memory back to the memory allocator while holding the mutex, the free is deferred until the shard mutex has been dropped. Informs cockroachdb#1997.
We've observed block cache mutex contention originating from EvictFile. When evicting a file with a significant count of blocks currently held within the block cache, EvictFile may hold the block cache shard mutex for an extended duration, increasing latency for requests that must access the same shard. This commit mitigates this contention with two approaches. EvictFile now periodically drops the mutex to provide these other requests with an opportunity to make progress. Additionally, rather than free()-ing manually managed memory back to the memory allocator while holding the mutex, the free is deferred until the shard mutex has been dropped. Informs cockroachdb#1997.
We've observed large allocations like the 64MB memtable allocation take 10ms+. This can contribute to batch commit tail latencies by slowing down WAL/memtable rotation during which the entire commit pipeline is stalled. This commit adapts the memtable lifecycle to keep the most recent obsolete memtable around for use as the next memtable. This reduces the commit latency hiccup during a memtable rotation, and it also reduces block cache mutex contention (cockroachdb#1997) by reducing the number of times we must reserve memory from the block cache. ``` goos: linux goarch: amd64 pkg: github.com/cockroachdb/pebble cpu: Intel(R) Xeon(R) CPU @ 2.30GHz │ old.txt │ new.txt │ │ sec/op │ sec/op vs base │ RotateMemtables-24 120.7µ ± 2% 102.8µ ± 4% -14.85% (p=0.000 n=25) │ old.txt │ new.txt │ │ B/op │ B/op vs base │ RotateMemtables-24 124.3Ki ± 0% 124.0Ki ± 0% -0.27% (p=0.000 n=25) │ old.txt │ new.txt │ │ allocs/op │ allocs/op vs base │ RotateMemtables-24 114.0 ± 0% 111.0 ± 0% -2.63% (p=0.000 n=25) ``` Informs cockroachdb#2646.
We've observed large allocations like the 64MB memtable allocation take 10ms+. This can add latency to the WAL/memtable rotation critical section during which the entire commit pipeline is stalled, contributing to batch commit tail latencies. This commit adapts the memtable lifecycle to keep the most recent obsolete memtable around for use as the next mutable memtable. This reduces the commit latency hiccup during a memtable rotation, and it also reduces block cache mutex contention (cockroachdb#1997) by reducing the number of times we must reserve memory from the block cache. ``` goos: linux goarch: amd64 pkg: github.com/cockroachdb/pebble cpu: Intel(R) Xeon(R) CPU @ 2.30GHz │ old.txt │ new.txt │ │ sec/op │ sec/op vs base │ RotateMemtables-24 120.7µ ± 2% 102.8µ ± 4% -14.85% (p=0.000 n=25) │ old.txt │ new.txt │ │ B/op │ B/op vs base │ RotateMemtables-24 124.3Ki ± 0% 124.0Ki ± 0% -0.27% (p=0.000 n=25) │ old.txt │ new.txt │ │ allocs/op │ allocs/op vs base │ RotateMemtables-24 114.0 ± 0% 111.0 ± 0% -2.63% (p=0.000 n=25) ``` Informs cockroachdb#2646.
We've observed large allocations like the 64MB memtable allocation take 10ms+. This can add latency to the WAL/memtable rotation critical section during which the entire commit pipeline is stalled, contributing to batch commit tail latencies. This commit adapts the memtable lifecycle to keep the most recent obsolete memtable around for use as the next mutable memtable. This reduces the commit latency hiccup during a memtable rotation, and it also reduces block cache mutex contention (#1997) by reducing the number of times we must reserve memory from the block cache. ``` goos: linux goarch: amd64 pkg: github.com/cockroachdb/pebble cpu: Intel(R) Xeon(R) CPU @ 2.30GHz │ old.txt │ new.txt │ │ sec/op │ sec/op vs base │ RotateMemtables-24 120.7µ ± 2% 102.8µ ± 4% -14.85% (p=0.000 n=25) │ old.txt │ new.txt │ │ B/op │ B/op vs base │ RotateMemtables-24 124.3Ki ± 0% 124.0Ki ± 0% -0.27% (p=0.000 n=25) │ old.txt │ new.txt │ │ allocs/op │ allocs/op vs base │ RotateMemtables-24 114.0 ± 0% 111.0 ± 0% -2.63% (p=0.000 n=25) ``` Informs #2646.
We've observed block cache mutex contention originating from EvictFile. When evicting a file with a significant count of blocks currently held within the block cache, EvictFile may hold the block cache shard mutex for an extended duration, increasing latency for requests that must access the same shard. This commit mitigates this contention with two approaches. EvictFile now periodically drops the mutex to provide these other requests with an opportunity to make progress. Additionally, rather than free()-ing manually managed memory back to the memory allocator while holding the mutex, the free is deferred until the shard mutex has been dropped. Informs cockroachdb#1997.
We've observed block cache mutex contention originating from EvictFile. When evicting a file with a significant count of blocks currently held within the block cache, EvictFile may hold the block cache shard mutex for an extended duration, increasing latency for requests that must access the same shard. This commit mitigates this contention with two approaches. EvictFile now periodically drops the mutex to provide these other requests with an opportunity to make progress. Additionally, rather than free()-ing manually managed memory back to the memory allocator while holding the mutex, the free is deferred until the shard mutex has been dropped. Informs #1997.
We've merged several different patches to mitigate block cache mutex contention. I'll close this out, and we can reopen if we uncover it again. |
@ajwerner reported high mutex contention on the block cache when running TPCC with 16 core machines (internal link: https://cockroachlabs.slack.com/archives/CAC6K3SLU/p1664480089719489), specifically with nodes under high load (running tpccbench). And also pointed to a discussion regarding poor
RWMutex
scaling with high core count golang/go#17973.The diff from the 2 profiles (that are 60s apart) in that internal link is shown below and shows a total contention of 122s. 34s is in the sstable cache.
There doesn't seem to be an easy fix here but we should reproduce this experiment and measure (a) mutex contention, (b) mutex related overhead in a cpu profile, (c) the number of concurrent MVCCGet calls. If (b) is "small" and (c) is "large", the contention walltime may be acceptable, since it would not impact the peak throughput we can achieve on that node.
The text was updated successfully, but these errors were encountered: