cache: investigate mutex contention with TPCC #1997

sumeerbhola · 2022-10-05T21:31:36Z

@ajwerner reported high mutex contention on the block cache when running TPCC with 16 core machines (internal link: https://cockroachlabs.slack.com/archives/CAC6K3SLU/p1664480089719489), specifically with nodes under high load (running tpccbench). And also pointed to a discussion regarding poor RWMutex scaling with high core count golang/go#17973.

The diff from the 2 profiles (that are 60s apart) in that internal link is shown below and shows a total contention of 122s. 34s is in the sstable cache.

There doesn't seem to be an easy fix here but we should reproduce this experiment and measure (a) mutex contention, (b) mutex related overhead in a cpu profile, (c) the number of concurrent MVCCGet calls. If (b) is "small" and (c) is "large", the contention walltime may be acceptable, since it would not impact the peak throughput we can achieve on that node.

The text was updated successfully, but these errors were encountered:

andrewbaptist · 2023-02-16T15:53:21Z

In a recent customer investigation (https://github.com/cockroachlabs/support/issues/2066) I was looking at what the runnable goroutines on the system were doing to try and determine why requests were taking longer than expected. There are a large number (445) of runnable goroutines on the system. Looking more into what they were doing, about half of them were waiting in the pebble layer. Note that this does not include goroutines that are in the semacquire state which would be what I naively expected.

Instead, these were busy attempting to get a lock, not waiting for the OS signal.
There are a few things that they are waiting for, but breaking them down it looks like they are almost all waiting on the pebble shard lock. I see that it is set to 2xgoprocs, but maybe that is not enough?

goroutine_dump.2023-02-12T06_50_56.716.double_since_last_dump.000002714.txt

Looking at what all the runnable threads are doing:

% grep -h -A1 "^goroutine.*runnable" goroutine_dump.2023-02-12T06_50_56.716.double_since_last_dump.000002714.txt | grep -v goroutine | grep -v "^--" | grep -o '^.*(' | sort | uniq -c | sort -n  | tail -10
   3 google.golang.org/grpc/internal/transport.(*recvBufferReader).read(
   3 sync.runtime_notifyListWait(
   5 github.com/cockroachdb/cockroach/pkg/kv/kvserver/spanlatch.(*Manager).waitForSignal(
   5 google.golang.org/grpc/internal/transport.(*controlBuffer).get(
   5 syscall.Syscall6(
   7 sync.runtime_Semacquire(
  15 github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2(
  21 internal/poll.runtime_pollWait(
 152 google.golang.org/grpc.(*Server).serveStreams.func1.2(
 211 sync.runtime_SemacquireMutex(

Looking at just where the runnable mutexes came from:

% grep -h -A10 "^goroutine.*runnable" goroutine_dump.2023-02-12T06_50_56.716.double_since_last_dump.000002714.txt | grep -v goroutine | grep -v "^--" | grep -A1 "GOROOT/src/sync/rwmutex.go" | grep -o "^github.*("  | sort | uniq -c | sort -n
   1 github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness.(*NodeLiveness).GetLiveness(
   1 github.com/cockroachdb/pebble/internal/cache.(*shard).Delete(
   3 github.com/cockroachdb/pebble/internal/cache.(*shard).Set(
 207 github.com/cockroachdb/pebble/internal/cache.(*shard).Get(

During compactions, avoid populating the block cache with input files' blocks. These files will soon be removed from the LSM, making it less likely any iterator will need to read these blocks. While Pebble uses a scan-resistant block cache algorithm (ClockPRO), the act of inserting the blocks into the cache increases contention on the block cache mutexes (cockroachdb#1997). This contention has been observed to significantly contribute to tail latencies, both for reads and for writes during memtable reservation. Additionally, although these blocks may be soon replaced with more useful blocks due to ClockPRO's scan resistance, they may be freed by a different thread inducing excessive TLB shootdowns (cockroachdb#2693). A compaction only requires a relatively small working set of buffers during its scan across input sstables. In this commit, we introduce a per-compaction BufferPool that is used to allocate buffers during cache misses. Buffers are reused throughout the compaction and only freed to the memory allocator when they're too small or the compaction is finished. This reduces pressure on the memory allocator and the block cache.

When opening a sstable for the first time, we read two 'meta' blocks: one describes the layout of sstable, encoding block handles for the index block, properties block, etc. The other contains table properties. These blocks are decoded, and the necessary state is copied onto the heap, and then blocks are released. In typical configurations with sufficiently large table caches, these blocks are never read again or only much later after the table has been evicted from the table cache. This commit uses a BufferPool to hold these temporary blocks, and refrains from adding these blocks to the block cache. This reduces contention on the block cache mutexes (cockroachdb#1997).

During compactions, avoid populating the block cache with input files' blocks. These files will soon be removed from the LSM, making it less likely any iterator will need to read these blocks. While Pebble uses a scan-resistant block cache algorithm (ClockPRO), the act of inserting the blocks into the cache increases contention on the block cache mutexes (cockroachdb#1997). This contention has been observed to significantly contribute to tail latencies, both for reads and for writes during memtable reservation. Additionally, although these blocks may be soon replaced with more useful blocks due to ClockPRO's scan resistance, they may be freed by a different thread inducing excessive TLB shootdowns (cockroachdb#2693). A compaction only requires a relatively small working set of buffers during its scan across input sstables. In this commit, we introduce a per-compaction BufferPool that is used to allocate buffers during cache misses. Buffers are reused throughout the compaction and only freed to the memory allocator when they're too small or the compaction is finished. This reduces pressure on the memory allocator and the block cache.

When opening a sstable for the first time, we read two 'meta' blocks: one describes the layout of sstable, encoding block handles for the index block, properties block, etc. The other contains table properties. These blocks are decoded, and the necessary state is copied onto the heap, and then blocks are released. In typical configurations with sufficiently large table caches, these blocks are never read again or only much later after the table has been evicted from the table cache. This commit uses a BufferPool to hold these temporary blocks, and refrains from adding these blocks to the block cache. This reduces contention on the block cache mutexes (cockroachdb#1997).

During compactions, avoid populating the block cache with input files' blocks. These files will soon be removed from the LSM, making it less likely any iterator will need to read these blocks. While Pebble uses a scan-resistant block cache algorithm (ClockPRO), the act of inserting the blocks into the cache increases contention on the block cache mutexes (#1997). This contention has been observed to significantly contribute to tail latencies, both for reads and for writes during memtable reservation. Additionally, although these blocks may be soon replaced with more useful blocks due to ClockPRO's scan resistance, they may be freed by a different thread inducing excessive TLB shootdowns (#2693). A compaction only requires a relatively small working set of buffers during its scan across input sstables. In this commit, we introduce a per-compaction BufferPool that is used to allocate buffers during cache misses. Buffers are reused throughout the compaction and only freed to the memory allocator when they're too small or the compaction is finished. This reduces pressure on the memory allocator and the block cache.

When opening a sstable for the first time, we read two 'meta' blocks: one describes the layout of sstable, encoding block handles for the index block, properties block, etc. The other contains table properties. These blocks are decoded, and the necessary state is copied onto the heap, and then blocks are released. In typical configurations with sufficiently large table caches, these blocks are never read again or only much later after the table has been evicted from the table cache. This commit uses a BufferPool to hold these temporary blocks, and refrains from adding these blocks to the block cache. This reduces contention on the block cache mutexes (#1997).

We've observed significant block cache mutex contention originating from EvictFile. When evicting a file with a significant count of blocks currently held within the block cache, EvictFile may hold the block cache shard mutex for a long duration, increasing latency for requests that must access the same shard. This commit periodically drops the mutex to provide these other requests with an opportunity to make progress. Informs cockroachdb#1997.

We've observed significant block cache mutex contention originating from EvictFile. When evicting a file with a significant count of blocks currently held within the block cache, EvictFile may hold the block cache shard mutex for a long duration, increasing latency for requests that must access the same shard. This commit mitigates this contention with two approaches. EvictFile now periodically drops the mutex to provide these other requests with an opportunity to make progress. Additionally, rather than free()-ing manually managed memory back to the memory allocator while holding the mutex, the free is deferred until the shard mutex has been dropped. Informs cockroachdb#1997.

We've observed block cache mutex contention originating from EvictFile. When evicting a file with a significant count of blocks currently held within the block cache, EvictFile may hold the block cache shard mutex for an extended duration, increasing latency for requests that must access the same shard. This commit mitigates this contention with two approaches. EvictFile now periodically drops the mutex to provide these other requests with an opportunity to make progress. Additionally, rather than free()-ing manually managed memory back to the memory allocator while holding the mutex, the free is deferred until the shard mutex has been dropped. Informs cockroachdb#1997.

We've observed large allocations like the 64MB memtable allocation take 10ms+. This can contribute to batch commit tail latencies by slowing down WAL/memtable rotation during which the entire commit pipeline is stalled. This commit adapts the memtable lifecycle to keep the most recent obsolete memtable around for use as the next memtable. This reduces the commit latency hiccup during a memtable rotation, and it also reduces block cache mutex contention (cockroachdb#1997) by reducing the number of times we must reserve memory from the block cache. ``` goos: linux goarch: amd64 pkg: github.com/cockroachdb/pebble cpu: Intel(R) Xeon(R) CPU @ 2.30GHz │ old.txt │ new.txt │ │ sec/op │ sec/op vs base │ RotateMemtables-24 120.7µ ± 2% 102.8µ ± 4% -14.85% (p=0.000 n=25) │ old.txt │ new.txt │ │ B/op │ B/op vs base │ RotateMemtables-24 124.3Ki ± 0% 124.0Ki ± 0% -0.27% (p=0.000 n=25) │ old.txt │ new.txt │ │ allocs/op │ allocs/op vs base │ RotateMemtables-24 114.0 ± 0% 111.0 ± 0% -2.63% (p=0.000 n=25) ``` Informs cockroachdb#2646.

We've observed block cache mutex contention originating from EvictFile. When evicting a file with a significant count of blocks currently held within the block cache, EvictFile may hold the block cache shard mutex for an extended duration, increasing latency for requests that must access the same shard. This commit mitigates this contention with two approaches. EvictFile now periodically drops the mutex to provide these other requests with an opportunity to make progress. Additionally, rather than free()-ing manually managed memory back to the memory allocator while holding the mutex, the free is deferred until the shard mutex has been dropped. Informs cockroachdb#1997.

We've observed large allocations like the 64MB memtable allocation take 10ms+. This can contribute to batch commit tail latencies by slowing down WAL/memtable rotation during which the entire commit pipeline is stalled. This commit adapts the memtable lifecycle to keep the most recent obsolete memtable around for use as the next memtable. This reduces the commit latency hiccup during a memtable rotation, and it also reduces block cache mutex contention (cockroachdb#1997) by reducing the number of times we must reserve memory from the block cache. ``` goos: linux goarch: amd64 pkg: github.com/cockroachdb/pebble cpu: Intel(R) Xeon(R) CPU @ 2.30GHz │ old.txt │ new.txt │ │ sec/op │ sec/op vs base │ RotateMemtables-24 120.7µ ± 2% 102.8µ ± 4% -14.85% (p=0.000 n=25) │ old.txt │ new.txt │ │ B/op │ B/op vs base │ RotateMemtables-24 124.3Ki ± 0% 124.0Ki ± 0% -0.27% (p=0.000 n=25) │ old.txt │ new.txt │ │ allocs/op │ allocs/op vs base │ RotateMemtables-24 114.0 ± 0% 111.0 ± 0% -2.63% (p=0.000 n=25) ``` Informs cockroachdb#2646.

We've observed large allocations like the 64MB memtable allocation take 10ms+. This can add latency to the WAL/memtable rotation critical section during which the entire commit pipeline is stalled, contributing to batch commit tail latencies. This commit adapts the memtable lifecycle to keep the most recent obsolete memtable around for use as the next mutable memtable. This reduces the commit latency hiccup during a memtable rotation, and it also reduces block cache mutex contention (cockroachdb#1997) by reducing the number of times we must reserve memory from the block cache. ``` goos: linux goarch: amd64 pkg: github.com/cockroachdb/pebble cpu: Intel(R) Xeon(R) CPU @ 2.30GHz │ old.txt │ new.txt │ │ sec/op │ sec/op vs base │ RotateMemtables-24 120.7µ ± 2% 102.8µ ± 4% -14.85% (p=0.000 n=25) │ old.txt │ new.txt │ │ B/op │ B/op vs base │ RotateMemtables-24 124.3Ki ± 0% 124.0Ki ± 0% -0.27% (p=0.000 n=25) │ old.txt │ new.txt │ │ allocs/op │ allocs/op vs base │ RotateMemtables-24 114.0 ± 0% 111.0 ± 0% -2.63% (p=0.000 n=25) ``` Informs cockroachdb#2646.

We've observed large allocations like the 64MB memtable allocation take 10ms+. This can add latency to the WAL/memtable rotation critical section during which the entire commit pipeline is stalled, contributing to batch commit tail latencies. This commit adapts the memtable lifecycle to keep the most recent obsolete memtable around for use as the next mutable memtable. This reduces the commit latency hiccup during a memtable rotation, and it also reduces block cache mutex contention (#1997) by reducing the number of times we must reserve memory from the block cache. ``` goos: linux goarch: amd64 pkg: github.com/cockroachdb/pebble cpu: Intel(R) Xeon(R) CPU @ 2.30GHz │ old.txt │ new.txt │ │ sec/op │ sec/op vs base │ RotateMemtables-24 120.7µ ± 2% 102.8µ ± 4% -14.85% (p=0.000 n=25) │ old.txt │ new.txt │ │ B/op │ B/op vs base │ RotateMemtables-24 124.3Ki ± 0% 124.0Ki ± 0% -0.27% (p=0.000 n=25) │ old.txt │ new.txt │ │ allocs/op │ allocs/op vs base │ RotateMemtables-24 114.0 ± 0% 111.0 ± 0% -2.63% (p=0.000 n=25) ``` Informs #2646.

We've observed block cache mutex contention originating from EvictFile. When evicting a file with a significant count of blocks currently held within the block cache, EvictFile may hold the block cache shard mutex for an extended duration, increasing latency for requests that must access the same shard. This commit mitigates this contention with two approaches. EvictFile now periodically drops the mutex to provide these other requests with an opportunity to make progress. Additionally, rather than free()-ing manually managed memory back to the memory allocator while holding the mutex, the free is deferred until the shard mutex has been dropped. Informs cockroachdb#1997.

We've observed block cache mutex contention originating from EvictFile. When evicting a file with a significant count of blocks currently held within the block cache, EvictFile may hold the block cache shard mutex for an extended duration, increasing latency for requests that must access the same shard. This commit mitigates this contention with two approaches. EvictFile now periodically drops the mutex to provide these other requests with an opportunity to make progress. Additionally, rather than free()-ing manually managed memory back to the memory allocator while holding the mutex, the free is deferred until the shard mutex has been dropped. Informs #1997.

jbowens · 2023-07-28T17:58:52Z

We've merged several different patches to mitigate block cache mutex contention. I'll close this out, and we can reopen if we uncover it again.

blathers-crl bot added A-storage T-storage labels Oct 5, 2022

jbowens added consider-23.2 C-performance labels Feb 16, 2023

jbowens removed the consider-23.2 label Feb 17, 2023

jbowens mentioned this issue Jun 16, 2023

db: investigate elevated batch commit tail latencies #2646

Closed

jbowens mentioned this issue Jul 13, 2023

db: do not cache compaction block reads #2699

Merged

jbowens mentioned this issue Jul 26, 2023

internal/cache: reduce mutex contention #2774

Merged

jbowens mentioned this issue Jul 28, 2023

db: keep up to one memtable for recycling #2772

Merged

jbowens closed this as completed Jul 28, 2023

jbowens added this to [Deprecated] Storage Jun 4, 2024

jbowens moved this to Done in [Deprecated] Storage Jun 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cache: investigate mutex contention with TPCC #1997

cache: investigate mutex contention with TPCC #1997

sumeerbhola commented Oct 5, 2022

andrewbaptist commented Feb 16, 2023

jbowens commented Jul 28, 2023

cache: investigate mutex contention with TPCC #1997

cache: investigate mutex contention with TPCC #1997

Comments

sumeerbhola commented Oct 5, 2022

andrewbaptist commented Feb 16, 2023

jbowens commented Jul 28, 2023