Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: investigate raft entry cache sizing #98666

Closed
tbg opened this issue Mar 15, 2023 · 13 comments · Fixed by #107424
Closed

kvserver: investigate raft entry cache sizing #98666

tbg opened this issue Mar 15, 2023 · 13 comments · Fixed by #107424
Assignees
Labels
C-investigation Further steps needed to qualify. C-label will change.

Comments

@tbg
Copy link
Member

tbg commented Mar 15, 2023

The raft entry cache is sized to 16mb per store. This seems very small:

// defaultRaftEntryCacheSize is the default size in bytes for a
// store's Raft log entry cache.
defaultRaftEntryCacheSize = 1 << 24 // 16M

The raft.entrycache.{accesses,hits} metrics can be used to arrive at a cache hit rate. This is something we should be paying attention to when optimizing write performance.

x-ref #98576 where this was noticed.

Jira issue: CRDB-25431

@tbg tbg added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv-replication labels Mar 15, 2023
@blathers-crl
Copy link

blathers-crl bot commented Mar 15, 2023

cc @cockroachdb/replication

@tbg tbg added C-investigation Further steps needed to qualify. C-label will change. and removed C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. labels Mar 15, 2023
@tbg
Copy link
Member Author

tbg commented Mar 15, 2023

Supplementary screenshot from an 8tb restore

image

We're reading from the raft storage quite a bit, probably we're usually lucky and pebble serves most of it from the block cache, as I don't see anywhere near that read bandwidth used by the disks.

@erikgrinaker
Copy link
Contributor

I'd also expect this to still be in OS disk caches.

tbg added a commit to tbg/cockroach that referenced this issue Mar 28, 2023
This tracks the cases where we fall back to reading log entries from storage
(i.e. pebble) in `(raft.Storage.Entries)`

Ideally this is zero, as everything ought to be served from the raft entry cache. We know that this cache is not configured well[^1] and so we can't really expect this to work, but you can't improve what you don't measure.

Additionally, this metric has been useful in investigations related to raft overload[^2].

[^1]: cockroachdb#98666
[^2]: cockroachdb#98576

Epic: none
Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Apr 12, 2023
This tracks the cases where we fall back to reading log entries from storage
(i.e. pebble) in `(raft.Storage.Entries)`

Ideally this is zero, as everything ought to be served from the raft entry cache. We know that this cache is not configured well[^1] and so we can't really expect this to work, but you can't improve what you don't measure.

Additionally, this metric has been useful in investigations related to raft overload[^2].

[^1]: cockroachdb#98666
[^2]: cockroachdb#98576

Epic: none
Release note: None
blathers-crl bot pushed a commit that referenced this issue Apr 26, 2023
This tracks the cases where we fall back to reading log entries from storage
(i.e. pebble) in `(raft.Storage.Entries)`

Ideally this is zero, as everything ought to be served from the raft entry cache. We know that this cache is not configured well[^1] and so we can't really expect this to work, but you can't improve what you don't measure.

Additionally, this metric has been useful in investigations related to raft overload[^2].

[^1]: #98666
[^2]: #98576

Epic: none
Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jun 29, 2023
Informs cockroachdb#98666.

This commit introduces a new `COCKROACH_RAFT_ENTRY_CACHE_SIZE` which can
be used to configure the size of the raft entry cache. The default value
is 16 MiB.

Release note: None
@nvanbenschoten
Copy link
Member

nvanbenschoten commented Jun 29, 2023

In a write-heavy workload running kv0 with 2kb writes (repro steps), we see a gradual drop off in throughput. This is due in part to an increase in raft scheduler latency, which is a consequence of a reduction in the raft entry cache hit rate as the workload runs. Interestingly, we are able to see this clearly because some of the nodes have higher cache entry hit rates than others, and these perform much better.

This demonstrates that for write-heavy workloads, a 16mb raft cache entry cache is insufficient. We should explore dynamically sizing the cache based on available memory capacity in the system.

In the meantime, #105799 makes it configurable through an environment variable.


qps and p99 workload latency
Screenshot 2023-06-29 at 2 04 24 AM

p99 raft scheduler latency: notice node 2 and 3, compared to the other nodes
Screenshot 2023-06-29 at 2 13 39 AM

raft entry cache accesses and hits
Screenshot 2023-06-29 at 2 09 16 AM

same data, tabular form

Accesses Hits Hit Rate
n1 314468 308334 98.1%
n2 276748 260645 94.2%
n3 271915 255306 93.9%
n4 325052 320766 98.7%
n5 326403 321934 98.6%

cpu profile from n3 and n4: notice n3's time reading the raft log from pebble
Screenshot 2023-06-29 at 2 05 06 AM
Screenshot 2023-06-29 at 2 44 13 AM

@erikgrinaker
Copy link
Contributor

Interesting, thanks for these results.

There are other actors here too -- we've seen previously (#78412) that performance tanks when recent Raft log entries are flushed from the memtable into L0 before application and truncation, and the Pebble block cache and OS caches also come into play.

I suspect the memtable flushing primarily affects performance by increasing compaction disk IO, competing with foreground writes. However, it'd be interesting to see if it would be sufficient/better to try to have the log entries stick around in the memtable for as long as possible, instead of relying on a separate entry cache, since that would solve two problems at once.

@erikgrinaker
Copy link
Contributor

That said, let's see if we can do something tactical for 23.2 -- 16 MB is miniscule.

@jbowens
Copy link
Collaborator

jbowens commented Jun 29, 2023

the log entries stick around in the memtable for as long as possible, instead of relying on a separate entry cache, since that would solve two problems at once.

I'm unsure how much keeping the entries in the memtable will help with these reads, because any read other than a point DB.Get lookup still needs to read and merge across all levels of the LSM. Even if the entries themselves are dropped in flushes, the tombstones dilute the keyspace making it less likely for reads of adjacent raft logs to hit the same sstable blocks.

@erikgrinaker
Copy link
Contributor

Good point, I guess that answers that.

craig bot pushed a commit that referenced this issue Jun 29, 2023
104350: sql/schemachanger: support for create sequence inside the declarative schema changer r=fqazi a=fqazi

This patch implements CREATE SEQUENCE in the declarative schema changer. The first three commits can be ignored and are included in (#104348, which should be reviewed and merged first). This patch will:

- Skip validation/backfill for descriptors in an added state
- Fixes prefix resolution creating new objects, where two-part names were not handled properly before
- Adds support for CREATE sequence in the declarative schema changer

Fixes: #104351

105454: roachtest: update `mixedversion` to always use secure clusters r=srosenberg a=renatolabs

Secure clusters are closer to production deployments and also allow
us to tests features that we couldn't before, like creating new users
with passwords during a test, and then performing SQL operations with
those users.

In the process of getting this to work, there were a few bugs that needed
to be fixed (first commit), and the `cluster` interface needed a small update
as well (second commit).

Epic: CRDB-19321

Release note: None

105765: ui: fix db page Node/Regions column rendering r=xinhaoz a=xinhaoz

Previously, the db page was not updating its columns if the
`showNodeRegionsColumn` prop changed. The db details page was
also not filtering out the regions column when desired.

Epic: none

Release note (bug fix): node/regions columns in db and db details
page should properly render. This column is hidden for tenants and
otherwise is shown for clusters with > 1 node.

105770: persistedsqlstats: skip TestSQLStatsPersistedLimitReached under stress r=zachlite a=zachlite

TestSQLStatsPersistedLimitReached succeeds under 'normal' testing conditions. We can and should get this test coverage when we can.

Informs #97488
Epic: none
Release note: None

105799: kv: expose env var to configure raft entry cache size r=erikgrinaker a=nvanbenschoten

Informs #98666.

This commit introduces a new `COCKROACH_RAFT_ENTRY_CACHE_SIZE` which can be used to configure the size of the raft entry cache. The default value is 16 MiB.

Release note: None

Co-authored-by: Faizan Qazi <[email protected]>
Co-authored-by: Renato Costa <[email protected]>
Co-authored-by: Xin Hao Zhang <[email protected]>
Co-authored-by: zachlite <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
jbowens pushed a commit to jbowens/cockroach that referenced this issue Jun 30, 2023
Informs cockroachdb#98666.

This commit introduces a new `COCKROACH_RAFT_ENTRY_CACHE_SIZE` which can
be used to configure the size of the raft entry cache. The default value
is 16 MiB.

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jul 5, 2023
Informs cockroachdb#98666.

This commit introduces a new `COCKROACH_RAFT_ENTRY_CACHE_SIZE` which can
be used to configure the size of the raft entry cache. The default value
is 16 MiB.

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jul 5, 2023
Informs cockroachdb#98666.

This commit introduces a new `COCKROACH_RAFT_ENTRY_CACHE_SIZE` which can
be used to configure the size of the raft entry cache. The default value
is 16 MiB.

Release note: None
jbowens pushed a commit to jbowens/cockroach that referenced this issue Jul 11, 2023
Informs cockroachdb#98666.

This commit introduces a new `COCKROACH_RAFT_ENTRY_CACHE_SIZE` which can
be used to configure the size of the raft entry cache. The default value
is 16 MiB.

Release note: None
@erikgrinaker erikgrinaker self-assigned this Jul 23, 2023
@erikgrinaker
Copy link
Contributor

I sampled CentMon hit rates:

quantile(x, raft_entrycache_hits / raft_entrycache_accesses)
  • 10th percentile: 100%
  • 5th percentile: 99.8%
  • 1st percentile: 99.4%
  • Minimum: 90.3%

@erikgrinaker
Copy link
Contributor

Some quick benchmarks, recording hit rates for all 3 nodes:

  • kv0/enc=false/nodes=3/size=4kb: 99.3% 99.4% 99.4%
  • kv0/enc=false/nodes=3/cpu=32/size=4kb: 99.4% 99.4% 99.4%
  • kv0/enc=false/nodes=3/size=64kb: 86.1% 87.3% 87.3%
  • kv0/enc=false/nodes=3/cpu=32/size=64kb: 87.1% 87.7% 87.4%

Will run some experiments with the small-node variants.

@erikgrinaker
Copy link
Contributor

Average hit rate on single 10-minute runs of kv0/enc=false/nodes=3/size=64kb with varying COCKROACH_RAFT_ENTRY_CACHE_SIZE:

  • 16 MB: 86.9%
  • 64 MB: 96.3%
  • 128 MB: 98.0%
  • 256 MB: 98.5%
  • 512 MB: 98.5%
  • 1 GB: 98.5%

In the >=256 MB cases, the steady-state hit rate was 100% -- the misses all happened at the start of the workload. The 128 MB case was almost able to keep up, with a 99.6% hit rate if we ignore the first minute.

I don't think we should optimize for the 64 KB case, since it seems like a bit of an outlier. I'll do some 30-minute runs at 128 MB cache size with 1 KB, 4 KB, 16 KB, and 64 KB writes. If 128 MB seems reasonable then I think we can simply change the default without any scaling heuristics, at least as a trivial first step.

Here's the 256 MB chart:

Screenshot 2023-07-23 at 13 16 12

@erikgrinaker
Copy link
Contributor

3 runs of kv0/enc=false/nodes=3/size= on GCE for 30 minutes with COCKROACH_RAFT_ENTRY_CACHE_SIZE=128MB. Cache metrics discarded for first 2 minutes. Median of 3 runs.

Size Hit Rate Throughput Nightly
1 kb 100% 12442 w/s
4 kb 100% 6522 w/s 6498 w/s
16 kb 100% 1692 w/s
64 kb 99.88% 417 w/s 418 w/s

While the cache hit rate is significantly improved compared to #98666 (comment), it unfortunately doesn't seem to have a significant impact on throughput for this specific workload.

The results in #98666 (comment) were on AWS though, which I believe tends to have lower IOPS capacity. I'll rerun that experiment with a 128 MB entry cache.

@erikgrinaker
Copy link
Contributor

I re-ran the experiments from #98666 (comment), using 16 MB and 128 MB entry cache sizes respectively. I saw similar cluster behavior as the original results, including the throughput degradation and node bimodality.

The 128 MB entry cache basically eliminated cache misses. Furthermore, these are 256 GB RAM nodes, so if we had used the linear 1/256 memory scaling from #107424 the entry cache would have been 1 GB, which almost certainly would have ensured a 100% hit rate.

Per-node hit rates:

  • 16 MB: 98.84% 98.84% 98.42% 95.53% 94.92%
  • 128 MB: 99.98% 99.98% 99.97% 99.98% 98.85%

Unfortunately, as with other results, this did not translate to a significant throughput improvement. In fact, the 16 MB cluster had higher throughput (40127 w/s vs. 38865 w/s) which is most likely down to VM variation.

Furthermore, the 128 MB cluster also showed Raft scheduler bimodality that did not correlate strongly with entry cache misses:

Screenshot 2023-07-23 at 21 14 56

The CPU profile from n3 (one with high scheduler latencies) confirms that we're not reading entries from disk:

Screenshot 2023-07-23 at 21 20 19 Screenshot 2023-07-23 at 21 21 37

n1 with low Raft scheduler latency appears very similar:

Screenshot 2023-07-23 at 21 23 34

For the purposes of the entry cache, I think we can say that the linear 1/256 scaling proposed in #107424 appears sufficient to avoid cache misses, but the impact on throughput is unclear.

craig bot pushed a commit that referenced this issue Jul 26, 2023
107265: liveness: allow registering callbacks after start r=erikgrinaker a=tbg

I discovered[^1] a deadlock scenario when multiple nodes in the cluster restart
with additional stores that need to be bootstrapped. In that case, liveness
must be running when the StoreIDs are allocated, but it is not.

Trying to address this problem, I realized that when an auxiliary Store is bootstrapped,
it will create a new replicateQueue, which will register a new callback into NodeLiveness.

But if liveness must be started at this point to fix #106706, we'll run into the assertion
that checks that we don't register callbacks on a started node liveness.

Something's got to give: we will allow registering callbacks at any given point
in time, and they'll get an initial set of notifications synchronously. I
audited the few users of RegisterCallback and this seems OK with all of them.

[^1]: #106706 (comment)

Epic: None
Release note: None


107417: kvserver: ignore RPC conn when deciding to campaign/vote r=erikgrinaker a=erikgrinaker

**kvserver: remove stale mayCampaignOnWake comment**

The comment is about a parameter that no longer exists.

**kvserver: revamp shouldCampaign/Forget tests**

**kvserver: ignore RPC conn in `shouldCampaignOnWake`**

Previously, `shouldCampaignOnWake()` used `IsLiveMapEntry.IsLive` to determine whether the leader was dead. However, this not only depends on the node's liveness, but also its RPC connectivity. This can prevent an unquiescing replica from acquiring Raft leadership if the leader is still alive but unable to heartbeat liveness, and the leader will be unable to acquire epoch leases in this case.

This patch ignores the RPC connection state when deciding whether to campaign, using only on the liveness state.

**kvserver: ignore RPC conn in `shouldForgetLeaderOnVoteRequest`**

Previously, `shouldForgetLeaderOnVoteRequest()` used `IsLiveMapEntry.IsLive` to determine whether the leader was dead. However, this not only depends on the node's liveness, but also its RPC connectivity. This can prevent granting votes to a new leader that may be attempting to acquire a epoch lease (which the current leader can't).

This patch ignores the RPC connection state when deciding whether to campaign, using only on the liveness state.

Resolves #107060.
Epic: none
Release note: None

**kvserver: remove `StoreTestingKnobs.DisableLivenessMapConnHealth`**

107424: kvserver: scale Raft entry cache size with system memory r=erikgrinaker a=erikgrinaker

The Raft entry cache size defaulted to 16 MB, which is rather small. This has been seen to cause tail latency and throughput degradation with high write volume on large nodes, correlating with a reduction in the entry cache hit rate.

This patch linearly scales the Raft entry cache size as 1/256 of total system/cgroup memory, shared evenly between all stores, with a minimum 32 MB. For example, a 32 GB 8-vCPU node will have a 128 MB entry cache.

This is a conservative default, since this memory is not accounted for in existing memory budgets nor by the `--cache` flag. We rarely see cache misses in production clusters anyway, and have seen significantly improved hit rates with this scaling (e.g. a 64 KB kv0 workload on 8-vCPU nodes increased from 87% to 99% hit rate).

Resolves #98666.
Epic: none

Release note (performance improvement): The default Raft entry cache size has been increased from 16 MB to 1/256 of system memory with a minimum of 32 MB, divided evenly between all stores. This can be configured via `COCKROACH_RAFT_ENTRY_CACHE_SIZE`.

107442: kvserver: deflake TestRequestsOnFollowerWithNonLiveLeaseholder r=erikgrinaker a=tbg

The test previously relied on aggressive liveness heartbeat expirations to
avoid running for too long. As a result, it was flaky since liveness wasn't
reliably pinned in the way the test wanted.

The hybrid manual clock allows time to jump forward at an opportune moment.
Use it here to avoid running with a tight lease interval.

On my gceworker, previously flaked within a few minutes. As of this commit, I
ran it for double-digit minutes without issue.

Fixes #107200.

Epic: None
Release note: None


107526: kvserver: fail gracefully in TestLeaseTransferRejectedIfTargetNeedsSnapshot r=erikgrinaker a=tbg

We saw this test hang in CI. What likely happened (according to the stacks) is
that a lease transfer that was supposed to be caught by an interceptor never
showed up in the interceptor. The most likely explanation is that it errored
out before it got to evaluation. It then signaled a channel the test was only
prepared to check later, so the test hung (waiting for a channel that was now
never to be touched).

This test is hard to maintain. It would be great (though, for now, out of reach)
to write tests like it in a deterministic framework[^1]

[^1]: see #105177.

For now, fix the test so that when the (so far unknown) error rears its
head again, it will fail properly, so we get to see the error and can
take another pass at fixing the test (separately). Stressing
this commit[^2], we get:

> transferErrC unexpectedly signaled: /Table/Max: transfer lease unexpected
> error: refusing to transfer lease to (n3,s3):3 because target may need a Raft
> snapshot: replica in StateProbe

This makes sense. The test wants to exercise the below-raft mechanism, but
the above-raft mechanism also exists and while we didn't want to interact
with it, we sometimes do[^1]

The second commit introduces a testing knob that disables the above-raft
mechanism selectively. I've stressed the test for 15 minutes without issues
after this change.

[^1]: somewhat related to #107524
[^2]: `./dev test --filter TestLeaseTransferRejectedIfTargetNeedsSnapshot --stress ./pkg/kv/kvserver/` on gceworker, 285s

Fixes #106383.

Epic: None
Release note: None


107531: kvserver: disable replicate queue and lease transfers in closedts tests r=erikgrinaker a=tbg

For a more holistic suggestion on how to fix this for the likely many other
tests susceptible to similar issues, see:
#107528

> 1171 runs so far, 0 failures, over 15m55s

Fixes #101824.

Release note: None
Epic: none


Co-authored-by: Tobias Grieger <[email protected]>
Co-authored-by: Erik Grinaker <[email protected]>
@craig craig bot closed this as completed in #107424 Jul 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-investigation Further steps needed to qualify. C-label will change.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants