liveness: allow registering callbacks after start #107265

tbg · 2023-07-20T14:03:59Z

I discovered¹ a deadlock scenario when multiple nodes in the cluster restart
with additional stores that need to be bootstrapped. In that case, liveness
must be running when the StoreIDs are allocated, but it is not.

Trying to address this problem, I realized that when an auxiliary Store is bootstrapped,
it will create a new replicateQueue, which will register a new callback into NodeLiveness.

But if liveness must be started at this point to fix #106706, we'll run into the assertion
that checks that we don't register callbacks on a started node liveness.

Something's got to give: we will allow registering callbacks at any given point
in time, and they'll get an initial set of notifications synchronously. I
audited the few users of RegisterCallback and this seems OK with all of them.

Epic: None
Release note (bug fix): it was possible for a node status to reflect a "last up" timestamp that lead the actual last liveness heartbeat of the node. This has been fixed.

https://github.com/cockroachdb/cockroach/issues/106706#issuecomment-1640254715 ↩

blathers-crl · 2023-07-20T14:04:07Z

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

cockroach-teamcity · 2023-07-20T14:04:12Z

This change is

erikgrinaker · 2023-07-21T09:05:08Z

Right, this is probably fallout from #103601.

andrewbaptist · 2023-07-21T14:41:56Z

Thanks Tobi!

I see what you mean about this test, and it was tricky to get some of tests, including this one, to work without doing what you did by allowing later registration.

The approach you have works, but I had considered two other options as well (and could code them up if you want)

Don't allow disks/stores to be added to a running instance, instead require a restart. I wasn't sure if this was something that we support in a field deployment but I should have checked first. If we don't support this, then it would be possible to rearrange the code to allow this to work
Have a "single" node level callback that then fans out to the stores on liveness changes. This simplifies the liveness code as there is only a single callback, and the node already has locking in place for adding stores.

Your approach also works and is probably the most straightforward.

tbg · 2023-07-21T14:49:55Z

Don't allow disks/stores to be added to a running instance, instead require a restart. I wasn't sure if this was something that we support in a field deployment but I should have checked first. If we don't support this, then it would be possible to rearrange the code to allow this to work

The new stores are being added following a restart. However, due to the need to assign StoreIDs, KV needs to be up and running. So these new stores are added "async". I don't see a good way around that.

Have a "single" node level callback that then fans out to the stores on liveness changes. This simplifies the liveness code as there is only a single callback, and the node already has locking in place for adding stores.

That's not a bad idea, though I won't be able to do this until I leave (and it might be crowded out anyway). So even though I like that idea I'd like to stick with what I have, so that we can at least address the bug.

pkg/kv/kvserver/liveness/liveness.go

The two removed fields are nil. This made a test failure during the refactors in this PR more annoying. If we're going to set up a half-inited NodeLiveness, let's at least be honest about it.

This will help with testing.

We'll give this a proper interface soon.

So that it can implement a public interface.

It's needed to implement the liveness storage (once it exists).

We still want a Storage to be passed into NewNodeLiveness as opposed to a `*kv.DB`, but so far, so good.

Now Liveness is constructed using a `Storage` as opposed to a `*kv.DB`.

I discovered[^1] a deadlock scenario when multiple nodes in the cluster restart with additional stores that need to be bootstrapped. In that case, liveness must be running when the StoreIDs are allocated, but it is not. Trying to address this problem, I realized that when an auxiliary Store is bootstrapped, it will create a new replicateQueue, which will register a new callback into NodeLiveness. But if liveness must be started at this point to fix cockroachdb#106706, we'll run into the assertion that checks that we don't register callbacks on a started node liveness. Something's got to give: we will allow registering callbacks at any given point in time, and they'll get an initial set of notifications synchronously. I audited the few users of RegisterCallback and this seems OK with all of them. [^1]: cockroachdb#106706 (comment) Epic: None Release note: None

I think there was a bug here. This method was previously invoked in `updateLiveness`, but that method is the general workhorse for updating anyone's liveness. In particular, it is called by `IncrementEpoch`. So we were invoking `onSelfHeartbeat` when we would increment other nodes' epochs. This doesn't seem great. Additionally, the code was trying to avoid invoking this callback before liveness was officially "started". Heartbeating yourself before liveness is started is unfortunately a thing due to the tangled start-up initialization sequence; we may see heartbeats triggered by lease requests. Avoid both complications by invoking `onSelfCallback` from the actual main heartbeat loop, whose only job is to heartbeat the own liveness record. I tried to adopt `TestNodeHeartbeatCallback` to give better coverage, but it's a yak shave. A deterministic node liveness (i.e. a way to invoke the main heartbeat loop manually) would make this a lot simpler. I filed an issue to that effect: cockroachdb#107452

Helps with testing.

This tests that regardless of when a callback is registered, it gets called.

pkg/kv/kvserver/liveness/liveness.go

andrewbaptist

pkg/kv/kvserver/liveness/storage.go

erikgrinaker · 2023-07-26T07:46:30Z

We'll want a targeted backport here too, since it seems like prior versions are vulnerable to the same issue (but I'm not sure with all the refactoring in this area).

tbg · 2023-07-26T08:00:02Z

We'll want a targeted backport here too, since it seems like prior versions are vulnerable to the same issue (but I'm not sure with all the refactoring in this area).

I'm not sure we should fix this in a backport. The chances of messing something up are much higher than anyone hitting this bug. Folks very very rarely add stores to existing nodes (multi-storage in itself is rare) and doing so in a way that loses quorum on the storeIDGen is even rarer. Also, there is a workaround - restart a quorum without the extra store.

If you disagree, let's file an issue because I am not sure I would get to this backport even if we tried.

erikgrinaker · 2023-07-26T08:01:48Z

Ok, I can buy that. Was thinking of surrounding issues like the self callback too though, but they're fairly minor.

tbg · 2023-07-26T08:05:03Z

The self-callback I can fix in a backport - we can just add a conditional (if we don't insist on also adding a test).

The hang at startup, btw, I think has existed in every past version of CRDB, and to the best of my knowledge was never hit by anyone in a real cluster. Even the test that caught it only did so after I refactored it and made it use in-mem engines. So I am fairly confident that not backporting is the reasonable strategy here.

tbg · 2023-07-26T08:19:02Z

bors r=erikgrinaker

See cockroachdb#107265 (comment). Epic: none Release note: none

tbg · 2023-07-26T08:41:35Z

Backports:

#107605
#107606

craig · 2023-07-26T09:17:06Z

Build succeeded:

Bazel Essential CI (Cockroach)

See cockroachdb#107265 (comment). Epic: none Release note: none

tbg mentioned this pull request Jul 20, 2023

server: avoid deadlock when initing additional stores #107124

Merged

tbg requested review from erikgrinaker and andrewbaptist July 21, 2023 08:58

tbg added the db-cy-23 label Jul 21, 2023

tbg marked this pull request as ready for review July 21, 2023 11:20

tbg requested a review from a team as a code owner July 21, 2023 11:20

erikgrinaker reviewed Jul 24, 2023

View reviewed changes

pkg/kv/kvserver/liveness/liveness.go Outdated Show resolved Hide resolved

tbg force-pushed the liveness-register-anytime branch 2 times, most recently from 05cbe38 to 9c8bddd Compare July 24, 2023 10:27

tbg added 13 commits July 24, 2023 17:55

kvserver: avoid implicit nils in TestReplicaLeaseCounters

7c2b6ea

The two removed fields are nil. This made a test failure during the refactors in this PR more annoying. If we're going to set up a half-inited NodeLiveness, let's at least be honest about it.

gossip: add GetNodeID accessor

708435a

liveness: add and adopt Gossip interface

7d32c4f

This will help with testing.

liveness: rename storage -> storageImpl

31b9ec4

We'll give this a proper interface soon.

liveness: export storage methods

beb2c9b

So that it can implement a public interface.

liveness: switch storage to ptr receivers

ea7cc94

liveness: export LivenessUpdate

52792fc

It's needed to implement the liveness storage (once it exists).

liveness: half-adopt Storage

ef8410e

We still want a Storage to be passed into NewNodeLiveness as opposed to a `*kv.DB`, but so far, so good.

liveness: finish adopting Storage

f865176

Now Liveness is constructed using a `Storage` as opposed to a `*kv.DB`.

liveness: move liveness and store regex to globals

4bf5865

Helps with testing.

liveness: add test for IsLiveCallback invocation

d767731

This tests that regardless of when a callback is registered, it gets called.

tbg force-pushed the liveness-register-anytime branch from 9c8bddd to d767731 Compare July 24, 2023 15:57

erikgrinaker reviewed Jul 25, 2023

View reviewed changes

pkg/kv/kvserver/liveness/liveness.go Show resolved Hide resolved

andrewbaptist approved these changes Jul 25, 2023

View reviewed changes

tbg requested a review from erikgrinaker July 26, 2023 07:23

erikgrinaker approved these changes Jul 26, 2023

View reviewed changes

pkg/kv/kvserver/liveness/storage.go Show resolved Hide resolved

tbg mentioned this pull request Jul 26, 2023

release-23.1: liveness: write LastUpTimestamp only on self-heartbeat #107605

Merged

tbg added a commit to tbg/cockroach that referenced this pull request Jul 26, 2023

liveness: write LastUpTimestamp only on self-heartbeat

396a7e6

See cockroachdb#107265 (comment). Epic: none Release note: none

tbg mentioned this pull request Jul 26, 2023

release-22.2: liveness: write LastUpTimestamp only on self-heartbeat #107606

Merged

tbg added a commit to tbg/cockroach that referenced this pull request Jul 26, 2023

liveness: write LastUpTimestamp only on self-heartbeat

37f8131

See cockroachdb#107265 (comment). Epic: none Release note: none

craig bot merged commit f147c2b into cockroachdb:master Jul 26, 2023

kvoli pushed a commit to kvoli/cockroach that referenced this pull request Jul 26, 2023

liveness: write LastUpTimestamp only on self-heartbeat

52c21e0

See cockroachdb#107265 (comment). Epic: none Release note: none

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

liveness: allow registering callbacks after start #107265

liveness: allow registering callbacks after start #107265

tbg commented Jul 20, 2023 •

edited

Loading

blathers-crl bot commented Jul 20, 2023

cockroach-teamcity commented Jul 20, 2023

erikgrinaker commented Jul 21, 2023

andrewbaptist commented Jul 21, 2023

tbg commented Jul 21, 2023

andrewbaptist left a comment •

edited by cockroach-dev-inf

Loading

erikgrinaker commented Jul 26, 2023

tbg commented Jul 26, 2023

erikgrinaker commented Jul 26, 2023

tbg commented Jul 26, 2023

tbg commented Jul 26, 2023

tbg commented Jul 26, 2023

craig bot commented Jul 26, 2023

liveness: allow registering callbacks after start #107265

liveness: allow registering callbacks after start #107265

Conversation

tbg commented Jul 20, 2023 • edited Loading

Footnotes

blathers-crl bot commented Jul 20, 2023

cockroach-teamcity commented Jul 20, 2023

erikgrinaker commented Jul 21, 2023

andrewbaptist commented Jul 21, 2023

tbg commented Jul 21, 2023

andrewbaptist left a comment • edited by cockroach-dev-inf Loading

Choose a reason for hiding this comment

erikgrinaker commented Jul 26, 2023

tbg commented Jul 26, 2023

erikgrinaker commented Jul 26, 2023

tbg commented Jul 26, 2023

tbg commented Jul 26, 2023

tbg commented Jul 26, 2023

craig bot commented Jul 26, 2023

tbg commented Jul 20, 2023 •

edited

Loading

andrewbaptist left a comment •

edited by cockroach-dev-inf

Loading