-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
liveness: remove one-time init fields from the mu lock #103601
liveness: remove one-time init fields from the mu lock #103601
Conversation
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
0c7bb37
to
053c2e4
Compare
This also fixes some "race" conditions related to node startup and notification of liveness changes. I don't think any of them were particularly important, but it could have caused test flakiness. The goal is to be able to cleanly split the 3 main concepts of liveness: persistence, gossip cache, and liveness publishing as they are currently somewhat overlapping. |
053c2e4
to
e3cc3fc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like it would be better to set this up during construction, instead of having to do all this non-obvious synchronization around Start()
. Is there a compelling reason not to? The engines are already available during construction, and it seems like it should be possible to propagate the various callbacks as well.
Reviewable status: complete! 0 of 0 LGTMs obtained
e3cc3fc
to
ecd04f0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR:
In theory, this should be possible, but it was a larger refactor than I wanted to take on. I started down this path, but soon realized that it required major surgery to the NewServer
and PreStart
methods in server
. It is pretty difficult with the current code structure.
I gave up on this refactor because I kept hitting issues with the "first node start" and "gossip vs health". One problem was that the liveness is read in the cache from a couple of unexpected places, specifically the replica_range_lease
for when it first tries to get an epoch lease (
l, ok := p.repl.store.cfg.NodeLiveness.GetLiveness(nextLeaseHolder.NodeID) |
I moved the one comment from the next commit back into this commit.
Reviewable status: complete! 0 of 0 LGTMs obtained
190465f
to
dae8d61
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it was a larger refactor than I wanted to take on
Yeah, that's fair -- the startup process is a bit gnarly. But thanks for moving Engines
to construction-time! Looks like it should be straightforward to move onSelfLive
to construction rather than Start()
too, since it only depends on clock
and stores
which are also available at construction time.
I suppose onSelfLive
could also be registered as a regular onIsLive
callback to avoid the duplicate logic. But there might be some subtlety around which ones are called when. Just an idea if you feel like it.
Reviewed 2 of 4 files at r1, 3 of 3 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist)
pkg/kv/kvserver/liveness/liveness.go
line 226 at r2 (raw file):
// onIsLive is a callback registered by stores prior to starting liveness. // It fires when a node transitions from not live to live. onIsLive []IsLiveCallback // see NodeLivenessOptions.OnSelfLive
nit: NodeLivenessOptions.OnSelfLive
should be RegisterCallback
pkg/kv/kvserver/liveness/liveness.go
line 227 at r2 (raw file):
syncutil.RWMutex onIsLive []IsLiveCallback // see NodeLivenessOptions.OnSelfLive // nodes is an in-memory cache of liveness records that NodeLiveness
This paragraph was wrongfully deleted.
pkg/kv/kvserver/liveness/liveness.go
line 764 at r2 (raw file):
for _, l := range nl.GetLivenesses() { for _, fn := range nl.onIsLive { fn(l)
This needs to check that the nodes are actually live, i.e. l.IsLive()
.
pkg/kv/kvserver/liveness/liveness.go
line 770 at r2 (raw file):
if nl.onSelfLive != nil { if _, found := nl.Self(); found { nl.onSelfLive(ctx)
This also needs to check that we're actually live, i.e. that we were able to heartbeat.
pkg/kv/kvserver/liveness/liveness.go
line 1310 at r2 (raw file):
// different replica and nudge it into acquiring the lease. This can leak a // goroutine in the case of a stalled disk. engines := nl.engines
nit: now that this is no longer mutable, we don't need to copy it into a local variable anymore.
pkg/kv/kvserver/node_liveness_test.go
line 310 at r2 (raw file):
verifyLiveness(t, tc) pauseNodeLivenessHeartbeatLoops(tc) tc.Servers[0].NodeLiveness().(*liveness.NodeLiveness).TestingStart(false)
Seems like it'd be better to just set up the callbacks before starting liveness, and check the time in the callback. We can propagate the callback down via a testing knob, unless we already have a suitable knob.
pkg/kv/kvserver/replica_range_lease_test.go
line 131 at r2 (raw file):
zonepb.DefaultZoneConfigRef()), }) l.TestingStart(true)
This doesn't appear necessary.
6ce6165
to
97e0524
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cleaned up the onSelfLive (renamed to onSelfHeartbeat to clarify) and had it injected at construction time.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)
pkg/kv/kvserver/liveness/liveness.go
line 226 at r2 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
nit:
NodeLivenessOptions.OnSelfLive
should beRegisterCallback
done
pkg/kv/kvserver/liveness/liveness.go
line 227 at r2 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
This paragraph was wrongfully deleted.
recovered that text
pkg/kv/kvserver/liveness/liveness.go
line 764 at r2 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
This needs to check that the nodes are actually live, i.e.
l.IsLive()
.
added a check here for whether it is actually live.
pkg/kv/kvserver/liveness/liveness.go
line 770 at r2 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
This also needs to check that we're actually live, i.e. that we were able to heartbeat.
removed this completely now that it is injected in construction time so won't be missed before Start is called.
pkg/kv/kvserver/liveness/liveness.go
line 1310 at r2 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
nit: now that this is no longer mutable, we don't need to copy it into a local variable anymore.
fixed
pkg/kv/kvserver/node_liveness_test.go
line 310 at r2 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Seems like it'd be better to just set up the callbacks before starting liveness, and check the time in the callback. We can propagate the callback down via a testing knob, unless we already have a suitable knob.
Added a testing knob and injected it that way. This does make it cleaner by removing the need for TestingStart
pkg/kv/kvserver/replica_range_lease_test.go
line 131 at r2 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
This doesn't appear necessary.
Removed - no longer needed after changing where the callbacks are set up.
Thanks - hopefully last pass! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the cleanups!
Reviewed 6 of 6 files at r3, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist)
pkg/kv/kvserver/liveness/liveness.go
line 770 at r2 (raw file):
Previously, andrewbaptist (Andrew Baptist) wrote…
removed this completely now that it is injected in construction time so won't be missed before Start is called.
Nice, that avoids the ambiguity as well.
TFTR! bors r=erikgrinaker |
Seperate out fields that don't need to be under the mu lock since they are initialized either prior or part of start. This simplifies later PRs that refactor liveness. Additionally this addresses an issue where the livenvess can be added to the cache before the callbacks are set up. That would prevent these callbacks from firing and could cause issues if nodes are not correctly registered. Epic: none Release note: None
97e0524
to
9f9848f
Compare
Canceled. |
bors retry |
1 similar comment
bors retry |
Already running a review |
Build succeeded: |
Seperate out fields that don't need to be under the mu lock since they are initialized either prior or part of start.
This simplifies later PRs that refactor liveness.