Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

liveness: remove one-time init fields from the mu lock #103601

Merged

Conversation

andrewbaptist
Copy link
Collaborator

Seperate out fields that don't need to be under the mu lock since they are initialized either prior or part of start.
This simplifies later PRs that refactor liveness.

@blathers-crl
Copy link

blathers-crl bot commented May 18, 2023

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@andrewbaptist andrewbaptist force-pushed the 2023-05-18-fix-liveness-locks branch 9 times, most recently from 0c7bb37 to 053c2e4 Compare May 19, 2023 17:37
@andrewbaptist
Copy link
Collaborator Author

This also fixes some "race" conditions related to node startup and notification of liveness changes. I don't think any of them were particularly important, but it could have caused test flakiness. The goal is to be able to cleanly split the 3 main concepts of liveness: persistence, gossip cache, and liveness publishing as they are currently somewhat overlapping.

@andrewbaptist andrewbaptist marked this pull request as ready for review May 19, 2023 20:12
@andrewbaptist andrewbaptist requested a review from a team as a code owner May 19, 2023 20:12
@andrewbaptist andrewbaptist requested review from erikgrinaker and removed request for a team May 19, 2023 20:12
@andrewbaptist andrewbaptist force-pushed the 2023-05-18-fix-liveness-locks branch from 053c2e4 to e3cc3fc Compare May 24, 2023 21:16
Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like it would be better to set this up during construction, instead of having to do all this non-obvious synchronization around Start(). Is there a compelling reason not to? The engines are already available during construction, and it seems like it should be possible to propagate the various callbacks as well.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

@andrewbaptist andrewbaptist force-pushed the 2023-05-18-fix-liveness-locks branch from e3cc3fc to ecd04f0 Compare May 25, 2023 12:28
Copy link
Collaborator Author

@andrewbaptist andrewbaptist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR:
In theory, this should be possible, but it was a larger refactor than I wanted to take on. I started down this path, but soon realized that it required major surgery to the NewServer and PreStart methods in server. It is pretty difficult with the current code structure.

I gave up on this refactor because I kept hitting issues with the "first node start" and "gossip vs health". One problem was that the liveness is read in the cache from a couple of unexpected places, specifically the replica_range_lease for when it first tries to get an epoch lease (

l, ok := p.repl.store.cfg.NodeLiveness.GetLiveness(nextLeaseHolder.NodeID)
). So that needs to happen before Start is called otherwise we can't get the lease on this range. I am going to add a note that if we ever completely remove epoch leases this should be possible, but I'm not sure how to do it right now.

I moved the one comment from the next commit back into this commit.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

@andrewbaptist andrewbaptist force-pushed the 2023-05-18-fix-liveness-locks branch 2 times, most recently from 190465f to dae8d61 Compare May 25, 2023 21:52
Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was a larger refactor than I wanted to take on

Yeah, that's fair -- the startup process is a bit gnarly. But thanks for moving Engines to construction-time! Looks like it should be straightforward to move onSelfLive to construction rather than Start() too, since it only depends on clock and stores which are also available at construction time.

I suppose onSelfLive could also be registered as a regular onIsLive callback to avoid the duplicate logic. But there might be some subtlety around which ones are called when. Just an idea if you feel like it.

Reviewed 2 of 4 files at r1, 3 of 3 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist)


pkg/kv/kvserver/liveness/liveness.go line 226 at r2 (raw file):

	// onIsLive is a callback registered by stores prior to starting liveness.
	// It fires when a node transitions from not live to live.
	onIsLive []IsLiveCallback // see NodeLivenessOptions.OnSelfLive

nit: NodeLivenessOptions.OnSelfLive should be RegisterCallback


pkg/kv/kvserver/liveness/liveness.go line 227 at r2 (raw file):

		syncutil.RWMutex
		onIsLive []IsLiveCallback // see NodeLivenessOptions.OnSelfLive
		// nodes is an in-memory cache of liveness records that NodeLiveness

This paragraph was wrongfully deleted.


pkg/kv/kvserver/liveness/liveness.go line 764 at r2 (raw file):

	for _, l := range nl.GetLivenesses() {
		for _, fn := range nl.onIsLive {
			fn(l)

This needs to check that the nodes are actually live, i.e. l.IsLive().


pkg/kv/kvserver/liveness/liveness.go line 770 at r2 (raw file):

	if nl.onSelfLive != nil {
		if _, found := nl.Self(); found {
			nl.onSelfLive(ctx)

This also needs to check that we're actually live, i.e. that we were able to heartbeat.


pkg/kv/kvserver/liveness/liveness.go line 1310 at r2 (raw file):

		// different replica and nudge it into acquiring the lease. This can leak a
		// goroutine in the case of a stalled disk.
		engines := nl.engines

nit: now that this is no longer mutable, we don't need to copy it into a local variable anymore.


pkg/kv/kvserver/node_liveness_test.go line 310 at r2 (raw file):

	verifyLiveness(t, tc)
	pauseNodeLivenessHeartbeatLoops(tc)
	tc.Servers[0].NodeLiveness().(*liveness.NodeLiveness).TestingStart(false)

Seems like it'd be better to just set up the callbacks before starting liveness, and check the time in the callback. We can propagate the callback down via a testing knob, unless we already have a suitable knob.


pkg/kv/kvserver/replica_range_lease_test.go line 131 at r2 (raw file):

					zonepb.DefaultZoneConfigRef()),
			})
			l.TestingStart(true)

This doesn't appear necessary.

@andrewbaptist andrewbaptist force-pushed the 2023-05-18-fix-liveness-locks branch 2 times, most recently from 6ce6165 to 97e0524 Compare May 26, 2023 14:44
Copy link
Collaborator Author

@andrewbaptist andrewbaptist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cleaned up the onSelfLive (renamed to onSelfHeartbeat to clarify) and had it injected at construction time.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)


pkg/kv/kvserver/liveness/liveness.go line 226 at r2 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

nit: NodeLivenessOptions.OnSelfLive should be RegisterCallback

done


pkg/kv/kvserver/liveness/liveness.go line 227 at r2 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

This paragraph was wrongfully deleted.

recovered that text


pkg/kv/kvserver/liveness/liveness.go line 764 at r2 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

This needs to check that the nodes are actually live, i.e. l.IsLive().

added a check here for whether it is actually live.


pkg/kv/kvserver/liveness/liveness.go line 770 at r2 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

This also needs to check that we're actually live, i.e. that we were able to heartbeat.

removed this completely now that it is injected in construction time so won't be missed before Start is called.


pkg/kv/kvserver/liveness/liveness.go line 1310 at r2 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

nit: now that this is no longer mutable, we don't need to copy it into a local variable anymore.

fixed


pkg/kv/kvserver/node_liveness_test.go line 310 at r2 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Seems like it'd be better to just set up the callbacks before starting liveness, and check the time in the callback. We can propagate the callback down via a testing knob, unless we already have a suitable knob.

Added a testing knob and injected it that way. This does make it cleaner by removing the need for TestingStart


pkg/kv/kvserver/replica_range_lease_test.go line 131 at r2 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

This doesn't appear necessary.

Removed - no longer needed after changing where the callbacks are set up.

@andrewbaptist
Copy link
Collaborator Author

Thanks - hopefully last pass!

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the cleanups!

Reviewed 6 of 6 files at r3, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist)


pkg/kv/kvserver/liveness/liveness.go line 770 at r2 (raw file):

Previously, andrewbaptist (Andrew Baptist) wrote…

removed this completely now that it is injected in construction time so won't be missed before Start is called.

Nice, that avoids the ambiguity as well.

@andrewbaptist
Copy link
Collaborator Author

TFTR!

bors r=erikgrinaker

Seperate out fields that don't need to be under the mu lock since they
are initialized either prior or part of start.
This simplifies later PRs that refactor liveness.

Additionally this addresses an issue where the livenvess can be added to
the cache before the callbacks are set up. That would prevent these
callbacks from firing and could cause issues if nodes are not correctly
registered.

Epic: none

Release note: None
@andrewbaptist andrewbaptist force-pushed the 2023-05-18-fix-liveness-locks branch from 97e0524 to 9f9848f Compare May 26, 2023 16:07
@craig
Copy link
Contributor

craig bot commented May 26, 2023

Canceled.

@andrewbaptist
Copy link
Collaborator Author

bors retry

1 similar comment
@andrewbaptist
Copy link
Collaborator Author

bors retry

@craig
Copy link
Contributor

craig bot commented May 26, 2023

Already running a review

@craig
Copy link
Contributor

craig bot commented May 26, 2023

Build succeeded:

@craig craig bot merged commit 743486e into cockroachdb:master May 26, 2023
@andrewbaptist andrewbaptist deleted the 2023-05-18-fix-liveness-locks branch December 15, 2023 21:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants