Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Always refresh ring on topology changes and reconnections #1680

Merged
merged 7 commits into from
Apr 25, 2023

Conversation

joao-r-reis
Copy link
Contributor

@joao-r-reis joao-r-reis commented Mar 13, 2023

Currently, there are some cases where gocql does not refresh the ring during a control connection reconnection. If topology changes are happening on the cluster during the reconnection, some NEW_NODE and REMOVED_NODE protocol events might be missed.

To fix this, every successful reconnection on the control connection should trigger a ring refresh operation. This PR adds that.

Also, this PR fixes an issue where NEW_NODE events can be overwritten by UP events and cause certain nodes to not be added to the ring. This is an issue already described in #1669 but the fix proposed by this PR is a bit different than one outlined on that PR. This PR causes any topology event to trigger a ring refresh regardless of how many nodes were removed or added. This eliminates the need to have duplicated logic around adding and removing nodes in both ringDescriber and Session.

Note: this is not an implementation of #1681.

@martin-sucha
Copy link
Contributor

Hi! Thanks for the pull request.

Did you mean to put removal of addNewNode to the remove unused code commit instead of handleNewNode and handleRemovedNode?

Could you please update the description of commit 326072b to what the description of the merge request says? refresh ring improvements is a bit vague. The commit title could be something like Always refresh ring on topology change. That should make it easier to find the commit in commit log and see the more details with git blame.

It seems gocql currently does not de-bounce the ring refresh, so always refreshing the ring could theoretically cause a lot of loads if events come in quick succession. I suspect this is not a problem in practice, but we can still add the de-bounce (maybe in a separate pull request?) to guard against that.

Another thing that I noticed while reviewing the code is that refreshRing loads data from system.peers, but seems to not use data from system.local. That happens only in addNewNode resp. ringDescriber.getHostInfo, which is not used in current master. We should read from system.local in ringDescriber.GetHosts as well.

Otherwise this looks good to me.

It would be great if you could add a test to make sure that we don't introduce the same issues in the future.

@joao-r-reis joao-r-reis force-pushed the refresh-ring-reconnect branch from 3d83e75 to 829456e Compare March 20, 2023 17:31
@joao-r-reis joao-r-reis changed the title Fix ring refresh issues Always refresh ring on topology changes and reconnections Mar 20, 2023
@joao-r-reis
Copy link
Contributor Author

joao-r-reis commented Mar 20, 2023

Thanks for the quick review. I've changed the commits a bit so they should be clearer now.

Did you mean to put removal of addNewNode to the remove unused code commit instead of handleNewNode and handleRemovedNode?

I removed addNewNode initially and then I found out that handleNewNode and handleRemovedNode were also unused so I removed these 2 later as well (now they are in the same commit).

It seems gocql currently does not de-bounce the ring refresh, so always refreshing the ring could theoretically cause a lot of loads if events come in quick succession. I suspect this is not a problem in practice, but we can still add the de-bounce (maybe in a separate pull request?) to guard against that.

The topology change events are being debounced but I agree that it would probably be a good idea to debounce the ring refresh operations themselves. I'm happy to address this in a subsequent PR or I can just create a new issue so another contributor can pick it up, let me know what you prefer.

Another thing that I noticed while reviewing the code is that refreshRing loads data from system.peers, but seems to not use data from system.local. That happens only in addNewNode resp. ringDescriber.getHostInfo, which is not used in current master. We should read from system.local in ringDescriber.GetHosts as well.

I'll start working on this and I'll also look into writing a test for this.

@martin-sucha
Copy link
Contributor

Did you mean to put removal of addNewNode to the remove unused code commit instead of handleNewNode and handleRemovedNode?

I removed addNewNode initially and then I found out that handleNewNode and handleRemovedNode were also unused so I removed these 2 later as well (now they are in the same commit).

Techically, handleNewNode and handleRemovedNode are used until 7b506bc so the removal should be part of that commit.

It seems gocql currently does not de-bounce the ring refresh, so always refreshing the ring could theoretically cause a lot of loads if events come in quick succession. I suspect this is not a problem in practice, but we can still add the de-bounce (maybe in a separate pull request?) to guard against that.

The topology change events are being debounced but I agree that it would probably be a good idea to debounce the ring refresh operations themselves. I'm happy to address this in a subsequent PR or I can just create a new issue so another contributor can pick it up, let me know what you prefer.

Feel free to send a pull request, that would be great.

Another thing that I noticed while reviewing the code is that refreshRing loads data from system.peers, but seems to not use data from system.local. That happens only in addNewNode resp. ringDescriber.getHostInfo, which is not used in current master. We should read from system.local in ringDescriber.GetHosts as well.

I'll start working on this and I'll also look into writing a test for this.

Thanks!

@martin-sucha martin-sucha mentioned this pull request Mar 21, 2023
@joao-r-reis joao-r-reis force-pushed the refresh-ring-reconnect branch from 829456e to e204b56 Compare March 27, 2023 16:26
@joao-r-reis
Copy link
Contributor Author

I've rebased my branch and pushed commits to make GetHosts() retrieve the local host info from system.local and to remove getHostInfo() which was only used by addNewNode() and this function was removed in one of my earlier commits because it was also unused.

For the debouncer, I would think having an event debouncer would be enough and I see that gocql already has one, is there a need to have two debouncers? What would be the goal of the ring refresh debouncer?

@StevenLacerda
Copy link

@martin-sucha Can you take a look at the merge request? This should help tremendously with K8's environments. Thanks!

@joao-r-reis
Copy link
Contributor Author

I've implemented a debouncer, can you take a look and see if this is aligned with what you were thinking? I'm going to look into tests now.

host_source.go Outdated Show resolved Hide resolved
@joao-r-reis joao-r-reis force-pushed the refresh-ring-reconnect branch from 295a4a5 to a891a8b Compare March 28, 2023 16:54
@joao-r-reis
Copy link
Contributor Author

joao-r-reis commented Mar 30, 2023

I've added unit tests but I don't see a way to simulate a topology change for an integration test, does the current test code support writing an integration test that could test whether the control connection updates the topology when it reconnects?

Oh I've also changed the host_source_test.go tests to run with the unit flag instead of cassandra flag since this file seems to contain unit tests only.

@martin-sucha
Copy link
Contributor

I've added unit tests but I don't see a way to simulate a topology change for an integration test, does the current test code support writing an integration test that could test whether the control connection updates the topology when it reconnects?

Yeah, the current integration tests use ccm and it would probably be hard to write an integration test that changes topology. We should probably use Go code directly instead of ccm to setup the cluster. That is a bigger chunk of work though. I've opened #1686 to track that.

Oh I've also changed the host_source_test.go tests to run with the unit flag instead of cassandra flag since this file seems to contain unit tests only.

Nice catch. Thanks!

Copy link
Contributor

@martin-sucha martin-sucha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for preparing the changes! Looks good to me in general, I've added some inline comments.

host_source.go Outdated Show resolved Hide resolved
host_source.go Outdated Show resolved Hide resolved
@joao-r-reis
Copy link
Contributor Author

Managed to add a ccm test that tests whether the control connection refreshes the ring on reconnect. I used DisableTopologyEvents config and a custom HostFilter to avoid using real topology changes.

I'll take a look at your review feedback now, thanks.

Copy link
Contributor

@martin-sucha martin-sucha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I've noticed one more thing, added inline comment.

host_source.go Show resolved Hide resolved
- remove Conn.localHostInfo() and add Session.hostInfoFromIter()
- add ringDescriber.getLocalHostInfo()
- remove local host from ringDescriber.getClusterPeerInfo() returned slice
- make ringDescriber.GetHosts() call ringDescriber.getLocalHostInfo() in addition to ringDescriber.getClusterPeerInfo()
- Ring refresh can be requested by topology event handler and control connection reconnections
@joao-r-reis joao-r-reis force-pushed the refresh-ring-reconnect branch from c91e08f to b9fa942 Compare April 11, 2023 14:53
@StevenLacerda
Copy link

@martin-sucha @joao-r-reis Is this ready to be reviewed and merged? Really looking forward to getting this out as it will help several of our customers.

@joao-r-reis
Copy link
Contributor Author

I'm waiting on a review and ready to implement any changes that are suggested. I've addressed all comments of the previous reviews.

@martin-sucha martin-sucha merged commit f42e40c into apache:master Apr 25, 2023
@martin-sucha
Copy link
Contributor

Looks good to me. Thank you!

@martin-sucha
Copy link
Contributor

It seems there is a data race in the ccm test. @joao-r-reis could you please check #1691 ?

@joao-r-reis joao-r-reis deleted the refresh-ring-reconnect branch April 25, 2023 10:05
@StevenLacerda
Copy link

@joao-r-reis @martin-sucha Any updates? What's needed at this point?

@joao-r-reis
Copy link
Contributor Author

joao-r-reis commented May 8, 2023

@joao-r-reis @martin-sucha Any updates? What's needed at this point?

This has already been released in v1.4.0 AFAIK

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants