Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refresh ring when node UP event received for missing host #1669

Merged
merged 2 commits into from
Mar 27, 2023

Conversation

sseidman
Copy link
Contributor

@sseidman sseidman commented Jan 11, 2023

The current issue is that there are no events dispatched for NEW_NODE changes. In a 3 node cassandra cluster deployed on cloud infrastructure, a cassandra instance was replaced using the replace_address flag so that it was moved to a new cloud instance (new IP, new host ID). The following debug logs are an example of how the client is currently handling this change in topology.

2023/01/11 20:59:35 gocql: handling frame: [topology_change change=NEW_NODE host=10.128.191.122 port=9042]
2023/01/11 20:59:35 gocql: handling frame: [status_change change=UP host=10.128.191.122 port=9042]
2023/01/11 20:59:35 gocql: dispatching event: &{change:UP host:[10 128 191 122] port:9042}
2023/01/11 20:59:35 gocql: Session.handleNodeUp: 10.128.191.122:9042

There is no event dispatched for the NEW_NODE event and it jumps straight to processing the UP event, so the new node is never added to session ring. If all nodes in the cluster are replaced in this manner, the eventual outcome is that clients lose connection to the cluster and begin outputting gocql: no hosts available in the pool.

After bisecting recent commits, I found this PR introduced the bug into the client. It looks like there was seemingly a fail-safe for when the NEW_NODE event was missed, but this function was removed in the previously mentioned PR.

This change proposes to refresh the hostSource ring on UP events when the host cannot be found in the ring. This ensures that the hostMap stays up to date even if NEW_NODE events are not processed.

This should fix #1668, #1667, and #1582

@sseidman
Copy link
Contributor Author

To be clear, this is occurs when multiple frames are passed to handleNodeEvent here. The NEW_NODE event can get overwritten by the UP event depending on what order they are processed:

2023/01/20 19:55:50 debug: begin frame processing loop
2023/01/20 19:55:50 debug: processing frame
2023/01/20 19:55:50 debug: topology frame event. change: NEW_NODE host: 10.131.0.96
2023/01/20 19:55:50 debug: host not duplicate, adding event to hostEvent map
2023/01/20 19:55:50 debug: setting event change type
2023/01/20 19:55:50 debug: processing frame
2023/01/20 19:55:50 debug: status frame event. change: UP host: 10.131.0.96
2023/01/20 19:55:50 debug: setting event change type
2023/01/20 19:55:50 debug: end frame processing loop

@sseidman
Copy link
Contributor Author

sseidman commented Feb 2, 2023

any thoughts on this?

@mikelococo
Copy link

mikelococo commented Feb 10, 2023

We're now running these changes on our sizeable production infrastructure, having built them off a private fork.

This issue doesn't just affect folks who cycle ip's on Cassandra restart, it also prevents node-rotation. Our IPs are stable across restarts, but we regularly perform vertical scaling operations with -Dcassandra.replace_address_first_boot=<dead_node_ip> where every Cassandra node is replaced in sequence with bigger hardware running on a new IP address. This operation was regularly causing serving impacts where the gocql clients would fail to re-establish connections to the new nodes resulting first in hotspots and then eventually in complete loss of connectivity to the cluster. It appears that a growing number of folks are impacted by this issue.

We (sseidman, myself, and the folks we work with) are happy to do any reasonable amount of work required to make this fix (or a better designed fix, if the maintainers think this is a non-optimal approach) upstreamable. Whether that's more tests, docs, testing alternate approaches... I think there's significant value in addressing this bug and we're motivated to try to get a fix merged if there's a world where that can happen. I don't think we plan on doing further nudges here to try to elicit a maintainer response, though.

mfamador added a commit to mfamador/gocql that referenced this pull request Feb 10, 2023
@StevenLacerda
Copy link

@martin-sucha Any chance this can be reviewed and merged?

@lwimmer
Copy link

lwimmer commented Mar 20, 2023

#915 seems to be related too. Do you think this will fix it too?

@martin-sucha
Copy link
Contributor

Hi! Looks good to me. Sorry for the delay.

What do you guys think of merging #1680 instead? Would that one work for you? Seems that it handles more cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Session hosts are not being updated
5 participants