-
Notifications
You must be signed in to change notification settings - Fork 623
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix k8s reconnect #1682
Fix k8s reconnect #1682
Conversation
36d1f2b
to
8593ee8
Compare
@martin-sucha At first glance, I'm not thrilled with refreshing the ring on every topology change. I share your fears re: thrashing. It also feels bad/inefficient to disregard the actual information in a NEW_NODE message (i.e., knowledge of the host that is actually added) and trigger a full-scale refresh. OTOH, doing a full ring refresh might help with consistency when the client misses a node add or remove event (if that ever occurs - I can't say). Just to ask - would either approach interfere or fail when TL;DR I prefer my targeted approach to handling topology events, but #1680 looks like it would be sufficient for the scenarios I've considered. I view both as functionally equivalent. |
I've implemented a full refresh ring in #1680 for topology events because some drivers are doing it this way and I've not heard of issues that could come from that approach. Although those drivers generally use a debouncer which I did not include in #1680 (I was planning on adding it in a subsequent PR). The full ring refresh approach avoids issues that could come from weird ordering of events and potential race condition issues. This added consistency is one of the reasons for this approach but another reason is that it helps reduce the maintenance burden of supporting code that might not need to exist. The code that performs a ring refresh on the control connection setup/reconnection can be reused for the topology event handlers. In any case, both of the PRs fix the events issue so I can just remove the topology event changes from #1680 if you prefer the approach taken by this PR instead. |
Hi, in our project we are extremely interested in this fix. Is there any plan for merging this and adding to the new release? If so, when approximately it may become available? |
All right, I've merged this PR together with #1669 so that we can cut a bug fix release. It seems the change to update the existing host's IP is needed to fix #915. The arguments for refreshing the whole ring (with the debouncer) are compelling though, since it will reduce the amount of code that needs to be maintained. @joao-r-reis could you please rebase #1680 and add the debouncer? I think with 8593ee8 included #1680 should work as a fix to #915 too. We can release those changes in a subsequent release, that way even if someone encounters an issue with refreshing the ring always, they could fall back to a release that contains only #1682 and #1669. |
Tagged v1.3.2 |
Will do, thanks! |
If all nodes in the cluster change their IPs at one time, driver used to no longer be able to ever contact the cluster; the only solution was to restart the driver. A fallback is added to the control connection `reconnect()` logic so that when no known host is reachable, all hostnames provided in ClusterConfig (initial contact points) are reresolved and control connection is attempted to be opened to any of them. If this succeeds. Then, a metadata fetch is issued normally and the whole cluster is discovered with its new IPs. For the cluster to correctly learn new IPs in case that nodes are accessible indirectly (e.g. through a proxy), that is, by translated address and not `rpc_address` or `broadcast_address`, the code introduced in apache#1682 was extended to remove and re-add a host also when its translated address changed (even when its internal address stays the same). As a bonus, a misnamed variable `hostport` is renamed to a suitable `hostaddr`.
If all nodes in the cluster change their IPs at one time, driver used to no longer be able to ever contact the cluster; the only solution was to restart the driver. A fallback is added to the control connection `reconnect()` logic so that when no known host is reachable, all hostnames provided in ClusterConfig (initial contact points) are reresolved and control connection is attempted to be opened to any of them. If this succeeds, a metadata fetch is issued normally and the whole cluster is discovered with its new IPs. For the cluster to correctly learn new IPs in case that nodes are accessible indirectly (e.g. through a proxy), that is, by translated address and not `rpc_address` or `broadcast_address`, the code introduced in apache#1682 was extended to remove and re-add a host also when its translated address changed (even when its internal address stays the same). As a bonus, a misnamed variable `hostport` is renamed to a suitable `hostaddr`.
If all nodes in the cluster change their IPs at one time, driver used to no longer be able to ever contact the cluster; the only solution was to restart the driver. A fallback is added to the control connection `reconnect()` logic so that when no known host is reachable, all hostnames provided in ClusterConfig (initial contact points) are reresolved and control connection is attempted to be opened to any of them. If this succeeds, a metadata fetch is issued normally and the whole cluster is discovered with its new IPs. For the cluster to correctly learn new IPs in case that nodes are accessible indirectly (e.g. through a proxy), that is, by translated address and not `rpc_address` or `broadcast_address`, the code introduced in apache#1682 was extended to remove and re-add a host also when its translated address changed (even when its internal address stays the same). As a bonus, a misnamed variable `hostport` is renamed to a suitable `hostaddr`.
If all nodes in the cluster change their IPs at one time, driver used to no longer be able to ever contact the cluster; the only solution was to restart the driver. A fallback is added to the control connection `reconnect()` logic so that when no known host is reachable, all hostnames provided in ClusterConfig (initial contact points) are reresolved and control connection is attempted to be opened to any of them. If this succeeds, a metadata fetch is issued normally and the whole cluster is discovered with its new IPs. For the cluster to correctly learn new IPs in case that nodes are accessible indirectly (e.g. through a proxy), that is, by translated address and not `rpc_address` or `broadcast_address`, the code introduced in apache#1682 was extended to remove and re-add a host also when its translated address changed (even when its internal address stays the same). As a bonus, a misnamed variable `hostport` is renamed to a suitable `hostaddr`.
If all nodes in the cluster change their IPs at one time, driver used to no longer be able to ever contact the cluster; the only solution was to restart the driver. A fallback is added to the control connection `reconnect()` logic so that when no known host is reachable, all hostnames provided in ClusterConfig (initial contact points) are reresolved and control connection is attempted to be opened to any of them. If this succeeds, a metadata fetch is issued normally and the whole cluster is discovered with its new IPs. For the cluster to correctly learn new IPs in case that nodes are accessible indirectly (e.g. through a proxy), that is, by translated address and not `rpc_address` or `broadcast_address`, the code introduced in apache#1682 was extended to remove and re-add a host also when its translated address changed (even when its internal address stays the same). As a bonus, a misnamed variable `hostport` is renamed to a suitable `hostaddr`.
With existing behavior, when a Cassandra node changes its IP address (while keeping the same host ID), the gocql client never utilizes that new IP address. In the event ALL nodes change their IP addresses (over the lifetime of the gocql client), the cluster connection will become unusable. Attempts to query Cassandra will be met with
gocql: no hosts available in the pool
errors.I am experiencing this behavior connecting to a Cassandra cluster running on Kubernetes. When a rolling restart occurs (StatefulSet change or otherwise), many of my long-lived gocql clients will become unresponsive. Restarting the affected processes is required to restore Cassandra connectivity.
With these changes in place, when a
NEW_NODE
message is received for an existing host (host ID) with a new IP address, 1) the existing host (HostInfo
struct) is removed and 2) the new host (HostInfo
struct) is added.Personal testing has shown this fix to work (i.e., the client learns the each new node IP address and maintains connectivity to the Cassandra cluster); more thorough testing would be welcome.
This may address #915 and/or #1582.