Fix k8s reconnect #1682

djschaap · 2023-03-20T19:31:17Z

With existing behavior, when a Cassandra node changes its IP address (while keeping the same host ID), the gocql client never utilizes that new IP address. In the event ALL nodes change their IP addresses (over the lifetime of the gocql client), the cluster connection will become unusable. Attempts to query Cassandra will be met with gocql: no hosts available in the pool errors.

I am experiencing this behavior connecting to a Cassandra cluster running on Kubernetes. When a rolling restart occurs (StatefulSet change or otherwise), many of my long-lived gocql clients will become unresponsive. Restarting the affected processes is required to restore Cassandra connectivity.

With these changes in place, when a NEW_NODE message is received for an existing host (host ID) with a new IP address, 1) the existing host (HostInfo struct) is removed and 2) the new host (HostInfo struct) is added.

Personal testing has shown this fix to work (i.e., the client learns the each new node IP address and maintains connectivity to the Cassandra cluster); more thorough testing would be welcome.

This may address #915 and/or #1582.

martin-sucha · 2023-03-21T11:18:31Z

Hello! Thank you for the pull request.

You have a very good point that refreshRing needs to handle re-adding the hosts, so I'll merge 8593ee8

As for the handling of events, please see #1680, what do you think about merging that pull request instead?

lwimmer · 2023-03-21T11:38:10Z

Thank you!

I'be briefly tested this PR and it seems to fix the issue described in #915 for us.

I've also tested #1680 and #1669, which did not fix the issue.

djschaap · 2023-03-21T13:56:21Z

@martin-sucha At first glance, I'm not thrilled with refreshing the ring on every topology change. I share your fears re: thrashing. It also feels bad/inefficient to disregard the actual information in a NEW_NODE message (i.e., knowledge of the host that is actually added) and trigger a full-scale refresh.

OTOH, doing a full ring refresh might help with consistency when the client misses a node add or remove event (if that ever occurs - I can't say).

Just to ask - would either approach interfere or fail when DisableInitialHostLookup or DisableTopologyEvents is true? Are there any race/locking conditions that would be more likely to be triggered by doing a full ring refresh (vs a simple remove/add node)?

TL;DR I prefer my targeted approach to handling topology events, but #1680 looks like it would be sufficient for the scenarios I've considered. I view both as functionally equivalent.

joao-r-reis · 2023-03-21T18:01:47Z

I've implemented a full refresh ring in #1680 for topology events because some drivers are doing it this way and I've not heard of issues that could come from that approach. Although those drivers generally use a debouncer which I did not include in #1680 (I was planning on adding it in a subsequent PR).

The full ring refresh approach avoids issues that could come from weird ordering of events and potential race condition issues. This added consistency is one of the reasons for this approach but another reason is that it helps reduce the maintenance burden of supporting code that might not need to exist. The code that performs a ring refresh on the control connection setup/reconnection can be reused for the topology event handlers.

In any case, both of the PRs fix the events issue so I can just remove the topology event changes from #1680 if you prefer the approach taken by this PR instead.

Bolodya1997 · 2023-03-27T07:23:52Z

Hi, in our project we are extremely interested in this fix.

Is there any plan for merging this and adding to the new release? If so, when approximately it may become available?

martin-sucha · 2023-03-27T10:14:39Z

All right, I've merged this PR together with #1669 so that we can cut a bug fix release. It seems the change to update the existing host's IP is needed to fix #915. The arguments for refreshing the whole ring (with the debouncer) are compelling though, since it will reduce the amount of code that needs to be maintained. @joao-r-reis could you please rebase #1680 and add the debouncer? I think with 8593ee8 included #1680 should work as a fix to #915 too. We can release those changes in a subsequent release, that way even if someone encounters an issue with refreshing the ring always, they could fall back to a release that contains only #1682 and #1669.

martin-sucha · 2023-03-27T10:30:37Z

Tagged v1.3.2

joao-r-reis · 2023-03-27T14:02:34Z

@joao-r-reis could you please rebase #1680 and add the debouncer?

Will do, thanks!

If all nodes in the cluster change their IPs at one time, driver used to no longer be able to ever contact the cluster; the only solution was to restart the driver. A fallback is added to the control connection `reconnect()` logic so that when no known host is reachable, all hostnames provided in ClusterConfig (initial contact points) are reresolved and control connection is attempted to be opened to any of them. If this succeeds. Then, a metadata fetch is issued normally and the whole cluster is discovered with its new IPs. For the cluster to correctly learn new IPs in case that nodes are accessible indirectly (e.g. through a proxy), that is, by translated address and not `rpc_address` or `broadcast_address`, the code introduced in apache#1682 was extended to remove and re-add a host also when its translated address changed (even when its internal address stays the same). As a bonus, a misnamed variable `hostport` is renamed to a suitable `hostaddr`.

If all nodes in the cluster change their IPs at one time, driver used to no longer be able to ever contact the cluster; the only solution was to restart the driver. A fallback is added to the control connection `reconnect()` logic so that when no known host is reachable, all hostnames provided in ClusterConfig (initial contact points) are reresolved and control connection is attempted to be opened to any of them. If this succeeds, a metadata fetch is issued normally and the whole cluster is discovered with its new IPs. For the cluster to correctly learn new IPs in case that nodes are accessible indirectly (e.g. through a proxy), that is, by translated address and not `rpc_address` or `broadcast_address`, the code introduced in apache#1682 was extended to remove and re-add a host also when its translated address changed (even when its internal address stays the same). As a bonus, a misnamed variable `hostport` is renamed to a suitable `hostaddr`.

djschaap added 2 commits March 20, 2023 13:45

Handle topology changes before status changes

d9383b5

Remove and re-add host (by hostID) on IP change (apache#1682)

8593ee8

djschaap force-pushed the fix-k8s-reconnect branch from 36d1f2b to 8593ee8 Compare March 20, 2023 19:34

djschaap marked this pull request as ready for review March 20, 2023 20:21

martin-sucha merged commit b8a948f into apache:master Mar 27, 2023

djschaap deleted the fix-k8s-reconnect branch March 27, 2023 13:34

PierreF mentioned this pull request May 2, 2023

"gocql: no hosts available in the pool" cannot recover without restart #915

Closed

mfamador mentioned this pull request May 2, 2023

Update gocql (fixes k8s reconnect) redpanda-data/connect#1879

Merged

justinfx mentioned this pull request May 7, 2023

Go cassandra driver (gocql) connection pool failure after Yugabyte rolling restart yugabyte/yugabyte-db#10182

Open

wprzytula mentioned this pull request Jun 22, 2023

Reresolve hostnames as fallback when all hosts are unreachable #1708

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix k8s reconnect #1682

Fix k8s reconnect #1682

djschaap commented Mar 20, 2023 •

edited

Loading

martin-sucha commented Mar 21, 2023

lwimmer commented Mar 21, 2023

djschaap commented Mar 21, 2023

joao-r-reis commented Mar 21, 2023 •

edited

Loading

Bolodya1997 commented Mar 27, 2023

martin-sucha commented Mar 27, 2023

martin-sucha commented Mar 27, 2023 •

edited

Loading

joao-r-reis commented Mar 27, 2023

Fix k8s reconnect #1682

Fix k8s reconnect #1682

Conversation

djschaap commented Mar 20, 2023 • edited Loading

martin-sucha commented Mar 21, 2023

lwimmer commented Mar 21, 2023

djschaap commented Mar 21, 2023

joao-r-reis commented Mar 21, 2023 • edited Loading

Bolodya1997 commented Mar 27, 2023

martin-sucha commented Mar 27, 2023

martin-sucha commented Mar 27, 2023 • edited Loading

joao-r-reis commented Mar 27, 2023

djschaap commented Mar 20, 2023 •

edited

Loading

joao-r-reis commented Mar 21, 2023 •

edited

Loading

martin-sucha commented Mar 27, 2023 •

edited

Loading