Reresolve hostnames as fallback when all hosts are unreachable #1708

wprzytula · 2023-06-22T16:02:48Z

If all nodes in the cluster change their IPs at one time, driver used to no longer be able to ever contact the cluster; the only solution was to restart the driver. A fallback is added to the control connection reconnect() logic so that when no known host is reachable, all hostnames provided in ClusterConfig (initial contact points) are reresolved and control connection is attempted to be opened to any of them. If this succeeds, a metadata fetch is issued normally and the whole cluster is discovered with its new IPs.

For the cluster to correctly learn new IPs in case that nodes are accessible indirectly (e.g. through a proxy), that is, by translated address and not rpc_address or broadcast_address, the code introduced in #1682 is extended to remove and re-add a host also when its translated address changed (even when its internal address stays the same).

As a bonus, a misnamed variable hostport is renamed to a suitable
hostaddr.

wprzytula · 2023-06-27T07:41:01Z

@martin-sucha Could you please take a look?

martin-sucha

@wprzytula thank you for the pull request! This is indeed a useful fix. I've added a few comments inline.

control.go

wprzytula · 2023-06-27T18:06:57Z

I've addressed comments and retested.

control.go

If all nodes in the cluster change their IPs at one time, driver used to no longer be able to ever contact the cluster; the only solution was to restart the driver. A fallback is added to the control connection `reconnect()` logic so that when no known host is reachable, all hostnames provided in ClusterConfig (initial contact points) are reresolved and control connection is attempted to be opened to any of them. If this succeeds, a metadata fetch is issued normally and the whole cluster is discovered with its new IPs. For the cluster to correctly learn new IPs in case that nodes are accessible indirectly (e.g. through a proxy), that is, by translated address and not `rpc_address` or `broadcast_address`, the code introduced in apache#1682 was extended to remove and re-add a host also when its translated address changed (even when its internal address stays the same). As a bonus, a misnamed variable `hostport` is renamed to a suitable `hostaddr`.

wprzytula · 2023-06-28T12:47:11Z

@martin-sucha Done.

martin-sucha · 2023-06-28T16:47:28Z

Merged. Thank you!

Would you be willing to contribute a test that would ensure that the function does not break in the future?

wprzytula · 2023-06-29T08:59:09Z

I'm afraid that automated testing of DNS changes is too much effort. The manual tests that I ran were complex: they involved:

stopping systemd DNS service,
running custom local DNS service that maps hostnames to "old" IPs,
using a proxy on connections to all nodes, listening on "old" IPs,
running a crafted test that periodically sends queries,
breaking connections by stopping proxies,
changing DNS rules to resolve to new IPs,
reestablishing proxies on new IPs,
waiting until all pools get populated again,
asserting that it happens in reasonable time.

martin-sucha reviewed Jun 27, 2023

View reviewed changes

control.go Outdated Show resolved Hide resolved

control.go Outdated Show resolved Hide resolved

control.go Outdated Show resolved Hide resolved

control.go Outdated Show resolved Hide resolved

control.go Outdated Show resolved Hide resolved

wprzytula force-pushed the reresolve-dns-upstream branch from 35aaf59 to 25a06a8 Compare June 27, 2023 18:01

wprzytula requested a review from martin-sucha June 27, 2023 18:01

martin-sucha reviewed Jun 28, 2023

View reviewed changes

control.go Outdated Show resolved Hide resolved

wprzytula force-pushed the reresolve-dns-upstream branch from 25a06a8 to b9737dd Compare June 28, 2023 10:27

wprzytula requested a review from martin-sucha June 28, 2023 10:27

martin-sucha merged commit a507dae into apache:master Jun 28, 2023

wprzytula deleted the reresolve-dns-upstream branch June 29, 2023 08:59

martin-sucha mentioned this pull request Jul 11, 2023

Reconnect control using initial hosts #1713

Closed

sylwiaszunejko mentioned this pull request Aug 29, 2023

Reresolve DNS as fallback when all hosts are unreachable scylladb/python-driver#254

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reresolve hostnames as fallback when all hosts are unreachable #1708

Reresolve hostnames as fallback when all hosts are unreachable #1708

wprzytula commented Jun 22, 2023

wprzytula commented Jun 27, 2023

martin-sucha left a comment

wprzytula commented Jun 27, 2023

wprzytula commented Jun 28, 2023

martin-sucha commented Jun 28, 2023

wprzytula commented Jun 29, 2023

Reresolve hostnames as fallback when all hosts are unreachable #1708

Reresolve hostnames as fallback when all hosts are unreachable #1708

Conversation

wprzytula commented Jun 22, 2023

wprzytula commented Jun 27, 2023

martin-sucha left a comment

Choose a reason for hiding this comment

wprzytula commented Jun 27, 2023

wprzytula commented Jun 28, 2023

martin-sucha commented Jun 28, 2023

wprzytula commented Jun 29, 2023