Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix HostResolver behavior on fail #62652

Merged
merged 7 commits into from
May 17, 2024

Conversation

ianton-ru
Copy link
Contributor

@ianton-ru ianton-ru commented Apr 15, 2024

Changelog category (leave one):

  • Performance Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

HostResolver has each IP address several times.
If remote host has several IPs and by some reason (firewall rules for example) access on some IPs allowed and on others forbidden, than only first record of forbidden IPs marked as failed, and in each try these IPs have a chance to be chosen (and failed again).
Even if fix this, every 120 seconds DNS cache dropped, and IPs can be chosen again.

This fix

  1. Allows only one record per IP in HostResoler
  2. Adds exponential timeouts between reties of failed IPs

Include tests (required builds will be added automatically):

  • Performance tests

@yariks5s yariks5s added the can be tested Allows running workflows for external contributors label Apr 15, 2024
@yariks5s yariks5s self-assigned this Apr 15, 2024
@robot-ch-test-poll robot-ch-test-poll added the pr-bugfix Pull request with bugfix, not backported by default label Apr 15, 2024
@robot-ch-test-poll
Copy link
Contributor

robot-ch-test-poll commented Apr 15, 2024

This is an automated comment for commit 8a24515 with description of existing statuses. It's updated for the latest CI running

❌ Click here to open a full report in a separate page

Check nameDescriptionStatus
AST fuzzerRuns randomly generated queries to catch program errors. The build type is optionally given in parenthesis. If it fails, ask a maintainer for help❌ failure
CI runningA meta-check that indicates the running CI. Normally, it's in success or pending state. The failed status indicates some problems with the PR⏳ pending
Stateful testsRuns stateful functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc❌ error
Stateless testsRuns stateless functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc❌ error
Stress testRuns stateless functional tests concurrently from several clients to detect concurrency-related errors❌ failure
Successful checks
Check nameDescriptionStatus
A SyncThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
ClickBenchRuns [ClickBench](https://github.com/ClickHouse/ClickBench/) with instant-attach table✅ success
ClickHouse build checkBuilds ClickHouse in various configurations for use in further steps. You have to fix the builds that fail. Build logs often has enough information to fix the error, but you might have to reproduce the failure locally. The cmake options can be found in the build log, grepping for cmake. Use these options and follow the general build process✅ success
Compatibility checkChecks that clickhouse binary runs on distributions with old libc versions. If it fails, ask a maintainer for help✅ success
Docker keeper imageThe check to build and optionally push the mentioned image to docker hub✅ success
Docker server imageThe check to build and optionally push the mentioned image to docker hub✅ success
Docs checkBuilds and tests the documentation✅ success
Fast testNormally this is the first check that is ran for a PR. It builds ClickHouse and runs most of stateless functional tests, omitting some. If it fails, further checks are not started until it is fixed. Look at the report to see which tests fail, then reproduce the failure locally as described here✅ success
Flaky testsChecks if new added or modified tests are flaky by running them repeatedly, in parallel, with more randomization. Functional tests are run 100 times with address sanitizer, and additional randomization of thread scheduling. Integrational tests are run up to 10 times. If at least once a new test has failed, or was too long, this check will be red. We don't allow flaky tests, read the doc✅ success
Install packagesChecks that the built packages are installable in a clear environment✅ success
Integration testsThe integration tests report. In parenthesis the package type is given, and in square brackets are the optional part/total tests✅ success
Mergeable CheckChecks if all other necessary checks are successful✅ success
PR CheckThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Performance ComparisonMeasure changes in query performance. The performance test report is described in detail here. In square brackets are the optional part/total tests✅ success
Style checkRuns a set of checks to keep the code style clean. If some of tests failed, see the related log from the report✅ success
Unit testsRuns the unit tests for different release types✅ success
Upgrade checkRuns stress tests on server version from last release and then tries to upgrade it to the version from the PR. It checks if the new server can successfully startup without any errors, crashes or sanitizer asserts✅ success

Copy link
Member

@yariks5s yariks5s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, can you please add tests emulating the situation that will raise the new behaviour?

Comment on lines 163 to 170
while (it != records.end() && it->address == address)
{
it->failed = true;
it->fail_time = now;
if (it->fail_count < RECORD_FAIL_COUNT_LIMIT)
++it->fail_count;
++it;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the point of creating while cycle here? We assume that we may have the same addresses there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you a right, with single record for address this cycle useless, removed.

@alexey-milovidov alexey-milovidov removed the pr-bugfix Pull request with bugfix, not backported by default label Apr 16, 2024
@robot-ch-test-poll2 robot-ch-test-poll2 added the pr-performance Pull request with some performance improvements label Apr 19, 2024
@ianton-ru
Copy link
Contributor Author

Also, can you please add tests emulating the situation that will raise the new behaviour?

Did not managed to write good test. Result of fix can be viewed after long work - reduced count of timeouts. With short test it's too random - in selectBest random IP choosed, so it can be alive IP.

Comment on lines 230 to 233
if (merged.empty() || merged.back().address != *it_next)
merged.push_back(Record(*it_next, now));
else
merged.back().resolve_time = now;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand that lines.
It is a case when new address discovered. Why do you do additional check here if (merged.empty() || merged.back().address != *it_next)? This condition is always true here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On my host getaddrinfo returns 3 record per each address, for example

(gdb) frame
#0  Poco::Net::DNS::hostByName (hostname=..., hintFlags=<optimized out>) at ./build-clang-17/./base/poco/Net/src/DNS.cpp:80
80		if (rc == 0)
(gdb) l
75		struct addrinfo* pAI;
76		struct addrinfo hints;
77		std::memset(&hints, 0, sizeof(hints));
78		hints.ai_flags = hintFlags;
79		int rc = getaddrinfo(hostname.c_str(), NULL, &hints, &pAI);
80		if (rc == 0)
81		{
82			HostEntry result(pAI);
83			freeaddrinfo(pAI);
84			return result;
(gdb) p *pAI
$14 = {ai_flags = 32, ai_family = 2, ai_socktype = 1, ai_protocol = 6, ai_addrlen = 16, ai_addr = 0x76bbf2316170, ai_canonname = 0x0, ai_next = 0x76bbf2316180}
(gdb) p *pAI->ai_next
$15 = {ai_flags = 32, ai_family = 2, ai_socktype = 2, ai_protocol = 17, ai_addrlen = 16, ai_addr = 0x76bbf23161b0, ai_canonname = 0x0, ai_next = 0x76bbf23161c0}
(gdb) p *pAI->ai_next->ai_next
$16 = {ai_flags = 32, ai_family = 2, ai_socktype = 3, ai_protocol = 0, ai_addrlen = 16, ai_addr = 0x76bbf23161f0, ai_canonname = 0x0, ai_next = 0x0}
(gdb) p *(sockaddr_in*)pAI->ai_addr
$19 = {sin_family = 2, sin_port = 0, sin_addr = {s_addr = 16777343}, sin_zero = "\000\000\000\000\000\000\000"}
(gdb) p *(sockaddr_in*)pAI->ai_next->ai_addr
$20 = {sin_family = 2, sin_port = 0, sin_addr = {s_addr = 16777343}, sin_zero = "\000\000\000\000\000\000\000"}
(gdb) p *(sockaddr_in*)pAI->ai_next->ai_next->ai_addr
$21 = {sin_family = 2, sin_port = 0, sin_addr = {s_addr = 16777343}, sin_zero = "\000\000\000\000\000\000\000"}

(16777343 ==> '127.0.0.1')
so each IP has three copies in next_gen parameter in HostResolver::updateImpl method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, randomizing in HostResolver does not respect address priority from /etc/gai.conf

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not saw any code in Clickhouse which cares about ip order, so I did not supported it here either. if you wish you could try to add it!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so each IP has three copies in next_gen parameter in HostResolver::updateImpl method.

I thought that DNSResolver eliminating duplicates. But if it does not, than some one should do it. Lets do it in HostResolver because it expects unique records right now. Thanks for that info.

Copy link
Member

@CheSema CheSema Apr 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW you reminded me that it should be like that here https://github.com/ClickHouse/ClickHouse/blob/f75447bf11e76e09a291b3a8793f3e953bf9e279/src/Common/HostResolvePool.cpp#L50

- , resolve_function([](const String & host_to_resolve) { return DNSResolver::instance().resolveHostAll(host_to_resolve); })
+ , resolve_function([](const String & host_to_resolve) { return DNSResolver::instance().resolveHostAllInOriginOrder(host_to_resolve); })  

Could you fix that as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second random in choosing IPs by weights, so ordering here does not fix it.

@@ -237,7 +244,7 @@ void HostResolver::updateImpl(Poco::Timestamp now, std::vector<Poco::Net::IPAddr
}

for (auto & rec : merged)
if (rec.failed && rec.fail_time < last_effective_resolve)
if (rec.failed && rec.fail_time < now - Poco::Timespan(history.totalSeconds() * (1ull << (rec.fail_count - 1)), 0))
Copy link
Member

@CheSema CheSema Apr 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What that expression means? now - Poco::Timespan(history.totalSeconds() * (1ull << (rec.fail_count - 1))
Make a variable or function, the name will give us a hint what it is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved in separate method HostResolver::Record::cleanTimeoutedFailedFlag

@@ -141,6 +142,7 @@ class HostResolver : public std::enable_shared_from_this<HostResolver>
size_t usage = 0;
bool failed = false;
Poco::Timestamp fail_time = 0;
size_t fail_count = 0;
Copy link
Member

@CheSema CheSema Apr 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you want to count failtures that you do not need this at all bool failed = false;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uint_8 fail_count = 0;
+
static_assert that (1 << sizeof (fail_count)) <= RECORD_FAIL_COUNT_LIMIT, or something like that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It;s a different. failed flag cleaned from time to time (HostResolverPool.cpp, line 248), and without fail_count it is cleaned every 2 minutes (DEFAULT_RESOLVE_TIME_HISTORY_SECONDS). It is possible that in some time IP can be accessible again. failed set to true to check address again, but fail_count is not set to zero to make pauses between checks in 2, 4, 8, ..., 62 minutes. fail_count set to zero only after success check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understood that too after some time ) Tnx.
May be it is better name for fail_count as consecutive_fail_count. You make bigger penalty according consecutive failures.

@CheSema
Copy link
Member

CheSema commented Apr 30, 2024

Actually idea what I see in the code is good. I like it.
But the description did not help me to understand that idea fast. Please write what you did in the PR description more detailed.

It is possible to write good tests here. See src/Common/tests/gtest_resolve_pool.cpp. There is ResolvePoolMock which takes result of resolving from any lambda.

@CheSema
Copy link
Member

CheSema commented Apr 30, 2024

Without tests it wont be merged. Please try to do them. It is better to check your good ideas with some tests than on real clusters.

Usually IP is not bad completely. It just overloaded and connections are failed to it. So there should not be big penalty just for occasional fails. If IP is really expired that it has to be removed from dns and clickhouse forgets it as well.
You are trying to improve detection of the bad IP in order to understand if there is some IP which is failed every connect. Such IP has to be ban more severe for each consecutive fails.
You purposely do not reset fail_count if all IP are marked as faulty and we have to reset their status. It is good. But comments and tests are required. This is why you need separate bool failed = false;.

@CheSema
Copy link
Member

CheSema commented Apr 30, 2024

Also, can you please add tests emulating the situation that will raise the new behaviour?

Did not managed to write good test. Result of fix can be viewed after long work - reduced count of timeouts. With short test it's too random - in selectBest random IP choosed, so it can be alive IP.

history time could be set as a small value in tests.

@ianton-ru
Copy link
Contributor Author

Also, can you please add tests emulating the situation that will raise the new behaviour?

Did not managed to write good test. Result of fix can be viewed after long work - reduced count of timeouts. With short test it's too random - in selectBest random IP choosed, so it can be alive IP.

history time could be set as a small value in tests.

It's a DEFAULT_RESOLVE_TIME_HISTORY_SECONDS and can't be changed with some setting now.

@CheSema
Copy link
Member

CheSema commented Apr 30, 2024

Also, can you please add tests emulating the situation that will raise the new behaviour?

Did not managed to write good test. Result of fix can be viewed after long work - reduced count of timeouts. With short test it's too random - in selectBest random IP choosed, so it can be alive IP.

history time could be set as a small value in tests.

It's a DEFAULT_RESOLVE_TIME_HISTORY_SECONDS and can't be changed with some setting now.

You do not have to make a test for HostResolversPool. You need a test for HostResolver.
It could be created with special agrs. Check this:
https://github.com/ClickHouse/ClickHouse/blob/f75447bf11e76e09a291b3a8793f3e953bf9e279/src/Common/HostResolvePool.cpp#L55-L57

@ianton-ru
Copy link
Contributor Author

Tried to add test for this.

@ianton-ru
Copy link
Contributor Author

@CheSema Is any chance to review tests on next week?

@CheSema CheSema force-pushed the fix-host-resolver-fail branch from a17cf27 to 21d0aeb Compare May 13, 2024 18:07
/// there are could be duplicates in next_gen vector
if (merged.empty() || merged.back().address != *it_next)
{
CurrentMetrics::add(metrics.active_count, 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we do not update metrics for duplicates.

@@ -237,10 +254,22 @@ void HostResolver::updateImpl(Poco::Timestamp now, std::vector<Poco::Net::IPAddr
}

for (auto & rec : merged)
if (rec.failed && rec.fail_time < last_effective_resolve)
rec.failed = false;
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I adjust new counter banned_count. class Rec is unaware about metrics, as a result that code does not belong Rec's method.

@@ -149,6 +152,11 @@ class HostResolver : public std::enable_shared_from_this<HostResolver>
return address < r.address;
}

bool operator ==(const Record & r) const
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs for is_unuque check under chassert

@@ -166,6 +174,28 @@ class HostResolver : public std::enable_shared_from_this<HostResolver>
return 8;
return 10;
}

bool setFail(const Poco::Timestamp & now)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return true if status has chenged. Needs for adjusting metrics.

@@ -188,7 +219,7 @@ class HostResolver : public std::enable_shared_from_this<HostResolver>

std::mutex mutex;

Poco::Timestamp last_resolve_time TSA_GUARDED_BY(mutex);
Poco::Timestamp last_resolve_time TSA_GUARDED_BY(mutex) = Poco::Timestamp::TIMEVAL_MIN;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to be sure that HostResolver::update is called in c-tor even if history is 0.


addresses = std::set<String>{"127.0.0.1", "127.0.0.2", "127.0.0.3"};
addresses = std::multiset<String>{"127.0.0.1", "127.0.0.2", "127.0.0.3"};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for testing duplicates in resolve result

@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to no response for status checks May 16, 2024
@CheSema CheSema added this pull request to the merge queue May 16, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to no response for status checks May 16, 2024
@CheSema CheSema added this pull request to the merge queue May 17, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to no response for status checks May 17, 2024
@yariks5s
Copy link
Member

yariks5s commented May 17, 2024

image

this is failed due to skipped black check in the CI:
image

@maxknv
Copy link
Member

maxknv commented May 17, 2024

@Felixoid ci in pull request skipped black check for a new .py file

@Felixoid Felixoid added this pull request to the merge queue May 17, 2024
Merged via the queue into ClickHouse:master with commit 38787c4 May 17, 2024
137 of 208 checks passed
@robot-ch-test-poll1 robot-ch-test-poll1 added the pr-synced-to-cloud The PR is synced to the cloud repo label May 17, 2024
@ianton-ru
Copy link
Contributor Author

@CheSema Is any chance to backport to 24.3 LTS?

@CheSema
Copy link
Member

CheSema commented May 21, 2024

@CheSema Is any chance to backport to 24.3 LTS?

This is new feature. Normally we do not back port such changes. I will check if there are conflicts.

@Algunenano
Copy link
Member

This is new feature. Normally we do not back port such changes. I will check if there are conflicts.

Please don't backport features. This would introduce unnecessary risk on minor upgrades for stable releases, where we should only backport important bug fixes.

@ianton-ru
Copy link
Contributor Author

HostResolver was changed only in 24.3, so backports to 24.2 and 23.8 are not required.

@robot-ch-test-poll4 robot-ch-test-poll4 added the pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore label May 21, 2024
@davenger
Copy link
Member

davenger commented Aug 3, 2024

#67705

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
can be tested Allows running workflows for external contributors pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore pr-backports-created-cloud pr-performance Pull request with some performance improvements pr-synced-to-cloud The PR is synced to the cloud repo v24.3-must-backport
Projects
None yet
Development

Successfully merging this pull request may close these issues.