Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: implement exponential random retry strategy #225

Merged
merged 13 commits into from
Apr 29, 2021
Merged

Conversation

rhajek
Copy link
Contributor

@rhajek rhajek commented Apr 15, 2021

Proposed Changes

This PR changes the default retry strategy to Full Jitter.

Related discussion is influxdata/influxdb#19722 (comment)

Original retry formula:
retry_interval * exponential_base^(attempts-1) + random(jitter_interval)

Purposed exponential random retry formula:

Retry delay is calculated as random value within the interval
[retry_interval * exponential_base^(attempts-1) and retry_interval * exponential_base^(attempts)]

Example for retry_interval=5_000, exponential_base=2, max_retry_delay=125_000

Retry delays are random distributed values within the ranges of
[5_000-10_000, 10_000-20_000, 20_000-40_000, 40_000-80_000, 80_000-125_000]

Checklist

  • CHANGELOG.md updated
  • Rebased/mergeable
  • A test has been added if appropriate
  • pytest tests completes successfully
  • Commit messages are in semantic format
  • Sign CLA (if not already signed)

Copy link
Contributor

@bednar bednar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following jitter_interval property should be remove:

jitter_interval=self._write_options.jitter_interval / 1_000,

@rhajek rhajek requested review from sranka and alespour April 15, 2021 12:28
@rhajek rhajek requested a review from vlastahajek April 15, 2021 12:31
@@ -58,16 +54,10 @@ def get_backoff_time(self):
if consecutive_errors_len < 0:
return 0

backoff_value = self.backoff_factor * (self.exponential_base ** consecutive_errors_len) + self._jitter_delay()
# Full Jitter strategy
backoff_value = self.backoff_factor * (self.exponential_base ** consecutive_errors_len) * self._random()
Copy link

@sranka sranka Apr 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose to change the implementation to compute the next retry delay this way:

def nextDelay(attempt /* 1 means called for the first time */, options):
  range = options.first_retry_range
  i = 1
  while i<attempt:
    i++ 
    range = range * options.exponential_base
    if range > options.max_retry_delay :
      break
  delay = options.min_retry_delay + (range - options.min_retry_delay) * random() /* at least min_retry_delay */
  delay = min(options.max_retry_delay, delay) /* at most max_retry_delay */
  return delay

Additionally, the implementation must ensure that the request is not scheduled for retries after
options.max_retry_time elapsed (max_request_time if possible).

options.max_retry_time can be the only meaningful configurable value from the library user POV. Setting to 0 disables retry.

These could be the defaults:

options.first_retry_range = 5 seconds
options.exponential_base = 2
options.max_retry_delay = 125 seconds
options.min_retry_delay = 1 second
options.max_retry_time = 180 seconds

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This delay function does no guarantee that delay is increasing. If generated random is a lot smaller than in previous attempt, then resulting delay is also smaller. I have no smart proposal how to fix at this moment though :(

Copy link
Contributor

@alespour alespour Apr 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose this simple modification to the above algorithm to ensure that delay values are increasing and increasing enough.

+ def randomArbitrary(min, max) {
+  return random() * (max - min) + min;
+ }
...
- delay = options.min_retry_delay + (range - options.min_retry_delay) * random() /* at least min_retry_delay */
+ delay = options.min_retry_delay + (range - options.min_retry_delay) * options.random /* at least min_retry_delay */
...
+ options.random = randomArbitrary(0.5, 1.0)

Or similarly in the PR like

+ self.random = randomArbitrary(0.5, 1.0)
...
- backoff_value = self.backoff_factor * (self.exponential_base ** consecutive_errors_len) * self._random()
+ backoff_value = self.backoff_factor * (self.exponential_base ** consecutive_errors_len) * self.random

Copy link
Contributor Author

@rhajek rhajek Apr 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@alespour alespour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway, either in alg in this PR, or proposed by sranka, I think random needs to be "controlled" somehow, in order to avoid delay dropping when generated random is way smaller than in previous retry. See #225 (comment)


def increment(self, method=None, url=None, response=None, error=None, _pool=None, _stacktrace=None):
"""Return a new Retry object with incremented retry counters."""
if self.retry_timeout < datetime.now():
Copy link

@sranka sranka Apr 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it also react the same way when retry is disabled? (max_retry_time is 0)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, max_retry_time=0 means retry is disabled, here is the test:

def test_retry_disabled_max_retry_time(self):

@codecov
Copy link

codecov bot commented Apr 20, 2021

Codecov Report

Merging #225 (5b16095) into master (4015c90) will increase coverage by 1.10%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #225      +/-   ##
==========================================
+ Coverage   89.96%   91.06%   +1.10%     
==========================================
  Files          26       26              
  Lines        2003     2038      +35     
==========================================
+ Hits         1802     1856      +54     
+ Misses        201      182      -19     
Impacted Files Coverage Δ
influxdb_client/client/write/retry.py 100.00% <100.00%> (ø)
influxdb_client/client/write_api.py 99.06% <100.00%> (+5.75%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4015c90...5b16095. Read the comment docs.

while i <= consecutive_errors_len:
i += 1
delay_range = delay_range * self.exponential_base
if delay_range > self.max_retry_delay:
Copy link

@sranka sranka Apr 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alespour had a good point that the delays should be increasing (on average), this condition makes it hard to happen since the delay range is the same after a fixed count of attempts, the delays are then oscillating randomly around self.max_retry_delay/2. This can be fixed by restricting the delay range to a large number:

Suggested change
if delay_range > self.max_retry_delay:
if delay_range > 100_000_000:

@rhajek rhajek changed the title feat: implement full jitter retry strategy feat: implement exponential random retry strategy Apr 26, 2021
@codecov-commenter
Copy link

codecov-commenter commented Apr 26, 2021

Codecov Report

Merging #225 (cd08898) into master (4015c90) will increase coverage by 1.10%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #225      +/-   ##
==========================================
+ Coverage   89.96%   91.06%   +1.10%     
==========================================
  Files          26       26              
  Lines        2003     2037      +34     
==========================================
+ Hits         1802     1855      +53     
+ Misses        201      182      -19     
Impacted Files Coverage Δ
influxdb_client/client/write/retry.py 100.00% <100.00%> (ø)
influxdb_client/client/write_api.py 99.05% <100.00%> (+5.75%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4015c90...cd08898. Read the comment docs.

README.rst Outdated
* - **max_retries**
- the number of max retries when write fails
- ``3``
- ``10``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correct?

README.rst Outdated
* - **exponential_base**
- the base for the exponential retry delay, the next delay is computed as ``retry_interval * exponential_base^(attempts-1) + random(jitter_interval)``
- ``5``
- the base for the exponential retry delay, the next delay is computed using random exponential backoff. Example for ``retry_interval=5_000, exponential_base=2, max_retry_delay=125_000, total=5`` Retry delays are random distributed values within the ranges of ``[5_000-10_000, 10_000-20_000, 20_000-40_000, 40_000-80_000, 80_000-125_000]``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add note how looks formula to compute delay.

README.rst Show resolved Hide resolved
@rhajek rhajek merged commit 6844f60 into master Apr 29, 2021
@rhajek rhajek deleted the feat/random_retry branch April 29, 2021 07:48
@bednar bednar added this to the 1.17.0 milestone Apr 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants