feat: implement exponential random retry strategy #225

rhajek · 2021-04-15T11:28:06Z

Proposed Changes

This PR changes the default retry strategy to Full Jitter.

Related discussion is influxdata/influxdb#19722 (comment)

Original retry formula:
retry_interval * exponential_base^(attempts-1) + random(jitter_interval)

Purposed exponential random retry formula:

Retry delay is calculated as random value within the interval
[retry_interval * exponential_base^(attempts-1) and retry_interval * exponential_base^(attempts)]

Example for retry_interval=5_000, exponential_base=2, max_retry_delay=125_000

Retry delays are random distributed values within the ranges of
[5_000-10_000, 10_000-20_000, 20_000-40_000, 40_000-80_000, 80_000-125_000]

Checklist

CHANGELOG.md updated
Rebased/mergeable
A test has been added if appropriate
pytest tests completes successfully
Commit messages are in semantic format
Sign CLA (if not already signed)

bednar

The following jitter_interval property should be remove:

influxdb-client-python/influxdb_client/client/write_api.py

Line 367 in 62622e4

jitter_interval=self._write_options.jitter_interval / 1_000,

sranka · 2021-04-15T14:04:33Z

influxdb_client/client/write/retry.py

@@ -58,16 +54,10 @@ def get_backoff_time(self):
        if consecutive_errors_len < 0:
            return 0

-        backoff_value = self.backoff_factor * (self.exponential_base ** consecutive_errors_len) + self._jitter_delay()
+        # Full Jitter strategy
+        backoff_value = self.backoff_factor * (self.exponential_base ** consecutive_errors_len) * self._random()


I propose to change the implementation to compute the next retry delay this way:

def nextDelay(attempt /* 1 means called for the first time */, options): range = options.first_retry_range i = 1 while i<attempt: i++ range = range * options.exponential_base if range > options.max_retry_delay : break delay = options.min_retry_delay + (range - options.min_retry_delay) * random() /* at least min_retry_delay */ delay = min(options.max_retry_delay, delay) /* at most max_retry_delay */ return delay

Additionally, the implementation must ensure that the request is not scheduled for retries after
options.max_retry_time elapsed (max_request_time if possible).

options.max_retry_time can be the only meaningful configurable value from the library user POV. Setting to 0 disables retry.

These could be the defaults:

options.first_retry_range = 5 seconds options.exponential_base = 2 options.max_retry_delay = 125 seconds options.min_retry_delay = 1 second options.max_retry_time = 180 seconds

This delay function does no guarantee that delay is increasing. If generated random is a lot smaller than in previous attempt, then resulting delay is also smaller. I have no smart proposal how to fix at this moment though :(

I propose this simple modification to the above algorithm to ensure that delay values are increasing and increasing enough.

+ def randomArbitrary(min, max) { + return random() * (max - min) + min; + } ... - delay = options.min_retry_delay + (range - options.min_retry_delay) * random() /* at least min_retry_delay */ + delay = options.min_retry_delay + (range - options.min_retry_delay) * options.random /* at least min_retry_delay */ ... + options.random = randomArbitrary(0.5, 1.0)

Or similarly in the PR like

+ self.random = randomArbitrary(0.5, 1.0) ... - backoff_value = self.backoff_factor * (self.exponential_base ** consecutive_errors_len) * self._random() + backoff_value = self.backoff_factor * (self.exponential_base ** consecutive_errors_len) * self.random

Full Jitter is algorithm is described in https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/ https://github.com/aws-samples/aws-arch-backoff-simulator/blob/66cb169277051eea207dbef8c7f71767fe6af144/src/backoff_simulator.py#L40

alespour

Anyway, either in alg in this PR, or proposed by sranka, I think random needs to be "controlled" somehow, in order to avoid delay dropping when generated random is way smaller than in previous retry. See #225 (comment)

influxdb_client/client/write/retry.py

sranka · 2021-04-20T12:44:49Z

influxdb_client/client/write/retry.py


    def increment(self, method=None, url=None, response=None, error=None, _pool=None, _stacktrace=None):
        """Return a new Retry object with incremented retry counters."""
+        if self.retry_timeout < datetime.now():


Does it also react the same way when retry is disabled? (max_retry_time is 0)

yes, max_retry_time=0 means retry is disabled, here is the test:

influxdb-client-python/tests/test_WriteApiBatching.py

Line 257 in 5b16095

def test_retry_disabled_max_retry_time(self):

codecov · 2021-04-20T13:01:56Z

Codecov Report

Merging #225 (5b16095) into master (4015c90) will increase coverage by 1.10%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #225      +/-   ##
==========================================
+ Coverage   89.96%   91.06%   +1.10%     
==========================================
  Files          26       26              
  Lines        2003     2038      +35     
==========================================
+ Hits         1802     1856      +54     
+ Misses        201      182      -19

Impacted Files	Coverage Δ
influxdb_client/client/write/retry.py	`100.00% <100.00%> (ø)`
influxdb_client/client/write_api.py	`99.06% <100.00%> (+5.75%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4015c90...5b16095. Read the comment docs.

sranka · 2021-04-21T05:09:32Z

influxdb_client/client/write/retry.py

+        while i <= consecutive_errors_len:
+            i += 1
+            delay_range = delay_range * self.exponential_base
+            if delay_range > self.max_retry_delay:


@alespour had a good point that the delays should be increasing (on average), this condition makes it hard to happen since the delay range is the same after a fixed count of attempts, the delays are then oscillating randomly around self.max_retry_delay/2. This can be fixed by restricting the delay range to a large number:

Suggested change

if delay_range > self.max_retry_delay:

if delay_range > 100_000_000:

codecov-commenter · 2021-04-26T10:12:49Z

Codecov Report

Merging #225 (cd08898) into master (4015c90) will increase coverage by 1.10%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #225      +/-   ##
==========================================
+ Coverage   89.96%   91.06%   +1.10%     
==========================================
  Files          26       26              
  Lines        2003     2037      +34     
==========================================
+ Hits         1802     1855      +53     
+ Misses        201      182      -19

Impacted Files	Coverage Δ
influxdb_client/client/write/retry.py	`100.00% <100.00%> (ø)`
influxdb_client/client/write_api.py	`99.05% <100.00%> (+5.75%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4015c90...cd08898. Read the comment docs.

bednar · 2021-04-26T10:20:42Z

README.rst

   * - **max_retries**
     - the number of max retries when write fails
-     - ``3``
+     - ``10``


Is this correct?

bednar · 2021-04-26T10:21:43Z

README.rst

   * - **exponential_base**
-     - the base for the exponential retry delay, the next delay is computed as ``retry_interval * exponential_base^(attempts-1) + random(jitter_interval)``
-     - ``5``
+     - the base for the exponential retry delay, the next delay is computed using random exponential backoff. Example for ``retry_interval=5_000, exponential_base=2, max_retry_delay=125_000, total=5`` Retry delays are random distributed values within the ranges of ``[5_000-10_000, 10_000-20_000, 20_000-40_000, 40_000-80_000, 80_000-125_000]``


Please add note how looks formula to compute delay.

README.rst

feat: implement full jitter retry strategy

62622e4

bednar reviewed Apr 15, 2021

View reviewed changes

rhajek requested review from sranka and alespour April 15, 2021 12:28

feat: implement full jitter retry strategy

d642c74

rhajek requested a review from vlastahajek April 15, 2021 12:31

feat: implement full jitter retry strategy

659e934

sranka reviewed Apr 15, 2021

View reviewed changes

alespour requested changes Apr 19, 2021

View reviewed changes

feat: implement full jitter retry strategy

56bb02d

sranka reviewed Apr 20, 2021

View reviewed changes

influxdb_client/client/write/retry.py Outdated Show resolved Hide resolved

sranka reviewed Apr 20, 2021

View reviewed changes

rhajek added 2 commits April 20, 2021 14:48

feat: fixing default retry settings

03b5ff4

feat: fixing min_retry_delay in full jitter backoff formula

5b16095

sranka requested changes Apr 21, 2021

View reviewed changes

rhajek changed the title ~~feat: implement full jitter retry strategy~~ feat: implement exponential random retry strategy Apr 26, 2021

feat: random exponential backoff retry

34767da

bednar requested changes Apr 26, 2021

View reviewed changes

rhajek added 6 commits April 26, 2021 14:52

feat: random exponential backoff retry

9c8977a

feat: random exponential backoff retry

0b3c5b0

feat: random exponential backoff retry

cd08898

feat: random exponential backoff retry

61c0593

feat: random exponential backoff retry

97da7ae

feat: random exponential backoff retry

3e641f3

bednar approved these changes Apr 26, 2021

View reviewed changes

rhajek merged commit 6844f60 into master Apr 29, 2021

rhajek deleted the feat/random_retry branch April 29, 2021 07:48

bednar added this to the 1.17.0 milestone Apr 29, 2021

rhajek mentioned this pull request May 10, 2021

feat: exponential random retry influxdata/influxdb-client-java#223

Merged

6 tasks

rhajek mentioned this pull request May 18, 2021

feat: exponential random retry influxdata/influxdb-client-php#76

Merged

6 tasks

rhajek mentioned this pull request Jun 3, 2021

feat: exponential random retry influxdata/influxdb-client-csharp#205

Merged

6 tasks

rhajek mentioned this pull request Jun 10, 2021

feat: batching write - exponential random backoff influxdata/influxdb-client-ruby#83

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: implement exponential random retry strategy #225

feat: implement exponential random retry strategy #225

rhajek commented Apr 15, 2021 •

edited

Loading

bednar left a comment •

edited

Loading

sranka Apr 15, 2021 •

edited

Loading

alespour Apr 16, 2021

alespour Apr 19, 2021 •

edited

Loading

rhajek Apr 20, 2021 •

edited

Loading

alespour left a comment •

edited

Loading

sranka Apr 20, 2021 •

edited

Loading

rhajek Apr 20, 2021

codecov bot commented Apr 20, 2021 •

edited

Loading

sranka Apr 21, 2021 •

edited

Loading

codecov-commenter commented Apr 26, 2021 •

edited

Loading

bednar Apr 26, 2021

bednar Apr 26, 2021

	if delay_range > self.max_retry_delay:
	if delay_range > 100_000_000:

feat: implement exponential random retry strategy #225

feat: implement exponential random retry strategy #225

Conversation

rhajek commented Apr 15, 2021 • edited Loading

Proposed Changes

Checklist

bednar left a comment • edited Loading

Choose a reason for hiding this comment

sranka Apr 15, 2021 • edited Loading

Choose a reason for hiding this comment

alespour Apr 16, 2021

Choose a reason for hiding this comment

alespour Apr 19, 2021 • edited Loading

Choose a reason for hiding this comment

rhajek Apr 20, 2021 • edited Loading

Choose a reason for hiding this comment

alespour left a comment • edited Loading

Choose a reason for hiding this comment

sranka Apr 20, 2021 • edited Loading

Choose a reason for hiding this comment

rhajek Apr 20, 2021

Choose a reason for hiding this comment

codecov bot commented Apr 20, 2021 • edited Loading

Codecov Report

sranka Apr 21, 2021 • edited Loading

Choose a reason for hiding this comment

codecov-commenter commented Apr 26, 2021 • edited Loading

Codecov Report

bednar Apr 26, 2021

Choose a reason for hiding this comment

bednar Apr 26, 2021

Choose a reason for hiding this comment

rhajek commented Apr 15, 2021 •

edited

Loading

bednar left a comment •

edited

Loading

sranka Apr 15, 2021 •

edited

Loading

alespour Apr 19, 2021 •

edited

Loading

rhajek Apr 20, 2021 •

edited

Loading

alespour left a comment •

edited

Loading

sranka Apr 20, 2021 •

edited

Loading

codecov bot commented Apr 20, 2021 •

edited

Loading

sranka Apr 21, 2021 •

edited

Loading

codecov-commenter commented Apr 26, 2021 •

edited

Loading