-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: implement exponential random retry strategy #225
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following jitter_interval
property should be remove:
jitter_interval=self._write_options.jitter_interval / 1_000, |
@@ -58,16 +54,10 @@ def get_backoff_time(self): | |||
if consecutive_errors_len < 0: | |||
return 0 | |||
|
|||
backoff_value = self.backoff_factor * (self.exponential_base ** consecutive_errors_len) + self._jitter_delay() | |||
# Full Jitter strategy | |||
backoff_value = self.backoff_factor * (self.exponential_base ** consecutive_errors_len) * self._random() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I propose to change the implementation to compute the next retry delay this way:
def nextDelay(attempt /* 1 means called for the first time */, options):
range = options.first_retry_range
i = 1
while i<attempt:
i++
range = range * options.exponential_base
if range > options.max_retry_delay :
break
delay = options.min_retry_delay + (range - options.min_retry_delay) * random() /* at least min_retry_delay */
delay = min(options.max_retry_delay, delay) /* at most max_retry_delay */
return delay
Additionally, the implementation must ensure that the request is not scheduled for retries after
options.max_retry_time elapsed (max_request_time if possible).
options.max_retry_time
can be the only meaningful configurable value from the library user POV. Setting to 0 disables retry.
These could be the defaults:
options.first_retry_range = 5 seconds
options.exponential_base = 2
options.max_retry_delay = 125 seconds
options.min_retry_delay = 1 second
options.max_retry_time = 180 seconds
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This delay function does no guarantee that delay is increasing. If generated random is a lot smaller than in previous attempt, then resulting delay is also smaller. I have no smart proposal how to fix at this moment though :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I propose this simple modification to the above algorithm to ensure that delay values are increasing and increasing enough.
+ def randomArbitrary(min, max) {
+ return random() * (max - min) + min;
+ }
...
- delay = options.min_retry_delay + (range - options.min_retry_delay) * random() /* at least min_retry_delay */
+ delay = options.min_retry_delay + (range - options.min_retry_delay) * options.random /* at least min_retry_delay */
...
+ options.random = randomArbitrary(0.5, 1.0)
Or similarly in the PR like
+ self.random = randomArbitrary(0.5, 1.0)
...
- backoff_value = self.backoff_factor * (self.exponential_base ** consecutive_errors_len) * self._random()
+ backoff_value = self.backoff_factor * (self.exponential_base ** consecutive_errors_len) * self.random
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyway, either in alg in this PR, or proposed by sranka, I think random
needs to be "controlled" somehow, in order to avoid delay dropping when generated random is way smaller than in previous retry. See #225 (comment)
|
||
def increment(self, method=None, url=None, response=None, error=None, _pool=None, _stacktrace=None): | ||
"""Return a new Retry object with incremented retry counters.""" | ||
if self.retry_timeout < datetime.now(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it also react the same way when retry is disabled? (max_retry_time is 0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, max_retry_time=0 means retry is disabled, here is the test:
def test_retry_disabled_max_retry_time(self): |
Codecov Report
@@ Coverage Diff @@
## master #225 +/- ##
==========================================
+ Coverage 89.96% 91.06% +1.10%
==========================================
Files 26 26
Lines 2003 2038 +35
==========================================
+ Hits 1802 1856 +54
+ Misses 201 182 -19
Continue to review full report at Codecov.
|
while i <= consecutive_errors_len: | ||
i += 1 | ||
delay_range = delay_range * self.exponential_base | ||
if delay_range > self.max_retry_delay: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alespour had a good point that the delays should be increasing (on average), this condition makes it hard to happen since the delay range is the same after a fixed count of attempts, the delays are then oscillating randomly around self.max_retry_delay/2
. This can be fixed by restricting the delay range to a large number:
if delay_range > self.max_retry_delay: | |
if delay_range > 100_000_000: |
Codecov Report
@@ Coverage Diff @@
## master #225 +/- ##
==========================================
+ Coverage 89.96% 91.06% +1.10%
==========================================
Files 26 26
Lines 2003 2037 +34
==========================================
+ Hits 1802 1855 +53
+ Misses 201 182 -19
Continue to review full report at Codecov.
|
README.rst
Outdated
* - **max_retries** | ||
- the number of max retries when write fails | ||
- ``3`` | ||
- ``10`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this correct?
README.rst
Outdated
* - **exponential_base** | ||
- the base for the exponential retry delay, the next delay is computed as ``retry_interval * exponential_base^(attempts-1) + random(jitter_interval)`` | ||
- ``5`` | ||
- the base for the exponential retry delay, the next delay is computed using random exponential backoff. Example for ``retry_interval=5_000, exponential_base=2, max_retry_delay=125_000, total=5`` Retry delays are random distributed values within the ranges of ``[5_000-10_000, 10_000-20_000, 20_000-40_000, 40_000-80_000, 80_000-125_000]`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add note how looks formula to compute delay.
Proposed Changes
This PR changes the default retry strategy to Full Jitter.
Related discussion is influxdata/influxdb#19722 (comment)
Original retry formula:
retry_interval * exponential_base^(attempts-1) + random(jitter_interval)
Purposed exponential random retry formula:
Retry delay is calculated as random value within the interval
[
retry_interval * exponential_base^(attempts-1) and retry_interval * exponential_base^(attempts)
]Example for
retry_interval=5_000, exponential_base=2, max_retry_delay=125_000
Retry delays are random distributed values within the ranges of
[5_000-10_000, 10_000-20_000, 20_000-40_000, 40_000-80_000, 80_000-125_000]
Checklist
pytest tests
completes successfully