Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ratelimiter polish / fix: improve zero -> nonzero filling behavior for new ratelimiters #6280

Merged
merged 2 commits into from
Sep 12, 2024

Conversation

Groxx
Copy link
Member

@Groxx Groxx commented Sep 12, 2024

Motivation:

The global ratelimiter system was exhibiting some weird request-rejection at very low RPS usage.
On our dashboards it looks like this:
Screenshot 2024-09-11 at 18 55 09

Previously I thought this was just due to undesirably-low weights, and #6238 addressed that (and is still a useful addition).

After that was rolled out, behavior improved, but small numbers still occurred... which should not have happened because the "boosting" logic should have meant that the global limits were at least identical, and likely larger.

Which drove me to re-read the details and think harder. And then I found this PR's issue.

Issue and fix

What was happening is that the initial rate.NewLimiter(0,0) detail was "leaking" into limits after the first update, so a request that occurred immediately after would likely be rejected, regardless of the configured limit.

This happens because (0, 0) creates a zero-burst limit on the "primary" limiter, and the shadowed .Allow() calls were advancing the limiter's internal "now" value...
... and then when the limit and burst were increased, the limiter would have to fill from zero.

This put it in a worse position than local / fallback limiters, which start from (local, local) with a zero "now" value, and then the next .Allow() is basically guaranteed to fill the token bucket due to many years "elapsing".

So the fix has two parts:

1: Avoid advancing the zero-valued limiter's internal time until a reasonable limit/burst has been set.
This is done by simply not calling it while in startup mode.

2: Avoid advancing limiters' time when setting limit and burst.
This means that after an idle period -> Update() -> Allow(), tokens will fill as if the new values were set all along, and the setters can be called in any order.

The underlying rate.Limiter does not do the second, it advances time when setting these... but that seems undesirable.
It means old values are preferred (which is reasonable, they were set when that time passed), and it means that the order you call to set both burst and limit has a significant impact on the outcome, even with the same values and the same timing: time passes only on the first call, the second has basically zero elapsed and has no immediate effect at all (unless lowering burst). I can only see that latter part as surprising, and definitely worth avoiding.

Alternative approach

2 seems worth keeping. But 1 has a relatively clear alternative:
Don't create the "primary" limiter until the first Update().

Because it's currently atomic-oriented, this can't be done safely without adding atomics or locks everywhere... so I didn't do that.
If I were to do this, I would just switch to a mutex, the rate.Limiter already uses them so it should be near zero cost.
I'm happy to build that if someone prefers, I just didn't bother this time.

…r new ratelimiters

# Motivation:
The global ratelimiter system was exhibiting some weird request-rejection at very low RPS usage.
Previously it was thought this was just due to irrationally-low weights, and cadence-workflow#6238 addressed that (and is still desirable).

After that was rolled out, behavior improved, but small numbers still occurred... which should not have happened because the "boosting" logic should have meant that the global limits were *at least* identical, and possibly larger.

And then I found this PR's issue.

# Issue and fix

What was happening is that the initial `rate.NewLimiter(0,0)` was "leaking" into limits after the first update, so a request that occurred immediately after would likely be rejected, regardless of the configured limit.

This happens because `(0, 0)` creates a zero-burst limit on the "primary" limiter, and the shadowed `.Allow()` calls were advancing the limiter's internal "now" value...
... and then when the limit and burst were increased, the limiter would have to fill from zero.

This put it in a worse position than local / fallback limiters, which start from `(local, local)` with a zero "now" value, and then the next `.Allow()` is basically guaranteed to fill the token bucket due to many years "elapsing".

The fix has two parts:

1: Avoid advancing the un-initialized limiter's internal time until a reasonable limit/burst has been set.
This is done by simply not calling it while in startup mode.

2: Avoid advancing limiters' time when setting limit and burst.
This means that after an idle period -> `Update()` -> `Allow()`, tokens will fill as if they were set all along, and the setters can be called in any order.
The underlying `rate.Limiter` does *not* do this, it advances time when setting these... but that seems undesirable.  It means old values are preferred (which is reasonable - they were set when that time passed), *and* it means that the order you call these has a significant impact on the outcome, even with the same values and the same timing.  I can only see that as surprising, and worth avoiding.

# Alternative approach

2 seems worth keeping.  But 1 has a relatively clear alternative:
Don't create the "primary" limiter until the first `Update()`.

Because it's currently atomic-oriented, this can't be done safely without adding atomics or locks everywhere... so I didn't do that.
If I were to do this, I would just switch to a mutex, the `rate.Limiter` already uses them so it should be near zero cost.
I'm happy to build that if someone prefers, I just didn't bother this time.
Copy link
Member

@davidporter-id-au davidporter-id-au left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ty for the comments, they're certainly helpful

Copy link

codecov bot commented Sep 12, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 73.10%. Comparing base (e5bd91e) to head (b054418).
Report is 1 commits behind head on master.

Additional details and impacted files
Files with missing lines Coverage Δ
common/clock/ratelimiter.go 100.00% <100.00%> (ø)
...mmon/quotas/global/collection/internal/fallback.go 96.66% <100.00%> (+0.17%) ⬆️

... and 5 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e5bd91e...b054418. Read the comment docs.

@Groxx Groxx enabled auto-merge (squash) September 12, 2024 02:47
@Groxx Groxx merged commit 04add2d into cadence-workflow:master Sep 12, 2024
20 checks passed
@Groxx Groxx deleted the limiter-polish branch September 12, 2024 22:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants