-
Notifications
You must be signed in to change notification settings - Fork 479
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
catchup: Dynamic parallel catchup #5802
Conversation
Codecov Report
@@ Coverage Diff @@
## master #5802 +/- ##
==========================================
- Coverage 55.61% 55.57% -0.05%
==========================================
Files 475 475
Lines 66869 66880 +11
==========================================
- Hits 37189 37166 -23
- Misses 27161 27192 +31
- Partials 2519 2522 +3
... and 11 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you clarify why duration measurement works? I expected fetchers to exit immediately if there is no such block, or return a block it is available in remote ledger.
I still think a new REST or WS endpoint streaming blocks would work better.
|
I'm not sure what that has to do with this PR |
As I understand the purpose of this PR is reduce exceed requests to non-existing yet blocks for follower nodes that already caught up. I propose an alternative approach to the same problem. |
Unfortunately it's not a solution because Conduit uses the sync API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR actually reduces number of requests made by a follower when it caught up.
Need to remove the debugging log.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, I think we should cleanly get to a resolution on the convo around line 47 of service.go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using download duration & 1s seems like a weird hack that affects all nodes, not just follower nodes, and is very arbitrary (why 1s?). @algorandskiy shouldn't the right solution to requesting blocks too far into the future involve smarter polling, or introduce additional calls to WaitMem()
like how fetchAndWrite
waits for lookbackComplete?
I think a better solution would incorporate feedback from peers about what their latest was, like: https://github.com/cce/go-algorand/tree/blockNotAvailableErrMsg-latest
@algorandskiy is working on this approach — smarter polling that introduces pauses in between retries when asking for blocks from the future, in #5809 — I think it would make this PR unnecessary. The problem this PR and #5809 are both trying to solve is aggressive polling once you get close to latest.
These are two different cases: 1) extra requests due to parallel requests - this PR addresses this 2) extra requests due to retries on errors. My PR addresses that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming this has been tested and works without causing the catchup pipeline to fail when continuously following the latest, this could be a workaround that prevents cancelling the pipeline unnecessarily. But ideally we replace it later with better polling and backoff, perhaps incorporating information from #5819 which tells you the latest round of the node you're requesting from.
It would be a little nicer to limit the scope of this change only to follower nodes, since regular nodes don't need to worry about following the tip of the chain using catchup.
Summary
This PR adds some rate limiting to the block download service. It will download two blocks and check the download duration. If they were fetched in less than the round time it allows full parallelism, otherwise it keeps the rate limited since we're likely at the latest round.
Test Plan
So far, I've tested manually with the print statement in the code. Not sure how to unit test this.