Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GrandPa warp sync times out on github CI #1562

Closed
lexnv opened this issue Jan 8, 2024 · 6 comments · Fixed by paritytech/subxt#1372 or #1591
Closed

GrandPa warp sync times out on github CI #1562

lexnv opened this issue Jan 8, 2024 · 6 comments · Fixed by paritytech/subxt#1372 or #1591

Comments

@lexnv
Copy link
Contributor

lexnv commented Jan 8, 2024

We have recently encountered a few subxt github CI timeouts with the light client.
GrandPa warp sync seems to be idle at the same block until the CI times out our testing.
This is happening in a flaky manner and could be reproduced by paritytech/subxt#1346.

Smoldot version is 0.15 and smoldot-light is 0.13 (from Cargo.toml).
The test connects the smoldot client to the polkadot network from a freshly downloaded chainSpec; then performs a few subscriptions and method requests.

Logs reported by the subxt test:

2024-01-08T13:22:48.947771Z  WARN json-rpc-polkadot: The JSON-RPC client has just called a JSON-RPC function from the legacy JSON-RPC API (chain_getFinalizedHead). Legacy JSON-RPC functions have loose semantics and cannot be properly implemented on a light client. You are encouraged to use the new JSON-RPC API <https://github.com/paritytech/json-rpc-interface-spec/> instead. The legacy JSON-RPC API functions will be deprecated and removed in the distant future.    
...

2024-01-08T13:22:58.947182Z  WARN sync-service-polkadot: GrandPa warp sync idle at block #18952945 (0x0xfde5…4433)    
...

2024-01-08T13:25:38.948843Z  WARN sync-service-polkadot: GrandPa warp sync idle at block #18952945 (0x0xfde5…4433)    
...

2024-01-08T13:37:58.956476Z  WARN sync-service-polkadot: GrandPa warp sync idle at block #18952945 (0x0xfde5…4433)    
 Error: The operation was canceled.

Logs extracted from: https://github.com/paritytech/subxt/actions/runs/7447975422/job/20261334692?pr=1346

Let me know if I could provide more details here, thanks! 🙏

@tomaka
Copy link
Contributor

tomaka commented Jan 8, 2024

You've set sync-service-polkadot=info in the logging settings for some reason. This unfortunately hides the log entries that would be useful for debugging this.

@lexnv
Copy link
Contributor Author

lexnv commented Jan 8, 2024

I was able to reproduce it with syn-service-polkadot=trace with two entries 🙏

@tomaka
Copy link
Contributor

tomaka commented Jan 8, 2024

Thanks!

What happens is that when we connect to peers, their finalized block is equal to the block of the checkpoint in the chain spec.
For this reason, smoldot doesn't start any warp sync request, as it's already at the head of the chain.
In principle it still needs to perform the rest of the warp sync process, as otherwise it can't sync further block, but it just doesn't.

So while this is obviously not the correct fix, adding a sleep(6 seconds) before starting smoldot would bypass the problem.

In principle, the nodes we're connected to should at some point finalize another block and send a commit message to smoldot, which would make smoldot start the warp syncing, but for some reason they don't and I'm not sure why.

@tomaka
Copy link
Contributor

tomaka commented Jan 8, 2024

I vaguely remember fixing a > that should be a >= in the warp syncing code in a recent PR, but I can't find it.
It's possible that this issue is already fixed in the more recent version of smoldot.

@lexnv
Copy link
Contributor Author

lexnv commented Jan 11, 2024

Hey, thanks for looking into this!

I managed to reproduce this in a new PR which uses the git-dependency smoldot from the latest branch with this commit. Logs are available at: https://github.com/paritytech/subxt/actions/runs/7488555076/job/20383286368?pr=1355

2024-01-11T12:16:58.851770Z  WARN sync-service-polkadot: GrandPa warp sync idle at block #18995151 (0x0x000f…8d36) 
 ...
2024-01-11T12:29:38.861440Z  WARN sync-service-polkadot: GrandPa warp sync idle at block #18995151 (0x0x000f…8d36) 
...
2024-01-11T12:31:48.863079Z  WARN sync-service-polkadot: GrandPa warp sync idle at block #18995151 (0x0x000f…8d36) 

Let me know if I could provide more details, thanks 🙏

@tomaka
Copy link
Contributor

tomaka commented Jan 11, 2024

In principle it still needs to perform the rest of the warp sync process, as otherwise it can't sync further block, but it just doesn't.

This was actually not true.

The checkpoint contains everything needed to start syncing properly, so there's no need to do any warp sync at all.

The issue is actually that the warp sync code is basically in the middle of a refactoring.
The warp sync code is now designed to pause itself when there's nothing to warp sync to (like in this case), however the code on top of it still waits for the initial warp sync to succeed before it starts syncing blocks on top.

This issue would be fixed by #1109, as it basically consists in finishing that refactoring, but that issue is non-trivial. It's not clear to me from the top of my head whether a temporary intermediary fix is easy or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants