Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

smoldot-light: GrandPa warp sync idle at block #592

Closed
lexnv opened this issue May 22, 2023 · 19 comments
Closed

smoldot-light: GrandPa warp sync idle at block #592

lexnv opened this issue May 22, 2023 · 19 comments

Comments

@lexnv
Copy link
Contributor

lexnv commented May 22, 2023

Hi, I'm trying to integrate smoldot-light in paritytech/subxt#965, locally on my mac M1 it works perfectly, on AMD x86 the light client seems stuck on:

2023-05-22T10:31:36.572347Z  WARN sync-service-dev: GrandPa warp sync idle at block #0 (0x0xaee0…6e8a)
2023-05-22T10:31:46.573234Z  WARN sync-service-dev: GrandPa warp sync idle at block #0 (0x0xaee0…6e8a

Im starting my polkadot nodes the same, and Im using the same chain spec

./polkadot --dev --tmp --alice --validator

spec:
  "bootNodes": [
      "/ip4/127.0.0.1/tcp/30333/p2p/12D3KooWKGrBgNaxndgc61G4P9UKnJSgCQjqmb6jRfEg6kxirD86"
  ],

Im using polkadot version on x86 cbd5330173, compared to 6dc9e84dde2 for mac.

Do you know what might cause this warning?

@tomaka
Copy link
Contributor

tomaka commented May 24, 2023

Please run smoldot with debug logging in order to get more information about what happens.

My immediate suspicion is that the PeerId of the chain spec is wrong because you're running with --tmp, or something like that.

@lexnv
Copy link
Contributor Author

lexnv commented May 24, 2023

Hi,

I've managed to get this working by doing the following:

  • Started polka-dot with:
./polkadot-cbd5330173 --dev --node-key 0000000000000000000000000000000000000000000000000000000000000001  --alice --validator

This produces the same genesis hash and the same node identity

  • Fetched the spec
curl -H "Content-Type: application/json" -d '{"id":1, "jsonrpc":"2.0", "method": "sync_state_genSyncSpec", "params":[true]}' http://localhost:9944/ | jq .result > res.spec
  • Modified the spec
    • Removed the lightSyncState entry
    • Added the bootnode address

If I were to leave the lightSyncState the smoldot will panic on the following:

https://github.com/smol-dot/smoldot/blob/main/lib/src/chain_spec.rs#L445-L446

Would it be possible that the "freshly" spawned local node does not have 2 epochs yet?

Is there a way to avoid this, but still benefit from the lightSyncState? I think that the warp syncing is a bit slow because I don't provide this (around 2~5 minutes)

@tomaka
Copy link
Contributor

tomaka commented May 24, 2023

I think that the warp syncing is a bit slow because I don't provide this (around 2~5 minutes)

The warp sync time is proportional to the length of the chain. Every epoch (4096 blocks? I believe), there's one signature per validator to download and verify. When the chain is several million blocks, this adds up and causes slowness. But for a local chain that has just started, it is really not supposed to take more than a few dozen milliseconds.

@tomaka
Copy link
Contributor

tomaka commented May 30, 2023

I'm a bit confused about whether you (still) have a problem. If you need help with the warp syncing being stuck, I would need the debug logs of smoldot.

@lexnv
Copy link
Contributor Author

lexnv commented May 30, 2023

Hi, I manage to get this working, however connecting to the local node seems to be slower than connecting to the polkadot chain.

I have attached here the logs:
https://gist.github.com/lexnv/f2e6f6fb920971b9664bee26810a6a20

2023-05-30T15:36:08.955641Z  INFO smoldot: Chain initialization complete for dev.

2023-05-30T15:41:13.019192Z  INFO runtime-dev: Successfully compiled runtime. Spec version: 9410. Size of `:code`: 1.4 MiB.

// TX submitted now
2023-05-30T15:41:13.769082Z DEBUG json-rpc-dev: PendingRequestsQueue <= {"jsonrpc":"2.0","id":"5", "method":"author_submitAndWatchExtrinsic","params":["0x2d028400d43593c715…


// TX included in chain
2023-05-30T15:41:42.700032Z DEBUG json-rpc-dev: JSON-RPC <= {"jsonrpc":"2.0","id":"7","result":"0x2400000000000000e2e8793a55170200000001000000000003c8ac2097c6f9…
Balance transfer success: Transfer { from: AccountId32([212, 53, 147, 199, 21, 253, 211, 28, 97, 20, 26, 189, 4, 169, 159, 214, 130, 44, 133, 88, 133, 76, 205, 227, 154, 86, 132, 231, 165, 109, 1 ..

Is there anything we could do for the local node (spawn as above) to make things a bit faster?

@tomaka
Copy link
Contributor

tomaka commented May 30, 2023

Is your CPU by any chance extremely busy constantly up until the transaction has been submitted? Or does it have idle times?

@lexnv
Copy link
Contributor Author

lexnv commented May 30, 2023

The CPU consumption is at 100% for 1 CPU, while the others are idle all the time, verified with htop seems that the light client is consuming that CPU

@tomaka
Copy link
Contributor

tomaka commented May 31, 2023

Another question is: did you truncate the logs between 15:36:53.446268Z and 15:38:39.646368Z?
If yes I do need the actual logs to understand what happens.

@tomaka
Copy link
Contributor

tomaka commented May 31, 2023

#645 and #644 add a lot of logs which could be useful to understand the problem

@lexnv
Copy link
Contributor Author

lexnv commented May 31, 2023

Hey, I did enable tracing for my example run, however, the transaction gets dropped after a while resulting in some 271 MB of logs. I expect this to be related to the large number of logs.

I ended up running RUST_LOG="smoldot=debug,network=debug,runtime-dev=debug,connections=debug,json-rpc-dev=debug" cargo run --example tx_basic_light_client for the provided logs.

Will retry the experiment fetching those two PRs, thanks! 🙏

@lexnv
Copy link
Contributor Author

lexnv commented May 31, 2023

I've updated the full logs at my drive.

  • smoldot-31-panic.logs:
thread 'async-std/runtime' panicked at 'explicit panic', /home/lexnv/workspace/smoldot/lib/src/sync/all.rs:1791:17
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
  • smoldot-31-2.logs

The tx was submitted successfully in around 5 minutes

@tomaka
Copy link
Contributor

tomaka commented May 31, 2023

It seems that the sync service is completely frozen for two minutes.
The freeze happens right before or during the compilation of the runtime (by wasmtime), so one possibility is that wasmtime takes 2 whole minutes to compile the runtime.

At the same time, the logs are filled at 99.9% by log lines from wasmtime and cranelift. If you remove all the lines from wasmtime and cranelift from the 92.2 MiB log file, there's only 87 kiB left.

So my hypothesis is that wasmtime indeed takes 2 whole minutes to compile the runtime, just because it spends 99% of its CPU time writing logs.
I've opened #649 to confirm whether this is true.

Please try again with wasmtime=warn,cranelift=warn for the log options.

@lexnv
Copy link
Contributor Author

lexnv commented Jun 1, 2023

I've uploaded another file to the drive smoldot-1-06.logs, after fetching the latest origin the logs indicate that GrandPa warp sync in progress for about 10 minutes

@tomaka
Copy link
Contributor

tomaka commented Jun 1, 2023

What happens is: smoldot connects, warp syncs, downloads the runtime, compiles it (which takes 221 seconds!), then tries to download other things from the peer, and the peer disconnects smoldot in return. Then smoldot reconnects, and the entire thing repeats.

I don't know why Substrate disconnects smoldot. I would need the sub-libp2p and/or sync logs (of Substrate) to figure out. Given that I'm no longer working on the networking of Substrate, I'm not sure to be able to help here.

However my main hypothesis is still that your CPU is completely overloaded for some reason. Are you compiling in debug maybe? If so you really should try opt-level = 2.
No matter what I do in smoldot, if wasmtime takes 221 seconds to compile the runtime the warp sync won't take less than 221 seconds.

@tomaka
Copy link
Contributor

tomaka commented Jun 9, 2023

It is now possible to no longer use wasmtime and instead always wasmi (#695), so you could try doing that.

@lexnv
Copy link
Contributor Author

lexnv commented Jun 13, 2023

The light-client with default-features disabled seems to take a bit longer than before:

2023-06-13T14:43:33.413835Z TRACE light-client: Create light client

2023-06-13T14:53:08.541603Z TRACE async_io::driver: block_on()
2023-06-13T14:53:08.541681Z TRACE light-client: Submit request "{\"jsonrpc\":\"2.0\",\"id\":\"1\", \"method\":\"chain_getBlockHash\",\"params\":[0]}"
2023-06-13T14:53:08.541714Z TRACE light-client-background: Received register message Request { id: "1", sender: Sender { inner: Some(Inner { state: State { is_complete: false, is_closed: false, is_rx_task_set: false, is_tx_task_set: false } }) } }
2023-06-13T14:53:08.541727Z DEBUG json-rpc-polkadot: PendingRequestsQueue <= {"jsonrpc":"2.0","id":"1", "method":"chain_getBlockHash","params":[0]}
2023-06-13T14:53:08.542210Z TRACE polling::epoll: add: epoll_fd=6, fd=8, ev=Event { key: 18446744073709551615, readable: false, writable: false }
2023-06-13T14:53:08.542227Z TRACE polling::epoll: add: epoll_fd=6, fd=7, ev=Event { key: 18446744073709551615, readable: true, writable: false }
2023-06-13T14:53:08.542234Z TRACE polling::epoll: new: epoll_fd=6, event_fd=7, timer_fd=Some(8)
2023-06-13T14:53:08.542289Z TRACE polling: Poller::notify()
2023-06-13T14:53:08.542296Z TRACE polling::epoll: notify: epoll_fd=6, event_fd=7
2023-06-13T14:53:08.542317Z TRACE async_io::driver: main_loop: waiting on I/O
2023-06-13T14:53:08.542343Z TRACE async_io::reactor: process_timers: 0 ready wakers
2023-06-13T14:53:08.542352Z TRACE polling: Poller::wait(_, Some(4.999827661s))
2023-06-13T14:53:08.542360Z TRACE polling::epoll: wait: epoll_fd=6, timeout=Some(4.999827661s)
2023-06-13T14:53:08.542367Z TRACE polling::epoll: modify: epoll_fd=6, fd=8, ev=Event { key: 18446744073709551615, readable: true, writable: false }
2023-06-13T14:53:08.542375Z TRACE polling::epoll: new events: epoll_fd=6, res=1
2023-06-13T14:53:08.542381Z TRACE polling::epoll: modify: epoll_fd=6, fd=7, ev=Event { key: 18446744073709551615, readable: true, writable: false }
2023-06-13T14:53:08.542390Z TRACE async_io::reactor: process_timers: 0 ready wakers
2023-06-13T14:53:08.542395Z TRACE async_io::reactor: react: 0 ready wakers
2023-06-13T14:53:08.542403Z TRACE async_io::driver: main_loop: sleeping for 50 us
2023-06-13T14:53:08.542423Z TRACE polling: Poller::notify()
2023-06-13T14:53:08.542432Z TRACE polling::epoll: notify: epoll_fd=6, event_fd=7
2023-06-13T14:53:08.542515Z TRACE async_io::driver: main_loop: waiting on I/O
2023-06-13T14:53:08.542523Z TRACE async_io::reactor: process_timers: 0 ready wakers
2023-06-13T14:53:08.542529Z TRACE polling: Poller::wait(_, Some(4.999642489s))
2023-06-13T14:53:08.542534Z TRACE polling::epoll: wait: epoll_fd=6, timeout=Some(4.999642489s)
2023-06-13T14:53:08.542540Z TRACE polling::epoll: modify: epoll_fd=6, fd=8, ev=Event { key: 18446744073709551615, readable: true, writable: false }
..
2023-06-13T14:53:15.623893Z  INFO sync-service-polkadot: GrandPa warp sync finished to #15951164 (0xc1e5…7054)
..
2023-06-13T14:53:15.624945Z  INFO runtime-polkadot: Finalized block runtime ready. Spec version: 9370. Size of `:code`: 1.3 MiB.

There is an ~10 minute gap in logs, might it be related to #661? 🤔

@tomaka
Copy link
Contributor

tomaka commented Jun 13, 2023

might it be related to #661? thinking

#661 concerns something completely outside of the light client

@tomaka
Copy link
Contributor

tomaka commented Jun 16, 2023

I don't really have any hypothesis except that you're reaching the computation limits of your machine.
While smoldot alone should be able to run on any machine just fine, if you run smoldot plus other nodes at the same time it's not completely surprising for things to be slow.

@lexnv
Copy link
Contributor Author

lexnv commented Jun 20, 2023

It seems to work just fine on arm mac M1, will close this for now as it might be a limitation of AMD/my box

Thanks for the help in debugging this and improving the code!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants