Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use dynamic aura slot duration in lookahead collator #3211

Merged
merged 10 commits into from
Feb 13, 2024

Conversation

s0me0ne-unkn0wn
Copy link
Contributor

It's a follow-up of #2949. It enables the lookahead collator to dynamically adjust the aura slot size, which may change during the runtime upgrade. It also addressed a couple of issues with time constants I missed in the original PR.

Good news: it works. The parachain successfully switches from sync backing with 12s slots to async backing with 6s slots.

Bad news: during the transitional period of a single block in which the actual runtime upgrade is performed, it still gets the old slot duration of 12s (as it gets it from the best block), resulting in a runtime panic (logs follow). That doesn't affect the following block production of the parachain. Ideas on how to improve the situation are appreciated.

2024-02-05 12:59:36.373  INFO tokio-runtime-worker sc_basic_authorship::basic_authorship: [Parachain] 🙌 Starting consensus session on top of parent 0x6fd2d8f904f12c22531bfabf77b16dc84a6a29e45d9ae358aa6547fbf3f0438b    
2024-02-05 12:59:36.373 ERROR tokio-runtime-worker runtime: [Parachain] panicked at /home/s0me0ne/wrk/parity/polkadot-sdk/cumulus/pallets/aura-ext/src/consensus_hook.rs:69:9:
assertion `left == right` failed: slot number mismatch
  left: Slot(142261198)
 right: Slot(284522396)    
2024-02-05 12:59:36.373  WARN tokio-runtime-worker sp_state_machine::overlayed_changes::changeset: [Parachain] 1 storage transactions are left open by the runtime. Those will be rolled back.    
2024-02-05 12:59:36.373  WARN tokio-runtime-worker sp_state_machine::overlayed_changes::changeset: [Parachain] 1 storage transactions are left open by the runtime. Those will be rolled back.    
2024-02-05 12:59:36.373  WARN tokio-runtime-worker basic-authorship: [Parachain] ❗ Inherent extrinsic returned unexpected error: Error at calling runtime api: Execution failed: Execution aborted due to trap: wasm trap: wasm `unreachable` instruction executed
WASM backtrace:
error while executing at wasm backtrace:
    0: 0x4e4a3b - <unknown>!rust_begin_unwind
    1: 0x46cf57 - <unknown>!core::panicking::panic_fmt::h3c280dba88683724
    2: 0x46d238 - <unknown>!core::panicking::assert_failed_inner::hebac5970933beb4d
    3: 0x3d00fc - <unknown>!core::panicking::assert_failed::h640a47e2fb5dfb4b
    4: 0xd0db3 - <unknown>!frame_support::storage::transactional::with_transaction::hcbc31515f81b2ee1
    5: 0x34d654 - <unknown>!<cumulus_pallet_parachain_system::pallet::Call<T> as frame_support::traits::dispatch::UnfilteredDispatchable>::dispatch_bypass_filter::{{closure}}::hb7c2c9a11fa88301
    6: 0x3547db - <unknown>!environmental::local_key::LocalKey<T>::with::h783f2605ae27d6d3
    7: 0x7f454 - <unknown>!<asset_hub_rococo_runtime::RuntimeCall as frame_support::traits::dispatch::UnfilteredDispatchable>::dispatch_bypass_filter::h5e11a01ab97c06c7
    8: 0x7f237 - <unknown>!<asset_hub_rococo_runtime::RuntimeCall as sp_runtime::traits::Dispatchable>::dispatch::h7f8ae4a8fede71ca
    9: 0x26a0f3 - <unknown>!frame_executive::Executive<System,Block,Context,UnsignedValidator,AllPalletsWithSystem,COnRuntimeUpgrade>::apply_extrinsic::h75e524ff34738391
   10: 0x282211 - <unknown>!BlockBuilder_apply_extrinsic. Dropping.    
2024-02-05 12:59:36.374 ERROR tokio-runtime-worker runtime: [Parachain] panicked at /home/s0me0ne/wrk/parity/polkadot-sdk/substrate/frame/aura/src/lib.rs:416:9:
assertion `left == right` failed: Timestamp slot must match `CurrentSlot`
  left: Slot(142261198)
 right: Slot(284522396)    
2024-02-05 12:59:36.374  WARN tokio-runtime-worker sp_state_machine::overlayed_changes::changeset: [Parachain] 1 storage transactions are left open by the runtime. Those will be rolled back.    
2024-02-05 12:59:36.374  WARN tokio-runtime-worker sp_state_machine::overlayed_changes::changeset: [Parachain] 1 storage transactions are left open by the runtime. Those will be rolled back.    
2024-02-05 12:59:36.374  WARN tokio-runtime-worker basic-authorship: [Parachain] ❗ Inherent extrinsic returned unexpected error: Error at calling runtime api: Execution failed: Execution aborted due to trap: wasm trap: wasm `unreachable` instruction executed
WASM backtrace:
error while executing at wasm backtrace:
    0: 0x4e4a3b - <unknown>!rust_begin_unwind
    1: 0x46cf57 - <unknown>!core::panicking::panic_fmt::h3c280dba88683724
    2: 0x46d238 - <unknown>!core::panicking::assert_failed_inner::hebac5970933beb4d
    3: 0x3d00fc - <unknown>!core::panicking::assert_failed::h640a47e2fb5dfb4b
    4: 0x9ece6 - <unknown>!frame_support::storage::transactional::with_transaction::h26f75cb9f9462088
    5: 0x356d7e - <unknown>!environmental::local_key::LocalKey<T>::with::hbcf2d4e17b48fdb5
    6: 0x7f507 - <unknown>!<asset_hub_rococo_runtime::RuntimeCall as frame_support::traits::dispatch::UnfilteredDispatchable>::dispatch_bypass_filter::h5e11a01ab97c06c7
    7: 0x7f237 - <unknown>!<asset_hub_rococo_runtime::RuntimeCall as sp_runtime::traits::Dispatchable>::dispatch::h7f8ae4a8fede71ca
    8: 0x26a0f3 - <unknown>!frame_executive::Executive<System,Block,Context,UnsignedValidator,AllPalletsWithSystem,COnRuntimeUpgrade>::apply_extrinsic::h75e524ff34738391
    9: 0x282211 - <unknown>!BlockBuilder_apply_extrinsic. Dropping.    
2024-02-05 12:59:36.374 DEBUG tokio-runtime-worker runtime::xcmp-queue-migration: [Parachain] Lazy migration finished: item gone    
2024-02-05 12:59:36.374 ERROR tokio-runtime-worker runtime: [Parachain] panicked at /home/s0me0ne/wrk/parity/polkadot-sdk/cumulus/pallets/parachain-system/src/lib.rs:265:18:
set_validation_data inherent needs to be present in every block!    
2024-02-05 12:59:36.374 ERROR tokio-runtime-worker aura::cumulus: [Parachain] err=Error { inner: Proposing

Caused by:
    0: Error at calling runtime api: Execution failed: Execution aborted due to trap: wasm trap: wasm `unreachable` instruction executed
       WASM backtrace:
       error while executing at wasm backtrace:
           0: 0x4e4a3b - <unknown>!rust_begin_unwind
           1: 0x46cf57 - <unknown>!core::panicking::panic_fmt::h3c280dba88683724
           2: 0x46da8b - <unknown>!core::option::expect_failed::hdf18d99c3adabca7
           3: 0x2134cb - <unknown>!<cumulus_pallet_parachain_system::pallet::Pallet<T> as frame_support::traits::hooks::OnFinalize<<<<T as frame_system::pallet::Config>::Block as sp_runtime::traits::HeaderProvider>::HeaderT as sp_runtime::traits::Header>::Number>>::on_finalize::hf98aac39802896ba
           4: 0x26a9d6 - <unknown>!frame_executive::Executive<System,Block,Context,UnsignedValidator,AllPalletsWithSystem,COnRuntimeUpgrade>::idle_and_finalize_hook::h32775c0df0749d92
           5: 0x26ad9f - <unknown>!frame_executive::Executive<System,Block,Context,UnsignedValidator,AllPalletsWithSystem,COnRuntimeUpgrade>::finalize_block::h15e5a1a6b9ca8032
           6: 0x2822bd - <unknown>!BlockBuilder_finalize_block
    1: Execution failed: Execution aborted due to trap: wasm trap: wasm `unreachable` instruction executed
       WASM backtrace:
       error while executing at wasm backtrace:
           0: 0x4e4a3b - <unknown>!rust_begin_unwind
           1: 0x46cf57 - <unknown>!core::panicking::panic_fmt::h3c280dba88683724
           2: 0x46da8b - <unknown>!core::option::expect_failed::hdf18d99c3adabca7
           3: 0x2134cb - <unknown>!<cumulus_pallet_parachain_system::pallet::Pallet<T> as frame_support::traits::hooks::OnFinalize<<<<T as frame_system::pallet::Config>::Block as sp_runtime::traits::HeaderProvider>::HeaderT as sp_runtime::traits::Header>::Number>>::on_finalize::hf98aac39802896ba
           4: 0x26a9d6 - <unknown>!frame_executive::Executive<System,Block,Context,UnsignedValidator,AllPalletsWithSystem,COnRuntimeUpgrade>::idle_and_finalize_hook::h32775c0df0749d92
           5: 0x26ad9f - <unknown>!frame_executive::Executive<System,Block,Context,UnsignedValidator,AllPalletsWithSystem,COnRuntimeUpgrade>::finalize_block::h15e5a1a6b9ca8032
           6: 0x2822bd - <unknown>!BlockBuilder_finalize_block }

@s0me0ne-unkn0wn s0me0ne-unkn0wn added R0-silent Changes should not be mentioned in any release notes T14-system_parachains This PR/Issue is related to system parachains. labels Feb 5, 2024
@s0me0ne-unkn0wn s0me0ne-unkn0wn changed the title Use synamic aura slot duration in lookahead collator Use dynamic aura slot duration in lookahead collator Feb 5, 2024
@alexggh alexggh self-requested a review February 5, 2024 12:43
@bkchr
Copy link
Member

bkchr commented Feb 5, 2024

Bad news: during the transitional period of a single block in which the actual runtime upgrade is performed, it still gets the old slot duration of 12s (as it gets it from the best block), resulting in a runtime panic (logs follow). That doesn't affect the following block production of the parachain. Ideas on how to improve the situation are appreciated.

There is not that much that could be done here. So, I think it is fine that these errors are appearing for this small moment in time.

@s0me0ne-unkn0wn s0me0ne-unkn0wn marked this pull request as ready for review February 5, 2024 12:49
@bkchr
Copy link
Member

bkchr commented Feb 5, 2024

Bad news: during the transitional period of a single block in which the actual runtime upgrade is performed, it still gets the old slot duration of 12s (as it gets it from the best block), resulting in a runtime panic (logs follow). That doesn't affect the following block production of the parachain. Ideas on how to improve the situation are appreciated.

There is not that much that could be done here. So, I think it is fine that these errors are appearing for this small moment in time.

While thinking again about this and looking at the code, the error should not happen. This is because we are not using the correct block to determine the slot duration or probably not the correct block.

@s0me0ne-unkn0wn s0me0ne-unkn0wn requested a review from bkchr February 6, 2024 16:40
Copy link
Contributor

@alexggh alexggh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix looks sane to me.

However, I'm not familiar with all the internal stuff, so I would like to understand what happens in the following scenarios(I can help with the testing):

  1. We upgrade the parachain runtime but the collator nodes is still running the old version, before Enable async backing on all testnet system chains #2949 and Enable async backing on asset-hub-rococo #2826.

  2. We upgrade the collator node, but the runtime is still the old ones before Enable async backing on all testnet system chains #2949 and Enable async backing on asset-hub-rococo #2826.

  3. We upgrade both runtime and collator node, does the relay chain need async backing enabled or will it work even with a relay chain where async backing is not enabled.

Btw: #2826, was included in 1.7.0 what will happen if they deploy it(node or runtime) without this fix?

@s0me0ne-unkn0wn
Copy link
Contributor Author

  1. We upgrade the parachain runtime but the collator nodes is still running the old version

Not checked yet, probably it's not a valid scenario. Collators should upgrade before the runtime upgrade (the same way we should wait for the relay chain nodes to upgrade before applying a runtime upgrade with some new features utilized with by nodes)

  1. We upgrade the collator node, but the runtime is still the old ones

The parachain will progress normally, but we'll see runtime panics in logs every second block. They are harmless but frightening 😵‍💫

  1. We upgrade both runtime and collator node, does the relay chain need async backing enabled or will it work even with a relay chain where async backing is not enabled

Not sure, my guess would be the same as 2., but would be good to test.

Btw: #2826, was included in 1.7.0 what will happen if they deploy it(node or runtime) without this fix?

Again, not 100% sure, my guess is the collators might have to be restarted after the runtime upgrade, otherwise the parachain will stall.

@s0me0ne-unkn0wn s0me0ne-unkn0wn requested review from a team as code owners February 8, 2024 12:42
@s0me0ne-unkn0wn s0me0ne-unkn0wn force-pushed the s0me0ne/aura-collator-dynamic-slot-duration branch from bfb30e2 to 3e8e268 Compare February 8, 2024 12:45
@s0me0ne-unkn0wn s0me0ne-unkn0wn removed request for a team February 8, 2024 12:46
@bkchr bkchr enabled auto-merge February 12, 2024 15:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
R0-silent Changes should not be mentioned in any release notes T14-system_parachains This PR/Issue is related to system parachains.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants