Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase para block inclusion reliability #5544

Open
4 tasks
eskimor opened this issue Sep 2, 2024 · 6 comments
Open
4 tasks

Increase para block inclusion reliability #5544

eskimor opened this issue Sep 2, 2024 · 6 comments

Comments

@eskimor
Copy link
Member

eskimor commented Sep 2, 2024

Having produced parachain blocks retracted is the very least detrimental to throughput of the chain, but also harms user and developer experience:

  • Worst case, blocks gets retracted, transaction becomes invalid, user has to issue a new one.
  • Having a transaction in a block, is already some proof of validity, under the
    assumption that we trust collators.
  • Having a transaction in a block, which got backed (off-chain), provides even
    higher level of security.

In general the block will move up in assurance over time, but if it happens
frequently that a block after all just gets discarded, the benefit this property
vanishes and one actually has to wait for definite finality, which takes the
longest.

The following is a kind of unordered list of things that can cause a parachain block to not make it + solutions to it.

Speculative Availability

Give availability more time, to enhance likelihood of cores getting freed on
time:

  • Factor out provisioner logic to get backable candidates
  • Reuse that code in availability-distribution to already fetch scheduled
    cores and then backable candidates from prospective parachains
  • Add fetch tasks for these (with the leaf it was scheduled in)
  • Profit

Immunity to relay chain forks

Either:

Build on slightly older relay parents

  • Simple
  • Can be made pretty robust, if we choose relay parents which have been
    finalized already.
  • Also provides some resilience against relay chain reorgs.
  • But: Latency with regards to message processing is added.

Build on all forks

  • No additional latency
  • More resources on the collator (likely fine, as block building is mostly
    single core, hence additional cores are free)
  • More complexity: One does not only need to track one fork, but multiple.
  • More work for the relay chain.

Avoid relay parents becoming obsolete

Allow relay parents that survive longer than the claim queue length. Then the
runtime would still accept those candidates if in the current claim queue the
parachain still has assignments on the core. This way, if e.g. a block producer does not
produce a block, the parachain would merely slow down a bit, but not get its
blocks discarded.

Update: We will punt on this for now.

Session boundaries

Even with above optimizations, session boundaries would still make relay parents
obsolete. A simple fix would be for collators to anticipate the session change
and stop producing candidates that would end up getting backed in the last block of the session.

Core Changes

With above "Avoid relay parents becoming obsolete", this would not work if the
parachain still has a core assigned, but it is different now. This is not easy
to fix in the current design, luckily it should also have very little impact:

  • Chains which need high levels of reliability, should aim to keep their core
    mappings stable.
  • With the above, you only run into issues if a block producer messes up exactly
    at that rare occasion you changed your core mapping. Given that we have
    pretty solid 6s block times, the chances for this happening seem acceptable.

Reliable Collator Protocol

We want to make validator - collator connections as reliable as possible to
ensure produced blocks also end up getting validated in a timely manner.

@sandreim
Copy link
Contributor

sandreim commented Sep 2, 2024

Speculative Availability

This is a good solution if at some point we discover that 1,5 seconds is not enough time for availability. I'd expect 10MB PoVs could add some pressure here. Running subsystem benchmark numbers with some realistic latencies should give a hint.

Immunity to relay chain forks

Either:

Build on slightly older relay parents

  • Simple
  • Can be made pretty robust, if we choose relay parents which have been
    finalized already.

Finality can slow down and then this strategy doesn't work.

  • Also provides some resilience against relay chain reorgs.

  • But: Latency with regards to message processing is added.

I'd expect Sassafras to fix this, but until then we need something to alleviate things a bit. I think the relay chain parent choice should be more dynamic so it can optimize for either tput or latency depending on tx pool and relay chain messaging state.

Build on all forks

  • No additional latency

  • More resources on the collator (likely fine, as block building is mostly
    single core, hence additional cores are free)

  • More complexity: One does not only need to track one fork, but multiple.

I think this is what we were doing until slot based collator. Beefier (more cores) collators should make this solution a lower hanging fruit.

Avoid relay parents becoming obsolete

Allow relay parents that survive longer than the claim queue length. Then the runtime would still accept those candidates if in the current claim queue the parachain still has assignments on the core. This way, if e.g. a block producer does not produce a block, the parachain would merely slow down a bit, but not get its blocks discarded.

We are planning to use the same value for the max ancestry and claim queue length. I don't really see a point in allowing RPs survive longer. If we do that, why not also have the same scheduling look ahead ?

Session boundaries

Even with above optimizations, session boundaries would still make relay parents obsolete. A simple fix would be for collators to anticipate the session change and stop producing candidates that would end up getting backed in the last block of the session.

💯

Reliable Collator Protocol

We want to make validator - collator connections as reliable as possible to ensure produced blocks also end up getting validated in a timely manner.

I think this will have most impact on block times in general.

@sandreim
Copy link
Contributor

sandreim commented Sep 2, 2024

Another one that makes sense to have on this list and a lower hanging fruit:

Currently for availability we actually have more time, but we are starting the bitfield singing task and timer as soon as we import a relay chain block. If we imported that block very early we have more than 1.5s to fetch chunks and also the PRE_PROPOSE_TIMEOUT provisioner timeout can be higher than 2s. We'd just have to compute when we've imported the block wrt to the next slot.

@bkchr
Copy link
Member

bkchr commented Sep 2, 2024

Even with above optimizations, session boundaries would still make relay parents
obsolete. A simple fix would be for collators to anticipate the session change
and stop producing candidates that would end up getting backed in the last block of the session.

If the underlying validator set doesn't change, we should completely stop invalidating candidates on a session change. Or is there any proper reason?

Currently for availability we actually have more time, but we are starting the bitfield singing task and timer as soon as we import a relay chain block. If we imported that block very early we have more than 1.5s to fetch chunks and also the PRE_PROPOSE_TIMEOUT provisioner timeout can be higher than 2s. We'd just have to compute when we've imported the block wrt to the next slot.

paritytech/polkadot#5484 (comment) 🙈

@eskimor
Copy link
Member Author

eskimor commented Sep 2, 2024

If the underlying validator set doesn't change, we should completely stop invalidating candidates on a session change. Or is there any proper reason?

Mostly implementation complexity. @rphmeier back then decided, that it is not worth it for now. Worth checking again though, things have changed a lot.

@rphmeier
Copy link
Contributor

rphmeier commented Sep 2, 2024

My reasoning at the time was that session changes affect only a tiny proportion of blocks. Session changes happen only once every several hours and take thousands of blocks. So we'd be chasing like 0.1% efficiency.

More resources on the collator (likely fine, as block building is mostly
single core, hence additional cores are free)

worth noting that collation is bottlenecked on IOPS, not CPU, so building on all forks might work until parachains actually are under load and then stop working altogether.


maybe things have changed, but AFAIK slow availability shouldn't cause a parachain block to get retracted. it should just become available more slowly. is the 1 minute availability timeout still a thing?

@eskimor
Copy link
Member Author

eskimor commented Sep 3, 2024

maybe things have changed, but AFAIK slow availability shouldn't cause a parachain block to get retracted. it should just become available more slowly. is the 1 minute availability timeout still a thing?

The issue is, that it delays follow up blocks, up until the point where their relay parent might went out of scope. (Fixable by being more lenient with accepted relay parents)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Backlog
Development

No branches or pull requests

4 participants