Increase para block inclusion reliability #5544

eskimor · 2024-09-02T08:06:39Z

Having produced parachain blocks retracted is the very least detrimental to throughput of the chain, but also harms user and developer experience:

Worst case, blocks gets retracted, transaction becomes invalid, user has to issue a new one.
Having a transaction in a block, is already some proof of validity, under the
assumption that we trust collators.
Having a transaction in a block, which got backed (off-chain), provides even
higher level of security.

In general the block will move up in assurance over time, but if it happens
frequently that a block after all just gets discarded, the benefit this property
vanishes and one actually has to wait for definite finality, which takes the
longest.

The following is a kind of unordered list of things that can cause a parachain block to not make it + solutions to it.

Speculative Availability

Give availability more time, to enhance likelihood of cores getting freed on
time:

Factor out provisioner logic to get backable candidates
Reuse that code in availability-distribution to already fetch scheduled
cores and then backable candidates from prospective parachains
Add fetch tasks for these (with the leaf it was scheduled in)
Profit

Immunity to relay chain forks

Either:

Build on slightly older relay parents

Simple
Can be made pretty robust, if we choose relay parents which have been
finalized already.
Also provides some resilience against relay chain reorgs.

But: Latency with regards to message processing is added.

Build on all forks

No additional latency

More resources on the collator (likely fine, as block building is mostly
single core, hence additional cores are free)
More complexity: One does not only need to track one fork, but multiple.
More work for the relay chain.

Avoid relay parents becoming obsolete

Allow relay parents that survive longer than the claim queue length. Then the
runtime would still accept those candidates if in the current claim queue the
parachain still has assignments on the core. This way, if e.g. a block producer does not
produce a block, the parachain would merely slow down a bit, but not get its
blocks discarded.

Update: We will punt on this for now.

Session boundaries

Even with above optimizations, session boundaries would still make relay parents
obsolete. A simple fix would be for collators to anticipate the session change
and stop producing candidates that would end up getting backed in the last block of the session.

Core Changes

With above "Avoid relay parents becoming obsolete", this would not work if the
parachain still has a core assigned, but it is different now. This is not easy
to fix in the current design, luckily it should also have very little impact:

Chains which need high levels of reliability, should aim to keep their core
mappings stable.
With the above, you only run into issues if a block producer messes up exactly
at that rare occasion you changed your core mapping. Given that we have
pretty solid 6s block times, the chances for this happening seem acceptable.

Reliable Collator Protocol

We want to make validator - collator connections as reliable as possible to
ensure produced blocks also end up getting validated in a timely manner.

sandreim · 2024-09-02T10:16:54Z

Speculative Availability

This is a good solution if at some point we discover that 1,5 seconds is not enough time for availability. I'd expect 10MB PoVs could add some pressure here. Running subsystem benchmark numbers with some realistic latencies should give a hint.

Immunity to relay chain forks

Either:

Build on slightly older relay parents

Simple

Can be made pretty robust, if we choose relay parents which have been
finalized already.

Finality can slow down and then this strategy doesn't work.

Also provides some resilience against relay chain reorgs.

But: Latency with regards to message processing is added.

I'd expect Sassafras to fix this, but until then we need something to alleviate things a bit. I think the relay chain parent choice should be more dynamic so it can optimize for either tput or latency depending on tx pool and relay chain messaging state.

Build on all forks

No additional latency

More resources on the collator (likely fine, as block building is mostly
single core, hence additional cores are free)

More complexity: One does not only need to track one fork, but multiple.

I think this is what we were doing until slot based collator. Beefier (more cores) collators should make this solution a lower hanging fruit.

Avoid relay parents becoming obsolete

Allow relay parents that survive longer than the claim queue length. Then the runtime would still accept those candidates if in the current claim queue the parachain still has assignments on the core. This way, if e.g. a block producer does not produce a block, the parachain would merely slow down a bit, but not get its blocks discarded.

We are planning to use the same value for the max ancestry and claim queue length. I don't really see a point in allowing RPs survive longer. If we do that, why not also have the same scheduling look ahead ?

Session boundaries

Even with above optimizations, session boundaries would still make relay parents obsolete. A simple fix would be for collators to anticipate the session change and stop producing candidates that would end up getting backed in the last block of the session.

💯

Reliable Collator Protocol

We want to make validator - collator connections as reliable as possible to ensure produced blocks also end up getting validated in a timely manner.

I think this will have most impact on block times in general.

sandreim · 2024-09-02T10:29:11Z

Another one that makes sense to have on this list and a lower hanging fruit:

Currently for availability we actually have more time, but we are starting the bitfield singing task and timer as soon as we import a relay chain block. If we imported that block very early we have more than 1.5s to fetch chunks and also the PRE_PROPOSE_TIMEOUT provisioner timeout can be higher than 2s. We'd just have to compute when we've imported the block wrt to the next slot.

bkchr · 2024-09-02T15:07:10Z

Even with above optimizations, session boundaries would still make relay parents
obsolete. A simple fix would be for collators to anticipate the session change
and stop producing candidates that would end up getting backed in the last block of the session.

If the underlying validator set doesn't change, we should completely stop invalidating candidates on a session change. Or is there any proper reason?

Currently for availability we actually have more time, but we are starting the bitfield singing task and timer as soon as we import a relay chain block. If we imported that block very early we have more than 1.5s to fetch chunks and also the PRE_PROPOSE_TIMEOUT provisioner timeout can be higher than 2s. We'd just have to compute when we've imported the block wrt to the next slot.

paritytech/polkadot#5484 (comment) 🙈

eskimor · 2024-09-02T15:54:20Z

If the underlying validator set doesn't change, we should completely stop invalidating candidates on a session change. Or is there any proper reason?

Mostly implementation complexity. @rphmeier back then decided, that it is not worth it for now. Worth checking again though, things have changed a lot.

rphmeier · 2024-09-02T21:35:24Z

My reasoning at the time was that session changes affect only a tiny proportion of blocks. Session changes happen only once every several hours and take thousands of blocks. So we'd be chasing like 0.1% efficiency.

More resources on the collator (likely fine, as block building is mostly
single core, hence additional cores are free)

worth noting that collation is bottlenecked on IOPS, not CPU, so building on all forks might work until parachains actually are under load and then stop working altogether.

maybe things have changed, but AFAIK slow availability shouldn't cause a parachain block to get retracted. it should just become available more slowly. is the 1 minute availability timeout still a thing?

eskimor · 2024-09-03T08:08:54Z

maybe things have changed, but AFAIK slow availability shouldn't cause a parachain block to get retracted. it should just become available more slowly. is the 1 minute availability timeout still a thing?

The issue is, that it delays follow up blocks, up until the point where their relay parent might went out of scope. (Fixable by being more lenient with accepted relay parents)

eskimor added this to parachains team board Sep 2, 2024

github-project-automation bot moved this to Backlog in parachains team board Sep 2, 2024

eskimor mentioned this issue Sep 30, 2024

"Instant" Collator based finality #5869

Open

eskimor mentioned this issue Dec 16, 2024

Parachain Blocktimes Increasing #6910

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase para block inclusion reliability #5544

Increase para block inclusion reliability #5544

eskimor commented Sep 2, 2024 •

edited

Loading

sandreim commented Sep 2, 2024

Speculative Availability

Immunity to relay chain forks

Build on slightly older relay parents

Build on all forks

Avoid relay parents becoming obsolete

Session boundaries

Reliable Collator Protocol

sandreim commented Sep 2, 2024 •

edited

Loading

bkchr commented Sep 2, 2024

eskimor commented Sep 2, 2024

rphmeier commented Sep 2, 2024 •

edited

Loading

eskimor commented Sep 3, 2024

Increase para block inclusion reliability #5544

Increase para block inclusion reliability #5544

Comments

eskimor commented Sep 2, 2024 • edited Loading

Speculative Availability

Immunity to relay chain forks

Build on slightly older relay parents

Build on all forks

Avoid relay parents becoming obsolete

Session boundaries

Core Changes

Reliable Collator Protocol

sandreim commented Sep 2, 2024

Speculative Availability

Immunity to relay chain forks

Build on slightly older relay parents

Build on all forks

Avoid relay parents becoming obsolete

Session boundaries

Reliable Collator Protocol

sandreim commented Sep 2, 2024 • edited Loading

bkchr commented Sep 2, 2024

eskimor commented Sep 2, 2024

rphmeier commented Sep 2, 2024 • edited Loading

eskimor commented Sep 3, 2024

eskimor commented Sep 2, 2024 •

edited

Loading

sandreim commented Sep 2, 2024 •

edited

Loading

rphmeier commented Sep 2, 2024 •

edited

Loading