Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large OnDemandQueue queue processing leads to up to 10s block import times #3051

Closed
sandreim opened this issue Jan 24, 2024 · 12 comments · Fixed by #3190
Closed

Large OnDemandQueue queue processing leads to up to 10s block import times #3051

sandreim opened this issue Jan 24, 2024 · 12 comments · Fixed by #3190
Assignees
Labels
I2-bug The node fails to follow expected behavior. T8-polkadot This PR/Issue is related to/affects the Polkadot network.

Comments

@sandreim
Copy link
Contributor

The issue surfaced on rococo after Parachain 4,343 placed around 5k worth of orders. The OnDemandQueue seems to be processed on each block causing very high inherent create and import times, leading to liveness issues due to validators building on same parent multiple times.

Screenshot 2024-01-24 at 16 31 18

CC @eskimor @antonva

@sandreim sandreim added I2-bug The node fails to follow expected behavior. T8-polkadot This PR/Issue is related to/affects the Polkadot network. labels Jan 24, 2024
@ordian
Copy link
Member

ordian commented Jan 24, 2024

I'd suspect this call

<scheduler::Pallet<T>>::free_cores_and_fill_claimqueue(freed, now);

@bkchr
Copy link
Member

bkchr commented Jan 24, 2024

I mean this is for sure this code here:

let pos = queue.iter().enumerate().position(|(index, assignment)| {
if <paras::Pallet<T>>::is_parathread(assignment.para_id) {
match ParaIdAffinity::<T>::get(&assignment.para_id) {
Some(affinity) => return affinity.core_idx == core_idx,
None => return true,
}
}
// Record no longer valid para_ids.
invalidated_para_id_indexes.push(index);
return false
});

We should get rid of ParaIdAffinity and move stuff that is already "assigned" to some new storage item and remove it from the OnDemandQueue. The OnDemandQueue should probably be a BTreeMap or similar indexed by the para_id. We should also bound the OnDemandQueue to some reasonable upper limit like 200 or whatever. While changing these data structures we should also maybe directly take into account that one parachain at some point could be assigned to multiple cores at once.

@eskimor
Copy link
Member

eskimor commented Jan 25, 2024

@antonva how expensive would placing 5000 orders be on Kusama/Polkadot?

@eskimor
Copy link
Member

eskimor commented Jan 25, 2024

Related: https://github.com/orgs/paritytech/projects/119/views/17?pane=issue&itemId=42349578

@eskimor
Copy link
Member

eskimor commented Jan 25, 2024

Memo to myself: We should always plan for performance tests, for any non super trivial feature. Especially if we already suspect that there could be issues.

For Kusama launch, have to have some solution here. Unfortunately we apparently are only updating the spot traffic once per block. @antonva didn't we agree back then that this is a bad idea?

So even if spot price goes up real quick, if it is possible to squeeze in a lot of orders in just one block, we could still be in trouble.

@antonva
Copy link
Contributor

antonva commented Jan 25, 2024

@antonva how expensive would placing 5000 orders be on Kusama/Polkadot?

Lets go with that we can fit all the orders in the same block and that the OnDemandQueue is empty. The charge in this case would be 5000 * configuration.on_demand_base_fee, which on both Kusama and Polkadot is currently 5_000 * 10_000_000 = 50_000_000_000. That's 0.05 KSM and 5 DOT due to the different denominations.
We should increase the base fee on all relays including Rococo.

... @antonva didn't we agree back then that this is a bad idea?

We did agree on that, not sure why it didn't get added. Will go do that now.

@antonva antonva self-assigned this Jan 25, 2024
@eskimor
Copy link
Member

eskimor commented Jan 25, 2024

We did agree on that, not sure why it didn't get added. Will go do that now.

Thank you! We need to get better in tracking these things. It is just a bummer if we actually thought of an issue, but forgot to implement the mitigation. My best take on this right now, is making sure there are tickets on the board, then we only need to make sure we are actually checking the board. 🙈

With the spot traffic going up each order - how would the math then end up?

@eskimor
Copy link
Member

eskimor commented Jan 25, 2024

I will try to find some time today to make the implementation more efficient.

@eskimor
Copy link
Member

eskimor commented Jan 25, 2024

A bit more analysis: We do have something like 50 on-demand cores on Rococo, which is leading to the following:

For each core (there are 50), we call pop_assignment_for_core, running this loop, which will traverse all 5000 entries on all but one core. So this is 50*5000 traversals: So 250_000 ... does not sound great, but also not like it should take seconds to execute. The loop is not doing much ... are runtimes really that slow? I mean it is also copies on reads and everything ... 🤷

Also Anton: for spot price calculation we fetch the length of the vec - which is also doing a full read.

@bkchr
Copy link
Member

bkchr commented Jan 28, 2024

running this loop, which will traverse all 5000 entries on all but one core. So this is 50*5000 traversals: So 250_000 ... does not sound great, but also not like it should take seconds to execute. The loop is not doing much ... are runtimes really that slow?

Maybe there is some other performance issue somewhere else as well.

Nevertheless, we need to fix this until 31.01 to get this stuff into the next release that will then be used by the fellowship.

@eskimor
Copy link
Member

eskimor commented Jan 29, 2024

Nevertheless, we need to fix this until 31.01 to get this stuff into the next release that will then be used by the fellowship

Don't think my fix will be fully ready by then, but as mentioned if reduce maximum queue size, don't go crazy with cores (which we should not do anyway) and maybe don't go super dirt cheap we should be fine for Kusama. @ordian any news on measurement front? Do we have a better understanding where we actually spent so much time?

@ordian
Copy link
Member

ordian commented Jan 30, 2024

Still trying to repro the issue with https://github.com/paritytech/polkadot-sdk/compare/ao-bench-on-demand, the problem is that with the removal of the native runtime, running benches even with --execution=native doesn't use the native, so debugging (currently the on demand queue is not being read) is slow (add logs and recompile cycle).

UPD: manage to make it read from the OnDemandQueue, but it's not slow during import (enter)

UPD2: managed to repro and we discussed that in Element

profile

@eskimor eskimor moved this from Backlog to In Progress in parachains team board Jan 30, 2024
@github-project-automation github-project-automation bot moved this from In Progress to Completed in parachains team board Mar 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I2-bug The node fails to follow expected behavior. T8-polkadot This PR/Issue is related to/affects the Polkadot network.
Projects
Status: Completed
5 participants