Large `OnDemandQueue` queue processing leads to up to 10s block import times #3051

sandreim · 2024-01-24T16:18:48Z

The issue surfaced on rococo after Parachain 4,343 placed around 5k worth of orders. The OnDemandQueue seems to be processed on each block causing very high inherent create and import times, leading to liveness issues due to validators building on same parent multiple times.

CC @eskimor @antonva

The text was updated successfully, but these errors were encountered:

ordian · 2024-01-24T16:30:39Z

I'd suspect this call

polkadot-sdk/polkadot/runtime/parachains/src/paras_inherent/mod.rs

Line 583 in 925af80

<scheduler::Pallet<T>>::free_cores_and_fill_claimqueue(freed, now);

bkchr · 2024-01-24T17:28:18Z

I mean this is for sure this code here:

polkadot-sdk/polkadot/runtime/parachains/src/assigner_on_demand/mod.rs

Lines 546 to 556 in 50eb12c

    
           let pos = queue.iter().enumerate().position(|(index, assignment)| { 
        
           	if <paras::Pallet<T>>::is_parathread(assignment.para_id) { 
        
           		match ParaIdAffinity::<T>::get(&assignment.para_id) { 
        
           			Some(affinity) => return affinity.core_idx == core_idx, 
        
           			None => return true, 
        
           		} 
        
           	} 
        
           	// Record no longer valid para_ids. 
        
           	invalidated_para_id_indexes.push(index); 
        
           	return false 
        
           });

We should get rid of ParaIdAffinity and move stuff that is already "assigned" to some new storage item and remove it from the OnDemandQueue. The OnDemandQueue should probably be a BTreeMap or similar indexed by the para_id. We should also bound the OnDemandQueue to some reasonable upper limit like 200 or whatever. While changing these data structures we should also maybe directly take into account that one parachain at some point could be assigned to multiple cores at once.

eskimor · 2024-01-25T08:58:58Z

@antonva how expensive would placing 5000 orders be on Kusama/Polkadot?

eskimor · 2024-01-25T11:27:47Z

Related: https://github.com/orgs/paritytech/projects/119/views/17?pane=issue&itemId=42349578

eskimor · 2024-01-25T11:30:43Z

Memo to myself: We should always plan for performance tests, for any non super trivial feature. Especially if we already suspect that there could be issues.

For Kusama launch, have to have some solution here. Unfortunately we apparently are only updating the spot traffic once per block. @antonva didn't we agree back then that this is a bad idea?

So even if spot price goes up real quick, if it is possible to squeeze in a lot of orders in just one block, we could still be in trouble.

antonva · 2024-01-25T11:50:51Z

@antonva how expensive would placing 5000 orders be on Kusama/Polkadot?

Lets go with that we can fit all the orders in the same block and that the OnDemandQueue is empty. The charge in this case would be 5000 * configuration.on_demand_base_fee, which on both Kusama and Polkadot is currently 5_000 * 10_000_000 = 50_000_000_000. That's 0.05 KSM and 5 DOT due to the different denominations.
We should increase the base fee on all relays including Rococo.

... @antonva didn't we agree back then that this is a bad idea?

We did agree on that, not sure why it didn't get added. Will go do that now.

eskimor · 2024-01-25T12:06:48Z

We did agree on that, not sure why it didn't get added. Will go do that now.

Thank you! We need to get better in tracking these things. It is just a bummer if we actually thought of an issue, but forgot to implement the mitigation. My best take on this right now, is making sure there are tickets on the board, then we only need to make sure we are actually checking the board. 🙈

With the spot traffic going up each order - how would the math then end up?

eskimor · 2024-01-25T12:08:22Z

I will try to find some time today to make the implementation more efficient.

eskimor · 2024-01-25T13:23:26Z

A bit more analysis: We do have something like 50 on-demand cores on Rococo, which is leading to the following:

For each core (there are 50), we call pop_assignment_for_core, running this loop, which will traverse all 5000 entries on all but one core. So this is 50*5000 traversals: So 250_000 ... does not sound great, but also not like it should take seconds to execute. The loop is not doing much ... are runtimes really that slow? I mean it is also copies on reads and everything ... 🤷

Also Anton: for spot price calculation we fetch the length of the vec - which is also doing a full read.

bkchr · 2024-01-28T19:22:22Z

running this loop, which will traverse all 5000 entries on all but one core. So this is 50*5000 traversals: So 250_000 ... does not sound great, but also not like it should take seconds to execute. The loop is not doing much ... are runtimes really that slow?

Maybe there is some other performance issue somewhere else as well.

Nevertheless, we need to fix this until 31.01 to get this stuff into the next release that will then be used by the fellowship.

eskimor · 2024-01-29T20:35:47Z

Nevertheless, we need to fix this until 31.01 to get this stuff into the next release that will then be used by the fellowship

Don't think my fix will be fully ready by then, but as mentioned if reduce maximum queue size, don't go crazy with cores (which we should not do anyway) and maybe don't go super dirt cheap we should be fine for Kusama. @ordian any news on measurement front? Do we have a better understanding where we actually spent so much time?

ordian · 2024-01-30T10:01:03Z

Still trying to repro the issue with https://github.com/paritytech/polkadot-sdk/compare/ao-bench-on-demand, the problem is that with the removal of the native runtime, running benches even with --execution=native doesn't use the native, so debugging (currently the on demand queue is not being read) is slow (add logs and recompile cycle).

UPD: manage to make it read from the OnDemandQueue, but it's not slow during import (enter)

UPD2: managed to repro and we discussed that in Element

sandreim added I2-bug The node fails to follow expected behavior. T8-polkadot This PR/Issue is related to/affects the Polkadot network. labels Jan 24, 2024

sandreim added this to parachains team board Jan 24, 2024

github-project-automation bot moved this to Backlog in parachains team board Jan 24, 2024

antonva self-assigned this Jan 25, 2024

eskimor moved this from Backlog to In Progress in parachains team board Jan 30, 2024

ordian linked a pull request Feb 12, 2024 that will close this issue

Fix algorithmic complexity of on-demand scheduler with regards to number of cores. #3190

Merged

7 tasks

antonva closed this as completed in #3190 Mar 20, 2024

github-project-automation bot moved this from In Progress to Completed in parachains team board Mar 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large `OnDemandQueue` queue processing leads to up to 10s block import times #3051

Large `OnDemandQueue` queue processing leads to up to 10s block import times #3051

sandreim commented Jan 24, 2024

ordian commented Jan 24, 2024

bkchr commented Jan 24, 2024

eskimor commented Jan 25, 2024

eskimor commented Jan 25, 2024

eskimor commented Jan 25, 2024 •

edited

Loading

antonva commented Jan 25, 2024

eskimor commented Jan 25, 2024

eskimor commented Jan 25, 2024

eskimor commented Jan 25, 2024

bkchr commented Jan 28, 2024

eskimor commented Jan 29, 2024

ordian commented Jan 30, 2024 •

edited

Loading

Large OnDemandQueue queue processing leads to up to 10s block import times #3051

Large OnDemandQueue queue processing leads to up to 10s block import times #3051

Comments

sandreim commented Jan 24, 2024

ordian commented Jan 24, 2024

bkchr commented Jan 24, 2024

eskimor commented Jan 25, 2024

eskimor commented Jan 25, 2024

eskimor commented Jan 25, 2024 • edited Loading

antonva commented Jan 25, 2024

eskimor commented Jan 25, 2024

eskimor commented Jan 25, 2024

eskimor commented Jan 25, 2024

bkchr commented Jan 28, 2024

eskimor commented Jan 29, 2024

ordian commented Jan 30, 2024 • edited Loading

Large `OnDemandQueue` queue processing leads to up to 10s block import times #3051

Large `OnDemandQueue` queue processing leads to up to 10s block import times #3051

eskimor commented Jan 25, 2024 •

edited

Loading

ordian commented Jan 30, 2024 •

edited

Loading