bulk child trie deletion #5280

cheme · 2020-03-17T15:39:24Z

This Pr is built upon #4827.

It implements bulk child deletion: instead of deleting every element of a child trie one by one with kill_child_storage, the deletion is done as a whole.
This means:

only a single item is stored when deleting a child trie (the identifier of the child trie).
A single change trie item is produce for the whole child trie deletion, this is done in a breaking way but I do not think change trie is currently use. I do not know if the item should appear in the digest too. cc @svyatonik
on pruning deletion is done by using the key prefix used by rocksdb to avoid key collision. This Pr is needed for this : Delete by prefix operator in kvdb parity-common#360
pruning journal an noncanonical journal need to store this change, and a different format is use. This change of format means that it may be possible to use the first approach of Manage child trie content independently #4827 to gain a bit of space (pushing the rocksdb prefix to key value later in order to reduce journal key redundancy by managing changes per child trie). cc @arkpar

This PR depending on #4827 first, I put its status in progress, but it should be usable for contract testing cc @Robbepop @pepyakin .

Note that KeyspacedDB are still use. KeyspacedDBMut only for test.

switching to a btreemap.

issue).

allocation whise unless we break all api for a close to nothing perf change: switching to simple single child info struct.

…ild_delete_2020

cheme · 2020-05-07T09:38:12Z

I did start implementing the possibility to remove deleted nodes from proof like describe in previous comment master...cheme:proof_backends .

The code is incomplete (no handle of child trie prefix deletion), and need many tests, but enough to formulate a few interesting observation.

I end up with three kind of proof construction and two kind of proof verification.
With the bulk trie deletion approach proof did not need those different execution code, because not registering the bulk deletion was specific operation.

proof that do not need to trace deletion that are bellow a parent deletion, and therefore can produce shorter proof:
- A first variant that do not process the transaction. (WithoutTransaction)

This should be the default choice unless we need to use the transactions (performance).
The implementation uses a different logic at Ext level and additional filtering in the overlay change (deleted child trie and deleted prefix).
On root calculation we simply calculate child trie root on a empty backend.
Note that for the prefix case, we are using a hack to insert the unitary changes lazilly. That does not achieve the real goal (on proof calculation we will register the access to the unitary element), and new trie primitive would be needed to only change the prefix node. Still the lazy nature of this hack benefits proofs that do not call storage root.

- A second variant that process the transaction but do not record it. (WithTransaction)

The implementation simply switch of the proof node recording during prefixed key access and during child trie key access on delete prefix and delete child trie operation.
But internally for storage root calculation it needs to work the same way as the previous variant and suffer from the same issue on prefix case.

Both variant needs to be verified with a verifier that do not query child trie nodes on deletion, that is the same change in processsing than the first variant.
The drawback here is that an host function could not be use in the proof (except if we do not prove its value as it is currently doable with Core_execute_block behavior), changes_storage_root that compute the change trie root: getting those deletion out of the proof we cannot check change trie, that can be acceptable (note that with bulk deletion approach all will be fine because change trie do not expect those deletion entries).

proof that need to all deletion access (the existing code where we register all trie node touched):

Interesting for testing.
Interesting for proving a change trie. Should parachain verify their change trie in polkadot, that does not seem very obvious to me (at first I was thinking 'no use', but we also probably want to trust change trie).

So my gist out of this branch would be to say that it is doable to skip unneeded deleted content in proof, but rely on the assumption that change trie root will never be read during block processing.
Generally I find it better to have a special change for trie deletion (and by extension prefix deletion), and do not see disadvantage from having it (except complexity but if we got some specific proof generation process that is not true at all).
Benefit being lighter change trie and smaller memory footprint.

In both scenario (the linked branch and this PR approach), delete by prefix optimization would require additional tools (radix trie map and specific trie update operation, plus some reverse deathrow logic on non archive node (this would be needed for bulk child trie deletion too if we would allow using a child trie again after deletion)).

athei · 2020-07-03T12:04:59Z

What is the status here?
@cheme You need feedback (if yes by whom)?

I am asking because we need this for #6357.

cheme · 2020-07-03T12:46:01Z

It is frozen, I need a feedback on wether we want to make child trie deletion (and by extension possibly prefix deletion) a first class citizen operation (currently we got insert key value and delete key in changesets).

I mean @rphmeier did reply that we would not include child trie key in proofs, so I started implemented that, but end up having complicated redundant code (state machine need to behave differently depending on context), and I still think that registering a specific child trie deletion operation is the easier and cleaner thing to do.

From a quick look at the changes, things that could be trimmed:

remove the InnerHash code, it is an unrelated optimisation
revert Manage child trie content independently #4827 change (but we still need to add an additional child trie related info in client and organising by child trie makes thing a bit more evolutive (bringing child_info into the client db layer to allow different backend depending on child type)). This relates to discussion with @arkpar and what is referred too as a 'totally different architecture'.

Only way I see to avoid child info in client would be to create a new storage change that contains the bulk child deletion with either keyspace for rocksdb delete or fetched child trie node hash for parity db (but it would means we fetch child trie content at the time of block processing which makes things awkward with proof, and make use store useless journal of key when we only need root).
In this case that would be the db type that will be leaking in state machine, and we need to avoid fetching key when running a child deletion in statemachine for proof, so I think this idea is a bad move.

athei · 2020-07-03T14:22:06Z

You outlined multiple possible solutions. You have problems deciding which to choose or why is it frozen? @rphmeier @arkpar can you comment which one you would prefer? I have to little knowledge on this topic to fully comprehend the answer.

arkpar · 2020-07-06T19:05:27Z

I still think subtree deletion should not leak to state-db abstraction. It should really be backend agnostic, and not rely on prefixes or backend-provided identifiers. ParityDb uses reference-counting, so that same node could be shared by multiple tries. Saving a lot of space potentially. This does not play well with prefixed keys. The most generic way to implement it is to walk the tree and delete each individual key. Avoiding any special handling of this operation. I'd suggest we don't do any DB-optimized deletion until it is shown to be a problem in practice.

Regarding proofs: Pausing recording during the deletion operation looks like the right thing to do. It is an optimization though. The validator would not access nodes when it validates deletion, so there's no need to provide them. I'm not sure I understand the issue with the change trie. How does skipping collecting proof for deletion affects the change trie root? As far as I know, change tries are only implemented for the main tree.

The long-term plan for ParityDB is to evolve to support trie-related operations. That would involve not just deletion, but also reference-counted insertion and iteration at the DB level. Until then we'd better keep it simple.

rphmeier · 2020-07-06T19:37:02Z

I feel too disconnected from this issue and the underlying DB roadmap to meaningfully comment. I believe the only place I have weighed in is here upon this request from @arkpar to clarify needs of some higher-level protocol. I defer to @arkpar for comments on the general approach.

athei · 2020-07-06T21:47:27Z

The most generic way to implement it is to walk the tree and delete each individual key. Avoiding any special handling of this operation. I'd suggest we don't do any DB-optimized deletion until it is shown to be a problem in practice.

This is basically what we have now, right? You argue for keeping child trie deletion a O(n) operation?

cheme · 2020-07-07T07:28:36Z

ParityDb uses reference-counting, so that same node could be shared by multiple tries. Saving a lot of space potentially. This does not play well with prefixed keys.

The PR do not use prefixed keys in this case: for rocksdb the drop tree operation contains both child trie root and keyspace, then only in the case of rocksdb we try to apply a delete by prefix, parity-db still iterate on trie but only when pruning is called.
At first I thought it would be useless with parity-db, but the good things are:

only keeping trace of child root in the pruning journal.
in case of an archive node it means that we never iterate the child trie at all
smaller change trie (but a bit less info which in this case looks good to me)
Still those point are either optimization (I don't expect the size of the journal to be problematic), or breaking (change trie).

I'd suggest we don't do any DB-optimized deletion until it is shown to be a problem in practice

Using trie iteration for rocksdb (the same way as parity-db) seems doable, it will then only helps with the previous point.

Pausing recording during the deletion operation looks like the right thing to do.

That is what I did in master...cheme:proof_backends , it means that there is different mode of execution for state-machine, one that produce the transaction payload, and one that don't (IIRC I did add a third one for doing both (tx payload and proof not recording ), I believe it is what is needed for cumulus).
I don't remember too well why I was really not happy with this branch, maybe bringing the same logic for delete_with_prefix was a bit too much, or I just end up seeing the new assumption with change trie and other things as more complicated than having a delete trie change set operation.

I'm not sure I understand the issue with the change trie.

Change trie root can be calculated from a storage host function so if a parachain uses it in its runtime, it would need to include all deletion in its proof (so running the proof add all the child trie deletion in change trie).
In practice the value of the change trie root would probably be directly send to the header without being use internally by the state machine and not stored in the state trie, but it is still possible to do.
Similarily including individual child trie deletion in change trie means that if we want to proof child trie the proof will allways be way bigger (as is the child trie)

As far as I know, change tries are only implemented for the main tree.

I reply to quickly to previous point, change trie do also include child trie changes.

The long-term plan for ParityDB is to evolve to support trie-related operations. That would involve not just deletion, but also reference-counted insertion and iteration at the DB level. Until then we'd better keep it simple.

Sounds interesting, I will try to keep looking at parity-db branches and issues if there is.

cheme · 2020-07-07T07:29:17Z

You argue for keeping child trie deletion a O(n) operation?

Note that in the parity-db case we still got a O(n) trie parsing removal operation. Regarding the perf gains the trade off seems fine.
The issue I see, and I think is your concern regarding rent system, is that we iterate during block evaluation when with this PR we only need to iterate on pruning (which makes sure this O(n) cost only happen for canonical and is easier to separate or reschedule).

athei · 2020-07-07T08:28:07Z

The issue I have with O(n) child trie removal is when contracts are evicted due to rent. If the contract has a huge amount of accumulated storage this eviction might be even too large for one block. We then need to implement something like lazy removal that can distribute this across multiple blocks.

arkpar · 2020-07-07T10:04:53Z

The most generic way to implement it is to walk the tree and delete each individual key. Avoiding any special handling of this operation. I'd suggest we don't do any DB-optimized deletion until it is shown to be a problem in practice.

This is basically what we have now, right? You argue for keeping child trie deletion a O(n) operation?

It is O(n) best case anyway. The only difference is with rocksdb you get O(n) on commit, and not on the actual delete operation. Unless you keep each subtree in a separate database. We can indeed optimize to perform it in the background on the database level but I'd argue it is too early for that.

arkpar · 2020-07-07T10:09:51Z

I think it does not make much sense for change tries to collect all removed keys for subtries. If it does, there's your O(n) right there and there's little sense to optimize it in the database, since you have to do iteration anyway.
There needs to be a single event recorded in the parent tree that would allow to track that all subtree data has been removed.

athei · 2020-07-07T10:47:21Z

I see. Thanks for your answers. We then need cope with the fact that eviction is linear time operation.

Robbepop · 2020-07-07T11:06:23Z

From my current understanding if we cannot get around the fact that contract eviction is linear time we should at least make it possible to perform the clean-up lazily or defer it when possible or break it up into chunks of clean-up. Otherwise we end up having no way to clean-up contract instances that are especially big. Might this even be an attack vector in certain cases? Maybe there are other operations that need to be deferred or chunked up into pieces to be distributed among several blocks so that we could find a generalized solution to this problem?

gnunicorn · 2020-09-09T07:29:13Z

closing because of inactivity.

cheme · 2020-09-09T07:36:06Z

There needs to be a single event recorded in the parent tree that would allow to track that all subtree data has been removed.

Using a specific child trie delete operation does inherently cover this point.

Closing note:
bulk child trie deletion seems still very relevant to me.
Though it is more in line with an architecture where each child trie are managed independently (which is needed for multiple other things and open perspective for other kind of state).
So in line with #4827 I feel like this will be needed in the future (I feel like the switching off deletion when registering proof is more complex than this approach).

cheme · 2020-11-23T09:06:06Z

Also for #6594 having a single child trie delete info send back to the client makes the non O(1) cost only happen at pruning (or never for archive nodes), and it could be easier to implement this way.

cheme · 2021-04-08T06:39:10Z

Note, this PR lacks a correct trie root calculation in state-machine, code such as https://github.com/cheme/trie/blob/379bf92720e23b2465badd4739c7d2ba75115c06/trie-db/src/traverse.rs#L658 (old code please don't use), mainly have a detach branch from trie action (can be implemented in triedbmut too, but more involve).

cheme added 30 commits January 28, 2020 09:45

change from cache root pr

e97accb

Merge branch 'master' into split_child_payload

d7d69a4

Targetted way of putting keyspace.

cf6393a

Note that KeyspacedDB are still use. KeyspacedDBMut only for test.

changes to state-db

2845d0e

change transaction to be by child trie.

a0532d1

slice index fix, many failing tests.

eb5961f

fix state-db tests

67687f8

vec with multiple entry of a same rc prefixeddb did not make sense,

48df830

switching to a btreemap.

Merge branch 'master' into split_child_payload

9e9a684

change set to btreemap, seems useless (at least do no solve changetrie

cb4c4a9

issue).

moving get_1 to get, state-machine needs rework

a398b82

Resolve a bit of child trie

7b26a93

Merge branch 'split_child_payload' into split_child_payload_map2

4d1378f

small refact

f39ce3f

Merge branch 'master' into split_child_payload

bc4fe52

Merge branch 'split_child_payload' into split_child_payload_map2

f27a2ec

indent

9f0c600

Use const of null hash check on BackendStorageRef.

ecb43d0

Merge branch 'split_child_payload_map2' into split_child_payload

d76a316

Associated null node hash set to a non optional const.

ae29df5

Make ChildInfo borrow of OwnedChildInfo.

6a06c0a

Removing unsafe cast, using ref_cast asumption for borrow case.

25aaa3a

Borrow approach on OwnedChildInfo and ChildInfo did not make sense

0d45d85

allocation whise unless we break all api for a close to nothing perf change: switching to simple single child info struct.

Factoring map of children code, before switching key.

274a923

Switching children key from optional to simple ChildInfo.

2e47a1d

Merge branch 'master' into split_child_payload

328ed66

fix merge test

b07d7ca

clean todos

e5d7b04

fix

3a71669

End up removing all keypacedDB from code.

c846471

cheme added 2 commits May 6, 2020 15:42

fix next storage key logic

329983d

Merge branch 'child_delete_2020' of github.com:cheme/polkadot into ch…

12f4229

…ild_delete_2020

cheme added B3-apinoteworthy C1-low PR touches the given topic and has a low impact on builders. labels Jun 11, 2020

athei mentioned this pull request Jun 23, 2020

Seal: Deployment Tracking Issue #6357

Closed

7 tasks

athei mentioned this pull request Jul 7, 2020

contracts: Lazy contract removal #6594

Closed

gnunicorn closed this Sep 9, 2020

athei mentioned this pull request Nov 6, 2020

Subtrie API for contracts #5215

Closed

cheme mentioned this pull request Nov 23, 2020

Document the difference between sp_io::storage and sp_io::default_child_storage #7574

Open

athei mentioned this pull request Dec 6, 2020

Allow capping the amount of work performed when deleting a child trie #7671

Merged

cheme mentioned this pull request Aug 11, 2022

Vision: Meta Transaction paritytech/polkadot-sdk#266

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bulk child trie deletion #5280

bulk child trie deletion #5280

cheme commented Mar 17, 2020

cheme commented May 7, 2020 •

edited

Loading

athei commented Jul 3, 2020

cheme commented Jul 3, 2020

athei commented Jul 3, 2020

arkpar commented Jul 6, 2020

rphmeier commented Jul 6, 2020

athei commented Jul 6, 2020

cheme commented Jul 7, 2020

cheme commented Jul 7, 2020

athei commented Jul 7, 2020

arkpar commented Jul 7, 2020

arkpar commented Jul 7, 2020

athei commented Jul 7, 2020 •

edited

Loading

Robbepop commented Jul 7, 2020

gnunicorn commented Sep 9, 2020

cheme commented Sep 9, 2020

cheme commented Nov 23, 2020

cheme commented Apr 8, 2021

bulk child trie deletion #5280

bulk child trie deletion #5280

Conversation

cheme commented Mar 17, 2020

cheme commented May 7, 2020 • edited Loading

athei commented Jul 3, 2020

cheme commented Jul 3, 2020

athei commented Jul 3, 2020

arkpar commented Jul 6, 2020

rphmeier commented Jul 6, 2020

athei commented Jul 6, 2020

cheme commented Jul 7, 2020

cheme commented Jul 7, 2020

athei commented Jul 7, 2020

arkpar commented Jul 7, 2020

arkpar commented Jul 7, 2020

athei commented Jul 7, 2020 • edited Loading

Robbepop commented Jul 7, 2020

gnunicorn commented Sep 9, 2020

cheme commented Sep 9, 2020

cheme commented Nov 23, 2020

cheme commented Apr 8, 2021

cheme commented May 7, 2020 •

edited

Loading

athei commented Jul 7, 2020 •

edited

Loading