Proposals to increase state modification performance #4310

mossid · 2019-05-08T22:45:42Z

The current SDK state uses IAVL tree, which is backed by goleveldb. IAVL tree is a Merkle tree working with a Key-Value mapping interface, and goleveldb is a Key-Value database which lives on disk. There are two major bottlenecks on the current workflow

+--------+Marshal+--------+ Set  +--------+
| object +------>+ []byte +----->+ nodedb |
+---+----+       +-+-+----+      +----+---+
    ^    Unmarshal | ^      Get       |
    +--------------+ +----------------+

Marshal/Unmarshal is a conversion between language-native objects and universal format byte slices. Certain data has to be marshaled and unmarshaled in each transaction (more specifically each deliverTx), even it was marshaled and unmarshaled before within the same block.

Get/Set is database access. A byte slice can be set into the disk, and can be read. In IAVL implementation, get can be done with binary searching and does not need to modify the existing nodes, but set requires remerklizing all parent nodes, taking much more time. This increases logarithmically, depending on the depth, so having more key-value pairs in it means more time to remerklize.

There are three methods that possibly can solve these problems, without modifying IAVL or underlying database, rather purely on the SDK side.

All of the methods are expecting a fixed-size set of key-value pairs which is updated very frequently, almost every block. Data that is being updated on EndBlocker is a good example.

The benchmark assumes each value has the size of 200bytes and the total number of key-value pairs is 160000, total size 200*160000=32MB.

flushkv

flushkv.Store consists of two parent CommitKVStore, each called main and hash. The store also has cache, which is cache wrapped main. On KVStore method call, it forwards the call to cache and hash.

store.{Get, Has, Iterator}(key) => store.cache.{Get, Has, Iterator}(key)
store.{Set, Delete}(key, value) => store.cache.{Set, Delete}(key, value); store.hash.{Set, Delete}(key, hash(value))

When Commit is called, it first checks the current height and if it is a multiple of predefined flushRate then Commits both cache and hash. If not, it only commits hash.

In summary, flushkv.Store only stores the hash of each values in order to reduce the setting time, and stores the full value only on the checkpoint heights. The full values are always stored in memory(by cached main), so it can be queried and be used by the app.

The downside is that when the node process stops the data stored in the cache is removed, so the node has to start state machine execution from the last checkpoint. However, we can just make the application dump the current cache to the disk when it stops by an interrupt, or make the flushRate small enough.

The benchmark shows about 30% of time reduce.

BenchmarkData/1000-32kb-IAVLFlushNonpRaw           1        3607ms/op  159000rw      0.0227ms/rw       19017kb alloc
BenchmarkData/1000-32kb-IAVLNonpRaw                1        5347ms/op  159000rw      0.0336ms/rw       18928kb alloc

batched values

If the values are going to be updated frequently, then having it as a single struct will be more efficient, because setting one large key-value pair is faster than setting multiple small key-value pairs. However, it is hard to simply concatenate unrelated objects, as it requires the client to query the whole batch which contains a lot of unnecessary information.

One way is to treat a KVStore as a MultiStore. Each leaf of the IAVL store represents a batched key-value pairs, not a single pair.

I haven't researched how we can effectively batch the data, as it is a proposal, not an implementation detail(yet). One of the ideas is forming a virtual memory segment on a byte slice. A segment consists of table space and memory space. The table is an array of pointer {key, offset, size}, which refers to the specific range of the memory space(space[offset:offset+size]) where the value associated with the key lives. Since the pointers have a fixed length, a user can simply binary search for the desired key, decodes the offset and size, slices the space, and only unmarshals that slice. Modules can specify which value will be stored under which segment(as it is retrieved by the same MultiStore interface), they can optimize by storing related data in the same segment.

The benchmark shows 88% of time reduce(not considering segment overhead), where each batch contains 160 pairs, approx 32KB, with total batch number of 1000.

BenchmarkData/1000-32kb-IAVLFlushNonpBatch         3         425ms/op  159000rw      0.0027ms/rw        6320kb alloc
BenchmarkData/1000-32kb-IAVLFlushNonpRaw           1        3607ms/op  159000rw      0.0227ms/rw       19017kb alloc

persinterf

persinterf.Store is a inmemory persistent interface store. type InterfaceStore has a similar interface with KVStore, but having different method for Get and Set:

Get(key []byte, ptr Interface)
Set(key []byte, value Interface)

where type Interface has the methods Marshal and Unmarshal. persinterf.Store persists the pairs in its field cache map[string]types.Interface, where it is copied to the pointer if the requested key exists in the cache. If not, it reads the value from the parent store and caches it.

On commit, unlike cachestores, it does not clear the cache, but keep holding it. This essentially works as an inter-block object cache. Excepts for that, the store works identical with IAVL store, and does not alter the appHash when applied.

The benchmark shows 40% time reduce

BenchmarkData/1000-32kb-IAVLFlushPersBatch         5         255ms/op  159000rw      0.0016ms/rw        1355kb alloc
BenchmarkData/1000-32kb-IAVLFlushNonpBatch         3         425ms/op  159000rw      0.0027ms/rw        6320kb alloc

The text was updated successfully, but these errors were encountered:

alexanderbez · 2019-05-09T13:29:06Z

Neat! Thanks for writing up this proposal @mossid! Here are some thoughts of mine:

flushkv

This seems like a neat idea, but aren't we now also incurring the storage overhead of <key, hash(value)> pairs? Also, why have a hash CommitKVStore? What purpose does this store serve? What will ever be retrieved from it?

As to the drawbacks of checkpoint idea:

However, we can just make the application dump the current cache to the disk when it stops by an interrupt, ...

Can we really guarantee this? What happens if we don't persist? How do we revert to the previous checkpoint? Sounds complicated.

batched values

This seems like a more efficient, yet more complicated approach. My fear here is that there is necessary changes to how modules wish to define and set their "segments". This will create a greater burden on SDK developers and users because they'd have to a decent understand of a such a concept. Perhaps it's best if we can come up with a solution that abstracts away everything from the end developer.

persinterf

I think this is something we can actually do! But does this save us any time complexity on re-merklizing the tree? Also, since we keep the cached items between blocks possibly, do we have to worry about stale data or are the key, values replaced? Also, how large do we keep the cache and for how long to objects live? Is it a fixed-size LRU cache?

/cc @cwgoes

mossid · 2019-05-10T23:09:28Z

flushkv

Yes, there will be storage overhead for hash. So there are two reasons we need hash: merkle proof and appHash. If we do not store the hashes, the in-memory data will be not included on the appHash, in that case, if the state between the validator diverges, it can be detected on the checkpoints, not each block. The hash also works as a merkle proof generator, as we can query on iavl stores. I should design another benchmark how much size it consumes in the disk, but the disk storage is may be the most extendible resource among the validator node resources, so I think it is an acceptable tradeoff.

I think I misexplained the checkpoint concept. Checkpointing is necessary on the flushkv store. If we don't checkpoint at all every node has to run the whole blockchain if the node process was closed. , No matter the rest of the state is stored in the disk or not. If we checkpoint for every 100 blocks, if the node stopped on height 1150, it only has to run the blocks from 1100 again.

State dumping is another functionality, separated from the checkpointing. Checkpointing is expected to be done for every validator for some interval of blocks and is part of the state machine, but state dumping is literally dumping all states when a node stops. If state dump failed, it can still run from the latest checkpoint. The latest checkpoint is just an iavl tree, but only with a past version, so can be easily loaded. The rootmultistore has to be updated to check the loadable versions of the trees before actually loading.

bached values

If we implement this "segment" successfully, the interface for end developers may not be changed drastically. In their perspective, the store that they will get with the storekey it is just a multistore, with a large number of kvstores mounted, and each of them has a limited size to store key-value pairs. I have also considered using flatbuffers, but too complicated for our purpose.

However I agree that this approach is bit complicated and may cause client side overhead. I don't think this is for general use cases, but rather a special feature for high-io application developer who has a prior understanding about the sdk.

persinterf

So in this benchmark, the data is defined as protobuf, and persinterf actually worked slower when not used with batch/flushkv. But we are using amino, so it would work in our case.

This does not save remerklizing time, only for marshal/unmarshal/get. The benchmark assumed that the keys are fixed size set, but we can easily overlay the LRU cache on it, I guess. For now, it persists, forever growing.

@alexanderbez

alexanderbez · 2019-05-13T13:17:18Z

Seems like the flushkv is the best of the suggested approaches, perhaps with a few small alternations. Also, we should keep in mind how this will play with the upcoming state sync work.

ebuchman · 2019-08-22T20:42:24Z

Cross linking to meta issue re performance concerns: #2131

github-actions · 2020-07-15T00:08:51Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

tac0turtle · 2023-05-03T09:26:03Z

closing this as it is covered in store v2 work that wil begin soon

mossid added S:proposed C:Store labels May 8, 2019

github-actions bot added the stale label Jul 15, 2020

tac0turtle removed the stale label Apr 3, 2023

tac0turtle closed this as completed May 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposals to increase state modification performance #4310

Proposals to increase state modification performance #4310

mossid commented May 8, 2019 •

edited by alexanderbez

Loading

alexanderbez commented May 9, 2019 •

edited

Loading

mossid commented May 10, 2019 •

edited

Loading

alexanderbez commented May 13, 2019

ebuchman commented Aug 22, 2019

github-actions bot commented Jul 15, 2020

tac0turtle commented May 3, 2023

Proposals to increase state modification performance #4310

Proposals to increase state modification performance #4310

Comments

mossid commented May 8, 2019 • edited by alexanderbez Loading

flushkv

batched values

persinterf

alexanderbez commented May 9, 2019 • edited Loading

flushkv

batched values

persinterf

mossid commented May 10, 2019 • edited Loading

flushkv

bached values

persinterf

alexanderbez commented May 13, 2019

ebuchman commented Aug 22, 2019

github-actions bot commented Jul 15, 2020

tac0turtle commented May 3, 2023

mossid commented May 8, 2019 •

edited by alexanderbez

Loading

alexanderbez commented May 9, 2019 •

edited

Loading

mossid commented May 10, 2019 •

edited

Loading