Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposals to increase state modification performance #4310

Closed
mossid opened this issue May 8, 2019 · 6 comments
Closed

Proposals to increase state modification performance #4310

mossid opened this issue May 8, 2019 · 6 comments

Comments

@mossid
Copy link
Contributor

mossid commented May 8, 2019

The current SDK state uses IAVL tree, which is backed by goleveldb. IAVL tree is a Merkle tree working with a Key-Value mapping interface, and goleveldb is a Key-Value database which lives on disk. There are two major bottlenecks on the current workflow

+--------+Marshal+--------+ Set  +--------+
| object +------>+ []byte +----->+ nodedb |
+---+----+       +-+-+----+      +----+---+
    ^    Unmarshal | ^      Get       |
    +--------------+ +----------------+

Marshal/Unmarshal is a conversion between language-native objects and universal format byte slices. Certain data has to be marshaled and unmarshaled in each transaction (more specifically each deliverTx), even it was marshaled and unmarshaled before within the same block.

Get/Set is database access. A byte slice can be set into the disk, and can be read. In IAVL implementation, get can be done with binary searching and does not need to modify the existing nodes, but set requires remerklizing all parent nodes, taking much more time. This increases logarithmically, depending on the depth, so having more key-value pairs in it means more time to remerklize.

There are three methods that possibly can solve these problems, without modifying IAVL or underlying database, rather purely on the SDK side.

All of the methods are expecting a fixed-size set of key-value pairs which is updated very frequently, almost every block. Data that is being updated on EndBlocker is a good example.

The benchmark assumes each value has the size of 200bytes and the total number of key-value pairs is 160000, total size 200*160000=32MB.

flushkv

flushkv.Store consists of two parent CommitKVStore, each called main and hash. The store also has cache, which is cache wrapped main. On KVStore method call, it forwards the call to cache and hash.

store.{Get, Has, Iterator}(key) => store.cache.{Get, Has, Iterator}(key)
store.{Set, Delete}(key, value) => store.cache.{Set, Delete}(key, value); store.hash.{Set, Delete}(key, hash(value))

When Commit is called, it first checks the current height and if it is a multiple of predefined flushRate then Commits both cache and hash. If not, it only commits hash.

In summary, flushkv.Store only stores the hash of each values in order to reduce the setting time, and stores the full value only on the checkpoint heights. The full values are always stored in memory(by cached main), so it can be queried and be used by the app.

The downside is that when the node process stops the data stored in the cache is removed, so the node has to start state machine execution from the last checkpoint. However, we can just make the application dump the current cache to the disk when it stops by an interrupt, or make the flushRate small enough.

The benchmark shows about 30% of time reduce.

BenchmarkData/1000-32kb-IAVLFlushNonpRaw           1        3607ms/op  159000rw      0.0227ms/rw       19017kb alloc
BenchmarkData/1000-32kb-IAVLNonpRaw                1        5347ms/op  159000rw      0.0336ms/rw       18928kb alloc

batched values

If the values are going to be updated frequently, then having it as a single struct will be more efficient, because setting one large key-value pair is faster than setting multiple small key-value pairs. However, it is hard to simply concatenate unrelated objects, as it requires the client to query the whole batch which contains a lot of unnecessary information.

One way is to treat a KVStore as a MultiStore. Each leaf of the IAVL store represents a batched key-value pairs, not a single pair.

I haven't researched how we can effectively batch the data, as it is a proposal, not an implementation detail(yet). One of the ideas is forming a virtual memory segment on a byte slice. A segment consists of table space and memory space. The table is an array of pointer {key, offset, size}, which refers to the specific range of the memory space(space[offset:offset+size]) where the value associated with the key lives. Since the pointers have a fixed length, a user can simply binary search for the desired key, decodes the offset and size, slices the space, and only unmarshals that slice. Modules can specify which value will be stored under which segment(as it is retrieved by the same MultiStore interface), they can optimize by storing related data in the same segment.

The benchmark shows 88% of time reduce(not considering segment overhead), where each batch contains 160 pairs, approx 32KB, with total batch number of 1000.

BenchmarkData/1000-32kb-IAVLFlushNonpBatch         3         425ms/op  159000rw      0.0027ms/rw        6320kb alloc
BenchmarkData/1000-32kb-IAVLFlushNonpRaw           1        3607ms/op  159000rw      0.0227ms/rw       19017kb alloc

persinterf

persinterf.Store is a inmemory persistent interface store. type InterfaceStore has a similar interface with KVStore, but having different method for Get and Set:

Get(key []byte, ptr Interface)
Set(key []byte, value Interface)

where type Interface has the methods Marshal and Unmarshal. persinterf.Store persists the pairs in its field cache map[string]types.Interface, where it is copied to the pointer if the requested key exists in the cache. If not, it reads the value from the parent store and caches it.

On commit, unlike cachestores, it does not clear the cache, but keep holding it. This essentially works as an inter-block object cache. Excepts for that, the store works identical with IAVL store, and does not alter the appHash when applied.

The benchmark shows 40% time reduce

BenchmarkData/1000-32kb-IAVLFlushPersBatch         5         255ms/op  159000rw      0.0016ms/rw        1355kb alloc
BenchmarkData/1000-32kb-IAVLFlushNonpBatch         3         425ms/op  159000rw      0.0027ms/rw        6320kb alloc
@alexanderbez
Copy link
Contributor

alexanderbez commented May 9, 2019

Neat! Thanks for writing up this proposal @mossid! Here are some thoughts of mine:


flushkv

This seems like a neat idea, but aren't we now also incurring the storage overhead of <key, hash(value)> pairs? Also, why have a hash CommitKVStore? What purpose does this store serve? What will ever be retrieved from it?

As to the drawbacks of checkpoint idea:

However, we can just make the application dump the current cache to the disk when it stops by an interrupt, ...

Can we really guarantee this? What happens if we don't persist? How do we revert to the previous checkpoint? Sounds complicated.

batched values

This seems like a more efficient, yet more complicated approach. My fear here is that there is necessary changes to how modules wish to define and set their "segments". This will create a greater burden on SDK developers and users because they'd have to a decent understand of a such a concept. Perhaps it's best if we can come up with a solution that abstracts away everything from the end developer.

persinterf

I think this is something we can actually do! But does this save us any time complexity on re-merklizing the tree? Also, since we keep the cached items between blocks possibly, do we have to worry about stale data or are the key, values replaced? Also, how large do we keep the cache and for how long to objects live? Is it a fixed-size LRU cache?

/cc @cwgoes

@mossid
Copy link
Contributor Author

mossid commented May 10, 2019

flushkv

Yes, there will be storage overhead for hash. So there are two reasons we need hash: merkle proof and appHash. If we do not store the hashes, the in-memory data will be not included on the appHash, in that case, if the state between the validator diverges, it can be detected on the checkpoints, not each block. The hash also works as a merkle proof generator, as we can query on iavl stores. I should design another benchmark how much size it consumes in the disk, but the disk storage is may be the most extendible resource among the validator node resources, so I think it is an acceptable tradeoff.

I think I misexplained the checkpoint concept. Checkpointing is necessary on the flushkv store. If we don't checkpoint at all every node has to run the whole blockchain if the node process was closed. , No matter the rest of the state is stored in the disk or not. If we checkpoint for every 100 blocks, if the node stopped on height 1150, it only has to run the blocks from 1100 again.

State dumping is another functionality, separated from the checkpointing. Checkpointing is expected to be done for every validator for some interval of blocks and is part of the state machine, but state dumping is literally dumping all states when a node stops. If state dump failed, it can still run from the latest checkpoint. The latest checkpoint is just an iavl tree, but only with a past version, so can be easily loaded. The rootmultistore has to be updated to check the loadable versions of the trees before actually loading.

bached values

If we implement this "segment" successfully, the interface for end developers may not be changed drastically. In their perspective, the store that they will get with the storekey it is just a multistore, with a large number of kvstores mounted, and each of them has a limited size to store key-value pairs. I have also considered using flatbuffers, but too complicated for our purpose.

However I agree that this approach is bit complicated and may cause client side overhead. I don't think this is for general use cases, but rather a special feature for high-io application developer who has a prior understanding about the sdk.

persinterf

So in this benchmark, the data is defined as protobuf, and persinterf actually worked slower when not used with batch/flushkv. But we are using amino, so it would work in our case.

This does not save remerklizing time, only for marshal/unmarshal/get. The benchmark assumed that the keys are fixed size set, but we can easily overlay the LRU cache on it, I guess. For now, it persists, forever growing.

@alexanderbez

@alexanderbez
Copy link
Contributor

Seems like the flushkv is the best of the suggested approaches, perhaps with a few small alternations. Also, we should keep in mind how this will play with the upcoming state sync work.

@ebuchman
Copy link
Member

Cross linking to meta issue re performance concerns: #2131

@github-actions
Copy link
Contributor

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale label Jul 15, 2020
@tac0turtle tac0turtle removed the stale label Apr 3, 2023
@tac0turtle
Copy link
Member

closing this as it is covered in store v2 work that wil begin soon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants