Use immutable key for nodes to avoid excessive LevelDB reorgs #137

mdyring · 2019-05-13T19:36:52Z

After discussion on Slack, a suggestion to implement a fix:

For each new node in IAVL tree, assign an sequential integer ID. This ID never changes for the node.
Use this integer ID as "key" in the <key, value> pair stored in LevelDB.
The value part stored in LevelDB contains the node hash (existing LevelDB key).
All parent/child references must be through integer IDs.
A special "root" key stored in LevelDB has an value that contains the integer ID of the root node.

mossid · 2019-05-16T16:14:34Z

Implementation suggestion:

I think using the key(or the hash of the key)will be better, sequentially assigned integer key cannot be obtained from the node solely and have to be stored separately.

Also we have to separate the hashbytes and keybytes of the nodes. Currently, the full hash of the node is used both for calculating root hash and leveldb key. We can maintain the current format of the hash, only change the leveldb key format to something immutable.

I suggest hash(node.key)|bigendian(node.version) format. The node.version part will be empty in case of the mutable tree, and be filled in case of the immutable key. This will preserve the backward compatibility(since the root hash calculation is not changed) and compatibility between the nodes using immutable/mutable versions of IAVL(since the transition between them can be done simply emptying/filling the node.version).

ethanfrey · 2019-05-17T17:48:02Z

One issue is that we can never overwrite nodes as long as we want to maintain history.
This is why iavl uses the unique hashes (including version) for the keys.

If history can be sacrificed, we can definitely make major speed-ups. But I think it is very important to allow some historical queries for consistent reads on fast-moving block times. On the other hand, we may be able to use a persistence layer that handles multiple snapshots very efficiently, and then avoid this orphan and garbage collection overhead.

ethanfrey · 2019-05-17T17:48:26Z

Are there concrete performance numbers (benchmarks and pprof) that show this root issue

ebuchman · 2019-05-17T18:03:52Z

I suggest hash(node.key)|bigendian(node.version) format.

This sounds reasonable. Fully derived from the node, unique, immutable. The node.key will always be the same (right?) but the node.version will change any time the value does, so the nodes are still immutable.

Here's some related info not about IAVL but Tendermint DB performance.

For one, Tendermint seems to slow down over time: tendermint/tendermint#1835

It seems to be primarily related to indexing txs in the Tendermint daemon (ie. hashes in LevelDB).

Anton opened an issue on GoLevelDB with more details, and some info from a go-eth dev and a user who decided to use RocksDB: syndtr/goleveldb#226. RocksDB apparently might perform much better for SSDs.

@AlexeyAkhunov did some investigations, and might be able to give some advice here too.

mossid · 2019-05-17T18:31:53Z

node.version will change any time the value does, so the nodes are still immutable.

However when the tree is running on mutable mode node.version will be empty(not sure yet which of bigendian(0), bigendian(maxuint64), nil) so when the value changes or the tree rotates the key does not change, overwriting the existing node.

ebuchman · 2019-05-18T03:10:41Z

Sounds like we might want to look into RocksDB. It has native snapshotting feature that we might be able to utilize instead of having to build it ourselves. And then we can just use the key itself as the persistent key in the RocksDB, and let the RocksDB deal with the versioning!

ethanfrey · 2019-05-20T12:10:15Z

BadgerDB has same snapshotting feature and is native go (no cgo required).
But I agree with using more powerful persistence layer to reduce complexity (and performance overhead) in our code. I think any dedicated DB package will have much more optimized routines, and closer to disk level for more optimization potential.

ethanfrey · 2019-05-20T12:11:31Z

However when the tree is running on mutable mode node.version will be empty(not sure yet which of bigendian(0), bigendian(maxuint64), nil) so when the value changes or the tree rotates the key does not change, overwriting the existing node.

Nothing is written to disk until we Commit which sets versions and hashes for all in-memory updated nodes. I think the above concern is not a problem, we just need to be concerned about the behavior of Committed versions.

silasdavis · 2019-05-21T17:08:32Z

We are hitting some fairly big slow downs on compaction in particular. @seanyoung is looking into this. We'd be interested in helping fix this. We do rely on a complete versioned history at every block height.

I have done some adhoc pprof on live systems that identified compaction as an issue - we could probably come up with something more specific on IAVL.

ethanfrey · 2019-05-23T10:08:19Z

Hi @silasdavis I think a good test harness that reproduces this case would be great place to start. I assume you have one from your comments? Would you be willing to share?

I would like a reproducible code to demo this issue, so we can all get the same pprof numbers and tune it. Ideally as a separate repo. If you don't have this, maybe I will build it... When I find time

ultraeric · 2019-05-24T19:19:58Z

Bringing together some conversations from the Tendermint slack and in person with @mdyring, @zmanian, and @jackzampolin, I'll be pushing forward on this issue for the next few weeks from the Tendermint side. I believe there are a few things to look into in terms of a fix:

Changing persistence layer. Look into other DBs, but for most performant KV stores compaction seems to still be a problem. See Excessive per-block write IO activity cosmos-sdk#2131 for more context.
Immutable indexing. This seems like it will only improve leveldb performance by some small constant factor (<4x) due to the append-only structure of leveldb and how compaction is used to remove old values. With any level-based KV stores this will be an issue, and even the speedup would simply be delaying the issue to when the trees grow larger instead of resolving it.
Write-back LRU caching. This would allow in a way "streaming" of writes to the db similar to what @jlandrews said in the original issue. The important thing, as @mdyring mentions, is to consider how this interacts with the consensus layer and what guarantees nodes can give about signed txns when they haven't been persisted to disk yet.
Less frequent writes and "backups." @zmanian mentions that as networks grow it is less important for individual nodes to constantly persist to non-volatile storage. Thus, this, in combination with a caching solution from above, I believe would significantly reduce the amount of I/O required.
Timed compactions and flushing. By manually triggering compactions across small ranges consistently, we can amortize the compaction across longer periods of time.

@ethanfrey I've been working on putting together a simulation to consistently reproduce these issues and benchmark them. Hopefully will have that done in the next few days.

ultraeric · 2019-06-03T17:36:44Z

Apologies for not updating on the benchmark.

For now, I'm simply using pprof on a full node sync with gigabit internet, with only the peers in the launch repo. Attempts to use a simulation are interesting because while compaction still grows to around 35% CPU time, it does so at a different rate. I suspect this is due to access patterns?

For the full node sync, the first few minutes compaction is at around 20% CPU time, but slowly grows to 35% of CPU time on my HDD. On my SSD, I/O never goes above 10%.

I will upload my code to a branch soon.

tac0turtle · 2023-10-06T13:10:27Z

this is done as part of iavl v1

HaoyangLiu mentioned this issue Jun 20, 2019

R4R: Close batch after write in nodedb irisnet/iavl#7

Merged

tomtau mentioned this issue May 3, 2021

Problem: possible memory leak crypto-org-chain/chain-main#519

Closed

tac0turtle mentioned this issue Oct 9, 2022

Modify the key format from a hash of the data to be write_version|node_path_in_tree #571

Closed

cool-develope mentioned this issue Oct 18, 2022

feat: add nodeKey integer field #593

Merged

tac0turtle closed this as completed Oct 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use immutable key for nodes to avoid excessive LevelDB reorgs #137

Use immutable key for nodes to avoid excessive LevelDB reorgs #137

mdyring commented May 13, 2019

mossid commented May 16, 2019 •

edited

Loading

ethanfrey commented May 17, 2019

ethanfrey commented May 17, 2019

ebuchman commented May 17, 2019

mossid commented May 17, 2019 •

edited

Loading

ebuchman commented May 18, 2019

ethanfrey commented May 20, 2019

ethanfrey commented May 20, 2019

silasdavis commented May 21, 2019

ethanfrey commented May 23, 2019

ultraeric commented May 24, 2019

ultraeric commented Jun 3, 2019 •

edited

Loading

tac0turtle commented Oct 6, 2023

Use immutable key for nodes to avoid excessive LevelDB reorgs #137

Use immutable key for nodes to avoid excessive LevelDB reorgs #137

Comments

mdyring commented May 13, 2019

mossid commented May 16, 2019 • edited Loading

ethanfrey commented May 17, 2019

ethanfrey commented May 17, 2019

ebuchman commented May 17, 2019

mossid commented May 17, 2019 • edited Loading

ebuchman commented May 18, 2019

ethanfrey commented May 20, 2019

ethanfrey commented May 20, 2019

silasdavis commented May 21, 2019

ethanfrey commented May 23, 2019

ultraeric commented May 24, 2019

ultraeric commented Jun 3, 2019 • edited Loading

tac0turtle commented Oct 6, 2023

mossid commented May 16, 2019 •

edited

Loading

mossid commented May 17, 2019 •

edited

Loading

ultraeric commented Jun 3, 2019 •

edited

Loading