Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use immutable key for nodes to avoid excessive LevelDB reorgs #137

Closed
mdyring opened this issue May 13, 2019 · 13 comments
Closed

Use immutable key for nodes to avoid excessive LevelDB reorgs #137

mdyring opened this issue May 13, 2019 · 13 comments

Comments

@mdyring
Copy link

mdyring commented May 13, 2019

Follow-up of cosmos/cosmos-sdk#2131.

After discussion on Slack, a suggestion to implement a fix:

  1. For each new node in IAVL tree, assign an sequential integer ID. This ID never changes for the node.
  2. Use this integer ID as "key" in the <key, value> pair stored in LevelDB.
  3. The value part stored in LevelDB contains the node hash (existing LevelDB key).
  4. All parent/child references must be through integer IDs.
  5. A special "root" key stored in LevelDB has an value that contains the integer ID of the root node.
@mossid
Copy link

mossid commented May 16, 2019

Implementation suggestion:

I think using the key(or the hash of the key)will be better, sequentially assigned integer key cannot be obtained from the node solely and have to be stored separately.

Also we have to separate the hashbytes and keybytes of the nodes. Currently, the full hash of the node is used both for calculating root hash and leveldb key. We can maintain the current format of the hash, only change the leveldb key format to something immutable.

I suggest hash(node.key)|bigendian(node.version) format. The node.version part will be empty in case of the mutable tree, and be filled in case of the immutable key. This will preserve the backward compatibility(since the root hash calculation is not changed) and compatibility between the nodes using immutable/mutable versions of IAVL(since the transition between them can be done simply emptying/filling the node.version).

@ethanfrey
Copy link
Contributor

One issue is that we can never overwrite nodes as long as we want to maintain history.
This is why iavl uses the unique hashes (including version) for the keys.

If history can be sacrificed, we can definitely make major speed-ups. But I think it is very important to allow some historical queries for consistent reads on fast-moving block times. On the other hand, we may be able to use a persistence layer that handles multiple snapshots very efficiently, and then avoid this orphan and garbage collection overhead.

@ethanfrey
Copy link
Contributor

Are there concrete performance numbers (benchmarks and pprof) that show this root issue

@ebuchman
Copy link
Member

I suggest hash(node.key)|bigendian(node.version) format.

This sounds reasonable. Fully derived from the node, unique, immutable. The node.key will always be the same (right?) but the node.version will change any time the value does, so the nodes are still immutable.

Here's some related info not about IAVL but Tendermint DB performance.

For one, Tendermint seems to slow down over time: tendermint/tendermint#1835

It seems to be primarily related to indexing txs in the Tendermint daemon (ie. hashes in LevelDB).

Anton opened an issue on GoLevelDB with more details, and some info from a go-eth dev and a user who decided to use RocksDB: syndtr/goleveldb#226. RocksDB apparently might perform much better for SSDs.

@AlexeyAkhunov did some investigations, and might be able to give some advice here too.

@mossid
Copy link

mossid commented May 17, 2019

node.version will change any time the value does, so the nodes are still immutable.

However when the tree is running on mutable mode node.version will be empty(not sure yet which of bigendian(0), bigendian(maxuint64), nil) so when the value changes or the tree rotates the key does not change, overwriting the existing node.

@ebuchman
Copy link
Member

Sounds like we might want to look into RocksDB. It has native snapshotting feature that we might be able to utilize instead of having to build it ourselves. And then we can just use the key itself as the persistent key in the RocksDB, and let the RocksDB deal with the versioning!

@ethanfrey
Copy link
Contributor

BadgerDB has same snapshotting feature and is native go (no cgo required).
But I agree with using more powerful persistence layer to reduce complexity (and performance overhead) in our code. I think any dedicated DB package will have much more optimized routines, and closer to disk level for more optimization potential.

@ethanfrey
Copy link
Contributor

However when the tree is running on mutable mode node.version will be empty(not sure yet which of bigendian(0), bigendian(maxuint64), nil) so when the value changes or the tree rotates the key does not change, overwriting the existing node.

Nothing is written to disk until we Commit which sets versions and hashes for all in-memory updated nodes. I think the above concern is not a problem, we just need to be concerned about the behavior of Committed versions.

@silasdavis
Copy link
Contributor

We are hitting some fairly big slow downs on compaction in particular. @seanyoung is looking into this. We'd be interested in helping fix this. We do rely on a complete versioned history at every block height.

I have done some adhoc pprof on live systems that identified compaction as an issue - we could probably come up with something more specific on IAVL.

@ethanfrey
Copy link
Contributor

Hi @silasdavis I think a good test harness that reproduces this case would be great place to start. I assume you have one from your comments? Would you be willing to share?

I would like a reproducible code to demo this issue, so we can all get the same pprof numbers and tune it. Ideally as a separate repo. If you don't have this, maybe I will build it... When I find time

@ultraeric
Copy link

Bringing together some conversations from the Tendermint slack and in person with @mdyring, @zmanian, and @jackzampolin, I'll be pushing forward on this issue for the next few weeks from the Tendermint side. I believe there are a few things to look into in terms of a fix:

  1. Changing persistence layer. Look into other DBs, but for most performant KV stores compaction seems to still be a problem. See Excessive per-block write IO activity  cosmos-sdk#2131 for more context.
  2. Immutable indexing. This seems like it will only improve leveldb performance by some small constant factor (<4x) due to the append-only structure of leveldb and how compaction is used to remove old values. With any level-based KV stores this will be an issue, and even the speedup would simply be delaying the issue to when the trees grow larger instead of resolving it.
  3. Write-back LRU caching. This would allow in a way "streaming" of writes to the db similar to what @jlandrews said in the original issue. The important thing, as @mdyring mentions, is to consider how this interacts with the consensus layer and what guarantees nodes can give about signed txns when they haven't been persisted to disk yet.
  4. Less frequent writes and "backups." @zmanian mentions that as networks grow it is less important for individual nodes to constantly persist to non-volatile storage. Thus, this, in combination with a caching solution from above, I believe would significantly reduce the amount of I/O required.
  5. Timed compactions and flushing. By manually triggering compactions across small ranges consistently, we can amortize the compaction across longer periods of time.

@ethanfrey I've been working on putting together a simulation to consistently reproduce these issues and benchmark them. Hopefully will have that done in the next few days.

@ultraeric
Copy link

ultraeric commented Jun 3, 2019

Apologies for not updating on the benchmark.

For now, I'm simply using pprof on a full node sync with gigabit internet, with only the peers in the launch repo. Attempts to use a simulation are interesting because while compaction still grows to around 35% CPU time, it does so at a different rate. I suspect this is due to access patterns?

For the full node sync, the first few minutes compaction is at around 20% CPU time, but slowly grows to 35% of CPU time on my HDD. On my SSD, I/O never goes above 10%.

I will upload my code to a branch soon.

@tac0turtle
Copy link
Member

this is done as part of iavl v1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants