-
Notifications
You must be signed in to change notification settings - Fork 268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use immutable key for nodes to avoid excessive LevelDB reorgs #137
Comments
Implementation suggestion: I think using the key(or the hash of the key)will be better, sequentially assigned integer key cannot be obtained from the node solely and have to be stored separately. Also we have to separate the hashbytes and keybytes of the nodes. Currently, the full hash of the node is used both for calculating root hash and leveldb key. We can maintain the current format of the hash, only change the leveldb key format to something immutable. I suggest |
One issue is that we can never overwrite nodes as long as we want to maintain history. If history can be sacrificed, we can definitely make major speed-ups. But I think it is very important to allow some historical queries for consistent reads on fast-moving block times. On the other hand, we may be able to use a persistence layer that handles multiple snapshots very efficiently, and then avoid this orphan and garbage collection overhead. |
Are there concrete performance numbers (benchmarks and pprof) that show this root issue |
This sounds reasonable. Fully derived from the node, unique, immutable. The Here's some related info not about IAVL but Tendermint DB performance. For one, Tendermint seems to slow down over time: tendermint/tendermint#1835 It seems to be primarily related to indexing txs in the Tendermint daemon (ie. hashes in LevelDB). Anton opened an issue on GoLevelDB with more details, and some info from a go-eth dev and a user who decided to use RocksDB: syndtr/goleveldb#226. RocksDB apparently might perform much better for SSDs. @AlexeyAkhunov did some investigations, and might be able to give some advice here too. |
However when the tree is running on mutable mode |
Sounds like we might want to look into RocksDB. It has native snapshotting feature that we might be able to utilize instead of having to build it ourselves. And then we can just use the key itself as the persistent key in the RocksDB, and let the RocksDB deal with the versioning! |
BadgerDB has same snapshotting feature and is native go (no cgo required). |
Nothing is written to disk until we |
We are hitting some fairly big slow downs on compaction in particular. @seanyoung is looking into this. We'd be interested in helping fix this. We do rely on a complete versioned history at every block height. I have done some adhoc pprof on live systems that identified compaction as an issue - we could probably come up with something more specific on IAVL. |
Hi @silasdavis I think a good test harness that reproduces this case would be great place to start. I assume you have one from your comments? Would you be willing to share? I would like a reproducible code to demo this issue, so we can all get the same pprof numbers and tune it. Ideally as a separate repo. If you don't have this, maybe I will build it... When I find time |
Bringing together some conversations from the Tendermint slack and in person with @mdyring, @zmanian, and @jackzampolin, I'll be pushing forward on this issue for the next few weeks from the Tendermint side. I believe there are a few things to look into in terms of a fix:
@ethanfrey I've been working on putting together a simulation to consistently reproduce these issues and benchmark them. Hopefully will have that done in the next few days. |
Apologies for not updating on the benchmark. For now, I'm simply using pprof on a full node sync with gigabit internet, with only the peers in the launch repo. Attempts to use a simulation are interesting because while compaction still grows to around 35% CPU time, it does so at a different rate. I suspect this is due to access patterns? For the full node sync, the first few minutes compaction is at around 20% CPU time, but slowly grows to 35% of CPU time on my HDD. On my SSD, I/O never goes above 10%. I will upload my code to a branch soon. |
this is done as part of iavl v1 |
Follow-up of cosmos/cosmos-sdk#2131.
After discussion on Slack, a suggestion to implement a fix:
The text was updated successfully, but these errors were encountered: