-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
experiment: switch from zlib to Zstd for index file #1438
base: staged
Are you sure you want to change the base?
experiment: switch from zlib to Zstd for index file #1438
Conversation
|
||
if ( compress( &bufferCompressed.front(), &compressedSize, &buffer.front(), bufferUsed ) != Z_OK ) | ||
const size_t size_or_err = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
API design is different.
In zlib,
compress will write the size written to its 2nd paramater
In Zstd,
compress will return the size written or error code. facebook/zstd#1825 (comment)
655ecf9
to
7a5a5d8
Compare
I run the same benchmark on my Linux box, this dict: https://jitendex.org/pages/downloads.html in debug build The speedup is around 8% |
[autofix.ci] apply automated fixes a
ab8a671
to
e0cb233
Compare
I think 8% is not worth the trouble. I would like to replace the entire index file with leveldb or rocksdb or even xapian which will give a boost in the headword browse requirement. The current index structure does not perform well when browse all the headwords when the dictionary has a very large amount of headwords. |
But it is a consistent improvement for the moment.
Ok, we will get there. I find the main of the challenge is dealing with the existing code rather than writing the new one 😅. Need lots of time |
I think the main concern is that using the new compression method will make users to reindex all the dictionaries. |
Yes, but it is a one-time cost. (However, it is not one-time cost for someone who switching between the original version and this.) |
It is also cause compatbile issue between our own releases. compression time & uncompression time should be both considered. Maybe we can start a beta version to try all the incompatible changes. such as unify dictionaryId generation logic between portable and normal version . |
I am unsure how to proceed. I believe most users of this problem are not really technical, breakages are devastating for them. Maybe we should label issues that will need a breakage to know the scope of the problem?
I think we should call it “optimized version” to give a reason for migration. In the release page, we say it includes optimizations that aren't possible to keep compatibility with the original GD and previous GD-ng versions. A little psychological trick 😅 |
create a the beta version can be co-exist with the alpha version . |
Maybe we should accumulate features (both planed & implemented) before publishing it, to avoid the cost of switching back and forth. We can also just reuse the main branch. Just add lots of cumbersome A new page in doc is needed like: optimized version changes (rationals, issuses...) |
the code will become too complex in the future. |
TBH, I don't have lots of spare time anymore. I prefer to work on gradually replacing the current index implementation, or at least make it simple to replace. 😅 Maybe at some point we can declare that the main branch is in maintenance mode and only get critical bug fixes only. All new code enters the beta branch as you said. |
compression speed is not the only thing to consider, the time to uncompres ,the disk consumption etc should also be considered. I guess if no compression method is used ,it should be more faster. A more elegant way should be consider all the followings,such as The index structure, |
Quality Gate failedFailed conditions |
What about use leveldb/rocksdb for the index file storage? |
I checked rocksdb before, and I think it is not suitable for our purpose, because their and our definition of “lightweight” is very different.
99% of the rockdb's “features” are useless to us. We don't need any of its analytical features. Also, as a product of facebook, it appears to depend on For other choices, leveldb, GDBM....., the difference aren't that big. API is pretty much
Once we can move all implementation details out of dict implementations, switching between them shouldn't be hard. If I have time right now immediately, I would probably choose this one https://dbmx.net/tkrzw/ (TreeDBM) because it provides C++ style API, and the implementation is very clean. Reading the webpage from top to bottom is all needed to learn the whole API. However, there are almost no other users. But the author of that project implemented a linage of KV databases and he has a good reputation. We can probably fix issues even if he stops maintainning it. The author himself has a dictionary implementation https://github.com/estraier/tkrzw-dict For example, the implementation of wildcard is pretty simiar to what GD currently have. Data are sorted tree ->
No idea when will I have time to pull this move. I will just refactor slowly. If you want to push it, I will help you whatever you choose. |
I prefer leveldb(used widely ) which should have good quality and support by google. the index's metadata can still be stored in the index file. the btree index can be stored in leveldb seperately. maybe with the name Another option: |
Most chunks in the index file are small, but some formats appear to be slightly large, like mdx, sometimes up to 64Kib for one chunk. The total index file size is usually less than 1 Mib, but sometimes it can go up to 10~15Mib.
Better compression library may lead to obvious improvement in indexing time.
Zstd on its website claims that it is better than zlib in all aspects. Various benchmarks online confirm this.
I added a simple time measurement to mdx's index file creation to compare zstd & zlib.
Run GD with a single large MDX dict, then grep
ms
from stdout.Measured with release build, MacBook Air (M1, 2020), a ~140 mb mdx dictionary.
Both generated index files are ~3.5 mb, the diff is less than 0.2 mb.
Indexing time can be reduced by >10% (note that the measurement time also includes unrelated things like file writing.)
Spreadsheet file used (To download -> top left -> file -> download) https://docs.google.com/spreadsheets/d/1In6Qvpp3M1GmWPN4L6AdLkKnLFnF9MUUwWedDdVG0Ms/edit?usp=sharing
An alternative is lz4 which is significantly faster on decompression/compression, but the compression ratio is lower. The cost of having large files, like for slow disks (The benefit of faster compression have to outweigh the cost of larger file writing, I don't know.).
The default compression level of Zstd is 3.
The default compression level of zlib is 6.
Since in our use case, the size is small, some adjustment may yield a better result.