Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

experiment: switch from zlib to Zstd for index file #1438

Draft
wants to merge 3 commits into
base: staged
Choose a base branch
from

Conversation

shenlebantongying
Copy link
Collaborator

Most chunks in the index file are small, but some formats appear to be slightly large, like mdx, sometimes up to 64Kib for one chunk. The total index file size is usually less than 1 Mib, but sometimes it can go up to 10~15Mib.

Better compression library may lead to obvious improvement in indexing time.

Zstd on its website claims that it is better than zlib in all aspects. Various benchmarks online confirm this.


I added a simple time measurement to mdx's index file creation to compare zstd & zlib.

Run GD with a single large MDX dict, then grep ms from stdout.

Measured with release build, MacBook Air (M1, 2020), a ~140 mb mdx dictionary.

Both generated index files are ~3.5 mb, the diff is less than 0.2 mb.

Indexing time can be reduced by >10% (note that the measurement time also includes unrelated things like file writing.)

Screenshot 2024-03-24 at 4 45 50 PM

Spreadsheet file used (To download -> top left -> file -> download) https://docs.google.com/spreadsheets/d/1In6Qvpp3M1GmWPN4L6AdLkKnLFnF9MUUwWedDdVG0Ms/edit?usp=sharing


An alternative is lz4 which is significantly faster on decompression/compression, but the compression ratio is lower. The cost of having large files, like for slow disks (The benefit of faster compression have to outweigh the cost of larger file writing, I don't know.).

The default compression level of Zstd is 3.

The default compression level of zlib is 6.

Since in our use case, the size is small, some adjustment may yield a better result.


if ( compress( &bufferCompressed.front(), &compressedSize, &buffer.front(), bufferUsed ) != Z_OK )
const size_t size_or_err =
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

API design is different.

In zlib,

compress will write the size written to its 2nd paramater

In Zstd,

compress will return the size written or error code. facebook/zstd#1825 (comment)

@shenlebantongying shenlebantongying changed the title feat: uses Zstd for index file instead of Zlib (and index building benchmarks) feat: uses Zstd for index file instead of zlib for faster indexing Mar 24, 2024
@shenlebantongying
Copy link
Collaborator Author

I run the same benchmark on my Linux box,

this dict: https://jitendex.org/pages/downloads.html

in debug build

The speedup is around 8%

image

@shenlebantongying shenlebantongying changed the title feat: uses Zstd for index file instead of zlib for faster indexing feat: switch from zlib to Zstd for index file to boost indexing time by >10% Mar 24, 2024
@xiaoyifang
Copy link
Owner

xiaoyifang commented Mar 25, 2024

The speedup is around 8%

I think 8% is not worth the trouble.

I would like to replace the entire index file with leveldb or rocksdb or even xapian which will give a boost in the headword browse requirement.

The current index structure does not perform well when browse all the headwords when the dictionary has a very large amount of headwords.

@shenlebantongying
Copy link
Collaborator Author

I think 8% is not worth the trouble.

But it is a consistent improvement for the moment.

I would like to replace the entire index file with leveldb or rocksdb or even xapian which will give a boost in the headword browse requirement.

Ok, we will get there. I find the main of the challenge is dealing with the existing code rather than writing the new one 😅. Need lots of time

@shenlebantongying shenlebantongying changed the title feat: switch from zlib to Zstd for index file to boost indexing time by >10% experiment: switch from zlib to Zstd for index file Mar 25, 2024
@xiaoyifang
Copy link
Owner

But it is a consistent improvement for the moment.

I think the main concern is that using the new compression method will make users to reindex all the dictionaries.

@shenlebantongying
Copy link
Collaborator Author

Yes, but it is a one-time cost.

(However, it is not one-time cost for someone who switching between the original version and this.)

@xiaoyifang
Copy link
Owner

xiaoyifang commented Mar 25, 2024

It is also cause compatbile issue between our own releases.

compression time & uncompression time should be both considered.

Maybe we can start a beta version to try all the incompatible changes. such as unify dictionaryId generation logic between portable and normal version .

@shenlebantongying
Copy link
Collaborator Author

shenlebantongying commented Mar 25, 2024

I am unsure how to proceed. I believe most users of this problem are not really technical, breakages are devastating for them.

Maybe we should label issues that will need a breakage to know the scope of the problem?

beta version

I think we should call it “optimized version” to give a reason for migration. In the release page, we say it includes optimizations that aren't possible to keep compatibility with the original GD and previous GD-ng versions. A little psychological trick 😅

@xiaoyifang
Copy link
Owner

I am unsure how to proceed

create a beta branch ,enable this branch auto build when pushed changes. and make an Attention in release note about the incompatible issue.

the beta version can be co-exist with the alpha version .

@shenlebantongying
Copy link
Collaborator Author

Maybe we should accumulate features (both planed & implemented) before publishing it, to avoid the cost of switching back and forth.

We can also just reuse the main branch. Just add lots of cumbersome #if FEATURE_XXX_ENABLED and add a compile option ENABLE_BREAHKING_CHANGES. One workflow can build and publish both versions.

A new page in doc is needed like: optimized version changes (rationals, issuses...)

@shenlebantongying shenlebantongying added the vNext Improvments and optimizations that need incompatible changes. label Mar 25, 2024
@xiaoyifang
Copy link
Owner

We can also just reuse the main branch. Just add lots of cumbersome #if FEATURE_XXX_ENABLED and add a compile option ENABLE_BREAHKING_CHANGES. One workflow can build and publish both versions.

the code will become too complex in the future.

@shenlebantongying
Copy link
Collaborator Author

shenlebantongying commented Mar 25, 2024

TBH, I don't have lots of spare time anymore. I prefer to work on gradually replacing the current index implementation, or at least make it simple to replace. 😅

Maybe at some point we can declare that the main branch is in maintenance mode and only get critical bug fixes only. All new code enters the beta branch as you said.

@xiaoyifang
Copy link
Owner

xiaoyifang commented Mar 26, 2024

compression speed is not the only thing to consider, the time to uncompres ,the disk consumption etc should also be considered.

I guess if no compression method is used ,it should be more faster.

A more elegant way should be consider all the followings,such as

The index structure,
compression algorithm | compressed size|
compatibility
etc.

@shenlebantongying shenlebantongying marked this pull request as draft April 5, 2024 13:26
Copy link

sonarqubecloud bot commented Apr 5, 2024

Quality Gate Failed Quality Gate failed

Failed conditions
0.0% Coverage on New Code (required ≥ 80%)

See analysis details on SonarCloud

@xiaoyifang
Copy link
Owner

xiaoyifang commented Oct 26, 2024

What about use leveldb/rocksdb for the index file storage?
leveldb/rocksdb will handled all the staff like compression , query/save performance etc.

@shenlebantongying
Copy link
Collaborator Author

shenlebantongying commented Oct 26, 2024

I checked rocksdb before, and I think it is not suitable for our purpose, because their and our definition of “lightweight” is very different.

  • rocksdb -> means works fine on something like 64 cores and 256 GB RAM.
  • GD -> works fine on PC with as low as 4 cores and 8 GB RAM.

99% of the rockdb's “features” are useless to us. We don't need any of its analytical features. Also, as a product of facebook, it appears to depend on folly (?), big burden of building it. The APIs are not really documented, and they evolve overtime. It is for big companies that directly hire the developers.


For other choices, leveldb, GDBM....., the difference aren't that big. API is pretty much write(char*)+read(char*) and that's all. Though the development model is different,

  • GDBM is GNU style development -> Single maintainer and requires mailing list to report bug 🤮
  • LevelDB pretty much big fixing only (Bug fixing google's bug 😅).
  • LMDB's development model is somewhat prioritizing sponsors' need.
  • ....

Once we can move all implementation details out of dict implementations, switching between them shouldn't be hard.


If I have time right now immediately, I would probably choose this one https://dbmx.net/tkrzw/ (TreeDBM) because it provides C++ style API, and the implementation is very clean. Reading the webpage from top to bottom is all needed to learn the whole API. However, there are almost no other users. But the author of that project implemented a linage of KV databases and he has a good reputation. We can probably fix issues even if he stops maintainning it.

The author himself has a dictionary implementation https://github.com/estraier/tkrzw-dict

For example, the implementation of wildcard is pretty simiar to what GD currently have. Data are sorted tree -> jumpLower one that matches first char -> then matching one by one using an iterator https://dbmx.net/tkrzw/api/classtkrzw_1_1DBM_1_1Iterator.html (Which is much type safer than C style API for example GDBM's https://www.gnu.org.ua/software/gdbm/manual/Sequential.html)

SetOpaqueMetadata -> equivalent of the idx headers.


No idea when will I have time to pull this move. I will just refactor slowly. If you want to push it, I will help you whatever you choose.

@xiaoyifang
Copy link
Owner

xiaoyifang commented Oct 26, 2024

I prefer leveldb(used widely ) which should have good quality and support by google.
Put, Get, Seek methods should be all we need.

the index's metadata can still be stored in the index file. the btree index can be stored in leveldb seperately. maybe with the name
[index filename]_db as leveldb's database name.

Another option:
use xapian for the headword index instead . We already use xapian for fulltext index. use it in the headword index will prevent to import new library.
The current gd-ng use a trick to search in the middle of the headword.
such as for headwor a lot of, search lot of will give the headword . Gd-ng stored 3 different headwords in the index
which is
a lot of,
lot of with prefix a,
of with prefix a lot
Use xapian will save us the trouble .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
vNext Improvments and optimizations that need incompatible changes.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants