experiment: switch from zlib to Zstd for index file #1438

shenlebantongying · 2024-03-24T21:19:30Z

Most chunks in the index file are small, but some formats appear to be slightly large, like mdx, sometimes up to 64Kib for one chunk. The total index file size is usually less than 1 Mib, but sometimes it can go up to 10~15Mib.

Better compression library may lead to obvious improvement in indexing time.

Zstd on its website claims that it is better than zlib in all aspects. Various benchmarks online confirm this.

I added a simple time measurement to mdx's index file creation to compare zstd & zlib.

Run GD with a single large MDX dict, then grep ms from stdout.

Measured with release build, MacBook Air (M1, 2020), a ~140 mb mdx dictionary.

Both generated index files are ~3.5 mb, the diff is less than 0.2 mb.

Indexing time can be reduced by >10% (note that the measurement time also includes unrelated things like file writing.)

Spreadsheet file used (To download -> top left -> file -> download) https://docs.google.com/spreadsheets/d/1In6Qvpp3M1GmWPN4L6AdLkKnLFnF9MUUwWedDdVG0Ms/edit?usp=sharing

An alternative is lz4 which is significantly faster on decompression/compression, but the compression ratio is lower. The cost of having large files, like for slow disks (The benefit of faster compression have to outweigh the cost of larger file writing, I don't know.).

The default compression level of Zstd is 3.

The default compression level of zlib is 6.

Since in our use case, the size is small, some adjustment may yield a better result.

shenlebantongying · 2024-03-24T21:22:30Z

src/chunkedstorage.cc


-  if ( compress( &bufferCompressed.front(), &compressedSize, &buffer.front(), bufferUsed ) != Z_OK )
+  const size_t size_or_err =


API design is different.

In zlib,

compress will write the size written to its 2nd paramater

In Zstd,

compress will return the size written or error code. facebook/zstd#1825 (comment)

shenlebantongying · 2024-03-24T22:36:52Z

I run the same benchmark on my Linux box,

this dict: https://jitendex.org/pages/downloads.html

in debug build

The speedup is around 8%

[autofix.ci] apply automated fixes a

xiaoyifang · 2024-03-25T00:47:50Z

The speedup is around 8%

I think 8% is not worth the trouble.

I would like to replace the entire index file with leveldb or rocksdb or even xapian which will give a boost in the headword browse requirement.

The current index structure does not perform well when browse all the headwords when the dictionary has a very large amount of headwords.

shenlebantongying · 2024-03-25T05:55:31Z

I think 8% is not worth the trouble.

But it is a consistent improvement for the moment.

I would like to replace the entire index file with leveldb or rocksdb or even xapian which will give a boost in the headword browse requirement.

Ok, we will get there. I find the main of the challenge is dealing with the existing code rather than writing the new one 😅. Need lots of time

xiaoyifang · 2024-03-25T06:55:39Z

But it is a consistent improvement for the moment.

I think the main concern is that using the new compression method will make users to reindex all the dictionaries.

shenlebantongying · 2024-03-25T07:27:50Z

Yes, but it is a one-time cost.

(However, it is not one-time cost for someone who switching between the original version and this.)

xiaoyifang · 2024-03-25T08:41:06Z

It is also cause compatbile issue between our own releases.

compression time & uncompression time should be both considered.

Maybe we can start a beta version to try all the incompatible changes. such as unify dictionaryId generation logic between portable and normal version .

shenlebantongying · 2024-03-25T08:59:52Z

I am unsure how to proceed. I believe most users of this problem are not really technical, breakages are devastating for them.

Maybe we should label issues that will need a breakage to know the scope of the problem?

beta version

I think we should call it “optimized version” to give a reason for migration. In the release page, we say it includes optimizations that aren't possible to keep compatibility with the original GD and previous GD-ng versions. A little psychological trick 😅

xiaoyifang · 2024-03-25T09:07:39Z

I am unsure how to proceed

create a beta branch ,enable this branch auto build when pushed changes. and make an Attention in release note about the incompatible issue.

the beta version can be co-exist with the alpha version .

shenlebantongying · 2024-03-25T09:38:34Z

Maybe we should accumulate features (both planed & implemented) before publishing it, to avoid the cost of switching back and forth.

We can also just reuse the main branch. Just add lots of cumbersome #if FEATURE_XXX_ENABLED and add a compile option ENABLE_BREAHKING_CHANGES. One workflow can build and publish both versions.

A new page in doc is needed like: optimized version changes (rationals, issuses...)

xiaoyifang · 2024-03-25T09:49:26Z

We can also just reuse the main branch. Just add lots of cumbersome #if FEATURE_XXX_ENABLED and add a compile option ENABLE_BREAHKING_CHANGES. One workflow can build and publish both versions.

the code will become too complex in the future.

shenlebantongying · 2024-03-25T10:01:39Z

TBH, I don't have lots of spare time anymore. I prefer to work on gradually replacing the current index implementation, or at least make it simple to replace. 😅

Maybe at some point we can declare that the main branch is in maintenance mode and only get critical bug fixes only. All new code enters the beta branch as you said.

xiaoyifang · 2024-03-26T00:33:46Z

compression speed is not the only thing to consider, the time to uncompres ,the disk consumption etc should also be considered.

I guess if no compression method is used ,it should be more faster.

A more elegant way should be consider all the followings,such as

The index structure,
compression algorithm | compressed size|
compatibility
etc.

sonarqubecloud · 2024-04-05T13:40:20Z

Quality Gate failed

Failed conditions
0.0% Coverage on New Code (required ≥ 80%)

See analysis details on SonarCloud

xiaoyifang · 2024-10-26T04:02:54Z

What about use leveldb/rocksdb for the index file storage?
leveldb/rocksdb will handled all the staff like compression , query/save performance etc.

shenlebantongying · 2024-10-26T05:45:04Z

I checked rocksdb before, and I think it is not suitable for our purpose, because their and our definition of “lightweight” is very different.

rocksdb -> means works fine on something like 64 cores and 256 GB RAM.
GD -> works fine on PC with as low as 4 cores and 8 GB RAM.

99% of the rockdb's “features” are useless to us. We don't need any of its analytical features. Also, as a product of facebook, it appears to depend on folly (?), big burden of building it. The APIs are not really documented, and they evolve overtime. It is for big companies that directly hire the developers.

For other choices, leveldb, GDBM....., the difference aren't that big. API is pretty much write(char*)+read(char*) and that's all. Though the development model is different,

GDBM is GNU style development -> Single maintainer and requires mailing list to report bug 🤮
LevelDB pretty much big fixing only (Bug fixing google's bug 😅).
LMDB's development model is somewhat prioritizing sponsors' need.
....

Once we can move all implementation details out of dict implementations, switching between them shouldn't be hard.

If I have time right now immediately, I would probably choose this one https://dbmx.net/tkrzw/ (TreeDBM) because it provides C++ style API, and the implementation is very clean. Reading the webpage from top to bottom is all needed to learn the whole API. However, there are almost no other users. But the author of that project implemented a linage of KV databases and he has a good reputation. We can probably fix issues even if he stops maintainning it.

The author himself has a dictionary implementation https://github.com/estraier/tkrzw-dict

For example, the implementation of wildcard is pretty simiar to what GD currently have. Data are sorted tree -> jumpLower one that matches first char -> then matching one by one using an iterator https://dbmx.net/tkrzw/api/classtkrzw_1_1DBM_1_1Iterator.html (Which is much type safer than C style API for example GDBM's https://www.gnu.org.ua/software/gdbm/manual/Sequential.html)

SetOpaqueMetadata -> equivalent of the idx headers.

No idea when will I have time to pull this move. I will just refactor slowly. If you want to push it, I will help you whatever you choose.

xiaoyifang · 2024-10-26T08:46:42Z

I prefer leveldb(used widely ) which should have good quality and support by google.
Put, Get, Seek methods should be all we need.

the index's metadata can still be stored in the index file. the btree index can be stored in leveldb seperately. maybe with the name
[index filename]_db as leveldb's database name.

Another option:
use xapian for the headword index instead . We already use xapian for fulltext index. use it in the headword index will prevent to import new library.
The current gd-ng use a trick to search in the middle of the headword.
such as for headwor a lot of, search lot of will give the headword . Gd-ng stored 3 different headwords in the index
which is
a lot of,
lot of with prefix a,
of with prefix a lot
Use xapian will save us the trouble .

TEMP: add basic time measure to mdx

2cea76f

shenlebantongying commented Mar 24, 2024

View reviewed changes

shenlebantongying force-pushed the feat/zstd-index-file branch from 655ecf9 to 7a5a5d8 Compare March 24, 2024 21:31

shenlebantongying changed the title ~~feat: uses Zstd for index file instead of Zlib (and index building benchmarks)~~ feat: uses Zstd for index file instead of zlib for faster indexing Mar 24, 2024

shenlebantongying changed the title ~~feat: uses Zstd for index file instead of zlib for faster indexing~~ feat: switch from zlib to Zstd for index file to boost indexing time by >10% Mar 24, 2024

feat: use Zstd for index file compression instead of zlib

e0cb233

[autofix.ci] apply automated fixes a

shenlebantongying force-pushed the feat/zstd-index-file branch from ab8a671 to e0cb233 Compare March 24, 2024 23:14

[autofix.ci] apply automated fixes

7d22d22

shenlebantongying closed this Mar 25, 2024

shenlebantongying changed the title ~~feat: switch from zlib to Zstd for index file to boost indexing time by >10%~~ experiment: switch from zlib to Zstd for index file Mar 25, 2024

shenlebantongying added the vNext Improvments and optimizations that need incompatible changes. label Mar 25, 2024

shenlebantongying reopened this Apr 5, 2024

shenlebantongying marked this pull request as draft April 5, 2024 13:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

experiment: switch from zlib to Zstd for index file #1438

experiment: switch from zlib to Zstd for index file #1438

shenlebantongying commented Mar 24, 2024

shenlebantongying Mar 24, 2024

shenlebantongying commented Mar 24, 2024

xiaoyifang commented Mar 25, 2024 •

edited

Loading

shenlebantongying commented Mar 25, 2024

xiaoyifang commented Mar 25, 2024

shenlebantongying commented Mar 25, 2024

xiaoyifang commented Mar 25, 2024 •

edited

Loading

shenlebantongying commented Mar 25, 2024 •

edited

Loading

xiaoyifang commented Mar 25, 2024

shenlebantongying commented Mar 25, 2024

xiaoyifang commented Mar 25, 2024

shenlebantongying commented Mar 25, 2024 •

edited

Loading

xiaoyifang commented Mar 26, 2024 •

edited

Loading

sonarqubecloud bot commented Apr 5, 2024

xiaoyifang commented Oct 26, 2024 •

edited

Loading

shenlebantongying commented Oct 26, 2024 •

edited

Loading

xiaoyifang commented Oct 26, 2024 •

edited

Loading


		if ( compress( &bufferCompressed.front(), &compressedSize, &buffer.front(), bufferUsed ) != Z_OK )
		const size_t size_or_err =

experiment: switch from zlib to Zstd for index file #1438

Are you sure you want to change the base?

experiment: switch from zlib to Zstd for index file #1438

Conversation

shenlebantongying commented Mar 24, 2024

shenlebantongying Mar 24, 2024

Choose a reason for hiding this comment

shenlebantongying commented Mar 24, 2024

xiaoyifang commented Mar 25, 2024 • edited Loading

shenlebantongying commented Mar 25, 2024

xiaoyifang commented Mar 25, 2024

shenlebantongying commented Mar 25, 2024

xiaoyifang commented Mar 25, 2024 • edited Loading

shenlebantongying commented Mar 25, 2024 • edited Loading

xiaoyifang commented Mar 25, 2024

shenlebantongying commented Mar 25, 2024

xiaoyifang commented Mar 25, 2024

shenlebantongying commented Mar 25, 2024 • edited Loading

xiaoyifang commented Mar 26, 2024 • edited Loading

sonarqubecloud bot commented Apr 5, 2024

Quality Gate failed

xiaoyifang commented Oct 26, 2024 • edited Loading

shenlebantongying commented Oct 26, 2024 • edited Loading

xiaoyifang commented Oct 26, 2024 • edited Loading

xiaoyifang commented Mar 25, 2024 •

edited

Loading

xiaoyifang commented Mar 25, 2024 •

edited

Loading

shenlebantongying commented Mar 25, 2024 •

edited

Loading

shenlebantongying commented Mar 25, 2024 •

edited

Loading

xiaoyifang commented Mar 26, 2024 •

edited

Loading

xiaoyifang commented Oct 26, 2024 •

edited

Loading

shenlebantongying commented Oct 26, 2024 •

edited

Loading

xiaoyifang commented Oct 26, 2024 •

edited

Loading