Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MongoDB 3 Compression Options #1099

Open
robodude666 opened this issue Aug 13, 2015 · 3 comments
Open

MongoDB 3 Compression Options #1099

robodude666 opened this issue Aug 13, 2015 · 3 comments

Comments

@robodude666
Copy link

One of the new features added in MongoDB 3.0 is compression. Could these compression options please be supported, especially combined with GridFS for storing text files?

From the article on MongoDB's blog, it appears the WiredTiger Storage Engine would be required to support this functionality; not sure if this is yet supported or not.

@iici-gli
Copy link
Contributor

The compression can be configured in MongoDB start up options. Mongoengine does not need to do anything.
For example, in your config file:
storage:
dbPath: "C:/mongodb3/db"
engine: "wiredTiger"

@lafrech
Copy link
Member

lafrech commented Feb 22, 2016

I'm not familiar with this, so I may be wrong, but these compression options can be used at collection level, as written in the article linked to by @robodude666. To use them, one needs to pass specific options through kwags (see also this SO question) in pymongo's create_collection.

It would make sense to expose these options in MongoEngine and from a quick glance the meta attribute of the collection seems like an appropriate place. If anyone is willing to propose some code, I think it would be an interesting feature indeed.

@amcgregor
Copy link
Contributor

You don't have control over the construction of GridFS collections, either the file tracking one, or the one containing the actual chunks. That leaves such configuration to manual effort or server-wide configuration, as was previously pointed out. Additionally, the MongoDB in-database compression algorithm defaults to Snappy, for performance reasons, or lets you use fast zlib, neither of which offer worthy compression ratios. (Zlib being a typical dictionary based Huffman coder, Snappy using no entropy coding at all, instead relying on repetitions described by relative references in the output stream; so, at worst, it's literally 100% worse than gzip. More akin to RLE. ;)

On Lewis Carroll's "Through the Looking Glass" (Project Gutenberg txt edition), which should be highly compressible, "fast gzip" (-2) reduces the 168K source material to 70K. A 58.2% reduction is nothing to sneeze at. (Snappy, using gross estimates and comparisons to gzip, would get around 23% reduction on this file.)

Compare that to something a bit more… modern… like xz… given room to work (compression level -9). Same source material: 54K result. A 68% reduction.

Conclusion: compress material before archiving it into GridFS; WiredTiger compression is intended for absolute speed and data mutability, not efficient archival. This is doubly important if you store mixed content in GridFS, such as including images, audio, or video alongside the text content. Any form of in-database compression would actually increase the size of the stored data, if it's already extremely tightly entropy coded as sound and video are.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants