Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mongoengine is very slow on large documents compared to native pymongo usage #1230

Open
baruchoxman opened this issue Feb 7, 2016 · 47 comments

Comments

@baruchoxman
Copy link

(See also this StackOverflow question)

I have the following mongoengine model:

class MyModel(Document):
    date = DateTimeField(required = True)
    data_dict_1 = DictField(required = False)
    data_dict_2 = DictField(required = True)

In some cases the document in the DB can be very large (around 5-10MB), and the data_dict fields contain complex nested documents (dict of lists of dicts, etc...).

I have encountered two (possibly related) issues:

  1. When I run native pymongo find_one() query, it returns within a second. When I run MyModel.objects.first() it takes 5-10 seconds.
  2. When I query a single large document from the DB, and then access its field, it takes 10-20 seconds just to do the following:
    m = MyModel.objects.first()
    val = m.data_dict_1.get(some_key)

The data in the object does not contain any references to any other objects, so it is not an issue of objects dereferencing.
I suspect it is related to some inefficiency of the internal data representation of mongoengine, which affects the document object construction as well as fields access.

@touilleMan
Copy link
Member

Hi,

Have you profiled the execution with Profile/cProfile ? A graph of it with objgraph should give us a better view of where the trouble is.

@baruchoxman
Copy link
Author

Hi,

Please see the following chart: http://i.stack.imgur.com/qAb0t.png (attached in the answer to my StackOverflow question) - it shows that the bottleneck is in "DictField.to_python" (being called 600,000 times).

@lafrech lafrech changed the title Mongoengine is very slow on large documents comapred to native pymongo usage Mongoengine is very slow on large documents compared to native pymongo usage Mar 22, 2016
@amcgregor
Copy link
Contributor

The entire approach of eager conversion is potentially fundamentally flawed. Lazy conversion on first access would defer all conversion overhead to the point where the structure is actually accessed, completely eliminating it in the situation where access is not made. (Beyond such a situation indicating proper use of .only() and related helpers is warranted.)

@amcgregor
Copy link
Contributor

Duplicate of #1137 ?

@lafrech

This comment has been minimized.

@amcgregor

This comment has been minimized.

@lafrech

This comment has been minimized.

@apolkosnik
Copy link

I've played a bit with a snippet and with some modifications( https://gist.github.com/apolkosnik/e94327e92dd2e0642e2b263efd87d1b1 ), then I ran it against Mongoengine 0.8.8 and 0.11:

Please see the pictures...
On 0.8.8:
mongoengine with dict took 16.95s
viz_dict_me0_8_8

On 0.11:
mongoengine with dict took 32.74s
viz_dict_me0_11_0

@apolkosnik
Copy link

It looks like some change from 0.8.8 to 0.9+ caused the get() in ComplexBaseField class to go on a dereference spree for dicts().

@sauravshah
Copy link

@wojcikstefan first of all, thank you for contributions to Mongoengine.

We are using Mongoengine heavily in production and running into this issue. Is this something you are actively looking into?

@touilleMan
Copy link
Member

@sauravshah I started investigating this issue and planned to release a fix for this

If you cannot wait, the trouble is in the Document initialization where a class is created then instanciated.
By replacing this self._data = SemiStrictDict.create(allowed_keys=self._fields_ordered)() by a simple self._data = {} I get a 30% boost on my entire application (not a microbenchmark)

It is the same thing for StrictDict, but it is not as simple to fix (the StrictDict subclass should be generated in the metaclass defining the Document). However I didn't see where it is really used.

There is 2 other troubles related to performances that could hit you:

#1446 (pymongo3.4 don't have ensure_index so a create_index request is actually send to the mongodb server before every save done by mongoengine). Solution is to handle index creation manually and disable it in mongoengine with meta = {..., auto_create_index: False}

#298 Accessing to reference field cause the fetching from the database of the document. This is really an issue if you only wanted to access it _id field which was already known in the first place... I'm working on a fix for this but it is really complicated issue given dereferencing early is at the core of mongoengine :(

@sauravshah
Copy link

Thanks, @touilleMan - that helps a bit. I looked at closeio's fork of the project and they seem to have some good ideas.

Thank you for pointing to the other issues, they have already started hitting us in production :). I'm excited to know that you are on it, really looking forward to a performant mongoengine.

Please let us know if you figure out any more quick wins in the meantime! Also, let me know if I can help in any way.

@sauravshah
Copy link

@touilleMan were you able to fix these?

@touilleMan
Copy link
Member

touilleMan commented Aug 24, 2017

@sauravshah sorry I had a branch ready but forget to issue a PR, here it is: #1630, can you have a look on it ?

Considering #298, I've tried numerous methods to create lazy referencing, but it involve too much magic (typically when having inheritance within the class referenced, you can't know before dereferencing what will be the type of the instance to return)
So in the end, I will try to provide a new type of field LazyReferenceField which would return a Reference class instance, allowing to access pk or to call fetch() to get back the actual document. But this mean one should rework it code to make use of this feature :-(

@sauravshah
Copy link

@touilleMan #1630 looks good to me.

Reg. #298, is it possible to take the class to be referenced as a kwarg on ReferenceField and solve this issue? Calling .fetch would be too much rework in most cases (including ours). Also, how would you solve the referenced class issue in .fetch ?

@touilleMan
Copy link
Member

is it possible to take the class to be referenced as a kwarg on ReferenceField and solve this issue?

Not sure what you mean...
We could add __getattr__/__setattr__ methods to the LazyReference which would dereference the document when accessed and modify it.
This way you wouldn't have to change you code, except if you use isinstance, but this should lower a lot the amount of things that need to be reworked ;-)

@sauravshah
Copy link

Why can't be we follow this approach?

class A(Document):
  b = ReferenceField(B)

When A is loaded, we already have B's id, so an instance of class B can be created with just id (with a flag on the class to denote its not been loaded yet). isintance would work correctly in this case.

Once __getattr__/__setattr__ is called a query to the DB could load the actual mongo document.

@touilleMan
Copy link
Member

The trouble B maybe have children classes,:

class B(Document):
    meta= {'allow_inheritance': True}


class Bchild(B):
    pass


class A(Document):
  b = ReferenceField(B)


b_child = Bchild().save()
a = A(b=b_child).save()

In this example you cannot know a.b is a Bchild instance before dereferencing it.

@sauravshah
Copy link

Ah ok, I understand the problem now. This is not a big deal for us (and most projects I would assume).

For backward compatibility, is it possible to add a kwarg to LazyReference (maybe ignore_inheritance) and make isintance work when that kwarg is present?

isinstance is being used all over the place in django-mongoengine, so would be great to not dereference on it.

@amcgregor
Copy link
Contributor

As an interesting idea, it can know what it references prior to dereferencing if the _cls reference is stored in the DBRef (concrete; technically allowed via **kwargs inclusion in the resulting SON) or if stored similarly to a CahcedReferenceField that incorporates that value.

@benjhastings
Copy link

Does anyone know if there is a patch in the works for this issue?

@touilleMan
Copy link
Member

@benjhastings #1690 is the solution, but it requires some change on your code (switching from ReferenceField to LazyReferenceField)

@benjhastings
Copy link

@touilleMan
Copy link
Member

@benjhastings If you perf trouble comes from a too big document... well there is nothing that can save you right now :-(
I guess the DictField could be improved (or a RawDictField could be created) to do no deserialization at all on the data.

@amcgregor
Copy link
Contributor

amcgregor commented Jul 4, 2018

I have written an alternative Document container type which preserves, internally, the MongoDB native value types rather than Python typecast values, casts only on actual attribute access (get/set) to the desired field, and is directly usable with PyMongo base APIs as if it were a dictionary; no conversion on use. No eager bulk conversion of records' values as they stream in, which is a large part of the overhead, and similarly, no eager dereferencing (additional independent read queries for each record loaded) going with a hardline interpretation of "explicit is better than implicit". Relevant links:

  • Document (MutableMapping proxy to an ordered dictionary)
  • Container (underlying declarative base class)
  • Reference (Field subclass)

Use of a Reference(Foo, concrete=True, cache=['_cls']) would store an import path reference (e.g. "myapp.model:Foo") within the DBRef. (If Foo is a model that allows subclassing; typically by inheriting the Derived trait which defines and automates the calculation of a _cls field import reference.)

@shr00mie
Copy link

...well...i just got annoyed enough with mongoengine enough to google what's what to find this...great.

should be on current versions of pymongo and mongoengine per pip install -U.

here's my output a la @apolkosnik:
dict:
viz_dict

embed:
viz_embed

console:
pymongo with dict took 0.06s
pymongo with embed took 0.06s
mongoengine with dict took 16.72s
mongoengine with embed took 0.74s
mongoengine with dict as_pymongo() took 0.06s
mongoengine with embed as_pymongo() took 0.06s
mongoengine aggregation with dict took 0.11s
mongoengine aggregation with embed took 0.11s

if DictField is the issue, then please for the love of all that is holy, let us know what to change it to or fix it. watching mongo and pymongo respond almost immediately and then waiting close to a minute for mongoengine to...do whatever it's doing...kind of a massive bottleneck. dig the rest of the package, but if this can't be resolved on the package side...

@shr00mie
Copy link

shr00mie commented Aug 2, 2018

cricket...cricket...

oh look at that. pymodm. and to porting we go.

@nickfaughey
Copy link

Just hit this bottleneck inserting a couple ~5MB documents. Pretty much a deal breaker, having an insert that takes less than a second with pymongo take over a minute with MongoEngine.

@shr00mie
Copy link

shr00mie commented Apr 2, 2019

@nickfaughey I switched to pymodm. It took very little if any modification with my existing code and is lightning fast. And by MongoDB, so ongoing development.

@Cayke
Copy link

Cayke commented Apr 2, 2019

pymodm has a similar sintax and is much faster. You should try it.

@amcgregor
Copy link
Contributor

amcgregor commented Apr 4, 2019

As some absolutely direct <ShamelessSelfPromotion> I'd like to point out again that I also offer an alternative, directly designed to alleviate some of the issues with ME I've encountered or issues I've submitted but never had corrected. E.g. promotion/demotion, the distinction between embedded and top-level documents (there should be none; allow the embedding of top-level documents, collection and active record behaviour isolated and optional), lazy conversion (not eager, let alone eager sub-findMany and conversion of References, or worse, Lists of References…), minimal interposing (I don't track document dirty state), inline comparison generates filter documents (alternative to parametric querying, which is… limiting), extremely rich and expressive allowable type conversions across most field types (ObjectId ~= datetime, but also anything date-like, like timedelta), 99.27% coverage, 100% if you ignore two codepaths rarely hit (unless you dir() or star import specific modules…) My package even has an opinion on how one should store localized data, something a naive approach harshly penalizes. (Naive being {"en": "English Text", "fr": "Text Francois", …}don't do that.)

Marrow Mongo (see also: WIP documentation manual)

Using the parametric helpers, syntax is nearly identical to MongoEngine, even with most of the same operator prefixes and suffixes so as to maintain that compatibility:

q1 = F(Foo, age__gt=30)  # {'age': {'$gt': 30}}
q2 = (Foo.age > 30)  # {'age': {'$gt': 30}}
q3 = F(Foo, not__age__gt=30)  # {'age': {'$not': {'$gt': 30}}}
q4 = F(Foo, attribute__name__exists=False)  # {'attribute.name': {'$exists': 1}}

Combineable using & and | operators. There's much more interesting things you can do, though. (Direct iteration of filter sets planned, currently.)

# Iterate all threads created or replied to within the last 7 days.
for record in (Thread.id | Thread.reply.id) >= -timedelta(days=7):
    ...

@nickfaughey
Copy link

Sweet, I'll check these 2 out. In the mean time I've literally just bypassed MongoEngine for these large documents and access mongo directly with PyMongo, but it would be nice to keep an ODM there for schema sanity.

@shr00mie
Copy link

shr00mie commented Apr 5, 2019

@nickfaughey you didn't even need to go that far. Pymodm has pretty much the same ODM syntax as mongoengine. Literally has ODM in the name. 😉

@amcgregor
Copy link
Contributor

amcgregor commented Apr 5, 2019

I'd love to formalize this benchmark set (akin to template engine's "bigtable" test) and add more contenders to it. The code below the file of older results demonstrates, effectively side-by-side, identical solutions.

This is a more direct comparison of querying, specifically and in isolation, with the note that as_query is entirely unnecessary on the MM side; just pass find_{one/many} the Filter instance; it's a suitable mapping natively. (Oh, and ME appears to be unable to "continue" from a "compiled" query, e.g. reconstitute the rich Q object from a plain dict; at least, when I made that comparison.)

@olka
Copy link

olka commented Nov 27, 2019

Same here - to_python call produces 70% of overhead
to_python

@pikeas
Copy link

pikeas commented Oct 10, 2020

Is this still an issue? I'm using Pymodm and would prefer switching to MongoEngine as a more popular ODM, but poor large object performance would be a deal breaker.

@amcgregor
Copy link
Contributor

amcgregor commented Oct 11, 2020

@pikeas Yes, with some variance for some additional optimization here, and further complication over there… the underlying mechanism remains "eager", that is, upon retrieval of a record MongoEngine recursively casts all elements of the document it can to native Python types via repeated to_python invocation.

This contrasts with my own DAO's approach (if I'm going to be fixing everything, might as well start from scratch) which is purely lazy: transformers to "cast" (or just generally process) MongoDB values to native values are executed on attribute access. Bypassed by dictionary dereferencing. The Document class' equivalent from_mongo factory class method only performs the outermost Document object lookup and wrapping. Mine was written after many years of MongoEngine use and frustration with lack of progress on numerous fronts. Parts are still enjoyably crazy, but at least I can very exactly explain the “crazy” in mine. 😉

Edited to add: Note that a double underscore-prefixed (dunderscore / soft private) initializer argument is available to disable eager conversion. The underlying machinery iteratively utilizes both explicit to_python invocation, and indirect invocation via setters (L125), which doesn't make it much easier to follow. 🙁

Using my silly simple benchmark, the latest deserialization numbers:

  • MongoEngine 0.20.0 0.4516570568084717s (5× longer)
  • Marrow Mongo 2.0 (next) 0.08598494529724121s

Admittedly, other areas differ in the opposite direction. Unsaved instance construction is faster under MongoEngine:

  • MongoEngine: 0.03315997123718262s
  • Marrow Mongo: 0.26718783378601074s (8×; Marrow Mongo shifts most of the responsibility to the initializer, zero work at save-time: no waiting until save for a validation error, for example; make the assignment, you get your ValueError. The Document instance is directly usable with native PyMongo APIs as a dictionary.)

@bagerard
Copy link
Collaborator

Note that I'll probably start experimenting with lazy loading the attribute soon in mongoengine (i.e defer the python de-serialization until it's actually called, like it is done in pymodm)

@amcgregor
Copy link
Contributor

The initial lazy version I completed myself 5 years ago — with minor deficits later corrected, e.g. from_mongo of an already cast Document. I hope you don't mind that I didn't wait.

@bagerard
Copy link
Collaborator

bagerard commented Oct 12, 2020

You do whatever you want in your own project :) I'll make sure to check how you dealt with that (compared to pymodm) when I'll work on that, out of curiosity. I understand the reasons that made you move away from mongoengine but I would appreciate if we could keep the discussions in the mongoengine project constructive.

@amcgregor
Copy link
Contributor

amcgregor commented Oct 13, 2020

@bagerard As the tag on my comments identify, I'm a past direct code contributor.

I understand the reasons that made you move away from mongoengine

marrow/contentment#12 — I indexed my issues for handy reference. A number date to 2013: almost 8 years with little to no progress. Progress on some, of course! #1136 (regression in limit use in 0.10.0) was a milestone for giving me the needed kick to get going on mine.

Many others documented there were simply never engineered to be problems in the first place, e.g. by removal of a "Document" v. "DynamicDocument" differentiation, clear segregation and isolation of "active collection"-like behavior, and explicit avoidance of the self-saving (change tracking) "active record" pattern, no global registry demanding unique model class names, plus virtually no form of implicit caching or automatic (edit: eager/recursive) casting/conversion. No result set middleware, or connection management middleware, and so on. The Document instances are usable anywhere a dictionary is, within plain PyMongo APIs, with zero use-time work. (And near-zero effort to encapsulate raw PyMongo result set records.) It's, unfortunately, almost exactly the opposite of MongoEngine. 😕

@bagerard
Copy link
Collaborator

I've seen your post many times and also all the references that it created in our tickets, I didn't need you to elaborate. If you don't like MongoEngine and gave up on improving it, that's ok but if we could keep the discussions in the MongoEngine project (thus its Issues) to actually improve MongoEngine, that would be more helpful

@bkntr
Copy link

bkntr commented Dec 7, 2020

Is there a plan to add lazy init to MongoEngine any time soon?

@neilsh
Copy link

neilsh commented Jan 17, 2022

Any news on this, or ways others can help?

@cundi
Copy link

cundi commented Mar 1, 2023

PyMODM is better than this lib, does someone knows lib which like it.

@amcgregor
Copy link
Contributor

amcgregor commented Nov 10, 2023

PyMODM is better than this lib, does someone knows lib which like it.

I… somewhat hate to do this, but given how stale this (and numerous others; see my previous comments) have become, I wrote a replacement for myself. I haven't really gone out of my way to advertise it, but it follows some of the suggestions given here re: lazy conversion. (The Document and Query classes act like a plain, PyMongo-compatible dictionaries/mappings, always storing MongoDB-safe types internally thus no conversion on final use.) It is rather stable, extremely well unit tested, and in active use.

It does not, however, reimplement the full, nested change-tracking active record approach, nor automatic foreign record lookup. (Those are also performance and memory utilization problems.) It uses a "layer your needs" approach. The base Document class is essentially just a fancy, declaratively-defined dictionary with validation and typecasting. Mix-ins such as Queryable (a type of Collection of Identified records) adds in bind (to connect a PyMongo DB or collection), find, insert_one, &c. methods. There are also data defining mix-ins beyond Identified such as Localized or Published. .get() or .first() replaced by class dereferencing on bound classes: User[identifier]

To allow multiple installed packages to cooperate in defining fields or traits, entry_points-based pseudo-namespaces can be imported from: marrow.mongo.document, marrow.mongo.field, and marrow.mongo.trait. Register an entry_point in the appropriate namespace, your custom document, field, and trait can be imported from there, too. (Making conflicts explicit.) There is a minimal beginning to proper documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests