-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mongoengine is very slow on large documents compared to native pymongo usage #1230
Comments
Hi, Have you profiled the execution with |
Hi, Please see the following chart: http://i.stack.imgur.com/qAb0t.png (attached in the answer to my StackOverflow question) - it shows that the bottleneck is in "DictField.to_python" (being called 600,000 times). |
The entire approach of eager conversion is potentially fundamentally flawed. Lazy conversion on first access would defer all conversion overhead to the point where the structure is actually accessed, completely eliminating it in the situation where access is not made. (Beyond such a situation indicating proper use of |
Duplicate of #1137 ? |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
I've played a bit with a snippet and with some modifications( https://gist.github.com/apolkosnik/e94327e92dd2e0642e2b263efd87d1b1 ), then I ran it against Mongoengine 0.8.8 and 0.11: Please see the pictures... |
It looks like some change from 0.8.8 to 0.9+ caused the get() in ComplexBaseField class to go on a dereference spree for dicts(). |
@wojcikstefan first of all, thank you for contributions to Mongoengine. We are using Mongoengine heavily in production and running into this issue. Is this something you are actively looking into? |
@sauravshah I started investigating this issue and planned to release a fix for this If you cannot wait, the trouble is in the Document initialization where a class is created then instanciated. It is the same thing for There is 2 other troubles related to performances that could hit you: #1446 (pymongo3.4 don't have #298 Accessing to reference field cause the fetching from the database of the document. This is really an issue if you only wanted to access it |
Thanks, @touilleMan - that helps a bit. I looked at closeio's fork of the project and they seem to have some good ideas. Thank you for pointing to the other issues, they have already started hitting us in production :). I'm excited to know that you are on it, really looking forward to a performant mongoengine. Please let us know if you figure out any more quick wins in the meantime! Also, let me know if I can help in any way. |
@touilleMan were you able to fix these? |
@sauravshah sorry I had a branch ready but forget to issue a PR, here it is: #1630, can you have a look on it ? Considering #298, I've tried numerous methods to create lazy referencing, but it involve too much magic (typically when having inheritance within the class referenced, you can't know before dereferencing what will be the type of the instance to return) |
@touilleMan #1630 looks good to me. Reg. #298, is it possible to take the class to be referenced as a kwarg on |
Not sure what you mean... |
Why can't be we follow this approach?
When A is loaded, we already have B's Once |
The trouble class B(Document):
meta= {'allow_inheritance': True}
class Bchild(B):
pass
class A(Document):
b = ReferenceField(B)
b_child = Bchild().save()
a = A(b=b_child).save() In this example you cannot know |
Ah ok, I understand the problem now. This is not a big deal for us (and most projects I would assume). For backward compatibility, is it possible to add a kwarg to
|
As an interesting idea, it can know what it references prior to dereferencing if the |
Does anyone know if there is a patch in the works for this issue? |
@benjhastings #1690 is the solution, but it requires some change on your code (switching from |
How does that work if a DictField is used though as per https://stackoverflow.com/questions/35257305/mongoengine-is-very-slow-on-large-documents-compared-to-native-pymongo-usage/35274930#35274930 ? |
@benjhastings If you perf trouble comes from a too big document... well there is nothing that can save you right now :-( |
I have written an alternative Document container type which preserves, internally, the MongoDB native value types rather than Python typecast values, casts only on actual attribute access (get/set) to the desired field, and is directly usable with PyMongo base APIs as if it were a dictionary; no conversion on use. No eager bulk conversion of records' values as they stream in, which is a large part of the overhead, and similarly, no eager dereferencing (additional independent read queries for each record loaded) going with a hardline interpretation of "explicit is better than implicit". Relevant links:
Use of a |
...well...i just got annoyed enough with mongoengine enough to google what's what to find this...great. should be on current versions of pymongo and mongoengine per pip install -U. here's my output a la @apolkosnik: console: if DictField is the issue, then please for the love of all that is holy, let us know what to change it to or fix it. watching mongo and pymongo respond almost immediately and then waiting close to a minute for mongoengine to...do whatever it's doing...kind of a massive bottleneck. dig the rest of the package, but if this can't be resolved on the package side... |
cricket...cricket... oh look at that. pymodm. and to porting we go. |
Just hit this bottleneck inserting a couple ~5MB documents. Pretty much a deal breaker, having an insert that takes less than a second with pymongo take over a minute with MongoEngine. |
@nickfaughey I switched to pymodm. It took very little if any modification with my existing code and is lightning fast. And by MongoDB, so ongoing development. |
pymodm has a similar sintax and is much faster. You should try it. |
As some absolutely direct Marrow Mongo (see also: WIP documentation manual) Using the parametric helpers, syntax is nearly identical to MongoEngine, even with most of the same operator prefixes and suffixes so as to maintain that compatibility:
Combineable using
|
Sweet, I'll check these 2 out. In the mean time I've literally just bypassed MongoEngine for these large documents and access mongo directly with PyMongo, but it would be nice to keep an ODM there for schema sanity. |
@nickfaughey you didn't even need to go that far. Pymodm has pretty much the same ODM syntax as mongoengine. Literally has ODM in the name. 😉 |
I'd love to formalize this benchmark set (akin to template engine's "bigtable" test) and add more contenders to it. The code below the file of older results demonstrates, effectively side-by-side, identical solutions. This is a more direct comparison of querying, specifically and in isolation, with the note that |
Is this still an issue? I'm using Pymodm and would prefer switching to MongoEngine as a more popular ODM, but poor large object performance would be a deal breaker. |
@pikeas Yes, with some variance for some additional optimization here, and further complication over there… the underlying mechanism remains "eager", that is, upon retrieval of a record MongoEngine recursively casts all elements of the document it can to native Python types via repeated This contrasts with my own DAO's approach (if I'm going to be fixing everything, might as well start from scratch) which is purely lazy: transformers to "cast" (or just generally process) MongoDB values to native values are executed on attribute access. Bypassed by dictionary dereferencing. The Document class' equivalent Edited to add: Note that a double underscore-prefixed (dunderscore / soft private) initializer argument is available to disable eager conversion. The underlying machinery iteratively utilizes both explicit Using my silly simple benchmark, the latest deserialization numbers:
Admittedly, other areas differ in the opposite direction. Unsaved instance construction is faster under MongoEngine:
|
Note that I'll probably start experimenting with lazy loading the attribute soon in mongoengine (i.e defer the python de-serialization until it's actually called, like it is done in pymodm) |
The initial lazy version I completed myself 5 years ago — with minor deficits later corrected, e.g. |
You do whatever you want in your own project :) I'll make sure to check how you dealt with that (compared to pymodm) when I'll work on that, out of curiosity. I understand the reasons that made you move away from mongoengine but I would appreciate if we could keep the discussions in the mongoengine project constructive. |
@bagerard As the tag on my comments identify, I'm a past direct code contributor.
marrow/contentment#12 — I indexed my issues for handy reference. A number date to 2013: almost 8 years with little to no progress. Progress on some, of course! #1136 (regression in Many others documented there were simply never engineered to be problems in the first place, e.g. by removal of a "Document" v. "DynamicDocument" differentiation, clear segregation and isolation of "active collection"-like behavior, and explicit avoidance of the self-saving (change tracking) "active record" pattern, no global registry demanding unique model class names, plus virtually no form of implicit caching or automatic (edit: eager/recursive) casting/conversion. No result set middleware, or connection management middleware, and so on. The Document instances are usable anywhere a dictionary is, within plain PyMongo APIs, with zero use-time work. (And near-zero effort to encapsulate raw PyMongo result set records.) It's, unfortunately, almost exactly the opposite of MongoEngine. 😕 |
I've seen your post many times and also all the references that it created in our tickets, I didn't need you to elaborate. If you don't like MongoEngine and gave up on improving it, that's ok but if we could keep the discussions in the MongoEngine project (thus its Issues) to actually improve MongoEngine, that would be more helpful |
Is there a plan to add lazy init to MongoEngine any time soon? |
Any news on this, or ways others can help? |
PyMODM is better than this lib, does someone knows lib which like it. |
I… somewhat hate to do this, but given how stale this (and numerous others; see my previous comments) have become, I wrote a replacement for myself. I haven't really gone out of my way to advertise it, but it follows some of the suggestions given here re: lazy conversion. (The It does not, however, reimplement the full, nested change-tracking active record approach, nor automatic foreign record lookup. (Those are also performance and memory utilization problems.) It uses a "layer your needs" approach. The base To allow multiple installed packages to cooperate in defining fields or traits, entry_points-based pseudo-namespaces can be imported from: |
(See also this StackOverflow question)
I have the following mongoengine model:
In some cases the document in the DB can be very large (around 5-10MB), and the data_dict fields contain complex nested documents (dict of lists of dicts, etc...).
I have encountered two (possibly related) issues:
pymongo find_one()
query, it returns within a second. When I runMyModel.objects.first()
it takes 5-10 seconds.m = MyModel.objects.first()
val = m.data_dict_1.get(some_key)
The data in the object does not contain any references to any other objects, so it is not an issue of objects dereferencing.
I suspect it is related to some inefficiency of the internal data representation of mongoengine, which affects the document object construction as well as fields access.
The text was updated successfully, but these errors were encountered: