INTPYTHON-617 - Improve DictField to_python performance #2888

NoahStapp · 2025-06-24T14:33:23Z

Improves performance of large documents with nested DictFields by nearly 20x, addressing #1230.

Modified benchmark script pulled from https://stackoverflow.com/questions/35257305/mongoengine-is-very-slow-on-large-documents-compared-to-native-pymongo-usage/:

import datetime
import itertools
import random
import timeit
from collections import defaultdict

import mongoengine as db

db.connect("test-dicts")

class MyModel(db.Document):
    date = db.DateTimeField(required=True, default=datetime.date.today)
    data_dict_1 = db.DictField(required=False)

MyModel.drop_collection()

data_1 = ['foo', 'bar']
data_2 = ['spam', 'eggs', 'ham']
data_3 = ["subf{}".format(f) for f in range(5)]

m = MyModel()
tree = lambda: defaultdict(tree)
data = tree()
for _d1, _d2, _d3 in itertools.product(data_1, data_2, data_3):
    data[_d1][_d2][_d3] = list(random.sample(range(50000), 20000))
m.data_dict_1 = data
m.save()

def pymongo_doc():
    return db.connection.get_connection()["test-dicts"]['my_model'].find_one()

def mongoengine_doc():
    model = MyModel.objects.first()
    return model

if __name__ == '__main__':
    print("pymongo took {:2.2f}s".format(timeit.timeit(pymongo_doc, number=10)))
    print("mongoengine took {:2.2f}s".format(timeit.timeit(mongoengine_doc, number=10)))

Before:

pymongo took 0.21s
mongoengine took 4.98s

After:

pymongo took 0.20s
mongoengine took 0.20s

terencehonles · 2025-08-14T06:58:16Z

Does your benchmark still apply? You seem to have changed the code substantially, and I would assume that the to_python caching you do at the top level not to be that impactful (but maybe it is). If this is really the case, then it may make sense to move your change down into the base class so this applies to lists too.

NoahStapp · 2025-08-15T18:57:18Z

Does your benchmark still apply? You seem to have changed the code substantially, and I would assume that the to_python caching you do at the top level not to be that impactful (but maybe it is). If this is really the case, then it may make sense to move your change down into the base class so this applies to lists too.

What do you by mean "still apply"? The benchmarking script was run against the current master branch as of June with and then without the change to DictField.to_python.

neilsh · 2025-08-18T18:40:00Z

mongoengine/fields.py

+        return (
+            {k: to_python(v) for k, v in value.items()}
+            if to_python and value
+            else value or None


I think this would convert a {} to None

class MyModel(db.Document): data_dict_1 = db.DictField(required=False) m = MyModel() m.data_dict_1 = {} m.save() model = MyModel.objects.first() print(model.data_dict_1)

Outputs {}.

Some example test cases showing where the behavior differs:

from mongoengine import Document, DictField, IntField, ListField, MapField from tests.utils import MongoDBTestCase class TestDictFieldRegressions(MongoDBTestCase): def test_nested_empty_dict_in_list_of_dictfield_roundtrips_as_dict_regression(self): """Regression: master => {}, PR => None. Empty dict appended into ListField(DictField()) should roundtrip as {}. """ class Model(Document): items = ListField(DictField()) Model.drop_collection() doc = Model(items=[{"a": 1}]).save() Model.objects(id=doc.id).update(push__items={}) doc.reload() assert doc.items[-1] == {} assert isinstance(doc.items[-1], dict) def test_nested_empty_dict_in_list_of_mapfield_roundtrips_as_dict_regression(self): """Regression: master => {}, PR => None. Empty dict appended into ListField(MapField(IntField())) should roundtrip as {}. """ class Model(Document): items = ListField(MapField(IntField())) Model.drop_collection() doc = Model(items=[{"a": 1}]).save() Model.objects(id=doc.id).update(push__items={}) doc.reload() assert doc.items[-1] == {} assert isinstance(doc.items[-1], dict) def test_top_level_dictfield_default_preserves_empty_dict(self): """No regression: master => {}, PR => {}.""" class Post(Document): info = DictField() Post.drop_collection() Post(info={}).save() p = Post.objects.first() assert p.info == {} assert isinstance(p.info, dict) def test_top_level_dictfield_null_true_preserves_empty_dict_regression(self): """Regression: master => {}, PR => None. With null=True, empty dict should still roundtrip as {} at top level. """ class PostNull(Document): info = DictField(null=True) PostNull.drop_collection() PostNull(info={}).save() pn = PostNull.objects.first() assert pn.info == {} assert isinstance(pn.info, dict)

It's not as elegant, but I think this should handle the failed cases while preserving most of the performance improvement:

def to_python(self, value): # Fast-path for mapping of homogenous child field conversions, # preserving behavior (no {} -> None) while avoiding ComplexBaseField's # generic recursion when possible. if not isinstance(value, dict) or self.field is None: return super().to_python(value) elif not value: # Keep empty dicts intact return value elif type(self.field).to_python is BaseField.to_python: # identity return value else: child_to_python = self.field.to_python return {k: child_to_python(v) for k, v in value.items()}

I believe this can be simplified:

def to_python(self, value): if value is None: return None to_python = getattr(self.field, "to_python", None) if not to_python or not value: return value return {k: to_python(v) for k, v in value.items()}

This passes all of the current test cases plus the ones you've added above.

@NoahStapp if a DictField receives an invalid/non-dict value (e.g. a list instead of a dict), I think the one I proposed (and orig) would give a mongoengine ValidationError, while the one you proposed would give a standard error (AttributeError?)

getattr also has a bit of a performance hit, although I haven't benchmarked it

Using my proposed code, I get a ValidationError when passing an invalid value to a DictField:

class Model(db.Document): field = db.DictField() m = Model(field=[]) m.save() ------ mongoengine.errors.ValidationError: ValidationError (Model:None) (Only dictionaries may be used in a DictField: ['field'])

terencehonles · 2025-08-19T06:35:22Z

Does your benchmark still apply? You seem to have changed the code substantially, and I would assume that the to_python caching you do at the top level not to be that impactful (but maybe it is). If this is really the case, then it may make sense to move your change down into the base class so this applies to lists too.

What do you by mean "still apply"? The benchmarking script was run against the current master branch as of June with and then without the change to DictField.to_python.

I wrote what I meant right after

You seem to have changed the (your) code substantially

It's not obvious what commit you ran your benchmarking from, so I was just confirming. The rest of my comment applies. Have you profiled this to see what code is the bottleneck that you're removing?

NoahStapp · 2025-08-19T13:43:05Z

Does your benchmark still apply? You seem to have changed the code substantially, and I would assume that the to_python caching you do at the top level not to be that impactful (but maybe it is). If this is really the case, then it may make sense to move your change down into the base class so this applies to lists too.

What do you by mean "still apply"? The benchmarking script was run against the current master branch as of June with and then without the change to DictField.to_python.

I wrote what I meant right after

You seem to have changed the (your) code substantially

It's not obvious what commit you ran your benchmarking from, so I was just confirming. The rest of my comment applies. Have you profiled this to see what code is the bottleneck that you're removing?

The speedup comes from not calling to_python on builtin Python types that don't need to be converted. Here's the snakeviz visualization of the profiling for master versus this change:

Master

This PR

The call graph shown is different due to nearly all of the calls from the root being PyMongo driver background operations

The core difference shown is that in the current behavior, DictField.to_python makes 6,000,400 recursive calls to itself even though all of its data is primitive Python types and not other Field objects.

In comparison, with this change, DictField.to_python makes 10 calls without any recursion.

bagerard · 2025-08-19T17:26:06Z

I was curious and had a quick look yesterday but I'm suspecting that it's only working with simple dicts (with native types that don't need any parsing), we must double check that it's also working with nested reference fields or embedded docs or fields that would require conversion when loaded from db. If you look at the basecomplexfield.to_python it has some code related with that that the current proposal is just skipping

NoahStapp · 2025-08-19T20:34:47Z

I was curious and had a quick look yesterday but I'm suspecting that it's only working with simple dicts (with native types that don't need any parsing), we must double check that it's also working with nested reference fields or embedded docs or fields that would require conversion when loaded from db. If you look at the basecomplexfield.to_python it has some code related with that that the current proposal is just skipping

Is passing the existing test suite sufficient to say that this change doesn't introduce any breaking changes? Or is there a specific example or test I could add to verify correct behavior beyond what already exists in the suite?

ShaneHarvey · 2025-09-19T18:29:53Z

Is this the test that proves embedded reference fields still work as expected?

mongoengine/tests/fields/test_dict_field.py

Lines 345 to 390 in e844138

    
           def test_dictfield_with_referencefield_complex_nesting_cases(self): 
        
               """Ensure complex nesting inside DictField handles dereferencing of ReferenceField(dbref=True | False)""" 
        
               # Relates to Issue #1453 
        
               class Doc(Document): 
        
                   s = StringField() 
        
               class Simple(Document): 
        
                   mapping0 = DictField(ReferenceField(Doc, dbref=True)) 
        
                   mapping1 = DictField(ReferenceField(Doc, dbref=False)) 
        
                   mapping2 = DictField(ListField(ReferenceField(Doc, dbref=True))) 
        
                   mapping3 = DictField(ListField(ReferenceField(Doc, dbref=False))) 
        
                   mapping4 = DictField(DictField(field=ReferenceField(Doc, dbref=True))) 
        
                   mapping5 = DictField(DictField(field=ReferenceField(Doc, dbref=False))) 
        
                   mapping6 = DictField(ListField(DictField(ReferenceField(Doc, dbref=True)))) 
        
                   mapping7 = DictField(ListField(DictField(ReferenceField(Doc, dbref=False)))) 
        
                   mapping8 = DictField( 
        
                       ListField(DictField(ListField(ReferenceField(Doc, dbref=True)))) 
        
                   ) 
        
                   mapping9 = DictField( 
        
                       ListField(DictField(ListField(ReferenceField(Doc, dbref=False)))) 
        
                   ) 
        
               Doc.drop_collection() 
        
               Simple.drop_collection() 
        
               d = Doc(s="aa").save() 
        
               e = Simple() 
        
               e.mapping0["someint"] = e.mapping1["someint"] = d 
        
               e.mapping2["someint"] = e.mapping3["someint"] = [d] 
        
               e.mapping4["someint"] = e.mapping5["someint"] = {"d": d} 
        
               e.mapping6["someint"] = e.mapping7["someint"] = [{"d": d}] 
        
               e.mapping8["someint"] = e.mapping9["someint"] = [{"d": [d]}] 
        
               e.save() 
        
               s = Simple.objects.first() 
        
               assert isinstance(s.mapping0["someint"], Doc) 
        
               assert isinstance(s.mapping1["someint"], Doc) 
        
               assert isinstance(s.mapping2["someint"][0], Doc) 
        
               assert isinstance(s.mapping3["someint"][0], Doc) 
        
               assert isinstance(s.mapping4["someint"]["d"], Doc) 
        
               assert isinstance(s.mapping5["someint"]["d"], Doc) 
        
               assert isinstance(s.mapping6["someint"][0]["d"], Doc) 
        
               assert isinstance(s.mapping7["someint"][0]["d"], Doc) 
        
               assert isinstance(s.mapping8["someint"][0]["d"][0], Doc) 
        
               assert isinstance(s.mapping9["someint"][0]["d"][0], Doc)

NoahStapp added 3 commits June 23, 2025 16:58

WIP

2931801

INTPYTHON-617 - Improve DictField to_python performance

d4619bd

Cleanup

40aa9a5

neilsh reviewed Aug 18, 2025

View reviewed changes

NoahStapp added 2 commits August 19, 2025 13:27

Refresh CI

e377528

Refresh CI

2586d45

NoahStapp marked this pull request as ready for review August 19, 2025 20:34

to_python should return falsy values

6cf60d2

INTPYTHON-617 - Improve DictField to_python performance #2888

Are you sure you want to change the base?

INTPYTHON-617 - Improve DictField to_python performance #2888

Uh oh!

Conversation

NoahStapp commented Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

terencehonles commented Aug 14, 2025

Uh oh!

NoahStapp commented Aug 15, 2025

Uh oh!

neilsh Aug 18, 2025

Choose a reason for hiding this comment

Uh oh!

NoahStapp Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

neilsh Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

neilsh Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NoahStapp Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

neilsh Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NoahStapp Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

terencehonles commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NoahStapp commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Master

This PR

Uh oh!

bagerard commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NoahStapp commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ShaneHarvey commented Sep 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

NoahStapp commented Jun 24, 2025 •

edited

Loading

neilsh Sep 22, 2025 •

edited

Loading

neilsh Sep 25, 2025 •

edited

Loading

terencehonles commented Aug 19, 2025 •

edited

Loading

NoahStapp commented Aug 19, 2025 •

edited

Loading

bagerard commented Aug 19, 2025 •

edited

Loading

NoahStapp commented Aug 19, 2025 •

edited

Loading