-
Notifications
You must be signed in to change notification settings - Fork 1.2k
INTPYTHON-617 - Improve DictField to_python performance #2888
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Does your benchmark still apply? You seem to have changed the code substantially, and I would assume that the |
What do you by mean "still apply"? The benchmarking script was run against the current master branch as of June with and then without the change to |
mongoengine/fields.py
Outdated
return ( | ||
{k: to_python(v) for k, v in value.items()} | ||
if to_python and value | ||
else value or None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would convert a {}
to None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
class MyModel(db.Document):
data_dict_1 = db.DictField(required=False)
m = MyModel()
m.data_dict_1 = {}
m.save()
model = MyModel.objects.first()
print(model.data_dict_1)
Outputs {}
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some example test cases showing where the behavior differs:
from mongoengine import Document, DictField, IntField, ListField, MapField
from tests.utils import MongoDBTestCase
class TestDictFieldRegressions(MongoDBTestCase):
def test_nested_empty_dict_in_list_of_dictfield_roundtrips_as_dict_regression(self):
"""Regression: master => {}, PR => None.
Empty dict appended into ListField(DictField()) should roundtrip as {}.
"""
class Model(Document):
items = ListField(DictField())
Model.drop_collection()
doc = Model(items=[{"a": 1}]).save()
Model.objects(id=doc.id).update(push__items={})
doc.reload()
assert doc.items[-1] == {}
assert isinstance(doc.items[-1], dict)
def test_nested_empty_dict_in_list_of_mapfield_roundtrips_as_dict_regression(self):
"""Regression: master => {}, PR => None.
Empty dict appended into ListField(MapField(IntField())) should roundtrip as {}.
"""
class Model(Document):
items = ListField(MapField(IntField()))
Model.drop_collection()
doc = Model(items=[{"a": 1}]).save()
Model.objects(id=doc.id).update(push__items={})
doc.reload()
assert doc.items[-1] == {}
assert isinstance(doc.items[-1], dict)
def test_top_level_dictfield_default_preserves_empty_dict(self):
"""No regression: master => {}, PR => {}."""
class Post(Document):
info = DictField()
Post.drop_collection()
Post(info={}).save()
p = Post.objects.first()
assert p.info == {}
assert isinstance(p.info, dict)
def test_top_level_dictfield_null_true_preserves_empty_dict_regression(self):
"""Regression: master => {}, PR => None.
With null=True, empty dict should still roundtrip as {} at top level.
"""
class PostNull(Document):
info = DictField(null=True)
PostNull.drop_collection()
PostNull(info={}).save()
pn = PostNull.objects.first()
assert pn.info == {}
assert isinstance(pn.info, dict)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not as elegant, but I think this should handle the failed cases while preserving most of the performance improvement:
def to_python(self, value):
# Fast-path for mapping of homogenous child field conversions,
# preserving behavior (no {} -> None) while avoiding ComplexBaseField's
# generic recursion when possible.
if not isinstance(value, dict) or self.field is None:
return super().to_python(value)
elif not value: # Keep empty dicts intact
return value
elif type(self.field).to_python is BaseField.to_python: # identity
return value
else:
child_to_python = self.field.to_python
return {k: child_to_python(v) for k, v in value.items()}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this can be simplified:
def to_python(self, value):
if value is None:
return None
to_python = getattr(self.field, "to_python", None)
if not to_python or not value:
return value
return {k: to_python(v) for k, v in value.items()}
This passes all of the current test cases plus the ones you've added above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NoahStapp if a DictField receives an invalid/non-dict value (e.g. a list instead of a dict), I think the one I proposed (and orig) would give a mongoengine ValidationError, while the one you proposed would give a standard error (AttributeError?)
getattr also has a bit of a performance hit, although I haven't benchmarked it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using my proposed code, I get a ValidationError when passing an invalid value to a DictField
:
class Model(db.Document):
field = db.DictField()
m = Model(field=[])
m.save()
------
mongoengine.errors.ValidationError: ValidationError (Model:None) (Only dictionaries may be used in a DictField: ['field'])
I wrote what I meant right after
It's not obvious what commit you ran your benchmarking from, so I was just confirming. The rest of my comment applies. Have you profiled this to see what code is the bottleneck that you're removing? |
I was curious and had a quick look yesterday but I'm suspecting that it's only working with simple dicts (with native types that don't need any parsing), we must double check that it's also working with nested reference fields or embedded docs or fields that would require conversion when loaded from db. If you look at the basecomplexfield.to_python it has some code related with that that the current proposal is just skipping |
Is passing the existing test suite sufficient to say that this change doesn't introduce any breaking changes? Or is there a specific example or test I could add to verify correct behavior beyond what already exists in the suite? |
Is this the test that proves embedded reference fields still work as expected? mongoengine/tests/fields/test_dict_field.py Lines 345 to 390 in e844138
|
Improves performance of large documents with nested
DictFields
by nearly 20x, addressing #1230.Modified benchmark script pulled from https://stackoverflow.com/questions/35257305/mongoengine-is-very-slow-on-large-documents-compared-to-native-pymongo-usage/:
Before:
After: