Skip to content
Merged
5 changes: 4 additions & 1 deletion pymongo/asynchronous/client_bulk.py
Original file line number Diff line number Diff line change
Expand Up @@ -133,7 +133,10 @@ def add_insert(self, namespace: str, document: _DocumentOut) -> None:
validate_is_document_type("document", document)
# Generate ObjectId client side.
if not (isinstance(document, RawBSONDocument) or "_id" in document):
document["_id"] = ObjectId()
new_document = {"_id": ObjectId()}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation is also incomplete because it does not put the _id field first if the user supplies it. For example when inserting {"a": 1, "_id": 2}. We should add tests for this case for insert/bulk/clientBulk as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've decided to make re-ordering user-supplied _id fields optional due to the complexity of doing so across different driver implementations. We can do it in PyMongo if we want, but it won't be standard across all drivers.

new_document.update(document)
document.clear()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more I think about it the more I think it's problematic to call clear() and update() here. Those methods could have unintentional side effects aside from the perf problems. For example consider a user passing a custom mapping class which overrides clear()/update().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using ChainMap makes sense, agreed. Explicitly modifying a user-supplied mapping will always carry some risks unfortunately, using the least amount of APIs as possible seems like a safer bet here.

Copy link
Contributor Author

@NoahStapp NoahStapp Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did some more thinking: is the added complexity and changing of the type to ChainMap here worth it over this much simpler approach:

  if "_id" in document:
      document = {"_id": document["_id"]} | document
  else:
      id = ObjectId()
      document["_id"] = id
      document = {"_id": id} | document

If the original document already had an _id field, this doesn't modify it, which is consistent with our other insert code paths. If we generated an _id field, we still add it to the original document, but we don't worry about the order of the original document.

This also resolves the doctest error we're seeing due to using ChainMap.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simpler yes, but it's not performant:

$ python -m timeit -s 'd={str(k):k for k in range(10000)};from collections import ChainMap' 'ChainMap(d,{"_id":1})'
2000000 loops, best of 5: 163 nsec per loop
$ python -m timeit -s 'd={str(k):k for k in range(10000)}' '{"_id":1}|d'
2000 loops, best of 5: 143 usec per loop

Copy link
Member

@ShaneHarvey ShaneHarvey Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the above, the slow approach adds 2 milliseconds per 100,000 fields copied on my machine. That's significant enough to warrant the complexity.

For the doc test, we probably want to unwrap the ChainMap (via .maps) before exposing it back to the user in bulk errors and possibly even command monitoring.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent point!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah unwrapping it back into the original map makes sense.

document.update(new_document)
Copy link
Member

@ShaneHarvey ShaneHarvey Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on the perf implications of this vs ChainMap? Yet another way is to encode the documents to RawBSONDocuments thus relying on the bson layer to reorder the id field.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using ChainMap is cleaner, encoding to RawBSONDocuments might have additional performance costs.

cmd = {"insert": -1, "document": document}
self.ops.append(("insert", cmd))
self.namespaces.append(namespace)
Expand Down
5 changes: 4 additions & 1 deletion pymongo/synchronous/client_bulk.py
Original file line number Diff line number Diff line change
Expand Up @@ -133,7 +133,10 @@ def add_insert(self, namespace: str, document: _DocumentOut) -> None:
validate_is_document_type("document", document)
# Generate ObjectId client side.
if not (isinstance(document, RawBSONDocument) or "_id" in document):
document["_id"] = ObjectId()
new_document = {"_id": ObjectId()}
new_document.update(document)
document.clear()
document.update(new_document)
cmd = {"insert": -1, "document": document}
self.ops.append(("insert", cmd))
self.namespaces.append(namespace)
Expand Down
52 changes: 52 additions & 0 deletions test/mockupdb/test_id_ordering.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
from __future__ import annotations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add the boilerplate License comment.


from test import PyMongoTestCase

import pytest

from pymongo import InsertOne

try:
from mockupdb import MockupDB, OpMsg, go, going

_HAVE_MOCKUPDB = True
except ImportError:
_HAVE_MOCKUPDB = False


from bson.objectid import ObjectId

pytestmark = pytest.mark.mockupdb


class TestIdOrdering(PyMongoTestCase):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a link to the crud spec that describes this test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once the spec is merged, yes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was the spec merged?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. added!

def test_id_ordering(self):
server = MockupDB()
server.autoresponds(
"hello",
isWritablePrimary=True,
msg="isdbgrid",
minWireVersion=0,
maxWireVersion=25,
helloOk=True,
serviceId=ObjectId(),
)
server.run()
self.addCleanup(server.stop)

client = self.simple_client(server.uri, loadBalanced=True)
collection = client.db.coll
with going(collection.insert_one, {"x": 1}):
request = server.receives()
self.assertEqual("_id", next(iter(request["documents"][0])))
request.reply({"ok": 1})

with going(collection.bulk_write, [InsertOne({"x1": 1})]):
request = server.receives()
self.assertEqual("_id", next(iter(request["documents"][0])))
request.reply({"ok": 1})

with going(client.bulk_write, [InsertOne(namespace="db.coll", document={"x2": 1})]):
request = server.receives()
self.assertEqual("_id", next(iter(request["ops"][0]["document"])))
request.reply({"ok": 1})
Loading