Skip to content

PYTHON-4947 - GridFS spec: Add performant 'delete revisions by filena… #2218

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Mar 31, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions doc/changelog.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@ PyMongo 4.12 brings a number of changes including:
- Support for configuring DEK cache lifetime via the ``key_expiration_ms`` argument to
:class:`~pymongo.encryption_options.AutoEncryptionOpts`.
- Support for $lookup in CSFLE and QE supported on MongoDB 8.1+.
- Added :meth:`gridfs.asynchronous.grid_file.AsyncGridFSBucket.delete_by_name` and :meth:`gridfs.grid_file.GridFSBucket.delete_by_name`
for more performant deletion of a file with multiple revisions.
- AsyncMongoClient no longer performs DNS resolution for "mongodb+srv://" connection strings on creation.
To avoid blocking the asyncio loop, the resolution is now deferred until the client is first connected.
- Added index hinting support to the
Expand Down
29 changes: 29 additions & 0 deletions gridfs/asynchronous/grid_file.py
Original file line number Diff line number Diff line change
Expand Up @@ -834,6 +834,35 @@ async def delete(self, file_id: Any, session: Optional[AsyncClientSession] = Non
if not res.deleted_count:
raise NoFile("no file could be deleted because none matched %s" % file_id)

@_csot.apply
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to add this helper to all the tests for sessions/transactions/CSOT. For example test_gridfs_bucket in test_session and test_gridfs_does_not_support_transactions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a separate ticket?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be in this ticket.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the existing tests missing this decorator have it added in a separate ticket? Those changes seem unrelated to this addition.

Copy link
Member

@ShaneHarvey ShaneHarvey Mar 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's completely related. We're adding a new api so we have to test all the features. Although I'm not not what you mean by "tests missing this decorator".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm saying we need to add the delete_by_name helper to the tests for sessions/transactions/CSOT...

async def delete_by_name(
self, filename: str, session: Optional[AsyncClientSession] = None
) -> None:
"""Given a filename, delete this stored file's files collection document(s)
and associated chunks from a GridFS bucket.

For example::

my_db = AsyncMongoClient().test
fs = AsyncGridFSBucket(my_db)
await fs.upload_from_stream("test_file", "data I want to store!")
await fs.delete_by_name("test_file")

Raises :exc:`~gridfs.errors.NoFile` if no file with the given filename exists.

:param filename: The name of the file to be deleted.
:param session: a :class:`~pymongo.client_session.AsyncClientSession`

.. versionadded:: 4.12
"""
_disallow_transactions(session)
files = self._files.find({"filename": filename}, {"_id": 1}, session=session)
file_ids = [file["_id"] async for file in files]
res = await self._files.delete_many({"_id": {"$in": file_ids}}, session=session)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the spec say anything about what should happen if file_ids is so large/numerous it overflows maxBsonObjectSize? Assuming all ids are ObjectIds, this will happen when a single filename has around 850,000 revisions:

>>> len(encode({"_id": {"$in": [ObjectId()]*850000}}))
16888915
>>> MAX_BSON_SIZE
16777216

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes no mention of that edge case. Do we have a standard pattern for overflow issues in other APIs?

await self._chunks.delete_many({"files_id": {"$in": file_ids}}, session=session)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could improve this by using client.bulk_write on 8.0+ servers to combine both deletes into one command but that's a spec issue.

if not res.deleted_count:
raise NoFile(f"no file could be deleted because none matched filename {filename!r}")

def find(self, *args: Any, **kwargs: Any) -> AsyncGridOutCursor:
"""Find and return the files collection documents that match ``filter``

Expand Down
27 changes: 27 additions & 0 deletions gridfs/synchronous/grid_file.py
Original file line number Diff line number Diff line change
Expand Up @@ -830,6 +830,33 @@ def delete(self, file_id: Any, session: Optional[ClientSession] = None) -> None:
if not res.deleted_count:
raise NoFile("no file could be deleted because none matched %s" % file_id)

@_csot.apply
def delete_by_name(self, filename: str, session: Optional[ClientSession] = None) -> None:
"""Given a filename, delete this stored file's files collection document(s)
and associated chunks from a GridFS bucket.

For example::

my_db = MongoClient().test
fs = GridFSBucket(my_db)
fs.upload_from_stream("test_file", "data I want to store!")
fs.delete_by_name("test_file")

Raises :exc:`~gridfs.errors.NoFile` if no file with the given filename exists.

:param filename: The name of the file to be deleted.
:param session: a :class:`~pymongo.client_session.ClientSession`

.. versionadded:: 4.12
"""
_disallow_transactions(session)
files = self._files.find({"filename": filename}, {"_id": 1}, session=session)
file_ids = [file["_id"] for file in files]
res = self._files.delete_many({"_id": {"$in": file_ids}}, session=session)
self._chunks.delete_many({"files_id": {"$in": file_ids}}, session=session)
if not res.deleted_count:
raise NoFile(f"no file could be deleted because none matched filename {filename!r}")

def find(self, *args: Any, **kwargs: Any) -> GridOutCursor:
"""Find and return the files collection documents that match ``filter``

Expand Down
11 changes: 11 additions & 0 deletions test/asynchronous/test_gridfs_bucket.py
Original file line number Diff line number Diff line change
Expand Up @@ -115,6 +115,17 @@ async def test_multi_chunk_delete(self):
self.assertEqual(0, await self.db.fs.files.count_documents({}))
self.assertEqual(0, await self.db.fs.chunks.count_documents({}))

async def test_delete_by_name(self):
self.assertEqual(0, await self.db.fs.files.count_documents({}))
self.assertEqual(0, await self.db.fs.chunks.count_documents({}))
gfs = gridfs.AsyncGridFSBucket(self.db)
await gfs.upload_from_stream("test_filename", b"hello", chunk_size_bytes=1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test should upload multiple versions of test_filename and assert all are deleted.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see the spec test already does this so you can disregard.

self.assertEqual(1, await self.db.fs.files.count_documents({}))
self.assertEqual(5, await self.db.fs.chunks.count_documents({}))
await gfs.delete_by_name("test_filename")
self.assertEqual(0, await self.db.fs.files.count_documents({}))
self.assertEqual(0, await self.db.fs.chunks.count_documents({}))

async def test_empty_file(self):
oid = await self.fs.upload_from_stream("test_filename", b"")
self.assertEqual(b"", await (await self.fs.open_download_stream(oid)).read())
Expand Down
4 changes: 2 additions & 2 deletions test/asynchronous/test_session.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@

from bson import DBRef
from gridfs.asynchronous.grid_file import AsyncGridFS, AsyncGridFSBucket
from pymongo import ASCENDING, AsyncMongoClient, monitoring
from pymongo import ASCENDING, AsyncMongoClient, _csot, monitoring
from pymongo.asynchronous.command_cursor import AsyncCommandCursor
from pymongo.asynchronous.cursor import AsyncCursor
from pymongo.asynchronous.helpers import anext
Expand Down Expand Up @@ -543,7 +543,7 @@ async def find(session=None):
(bucket.rename, [1, "f2"], {}),
# Delete both files so _test_ops can run these operations twice.
(bucket.delete, [1], {}),
(bucket.delete, [2], {}),
(bucket.delete_by_name, ["f"], {}),
)

async def test_gridfsbucket_cursor(self):
Expand Down
3 changes: 2 additions & 1 deletion test/asynchronous/test_transactions.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@

from bson import encode
from bson.raw_bson import RawBSONDocument
from pymongo import WriteConcern
from pymongo import WriteConcern, _csot
from pymongo.asynchronous import client_session
from pymongo.asynchronous.client_session import TransactionOptions
from pymongo.asynchronous.command_cursor import AsyncCommandCursor
Expand Down Expand Up @@ -295,6 +295,7 @@ async def gridfs_open_upload_stream(*args, **kwargs):
"new-name",
),
),
(bucket.delete_by_name, ("new-name",)),
]

async with client.start_session() as s, await s.start_transaction():
Expand Down
4 changes: 2 additions & 2 deletions test/asynchronous/unified_format.py
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@
from bson import SON, json_util
from bson.codec_options import DEFAULT_CODEC_OPTIONS
from bson.objectid import ObjectId
from gridfs import AsyncGridFSBucket, GridOut
from gridfs import AsyncGridFSBucket, GridOut, NoFile
from pymongo import ASCENDING, AsyncMongoClient, CursorType, _csot
from pymongo.asynchronous.change_stream import AsyncChangeStream
from pymongo.asynchronous.client_session import AsyncClientSession, TransactionOptions, _TxnState
Expand Down Expand Up @@ -632,7 +632,7 @@ def process_error(self, exception, spec):
# Connection errors are considered client errors.
if isinstance(error, ConnectionFailure):
self.assertNotIsInstance(error, NotPrimaryError)
elif isinstance(error, (InvalidOperation, ConfigurationError, EncryptionError)):
elif isinstance(error, (InvalidOperation, ConfigurationError, EncryptionError, NoFile)):
pass
else:
self.assertNotIsInstance(error, PyMongoError)
Expand Down
230 changes: 230 additions & 0 deletions test/gridfs/deleteByName.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,230 @@
{
"description": "gridfs-deleteByName",
"schemaVersion": "1.0",
"createEntities": [
{
"client": {
"id": "client0"
}
},
{
"database": {
"id": "database0",
"client": "client0",
"databaseName": "gridfs-tests"
}
},
{
"bucket": {
"id": "bucket0",
"database": "database0"
}
},
{
"collection": {
"id": "bucket0_files_collection",
"database": "database0",
"collectionName": "fs.files"
}
},
{
"collection": {
"id": "bucket0_chunks_collection",
"database": "database0",
"collectionName": "fs.chunks"
}
}
],
"initialData": [
{
"collectionName": "fs.files",
"databaseName": "gridfs-tests",
"documents": [
{
"_id": {
"$oid": "000000000000000000000001"
},
"length": 0,
"chunkSize": 4,
"uploadDate": {
"$date": "1970-01-01T00:00:00.000Z"
},
"filename": "filename",
"metadata": {}
},
{
"_id": {
"$oid": "000000000000000000000002"
},
"length": 0,
"chunkSize": 4,
"uploadDate": {
"$date": "1970-01-01T00:00:00.000Z"
},
"filename": "filename",
"metadata": {}
},
{
"_id": {
"$oid": "000000000000000000000003"
},
"length": 2,
"chunkSize": 4,
"uploadDate": {
"$date": "1970-01-01T00:00:00.000Z"
},
"filename": "filename",
"metadata": {}
},
{
"_id": {
"$oid": "000000000000000000000004"
},
"length": 8,
"chunkSize": 4,
"uploadDate": {
"$date": "1970-01-01T00:00:00.000Z"
},
"filename": "otherfilename",
"metadata": {}
}
]
},
{
"collectionName": "fs.chunks",
"databaseName": "gridfs-tests",
"documents": [
{
"_id": {
"$oid": "000000000000000000000001"
},
"files_id": {
"$oid": "000000000000000000000002"
},
"n": 0,
"data": {
"$binary": {
"base64": "",
"subType": "00"
}
}
},
{
"_id": {
"$oid": "000000000000000000000002"
},
"files_id": {
"$oid": "000000000000000000000003"
},
"n": 0,
"data": {
"$binary": {
"base64": "",
"subType": "00"
}
}
},
{
"_id": {
"$oid": "000000000000000000000003"
},
"files_id": {
"$oid": "000000000000000000000003"
},
"n": 0,
"data": {
"$binary": {
"base64": "",
"subType": "00"
}
}
},
{
"_id": {
"$oid": "000000000000000000000004"
},
"files_id": {
"$oid": "000000000000000000000004"
},
"n": 0,
"data": {
"$binary": {
"base64": "",
"subType": "00"
}
}
}
]
}
],
"tests": [
{
"description": "delete when multiple revisions of the file exist",
"operations": [
{
"name": "deleteByName",
"object": "bucket0",
"arguments": {
"filename": "filename"
}
}
],
"outcome": [
{
"collectionName": "fs.files",
"databaseName": "gridfs-tests",
"documents": [
{
"_id": {
"$oid": "000000000000000000000004"
},
"length": 8,
"chunkSize": 4,
"uploadDate": {
"$date": "1970-01-01T00:00:00.000Z"
},
"filename": "otherfilename",
"metadata": {}
}
]
},
{
"collectionName": "fs.chunks",
"databaseName": "gridfs-tests",
"documents": [
{
"_id": {
"$oid": "000000000000000000000004"
},
"files_id": {
"$oid": "000000000000000000000004"
},
"n": 0,
"data": {
"$binary": {
"base64": "",
"subType": "00"
}
}
}
]
}
]
},
{
"description": "delete when file name does not exist",
"operations": [
{
"name": "deleteByName",
"object": "bucket0",
"arguments": {
"filename": "missing-file"
},
"expectError": {
"isClientError": true
}
}
]
}
]
}
11 changes: 11 additions & 0 deletions test/test_gridfs_bucket.py
Original file line number Diff line number Diff line change
Expand Up @@ -115,6 +115,17 @@ def test_multi_chunk_delete(self):
self.assertEqual(0, self.db.fs.files.count_documents({}))
self.assertEqual(0, self.db.fs.chunks.count_documents({}))

def test_delete_by_name(self):
self.assertEqual(0, self.db.fs.files.count_documents({}))
self.assertEqual(0, self.db.fs.chunks.count_documents({}))
gfs = gridfs.GridFSBucket(self.db)
gfs.upload_from_stream("test_filename", b"hello", chunk_size_bytes=1)
self.assertEqual(1, self.db.fs.files.count_documents({}))
self.assertEqual(5, self.db.fs.chunks.count_documents({}))
gfs.delete_by_name("test_filename")
self.assertEqual(0, self.db.fs.files.count_documents({}))
self.assertEqual(0, self.db.fs.chunks.count_documents({}))

def test_empty_file(self):
oid = self.fs.upload_from_stream("test_filename", b"")
self.assertEqual(b"", (self.fs.open_download_stream(oid)).read())
Expand Down
Loading
Loading