-
Notifications
You must be signed in to change notification settings - Fork 19
feat: Update POST PATCH PUT /collections allowing collections update #506
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: Update POST PATCH PUT /collections allowing collections update #506
Conversation
WalkthroughCollection endpoints now accept form parameters and optional Parquet uploads; new ParquetReader/ParquetDataExtractor, DocumentsUpdater, and ChunkHasher implement Parquet-driven document extraction, hashing, change detection, upserts, and rollback. Parser, document manager, exceptions, tests, cassettes, docs, and optional deps updated; version bumped to 0.3.0. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Endpoint as Collections Endpoint
participant Reader as ParquetReader
participant Extractor as ParquetDataExtractor
participant Updater as DocumentsUpdater
participant Vector as Vector Store
Client->>Endpoint: POST/PATCH/PUT /v1/collections (multipart form with optional parquet)
Endpoint->>Reader: read_and_validate(parquet_file)
alt invalid file (size/type/format)
Reader-->>Endpoint: raise (413/400/422)
Endpoint-->>Client: 4xx error
else valid file
Reader-->>Endpoint: (table, available_columns, has_chunk_index)
Endpoint->>Extractor: get_unique_document_names(table)
loop per document
Endpoint->>Extractor: filter_document_table(...) / extract_chunk_ids(...)
Endpoint->>Updater: process_document(session,user,collection,doc_name,table_slice,...)
Updater->>Vector: fetch existing chunks / upsert new chunks
Vector-->>Updater: existing chunks / upsert result
Updater-->>Endpoint: update stats (added/updated/deleted/chunks)
end
Endpoint-->>Client: 200/201 { id, details: { added, updated, deleted, total_chunks_processed } }
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: ASSERTIVE Plan: Pro ⛔ Files ignored due to path filters (3)
📒 Files selected for processing (35)
🧰 Additional context used🧬 Code graph analysis (21)api/tests/integ/test_documents.py (1)
api/helpers/_documentmanager.py (3)
api/helpers/files/parquet/__init__.py (2)
api/tests/integ/test_search.py (1)
api/tests/unit/test_helpers/test_documents_updater.py (2)
api/helpers/data/hasher/chunk_hasher.py (1)
api/helpers/documents/documents_updater.py (5)
api/helpers/files/parquet/reader.py (3)
api/tests/integ/test_files.py (2)
api/tests/integ/test_admin.py (1)
api/tests/unit/test_helpers/test_parsermanager.py (3)
api/helpers/_parsermanager.py (2)
api/helpers/_accesscontroller.py (2)
api/endpoints/collections.py (2)
api/helpers/data/hasher/__init__.py (1)
api/helpers/files/parquet/extractor.py (1)
api/tests/integ/test_chunks.py (1)
api/tests/integ/test_collections.py (3)
api/helpers/documents/__init__.py (1)
api/tests/integ/test_chat.py (1)
api/tests/unit/test_helpers/test_chunk_hasher.py (1)
🪛 markdownlint-cli2 (0.18.1)docs/docs/functionalities/document-management/collections.md191-191: Files should end with a single newline character (MD047, single-trailing-newline) 🔇 Additional comments (30)
Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 12
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
⛔ Files ignored due to path filters (3)
api/tests/integ/assets/parquet.parquetis excluded by!**/*.parquetapi/tests/integ/assets/parquet_wrong_format.parquetis excluded by!**/*.parquetapi/tests/integ/assets/updated_parquet.parquetis excluded by!**/*.parquet
📒 Files selected for processing (31)
.github/badges/coverage-unit.json(1 hunks).github/badges/coverage.json(1 hunks)api/endpoints/collections.py(3 hunks)api/helpers/_accesscontroller.py(1 hunks)api/helpers/_documentmanager.py(2 hunks)api/helpers/_parsermanager.py(3 hunks)api/helpers/data/hasher/__init__.py(1 hunks)api/helpers/data/hasher/chunk_hasher.py(1 hunks)api/helpers/documents/__init__.py(1 hunks)api/helpers/documents/documents_updater.py(1 hunks)api/helpers/files/__init__.py(1 hunks)api/helpers/files/parquet/__init__.py(1 hunks)api/helpers/files/parquet/extractor.py(1 hunks)api/helpers/files/parquet/reader.py(1 hunks)api/schemas/collections.py(2 hunks)api/schemas/core/documents.py(1 hunks)api/tests/integ/cassettes/test_collections.test_create_collection_with_parquet_file.yaml(1 hunks)api/tests/integ/cassettes/test_collections.test_patch_collection_metadata_and_file.yaml(1 hunks)api/tests/integ/cassettes/test_collections.test_put_collection_with_parquet_file.yaml(1 hunks)api/tests/integ/conftest.py(1 hunks)api/tests/integ/test_admin.py(1 hunks)api/tests/integ/test_chat.py(1 hunks)api/tests/integ/test_chunks.py(1 hunks)api/tests/integ/test_collections.py(14 hunks)api/tests/integ/test_documents.py(1 hunks)api/tests/integ/test_files.py(1 hunks)api/tests/integ/test_search.py(2 hunks)api/tests/unit/test_helpers/test_chunk_hasher.py(1 hunks)api/tests/unit/test_helpers/test_parsermanager.py(2 hunks)api/utils/exceptions.py(2 hunks)pyproject.toml(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (18)
api/tests/integ/test_search.py (1)
api/tests/integ/conftest.py (1)
client(335-361)
api/helpers/data/hasher/chunk_hasher.py (1)
api/schemas/chunks.py (1)
Chunk(6-10)
api/helpers/documents/__init__.py (1)
api/helpers/documents/documents_updater.py (1)
DocumentsUpdater(19-303)
api/tests/integ/test_collections.py (3)
api/tests/integ/conftest.py (1)
client(335-361)api/schemas/collections.py (1)
CollectionVisibility(9-11)api/tests/integ/test_documents.py (1)
collection(13-21)
api/helpers/files/parquet/reader.py (1)
api/utils/exceptions.py (1)
InvalidPARQUETFormatException(143-145)
api/tests/integ/test_chunks.py (1)
api/schemas/collections.py (1)
CollectionVisibility(9-11)
api/helpers/_accesscontroller.py (2)
api/schemas/collections.py (1)
CollectionVisibility(9-11)api/schemas/admin/roles.py (1)
PermissionType(10-14)
api/helpers/files/parquet/__init__.py (2)
api/helpers/files/parquet/extractor.py (1)
ParquetDataExtractor(13-167)api/helpers/files/parquet/reader.py (1)
ParquetReader(15-116)
api/tests/integ/test_documents.py (1)
api/schemas/collections.py (1)
CollectionVisibility(9-11)
api/tests/integ/test_chat.py (1)
api/tests/integ/conftest.py (1)
client(335-361)
api/helpers/_parsermanager.py (2)
api/schemas/core/documents.py (1)
FileType(19-25)api/utils/exceptions.py (1)
UnsupportedFileTypeException(163-165)
api/helpers/documents/documents_updater.py (5)
api/helpers/data/hasher/chunk_hasher.py (4)
ChunkHasher(10-100)concatenate_chunk_values(19-47)compute_chunk_hash(50-67)compute_existing_chunk_hash(70-100)api/helpers/files/parquet/extractor.py (5)
ParquetDataExtractor(13-167)extract_chunk_ids(54-81)extract_metadata_columns(109-120)build_chunk_metadata(123-152)get_row_values(155-167)api/schemas/chunks.py (1)
Chunk(6-10)api/utils/exceptions.py (2)
CollectionNotFoundException(78-80)VectorizationFailedException(190-192)api/helpers/_documentmanager.py (1)
get_chunks(303-332)
api/helpers/data/hasher/__init__.py (1)
api/helpers/data/hasher/chunk_hasher.py (1)
ChunkHasher(10-100)
api/tests/unit/test_helpers/test_chunk_hasher.py (1)
api/helpers/data/hasher/chunk_hasher.py (3)
ChunkHasher(10-100)concatenate_chunk_values(19-47)compute_chunk_hash(50-67)
api/tests/integ/test_files.py (2)
api/schemas/collections.py (1)
CollectionVisibility(9-11)api/tests/integ/conftest.py (1)
client(335-361)
api/tests/unit/test_helpers/test_parsermanager.py (4)
api/schemas/core/documents.py (1)
FileType(19-25)api/helpers/_parsermanager.py (2)
ParserManager(16-214)parse_file(109-119)api/utils/exceptions.py (1)
UnsupportedFileTypeException(163-165)api/helpers/_documentmanager.py (1)
parse_file(335-346)
api/helpers/_documentmanager.py (5)
api/helpers/documents/documents_updater.py (2)
DocumentsUpdater(19-303)process_document(42-122)api/helpers/files/parquet/extractor.py (3)
ParquetDataExtractor(13-167)get_unique_document_names(22-35)filter_document_table(38-51)api/helpers/files/parquet/reader.py (4)
ParquetReader(15-116)read_and_validate(27-60)sort_table(78-95)identify_metadata_columns(98-116)api/utils/exceptions.py (1)
CollectionNotFoundException(78-80)api/clients/vector_store/_basevectorstoreclient.py (1)
delete_document(55-56)
api/endpoints/collections.py (4)
api/schemas/core/documents.py (1)
FileType(19-25)api/utils/exceptions.py (3)
CollectionNotFoundException(78-80)FileSizeLimitExceededException(124-129)UnsupportedFileTypeException(163-165)api/helpers/_documentmanager.py (4)
create_collection(74-83)delete_collection(86-101)update_collection_from_parquet(496-598)update_collection(104-126)api/helpers/_parsermanager.py (1)
_detect_file_type(60-91)
🔇 Additional comments (45)
api/helpers/files/__init__.py (1)
1-1: LGTM!The module docstring provides clear documentation for the package purpose.
api/tests/integ/conftest.py (3)
343-343: LGTM!The addition of
put_without_permissionstest helper follows the established pattern for other HTTP methods and enables comprehensive testing of PUT endpoints.
350-350: LGTM!The addition of
put_with_permissionstest helper follows the established pattern and supports testing PUT operations with admin permissions.
357-357: LGTM!The addition of
put_mastertest helper follows the established pattern and enables testing PUT operations with master key authentication.api/utils/exceptions.py (1)
143-146: LGTM!The
InvalidPARQUETFormatExceptioncorrectly follows the established pattern for format validation exceptions (HTTP 422) and aligns well with the existingInvalidJSONFormatException.api/schemas/collections.py (1)
1-14: LGTM!The introduction of
CollectionNameas a validated type alias withstrip_whitespace=Trueandmin_length=1provides consistent validation for collection names across the API. This is a clean, type-safe approach that replaces the removed request models and ensures that collection names are always properly validated.pyproject.toml (1)
46-47: I need to verify the actual latest stable releases and understand the release statuses better.Remove the incorrect assumption about PyArrow version ranges.
PyArrow has progressed beyond the 10.x-15.x range to 22.0.0 (released October 24, 2025). The pinned version pyarrow==21.0.0 (released July 17, 2025) exists and is a stable release, though not the absolute latest. xxhash==3.6.0 exists and is the latest stable version. Both versions are valid and appropriate for production use.
Likely an incorrect or invalid review comment.
api/helpers/_accesscontroller.py (1)
278-281: No issues found. The query parameter handling for None values is correct and intentional.The access controller correctly handles the case when
visibilityis not provided in query parameters. Whenrequest.query_params.get("visibility")returnsNone, the equality checkNone == CollectionVisibility.PUBLIC.valuesimply evaluates toFalse, and the permission exception is not raised—which is the intended behavior. Permission validation only applies when visibility is explicitly set to"public".The endpoint parameters with defaults (
CollectionVisibility.PRIVATEfor create,Nonefor update) operate separately at the FastAPI layer and are applied after the access controller check completes successfully.api/tests/integ/cassettes/test_collections.test_patch_collection_metadata_and_file.yaml (1)
1-170: LGTM - Test cassette structure is correct.This test cassette file follows the standard VCR format for recording HTTP interactions during integration tests. The use of dummy tokens ("Bearer dummy_token_for_test") is appropriate for test fixtures, and the recorded embeddings API responses will enable deterministic testing of the Parquet collection patch functionality.
api/schemas/core/documents.py (1)
21-21: LGTM - Parquet file type support added.The addition of
PARQUET = "parquet"to theFileTypeenum follows the existing pattern and enables the Parquet file handling functionality described in the PR objectives.api/tests/integ/test_chunks.py (1)
17-17: LGTM - Correct enum value usage.The change properly uses
CollectionVisibility.PRIVATE.valueto pass the string value"private"as a query parameter, which is the correct approach for query parameter serialization.api/tests/integ/test_documents.py (1)
16-16: LGTM - Consistent with test suite pattern.The collection creation correctly uses query parameters with the enum value, matching the pattern established across other test files in this PR.
api/tests/integ/test_search.py (1)
17-17: LGTM - Default visibility behavior.Both collection creation calls now use query parameters and rely on the server's default visibility setting. This is acceptable for these test scenarios where the visibility isn't the focus of the test.
Also applies to: 103-103
api/tests/integ/test_files.py (1)
15-15: LGTM - Proper enum value usage for both visibility types.Both collection creations correctly use the enum's
.valueproperty to pass the appropriate visibility string values as query parameters. This properly tests both PRIVATE and PUBLIC collection scenarios.Also applies to: 22-22
api/tests/integ/test_chat.py (1)
29-29: Default collection visibility is PRIVATE—verification complete.The endpoint defines
visibility: CollectionVisibility = Query(CollectionVisibility.PRIVATE, description="Visibility for the collection"), confirming that omitting the visibility parameter in the test defaults to a private collection, which is appropriate for isolated test scenarios.api/tests/unit/test_helpers/test_chunk_hasher.py (1)
1-68: LGTM! Well-structured test coverage for ChunkHasher.The test suite covers essential hash computation behaviors: determinism, content sensitivity, metadata sensitivity, and concatenation consistency. The tests validate both the individual methods and their integration.
api/tests/integ/cassettes/test_collections.test_create_collection_with_parquet_file.yaml (1)
1-86: Test cassette file looks correct.This VCR cassette records embeddings API interactions for the parquet collection test. The format is standard and the recorded interactions align with the test expectations.
api/tests/unit/test_helpers/test_parsermanager.py (3)
437-437: LGTM! Extension map updated for PARQUET support.The addition of
.parquetmapping toFileType.PARQUETaligns with the new file type support introduced in this PR.
451-453: LGTM! Content types check updated for PARQUET.The test now correctly validates that
FileType.PARQUETis included in the valid content types mapping.
455-466: LGTM! Parquet rejection test validates the expected behavior.The test correctly verifies that parquet files are explicitly rejected by
parse_filewith anUnsupportedFileTypeException, as parquet files require special handling outside the document parsing flow.api/helpers/files/parquet/__init__.py (1)
1-6: LGTM! Clean package initialization.The module clearly defines its public API by re-exporting
ParquetReaderandParquetDataExtractor, making them easily accessible to consumers.api/tests/integ/test_collections.py (7)
11-13: LGTM! Asset directory setup is correct.The path setup for test assets enables file-based testing with parquet files.
19-21: LGTM! Consistent migration to params-based requests.The tests now consistently use
paramsfor query parameters instead of JSON bodies, aligning with the updated API contract for collection endpoints.Also applies to: 40-42, 54-56, 71-73, 95-97, 100-102, 106-108, 125-127, 137-139, 148-150, 159-161, 177-179, 192-194, 210-212, 215-217, 229-231
254-270: LGTM! Well-structured parquet creation test.The test validates the new collection creation flow with an initial parquet file upload, checking for the expected response structure (
idanddetailsfields).
271-294: LGTM! PUT endpoint test validates force update flow.The test correctly validates the PUT endpoint for force-updating a collection with a parquet file, checking for
updatedanddetailsin the response.
295-371: Excellent comprehensive test for document change detection!This test provides thorough coverage of the parquet update workflow:
- First upload with metadata update
- Second upload with identical file (validates no spurious changes detected)
- Third upload with updated file (validates correct change detection)
The assertions on
added_documents,updated_documents,deleted_documents, andtotal_chunks_processedensure the change detection logic works correctly.
372-384: LGTM! Validation test for malformed parquet files.The test correctly validates that parquet files missing required columns are rejected with a 422 status code. Test asset files verified to exist in the repository.
61-62: PATCH status code verification complete.All three test assertions (lines 61-62, 133, 245) correctly expect
status_code == 200, which aligns with the PATCH endpoint implementation inapi/endpoints/collections.py(line 148 decorator and line 207 return). The status code change from 204 to 200 has been properly implemented and consistently tested.api/helpers/data/hasher/chunk_hasher.py (1)
49-67: LGTM! Hash computation and existing chunk handling are well-designed.The use of xxHash64 is appropriate for fast, non-cryptographic hashing. The exclusion of internal metadata fields (collection_id, document_id, document_created_at) in
compute_existing_chunk_hashcorrectly focuses on content comparison.Also applies to: 69-100
api/helpers/files/parquet/extractor.py (6)
21-35: LGTM! Efficient use of PyArrow vectorized operations.Using
pc.unique()for extracting unique document names leverages PyArrow's optimized compute functions for better performance with large datasets.
37-51: LGTM! Clean filtering implementation.The use of PyArrow's
pc.equal()andfilter()provides efficient row filtering without loading unnecessary data into memory.
53-81: LGTM! Robust chunk ID handling.The method correctly handles both scenarios: extracting and validating existing chunk indices, or assigning sequential IDs when the column is absent. The logging on line 80 provides helpful debugging information.
83-106: LGTM! Sequential validation logic is sound.The validation correctly checks for gaps in the sequence by comparing against the expected range. The early return for empty lists (lines 95-96) prevents errors.
108-152: LGTM! Safe JSON parsing with appropriate fallbacks.The
build_chunk_metadatamethod safely handles:
- None values
- JSON strings (both objects and arrays)
- Parse errors (falls back to raw string)
This defensive approach prevents crashes from malformed data.
154-167: LGTM! Straightforward row extraction.The method provides a clean interface for extracting all column values for a specific row.
api/helpers/files/parquet/reader.py (3)
26-60: LGTM! Robust file reading with proper error handling.The method correctly:
- Reads the file content asynchronously
- Validates the parquet format
- Extracts and validates columns
- Returns all necessary information
The exception handling on lines 44-46 provides clear error messages for debugging.
62-75: LGTM! Clear validation with helpful error messages.The validation correctly identifies missing required columns and provides actionable error messages listing both what's missing and what's available.
77-95: LGTM! Efficient sorting for document grouping.The multi-level sort ensures chunks are grouped by document and ordered sequentially, which is essential for downstream processing.
api/helpers/documents/documents_updater.py (5)
27-40: LGTM: Clean dependency injection pattern.The initialization properly accepts dependencies and instantiates helper utilities. The design allows for flexible testing and component composition.
124-169: LGTM: Efficient chunk preparation.The method efficiently prepares chunk data with conditional hash computation. The logic correctly builds metadata and only computes hashes when needed for comparison.
229-255: LGTM: Proper error handling.The method correctly handles foreign key constraint violations and translates them into appropriate exceptions. The error handling ensures meaningful feedback when collections are deleted during document creation.
257-303: LGTM: Robust chunk insertion.The method properly prepares chunks with complete metadata and handles vectorization failures gracefully with appropriate error wrapping and logging.
102-111: <!-- [scratchpad_start]
[task review_file_1/1 verification_document_id_stability_4/5]
Complete analysis of document ID stability concern[observations]
- Document schema (api/schemas/documents.py): exposes
idfield in public API response- Document model includes: id, name, collection_id, created_at, chunks
- delete_document in DocumentManager: deletes from DocumentTable by ID
- process_document: deletes old doc (ID), creates new doc (new ID assigned)
- document_id stored in vector store metadata - properly deleted/recreated
- document_name used as stable identifier internally (existing_docs_map)
- No ID remapping, versioning, or migration logic found
- No tests found documenting expected update behavior
- No ForeignKey declarations found to other tables (checked via rg)
[analysis]
Review comment concern is VALID:
- Document IDs ARE exposed to clients in API (Document schema includes id)
- Document IDs CHANGE on update (delete old, create new)
- If clients store/bookmark document IDs, they will break after update
- Pattern appears intentional (explicit delete-recreate) but not documented
- No evidence of ID stability requirement or migration logic
This is a DESIGN QUESTION, not a bug:
- Code works as written
- But design intent unclear: is changing IDs intentional or oversight?
- Depends on whether external systems cache/reference document IDs
- No tests document this behavior
[pending]
None - sufficient information gathered[analysis complete]
The review comment raises a legitimate design concern that needs developer consideration.
</scratchpad_end] -->Document IDs change on update—verify if this design is acceptable.
The code exposes document IDs in the public API (schema includes
idfield), but the update logic deletes the old document and creates a new one, resulting in a new ID being returned to clients. If external systems, bookmarks, or stored references rely on document IDs, they will break after an update.This appears intentional (document names serve as the stable identifier internally), but the implications depend on your use case:
- If IDs are transient/internal-only: Current behavior is fine.
- If external systems reference IDs: Consider maintaining stable IDs and updating in-place, or implementing ID redirection/mapping.
Confirm whether this behavior aligns with your design requirements.
api/endpoints/collections.py (2)
1-1: LGTM: Appropriate imports for Parquet support.The new imports correctly support file upload handling and Parquet-specific validations for the enhanced collection endpoints.
Also applies to: 6-7, 10-10
148-148: Verify status code change for PATCH endpoint.The status code for the PATCH endpoint was changed from the default (likely 200 or 204) to explicitly 200. Ensure this aligns with your API design:
- 200 (OK) with response body is appropriate when returning update details
- 204 (No Content) would be more RESTful if no body is returned
The current implementation returns a JSON response, so 200 is correct.
47956d4 to
d15048e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
♻️ Duplicate comments (2)
api/utils/exceptions.py (1)
126-129: MAX_PARQUET_CONTENT_SIZE is unused in the default error message.The newly added
MAX_PARQUET_CONTENT_SIZEconstant (line 126) is not used in the default error message (line 128), which still referencesMAX_CONTENT_SIZE. This was previously raised and marked as "Skipped" by the developer.api/helpers/data/hasher/chunk_hasher.py (1)
38-46: Add an unambiguous serialization before hashing.Different metadata dictionaries still collapse to the same concatenated string (e.g.,
{"a": "bc", "ab": "c"}vs{"a": "bca", "b": "c"}both yielda:bcab:c). Because the hash is built on top of this string, distinct chunks can produce identical hashes and slip through change detection. Please serialize each key/value pair unambiguously (e.g., JSON encode values and separate entries) before hashing.Apply this diff to fix the issue:
parts = [] for key in sorted(chunk.keys()): value = chunk.get(key, None) - if isinstance(value, (dict | list)): - # Serialize complex types to JSON for consistent representation - parts.append(f"{key}:{json.dumps(value, ensure_ascii=False)}") - else: - parts.append(f"{key}:{value}") - return "".join(parts) + if isinstance(value, (dict | list)): + serialized = json.dumps(value, ensure_ascii=False, sort_keys=True) + else: + serialized = json.dumps(value, ensure_ascii=False, default=str) + parts.append(f"{key}={serialized}") + return "|".join(parts)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
⛔ Files ignored due to path filters (3)
api/tests/integ/assets/parquet.parquetis excluded by!**/*.parquetapi/tests/integ/assets/parquet_wrong_format.parquetis excluded by!**/*.parquetapi/tests/integ/assets/updated_parquet.parquetis excluded by!**/*.parquet
📒 Files selected for processing (31)
.github/badges/coverage-unit.json(1 hunks).github/badges/coverage.json(1 hunks)api/endpoints/collections.py(4 hunks)api/helpers/_accesscontroller.py(1 hunks)api/helpers/_documentmanager.py(2 hunks)api/helpers/_parsermanager.py(3 hunks)api/helpers/data/hasher/__init__.py(1 hunks)api/helpers/data/hasher/chunk_hasher.py(1 hunks)api/helpers/documents/__init__.py(1 hunks)api/helpers/documents/documents_updater.py(1 hunks)api/helpers/files/__init__.py(1 hunks)api/helpers/files/parquet/__init__.py(1 hunks)api/helpers/files/parquet/extractor.py(1 hunks)api/helpers/files/parquet/reader.py(1 hunks)api/schemas/collections.py(2 hunks)api/schemas/core/documents.py(1 hunks)api/tests/integ/cassettes/test_collections.test_create_collection_with_parquet_file.yaml(1 hunks)api/tests/integ/cassettes/test_collections.test_patch_collection_metadata_and_file.yaml(1 hunks)api/tests/integ/cassettes/test_collections.test_put_collection_with_parquet_file.yaml(1 hunks)api/tests/integ/conftest.py(1 hunks)api/tests/integ/test_admin.py(2 hunks)api/tests/integ/test_chat.py(1 hunks)api/tests/integ/test_chunks.py(1 hunks)api/tests/integ/test_collections.py(14 hunks)api/tests/integ/test_documents.py(1 hunks)api/tests/integ/test_files.py(1 hunks)api/tests/integ/test_search.py(2 hunks)api/tests/unit/test_helpers/test_chunk_hasher.py(1 hunks)api/tests/unit/test_helpers/test_parsermanager.py(2 hunks)api/utils/exceptions.py(2 hunks)pyproject.toml(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (18)
api/helpers/_documentmanager.py (5)
api/helpers/documents/documents_updater.py (2)
DocumentsUpdater(19-303)process_document(42-122)api/helpers/files/parquet/extractor.py (3)
ParquetDataExtractor(13-167)get_unique_document_names(22-35)filter_document_table(38-51)api/helpers/files/parquet/reader.py (4)
ParquetReader(20-139)read_and_validate(32-83)sort_table(101-118)identify_metadata_columns(121-139)api/utils/exceptions.py (1)
CollectionNotFoundException(78-80)api/clients/vector_store/_basevectorstoreclient.py (1)
delete_document(55-56)
api/helpers/_accesscontroller.py (2)
api/schemas/collections.py (1)
CollectionVisibility(9-11)api/schemas/admin/roles.py (1)
PermissionType(10-14)
api/helpers/files/parquet/__init__.py (2)
api/helpers/files/parquet/extractor.py (1)
ParquetDataExtractor(13-167)api/helpers/files/parquet/reader.py (1)
ParquetReader(20-139)
api/helpers/data/hasher/__init__.py (1)
api/helpers/data/hasher/chunk_hasher.py (1)
ChunkHasher(10-99)
api/tests/integ/test_files.py (2)
api/schemas/collections.py (1)
CollectionVisibility(9-11)api/tests/integ/conftest.py (1)
client(335-361)
api/tests/integ/test_admin.py (1)
api/schemas/collections.py (1)
CollectionVisibility(9-11)
api/tests/unit/test_helpers/test_chunk_hasher.py (1)
api/helpers/data/hasher/chunk_hasher.py (3)
ChunkHasher(10-99)concatenate_chunk_values(19-46)compute_chunk_hash(49-66)
api/tests/integ/test_chat.py (1)
api/tests/integ/conftest.py (1)
client(335-361)
api/tests/integ/test_chunks.py (1)
api/schemas/collections.py (1)
CollectionVisibility(9-11)
api/tests/integ/test_search.py (1)
api/tests/integ/conftest.py (1)
client(335-361)
api/helpers/data/hasher/chunk_hasher.py (1)
api/schemas/chunks.py (1)
Chunk(6-10)
api/helpers/files/parquet/reader.py (3)
api/schemas/core/documents.py (1)
FileType(19-25)api/utils/exceptions.py (2)
FileSizeLimitExceededException(124-129)InvalidPARQUETFormatException(143-145)api/helpers/_parsermanager.py (1)
_detect_file_type(61-92)
api/tests/integ/test_documents.py (1)
api/schemas/collections.py (1)
CollectionVisibility(9-11)
api/helpers/_parsermanager.py (2)
api/schemas/core/documents.py (1)
FileType(19-25)api/utils/exceptions.py (1)
UnsupportedFileTypeException(163-165)
api/tests/integ/test_collections.py (3)
api/tests/integ/conftest.py (1)
client(335-361)api/schemas/collections.py (1)
CollectionVisibility(9-11)api/tests/integ/test_documents.py (1)
collection(13-21)
api/helpers/documents/__init__.py (1)
api/helpers/documents/documents_updater.py (1)
DocumentsUpdater(19-303)
api/tests/unit/test_helpers/test_parsermanager.py (4)
api/schemas/core/documents.py (1)
FileType(19-25)api/helpers/_parsermanager.py (2)
ParserManager(16-215)parse_file(110-120)api/utils/exceptions.py (1)
UnsupportedFileTypeException(163-165)api/helpers/_documentmanager.py (1)
parse_file(335-346)
api/helpers/documents/documents_updater.py (4)
api/helpers/data/hasher/chunk_hasher.py (4)
ChunkHasher(10-99)concatenate_chunk_values(19-46)compute_chunk_hash(49-66)compute_existing_chunk_hash(69-99)api/helpers/files/parquet/extractor.py (5)
ParquetDataExtractor(13-167)extract_chunk_ids(54-81)extract_metadata_columns(109-120)build_chunk_metadata(123-152)get_row_values(155-167)api/schemas/chunks.py (1)
Chunk(6-10)api/utils/exceptions.py (2)
CollectionNotFoundException(78-80)VectorizationFailedException(190-192)
🔇 Additional comments (28)
.github/badges/coverage.json (1)
1-1: Overall coverage increased: +0.37 percentage points.The overall test coverage improved to 73.14%, which is encouraging. However, review the unit test coverage trend (see related badge file).
api/helpers/files/__init__.py (1)
1-1: LGTM!The docstring appropriately describes this package's purpose.
pyproject.toml (2)
3-3: LGTM!Version bump to 0.3.0 is appropriate for the new Parquet ingestion feature.
46-47: Dependency versions verified—both are secure.
- xxhash==3.6.0: Already at latest version with no known vulnerabilities.
- pyarrow==21.0.0: Secure (not affected by known CVEs). Latest available is 22.0.0; consider upgrading to 22.0.0 for the most current release.
api/helpers/documents/__init__.py (1)
1-5: LGTM!Clean module structure with appropriate public export of DocumentsUpdater.
api/tests/integ/cassettes/test_collections.test_put_collection_with_parquet_file.yaml (1)
1-86: LGTM!Standard VCR cassette file recording embeddings API interactions for the PUT collection with Parquet file test.
api/tests/integ/test_chat.py (1)
29-29: LGTM!Test correctly updated to send collection name via query parameters, aligning with the updated API endpoint signature.
api/schemas/core/documents.py (1)
21-21: LGTM!Addition of PARQUET file type to the enum is appropriate and follows the existing pattern.
api/tests/integ/test_chunks.py (1)
15-18: LGTM!Test correctly updated to use query parameters and enum value, aligning with API endpoint changes.
api/tests/integ/test_documents.py (1)
14-17: LGTM!Test correctly updated to use query parameters with enum value serialization, consistent with the updated collections endpoint.
api/tests/integ/conftest.py (1)
343-343: LGTM! Consistent test helper pattern.The addition of PUT method helpers (
put_without_permissions,put_with_permissions,put_master) follows the established pattern for other HTTP methods and properly configures authorization headers usingpartial.Also applies to: 350-350, 357-357
api/tests/integ/test_search.py (1)
17-17: LGTM! Consistent migration to query parameters.The collection creation calls correctly switched from JSON body to query parameters, aligning with the endpoint signature changes documented in the broader PR context.
Also applies to: 103-103
api/utils/exceptions.py (1)
143-145: LGTM! Clear exception for Parquet format validation.The new
InvalidPARQUETFormatExceptionfollows the established pattern for format-specific exceptions and provides appropriate HTTP 422 status for validation errors.api/tests/integ/test_admin.py (1)
11-11: LGTM! Improved type safety with enum.The import of
CollectionVisibilityand its usage with.PRIVATE.valuein the collection creation request improves type safety and consistency across the test suite.Also applies to: 406-406
api/helpers/_accesscontroller.py (1)
278-281: LGTM! Correctly adapted to query parameter inputs.The permission check now reads
visibilityfromrequest.query_paramsinstead of the request body, aligning with the endpoint signature changes that accept collection metadata as query parameters.api/tests/integ/test_files.py (1)
15-15: LGTM! Consistent enum usage and query parameters.The collection creation calls correctly use query parameters with
CollectionVisibility.PRIVATE.valueandCollectionVisibility.PUBLIC.value, ensuring type safety and consistency with the updated endpoint signatures.Also applies to: 22-22
api/helpers/data/hasher/__init__.py (1)
1-5: LGTM! Clean module structure.The package initialization follows Python best practices by providing a clear docstring and using
__all__to explicitly define the public API surface.api/tests/unit/test_helpers/test_chunk_hasher.py (4)
12-26: LGTM! Essential determinism test.The test correctly verifies that identical chunk dictionaries produce identical hashes and validates the expected 16-character hexadecimal output format from xxHash64.
27-38: LGTM! Collision detection test.The test correctly verifies that different content produces different hashes, which is critical for change detection functionality.
40-51: LGTM! Metadata sensitivity test.The test correctly validates that metadata changes (e.g.,
metadata_version) affect the hash output, ensuring comprehensive change detection beyond just content.
53-68: LGTM! Concatenation consistency test.The test correctly verifies that
concatenate_chunk_valuesproduces deterministic output and includes all expected keys, which is essential for reproducible hash computation.api/helpers/files/parquet/reader.py (2)
31-83: LGTM! Well-structured file validation and reading.The
read_and_validatemethod efficiently handles file size validation usingseek/tell, properly resets the file pointer, validates file type and structure, and provides clear error messages. The separation of sorting logic (viasort_table) to the caller is appropriate.
120-139: LGTM! Mutation bug properly fixed.The
identify_metadata_columnsmethod now correctly uses.copy()to avoid mutating the class constantREQUIRED_COLUMNS. Good fix from the previous review feedback.api/helpers/documents/documents_updater.py (2)
42-122: LGTM! Well-orchestrated document processing flow.The
process_documentmethod cleanly orchestrates the update workflow: extracting data, checking for changes, handling new vs. updated documents, and managing statistics. The delete-before-create pattern for updates is appropriate.
171-227: LGTM! Efficient change detection with hash comparison.The update detection logic efficiently compares chunk counts first, then hashes, stopping at the first difference. The logging of specific differences is helpful for debugging.
Note: The 10,000 chunk limit at line 206 was previously flagged and you chose to skip addressing it.
api/endpoints/collections.py (3)
39-59: LGTM! Robust error handling with collection cleanup.The exception handling properly rolls back the collection creation if parquet processing fails, ensuring no orphaned collections are left in the database. The re-raising of the exception maintains proper error propagation to the client.
138-189: LGTM! Clean implementation of metadata updates with good refactoring.The update endpoint properly separates metadata updates from document updates. The dictionary-based approach for tracking metadata changes (lines 162-177) is a good improvement from previous feedback, making the code more maintainable and the response construction cleaner.
192-220: LGTM! Well-designed force update endpoint.The
force_update_collectionendpoint appropriately uses PUT for full replacement semantics and makes the file parameter required. Theforce_update=Trueflag clearly communicates the delete-all-then-recreate behavior, and the documentation properly explains the requirements and behavior.
534ce22 to
6b9aadd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
♻️ Duplicate comments (3)
.github/badges/coverage-unit.json (1)
1-1: Coverage decline already flagged in previous review.The unit test coverage decrease has already been comprehensively addressed in the previous review comment, which identified that DocumentsUpdater and Parquet-related features lack unit test coverage.
api/helpers/data/hasher/chunk_hasher.py (1)
38-46: Fix chunk hash collisions between content and metadata.Concatenating key/value pairs without a delimiter collapses different chunks to the same hash. For example,
{"content": "foo", "source": "bar"}and{"content": "foosource:bar"}both currently producecontent:foosource:bar, so we would fail to notice metadata changes. Please emit each pair with an unambiguous separator (and normalizeNone) before hashing to make collisions impossible.parts = [] for key in sorted(chunk.keys()): - value = chunk.get(key, None) - if isinstance(value, (dict | list)): - # Serialize complex types to JSON for consistent representation - parts.append(f"{key}:{json.dumps(value, ensure_ascii=False)}") - else: - parts.append(f"{key}:{value}") - return "".join(parts) + value = chunk.get(key) + if isinstance(value, (dict | list)): + serialized = json.dumps(value, ensure_ascii=False, separators=(",", ":")) + elif value is None: + serialized = "null" + else: + serialized = str(value) + parts.append(f"{key}:{serialized}") + return "\u001f".join(parts)api/helpers/documents/documents_updater.py (1)
200-215: Remove the 10 000 chunk cap; it still breaks update detection.
get_chunks(..., limit=10000)means we only ever compare the first 10 000 stored chunks. Any document larger than that is always flagged as “changed” (because the new dataset has N chunks but we only fetched 10 000), so we delete and rebuild it on every update—even when nothing changed. Combined with the issue above, that false positive can wipe a healthy document if vectorization glitches once. Please paginate until all chunks are loaded (as previously suggested) or adopt a hash that covers the full set.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
⛔ Files ignored due to path filters (3)
api/tests/integ/assets/parquet.parquetis excluded by!**/*.parquetapi/tests/integ/assets/parquet_wrong_format.parquetis excluded by!**/*.parquetapi/tests/integ/assets/updated_parquet.parquetis excluded by!**/*.parquet
📒 Files selected for processing (31)
.github/badges/coverage-unit.json(1 hunks).github/badges/coverage.json(1 hunks)api/endpoints/collections.py(4 hunks)api/helpers/_accesscontroller.py(1 hunks)api/helpers/_documentmanager.py(2 hunks)api/helpers/_parsermanager.py(3 hunks)api/helpers/data/hasher/__init__.py(1 hunks)api/helpers/data/hasher/chunk_hasher.py(1 hunks)api/helpers/documents/__init__.py(1 hunks)api/helpers/documents/documents_updater.py(1 hunks)api/helpers/files/__init__.py(1 hunks)api/helpers/files/parquet/__init__.py(1 hunks)api/helpers/files/parquet/extractor.py(1 hunks)api/helpers/files/parquet/reader.py(1 hunks)api/schemas/collections.py(2 hunks)api/schemas/core/documents.py(1 hunks)api/tests/integ/cassettes/test_collections.test_create_collection_with_parquet_file.yaml(1 hunks)api/tests/integ/cassettes/test_collections.test_patch_collection_metadata_and_file.yaml(1 hunks)api/tests/integ/cassettes/test_collections.test_put_collection_with_parquet_file.yaml(1 hunks)api/tests/integ/conftest.py(1 hunks)api/tests/integ/test_admin.py(2 hunks)api/tests/integ/test_chat.py(1 hunks)api/tests/integ/test_chunks.py(1 hunks)api/tests/integ/test_collections.py(14 hunks)api/tests/integ/test_documents.py(1 hunks)api/tests/integ/test_files.py(1 hunks)api/tests/integ/test_search.py(2 hunks)api/tests/unit/test_helpers/test_chunk_hasher.py(1 hunks)api/tests/unit/test_helpers/test_parsermanager.py(2 hunks)api/utils/exceptions.py(2 hunks)pyproject.toml(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (19)
api/tests/integ/test_admin.py (1)
api/schemas/collections.py (1)
CollectionVisibility(9-11)
api/helpers/_accesscontroller.py (2)
api/schemas/collections.py (1)
CollectionVisibility(9-11)api/schemas/admin/roles.py (1)
PermissionType(10-14)
api/tests/integ/test_chat.py (1)
api/tests/integ/conftest.py (1)
client(335-361)
api/tests/integ/test_files.py (2)
api/schemas/collections.py (1)
CollectionVisibility(9-11)api/tests/integ/conftest.py (1)
client(335-361)
api/helpers/documents/__init__.py (1)
api/helpers/documents/documents_updater.py (1)
DocumentsUpdater(19-301)
api/tests/integ/test_documents.py (1)
api/schemas/collections.py (1)
CollectionVisibility(9-11)
api/helpers/files/parquet/reader.py (3)
api/schemas/core/documents.py (1)
FileType(19-25)api/utils/exceptions.py (2)
FileSizeLimitExceededException(124-129)InvalidPARQUETFormatException(143-145)api/helpers/_parsermanager.py (1)
_detect_file_type(61-92)
api/tests/integ/test_search.py (1)
api/tests/integ/conftest.py (1)
client(335-361)
api/helpers/data/hasher/__init__.py (1)
api/helpers/data/hasher/chunk_hasher.py (1)
ChunkHasher(10-99)
api/tests/integ/test_chunks.py (1)
api/schemas/collections.py (1)
CollectionVisibility(9-11)
api/tests/integ/test_collections.py (3)
api/tests/integ/conftest.py (1)
client(335-361)api/schemas/collections.py (1)
CollectionVisibility(9-11)api/tests/integ/test_documents.py (1)
collection(13-21)
api/helpers/documents/documents_updater.py (4)
api/helpers/data/hasher/chunk_hasher.py (4)
ChunkHasher(10-99)concatenate_chunk_values(19-46)compute_chunk_hash(49-66)compute_existing_chunk_hash(69-99)api/helpers/files/parquet/extractor.py (5)
ParquetDataExtractor(13-167)extract_chunk_ids(54-81)extract_metadata_columns(109-120)build_chunk_metadata(123-152)get_row_values(155-167)api/schemas/chunks.py (1)
Chunk(6-10)api/utils/exceptions.py (2)
CollectionNotFoundException(78-80)VectorizationFailedException(190-192)
api/endpoints/collections.py (2)
api/schemas/collections.py (3)
Collection(17-26)Collections(29-31)CollectionVisibility(9-11)api/helpers/_documentmanager.py (4)
create_collection(74-83)update_collection_from_parquet(496-608)delete_collection(86-101)update_collection(104-126)
api/tests/unit/test_helpers/test_chunk_hasher.py (1)
api/helpers/data/hasher/chunk_hasher.py (3)
ChunkHasher(10-99)concatenate_chunk_values(19-46)compute_chunk_hash(49-66)
api/helpers/files/parquet/__init__.py (2)
api/helpers/files/parquet/extractor.py (1)
ParquetDataExtractor(13-167)api/helpers/files/parquet/reader.py (1)
ParquetReader(20-139)
api/helpers/data/hasher/chunk_hasher.py (1)
api/schemas/chunks.py (1)
Chunk(6-10)
api/tests/unit/test_helpers/test_parsermanager.py (4)
api/schemas/core/documents.py (1)
FileType(19-25)api/helpers/_parsermanager.py (2)
ParserManager(16-215)parse_file(110-120)api/utils/exceptions.py (1)
UnsupportedFileTypeException(163-165)api/helpers/_documentmanager.py (1)
parse_file(335-346)
api/helpers/_parsermanager.py (2)
api/schemas/core/documents.py (1)
FileType(19-25)api/utils/exceptions.py (1)
UnsupportedFileTypeException(163-165)
api/helpers/_documentmanager.py (8)
api/helpers/documents/documents_updater.py (2)
DocumentsUpdater(19-301)process_document(40-120)api/helpers/files/parquet/extractor.py (3)
ParquetDataExtractor(13-167)get_unique_document_names(22-35)filter_document_table(38-51)api/helpers/files/parquet/reader.py (4)
ParquetReader(20-139)read_and_validate(32-83)sort_table(101-118)identify_metadata_columns(121-139)api/utils/exceptions.py (2)
CollectionNotFoundException(78-80)VectorizationFailedException(190-192)api/endpoints/collections.py (1)
create_collection(16-61)api/clients/vector_store/_elasticsearchvectorstoreclient.py (2)
create_collection(27-97)delete_document(117-120)api/clients/vector_store/_qdrantvectorstoreclient.py (2)
create_collection(45-54)delete_document(77-79)api/clients/vector_store/_basevectorstoreclient.py (2)
create_collection(39-40)delete_document(55-56)
🔇 Additional comments (21)
.github/badges/coverage.json (1)
1-1: Confirm this is CI-generated, not manually committed.Coverage badge files should typically be auto-generated by your CI/CD pipeline (e.g., codecov workflow) rather than manually committed to the repository. If this was generated locally or via a pre-commit hook, please ensure the generation process is documented and automated to prevent manual divergence.
Do you have a GitHub Actions workflow or similar CI job that regenerates this badge automatically? If so, this file may belong in
.gitignoreto avoid merge conflicts and drift from actual coverage values.api/schemas/core/documents.py (1)
21-21: LGTM! FileType enum extended for Parquet support.The addition of the PARQUET file type is consistent with the existing enum structure and aligns with the broader Parquet support being added across the codebase.
api/helpers/files/__init__.py (1)
1-1: LGTM! Module documentation added.The docstring appropriately describes the module's purpose.
api/tests/integ/conftest.py (1)
343-357: LGTM! PUT helpers added consistently.The new PUT method helpers (
put_without_permissions,put_with_permissions,put_master) follow the same pattern as existing HTTP method helpers and properly configure authorization headers for different permission levels.api/helpers/_accesscontroller.py (1)
278-281: LGTM! Visibility check updated to use query parameters.The change from reading visibility from the request body to query parameters aligns with the updated API design where collection metadata is passed via query params. The permission check logic remains correct.
api/tests/integ/cassettes/test_collections.test_patch_collection_metadata_and_file.yaml (1)
1-170: LGTM! Test cassette for collection metadata and file patching.This VCR cassette fixture correctly records embeddings API interactions for integration testing of the collection PATCH endpoint with parquet file uploads.
api/tests/unit/test_helpers/test_parsermanager.py (2)
437-446: LGTM! Extension map test updated for PARQUET.The test correctly verifies that
.parquetis mapped toFileType.PARQUETin the extension map.
451-453: LGTM! Content types validation includes PARQUET.The test correctly verifies that
FileType.PARQUEThas valid content types defined.api/tests/integ/test_chat.py (1)
29-29: LGTM: Clean migration to query parameters.The collection creation correctly uses query parameters instead of a JSON body, aligning with the updated endpoint signature across the codebase.
api/tests/integ/test_documents.py (1)
14-17: LGTM: Proper enum serialization and parameter usage.The fixture correctly passes collection metadata via query parameters and uses
.valueto serialize the visibility enum to its string representation.api/tests/integ/test_chunks.py (1)
15-18: LGTM: Consistent with test suite pattern.The setup fixture follows the same query parameter pattern established across other integration tests.
api/tests/integ/test_search.py (2)
17-17: LGTM: Setup fixture updated correctly.The collection creation in the setup fixture properly uses query parameters.
103-103: LGTM: Test updated consistently.The test correctly creates a collection using query parameters, maintaining consistency with other test updates.
api/utils/exceptions.py (2)
126-126: LGTM: Parquet size limit constant added.The MAX_PARQUET_CONTENT_SIZE constant provides a clear size constraint for Parquet file processing.
143-146: LGTM: Clean exception class for Parquet validation.The InvalidPARQUETFormatException follows the established exception pattern and provides appropriate HTTP status code (422) for format validation errors.
api/tests/integ/test_files.py (2)
13-17: LGTM: Private collection setup updated correctly.The fixture properly creates a private collection using query parameters with enum value serialization.
20-24: LGTM: Public collection setup follows same pattern.Consistent with the private collection setup and other test files in the suite.
api/tests/integ/test_admin.py (2)
11-11: LGTM: Import added for enum usage.The CollectionVisibility import supports the updated collection creation pattern in the test.
403-407: LGTM: Test correctly uses enum value.The collection creation properly uses query parameters with
CollectionVisibility.PRIVATE.valuefor type safety and consistency.api/helpers/data/hasher/__init__.py (1)
1-5: LGTM: Clean module interface.The module properly exposes ChunkHasher via
__all__with a clear docstring describing its purpose. This follows Python packaging best practices for public API definition.api/tests/unit/test_helpers/test_chunk_hasher.py (1)
12-68: Nice coverage for the hash helper.The deterministic, content/metadata variance, and concat smoke checks give us confidence the helper behaves as intended.
cfcef3e to
2035450
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 7
♻️ Duplicate comments (2)
api/helpers/data/hasher/chunk_hasher.py (1)
38-46: Add separators to prevent hash collisions (unresolved from past review).The past review flagged this collision risk, and while marked "Partially Done," the fundamental issue remains unresolved. Without delimiters between key-value pairs, distinct dictionaries can produce identical concatenated strings:
{"a": "b:c", "d": "e"} → "a:b:cd:e" {"a": "b", "c": "d:e"} → "a:bcd:e" ← collision!This creates a risk of false hash matches in change detection, causing the system to incorrectly skip legitimate document updates.
Apply this fix to add unambiguous separators:
parts = [] for key in sorted(chunk.keys()): value = chunk.get(key, None) if isinstance(value, (dict | list)): # Serialize complex types to JSON for consistent representation parts.append(f"{key}:{json.dumps(value, ensure_ascii=False)}") else: parts.append(f"{key}:{value}") - return "".join(parts) + return "|".join(parts) # Separator prevents collisionsapi/helpers/documents/documents_updater.py (1)
215-233: Remove the 10k chunk cap when comparing existing content.Limiting
get_chunksto 10 000 still causes us to ignore differences in chunks beyond that index, so large documents can silently skip updates. Please page through all chunks (or otherwise compare full content) before deciding “no change.”- existing_chunks = await self.vector_store.get_chunks( - collection_id=collection_id, - document_id=existing_doc_id, - offset=0, - limit=10000, - ) - existing_chunks.sort(key=lambda c: c.id) + existing_chunks: list[Chunk] = [] + offset = 0 + page_size = 1000 + while True: + batch = await self.vector_store.get_chunks( + collection_id=collection_id, + document_id=existing_doc_id, + offset=offset, + limit=page_size, + ) + if not batch: + break + existing_chunks.extend(batch) + if len(batch) < page_size: + break + offset += page_size + existing_chunks.sort(key=lambda c: c.id)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
⛔ Files ignored due to path filters (3)
api/tests/integ/assets/parquet.parquetis excluded by!**/*.parquetapi/tests/integ/assets/parquet_wrong_format.parquetis excluded by!**/*.parquetapi/tests/integ/assets/updated_parquet.parquetis excluded by!**/*.parquet
📒 Files selected for processing (32)
.github/badges/coverage-unit.json(1 hunks).github/badges/coverage.json(1 hunks)api/endpoints/collections.py(4 hunks)api/helpers/_accesscontroller.py(1 hunks)api/helpers/_documentmanager.py(2 hunks)api/helpers/_parsermanager.py(3 hunks)api/helpers/data/hasher/__init__.py(1 hunks)api/helpers/data/hasher/chunk_hasher.py(1 hunks)api/helpers/documents/__init__.py(1 hunks)api/helpers/documents/documents_updater.py(1 hunks)api/helpers/files/__init__.py(1 hunks)api/helpers/files/parquet/__init__.py(1 hunks)api/helpers/files/parquet/extractor.py(1 hunks)api/helpers/files/parquet/reader.py(1 hunks)api/schemas/collections.py(2 hunks)api/schemas/core/documents.py(1 hunks)api/tests/integ/cassettes/test_collections.test_create_collection_with_parquet_file.yaml(1 hunks)api/tests/integ/cassettes/test_collections.test_patch_collection_metadata_and_file.yaml(1 hunks)api/tests/integ/cassettes/test_collections.test_put_collection_with_parquet_file.yaml(1 hunks)api/tests/integ/conftest.py(1 hunks)api/tests/integ/test_admin.py(2 hunks)api/tests/integ/test_chat.py(1 hunks)api/tests/integ/test_chunks.py(1 hunks)api/tests/integ/test_collections.py(14 hunks)api/tests/integ/test_documents.py(1 hunks)api/tests/integ/test_files.py(1 hunks)api/tests/integ/test_search.py(2 hunks)api/tests/unit/test_helpers/test_chunk_hasher.py(1 hunks)api/tests/unit/test_helpers/test_documents_updater.py(1 hunks)api/tests/unit/test_helpers/test_parsermanager.py(2 hunks)api/utils/exceptions.py(2 hunks)pyproject.toml(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (20)
api/tests/integ/test_files.py (2)
api/schemas/collections.py (1)
CollectionVisibility(9-11)api/tests/integ/conftest.py (1)
client(335-361)
api/tests/integ/test_search.py (1)
api/tests/integ/conftest.py (1)
client(335-361)
api/tests/unit/test_helpers/test_chunk_hasher.py (1)
api/helpers/data/hasher/chunk_hasher.py (3)
ChunkHasher(10-99)concatenate_chunk_values(19-46)compute_chunk_hash(49-66)
api/tests/unit/test_helpers/test_parsermanager.py (4)
api/schemas/core/documents.py (1)
FileType(19-25)api/helpers/_parsermanager.py (2)
ParserManager(16-215)parse_file(110-120)api/utils/exceptions.py (1)
UnsupportedFileTypeException(163-165)api/helpers/_documentmanager.py (1)
parse_file(335-346)
api/helpers/files/parquet/reader.py (3)
api/schemas/core/documents.py (1)
FileType(19-25)api/utils/exceptions.py (2)
FileSizeLimitExceededException(124-129)InvalidPARQUETFormatException(143-145)api/helpers/_parsermanager.py (1)
_detect_file_type(61-92)
api/helpers/data/hasher/chunk_hasher.py (1)
api/schemas/chunks.py (1)
Chunk(6-10)
api/helpers/documents/__init__.py (1)
api/helpers/documents/documents_updater.py (1)
DocumentsUpdater(19-318)
api/tests/integ/test_chunks.py (1)
api/schemas/collections.py (1)
CollectionVisibility(9-11)
api/tests/integ/test_collections.py (3)
api/tests/integ/conftest.py (1)
client(335-361)api/schemas/collections.py (1)
CollectionVisibility(9-11)api/tests/integ/test_documents.py (1)
collection(13-21)
api/helpers/files/parquet/__init__.py (2)
api/helpers/files/parquet/extractor.py (1)
ParquetDataExtractor(13-167)api/helpers/files/parquet/reader.py (1)
ParquetReader(20-139)
api/endpoints/collections.py (6)
api/helpers/_accesscontroller.py (1)
AccessController(56-399)api/schemas/collections.py (3)
Collection(17-26)Collections(29-31)CollectionVisibility(9-11)api/helpers/_documentmanager.py (4)
create_collection(74-83)update_collection_from_parquet(496-608)delete_collection(86-101)update_collection(104-126)api/sql/session.py (1)
get_db_session(33-39)api/tests/integ/test_documents.py (1)
collection(13-21)api/utils/exceptions.py (1)
CollectionNotFoundException(78-80)
api/tests/integ/test_chat.py (1)
api/tests/integ/conftest.py (1)
client(335-361)
api/helpers/data/hasher/__init__.py (1)
api/helpers/data/hasher/chunk_hasher.py (1)
ChunkHasher(10-99)
api/tests/integ/test_documents.py (1)
api/schemas/collections.py (1)
CollectionVisibility(9-11)
api/helpers/_parsermanager.py (2)
api/schemas/core/documents.py (1)
FileType(19-25)api/utils/exceptions.py (1)
UnsupportedFileTypeException(163-165)
api/tests/unit/test_helpers/test_documents_updater.py (2)
api/helpers/documents/documents_updater.py (3)
_prepare_chunks_data(139-184)_check_update_needed(186-242)process_document(40-137)api/utils/exceptions.py (1)
VectorizationFailedException(190-192)
api/helpers/documents/documents_updater.py (5)
api/helpers/data/hasher/chunk_hasher.py (4)
ChunkHasher(10-99)concatenate_chunk_values(19-46)compute_chunk_hash(49-66)compute_existing_chunk_hash(69-99)api/helpers/files/parquet/extractor.py (5)
ParquetDataExtractor(13-167)extract_chunk_ids(54-81)extract_metadata_columns(109-120)build_chunk_metadata(123-152)get_row_values(155-167)api/schemas/chunks.py (1)
Chunk(6-10)api/utils/exceptions.py (2)
CollectionNotFoundException(78-80)VectorizationFailedException(190-192)api/helpers/_documentmanager.py (1)
get_chunks(303-332)
api/helpers/_accesscontroller.py (2)
api/schemas/collections.py (1)
CollectionVisibility(9-11)api/schemas/admin/roles.py (1)
PermissionType(10-14)
api/helpers/_documentmanager.py (3)
api/helpers/documents/documents_updater.py (2)
DocumentsUpdater(19-318)process_document(40-137)api/helpers/files/parquet/extractor.py (3)
ParquetDataExtractor(13-167)get_unique_document_names(22-35)filter_document_table(38-51)api/helpers/files/parquet/reader.py (4)
ParquetReader(20-139)read_and_validate(32-83)sort_table(101-118)identify_metadata_columns(121-139)
api/tests/integ/test_admin.py (1)
api/schemas/collections.py (1)
CollectionVisibility(9-11)
🔇 Additional comments (27)
.github/badges/coverage.json (1)
1-1: Coverage improved: +0.28 percentage points.Overall test coverage increased from 72.77% to 73.05%. This is a positive change.
.github/badges/coverage-unit.json (1)
1-1: Unit test coverage declined slightly: −0.17 percentage points.Unit test coverage decreased from 58.95% to 58.78%. However, the enriched summary confirms that new unit tests were added for ChunkHasher (api/tests/unit/test_helpers/test_chunk_hasher.py) and DocumentsUpdater (api/tests/unit/test_helpers/test_documents_updater.py). The slight decline is likely due to the proportion of new code added vs. unit test coverage added.
api/schemas/core/documents.py (1)
19-26: LGTM! Clean addition of Parquet file type.The PARQUET enum member follows the existing pattern and enables Parquet file handling in the codebase.
api/helpers/files/__init__.py (1)
1-1: LGTM! Appropriate module docstring.Standard Python practice to document module purposes.
pyproject.toml (2)
3-3: LGTM! Appropriate version bump.Version bumped from 0.2.8post3 to 0.3.0, which correctly indicates a minor release with new features (Parquet support, collection update endpoints).
46-48: LGTM! Security vulnerability resolved.pyarrow upgraded to 22.0.0, which addresses the critical security vulnerability flagged in previous reviews. xxhash 3.6.0 is the latest version. Both dependencies are correctly placed in the optional api group to support Parquet processing and chunk hashing features.
api/tests/integ/conftest.py (3)
343-343: LGTM! Consistent test helper for PUT requests.The
put_without_permissionshelper follows the existing pattern for other HTTP methods, enabling permission-scoped testing of the new PUT endpoint.
350-350: LGTM! Consistent test helper for PUT requests.The
put_with_permissionshelper follows the existing pattern, enabling admin-level testing of the new PUT endpoint.
357-357: LGTM! Consistent test helper for PUT requests.The
put_masterhelper follows the existing pattern, enabling master-level testing of the new PUT endpoint.api/schemas/collections.py (1)
1-14: LGTM! Clean migration to Pydantic v2 patterns.The code now uses modern Pydantic v2 features (Annotated, StringConstraints) instead of v1-style patterns. The CollectionName type alias provides reusable validation (whitespace stripping, minimum length) for collection names. This aligns with the API changes where collection metadata is now passed via query parameters instead of request bodies.
api/tests/integ/test_chat.py (1)
29-29: LGTM! Test updated to match new API contract.Collection creation now uses query parameters instead of a JSON body, aligning with the removal of CollectionRequest and the new query-parameter-based API design.
api/tests/integ/test_chunks.py (1)
17-17: LGTM! Collection creation migrated to query parameters.The test correctly uses query parameters for collection creation with
CollectionVisibility.PRIVATE.valueto pass the enum's underlying string value, aligning with the updated endpoint signature.api/tests/integ/test_documents.py (1)
16-16: LGTM! Consistent with endpoint signature changes.The collection fixture correctly migrated to query parameters, using
CollectionVisibility.PRIVATE.valueto provide the string representation of the enum.api/tests/integ/test_admin.py (2)
11-11: LGTM! Import added to support enum usage.The
CollectionVisibilityimport is correctly added to support the updated collection creation pattern using enum values.
406-406: LGTM! Test updated to use enum for type safety.The collection creation now correctly uses
CollectionVisibility.PRIVATE.valuewith query parameters, improving type safety and consistency across tests.api/utils/exceptions.py (2)
126-126: LGTM! Parquet size limit constant added.The
MAX_PARQUET_CONTENT_SIZEconstant provides a clear size limit for Parquet file validation, maintaining consistency with the existingMAX_CONTENT_SIZEpattern.
143-145: LGTM! Exception class follows established patterns.The
InvalidPARQUETFormatExceptionis correctly implemented with status code 422 and follows the same structure as other format validation exceptions (e.g.,InvalidJSONFormatException).api/helpers/_accesscontroller.py (1)
277-281: LGTM! Access control correctly migrated to query parameters.The
_check_collectionsmethod now reads visibility fromrequest.query_paramsinstead of the request body, correctly aligning with the updated endpoint signature. The permission check logic remains sound.api/tests/integ/test_files.py (2)
15-15: LGTM! Private collection creation updated correctly.The test properly uses query parameters with
CollectionVisibility.PRIVATE.valueto create the private test collection.
22-22: LGTM! Public collection creation updated correctly.The test properly uses query parameters with
CollectionVisibility.PUBLIC.valueto create the public test collection, ensuring appropriate test coverage for both visibility types.api/helpers/data/hasher/__init__.py (1)
1-5: LGTM! Well-structured module initialization.The module correctly exposes the
ChunkHasherutility with proper documentation and follows Python packaging conventions. The explicit__all__export list provides clear public API boundaries.api/helpers/_parsermanager.py (1)
114-116: LGTM! PARQUET handling is correctly implemented.The guard clause properly prevents PARQUET files from being parsed as documents with a clear error message. This aligns with the PR's design where PARQUET files are exclusively handled by collection endpoints.
api/tests/unit/test_helpers/test_parsermanager.py (1)
455-466: LGTM! PARQUET rejection test is adequate.The test correctly verifies that PARQUET files trigger
UnsupportedFileTypeExceptioninparse_file. While the exception message isn't explicitly verified (per the skipped past review suggestion), the test adequately ensures the rejection behavior.api/helpers/files/parquet/__init__.py (1)
1-6: LGTM! Package exports are correctly defined.The module properly exports the two main Parquet utility classes with appropriate
__all__declaration.api/tests/unit/test_helpers/test_documents_updater.py (1)
1-450: LGTM! Comprehensive test coverage for DocumentsUpdater.The test suite is well-structured and thorough:
- Proper fixtures for mocking dependencies (vector store, callbacks)
- Comprehensive coverage of chunk preparation scenarios (metadata, force_update, empty content)
- Update detection logic tested across multiple scenarios (new docs, force updates, hash comparisons, chunk count differences)
- Document processing workflow tested including rollback behavior on failure
- Appropriate use of
AsyncMockfor async operations andpatchfor internal methodsapi/tests/integ/test_collections.py (2)
295-370: Well-structured multi-step integration test.This test effectively validates the complete Parquet processing pipeline through three upload scenarios:
- Initial upload with metadata update (lines 304-325)
- Identical file re-upload detecting no changes (lines 327-347)
- Updated file detecting specific changes (lines 349-370)
The assertions correctly verify both metadata updates and document processing statistics, providing good coverage of the update detection logic.
11-13: All required Parquet test assets are present in the repository.Verification confirms that all three Parquet files referenced in the integration tests are committed and available:
api/tests/integ/assets/parquet.parquetapi/tests/integ/assets/updated_parquet.parquetapi/tests/integ/assets/parquet_wrong_format.parquetThe tests will execute successfully with these assets in place.
9f01d87 to
fb77e16
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (5)
api/tests/integ/test_files.py (1)
40-41: Remove redundant file.close() inside context managers.The with block closes the file; explicit close is unnecessary and brittle if reused later.
- response = client.post_without_permissions(url=f"/v1{ENDPOINT__FILES}", data=data, files=files) - file.close() + response = client.post_without_permissions(url=f"/v1{ENDPOINT__FILES}", data=data, files=files)(Apply similarly to other tests in this file.)
Also applies to: 52-53, 65-65, 76-76, 88-88, 100-100, 112-112, 124-124, 136-136, 148-148, 159-159
api/tests/unit/test_helpers/test_parsermanager.py (2)
158-167: Fix tautological assertion; assert replacement char."" in result is always true for any string; assert U+FFFD occurrence and non-empty content.
- # Should contain replacement characters - assert "" in result or len(result) > 0 + # Should contain the Unicode replacement character from decode(errors="replace") + assert "\uFFFD" in result + assert len(result) > 0
110-121: Use PropertyMock on the type, not the instance.To raise on attribute access, set the PropertyMock on the mock’s type; current code may not exercise the intended path.
- mock_file = MagicMock(spec=UploadFile) - # Make accessing filename raise an exception - mock_file.filename = PropertyMock(side_effect=Exception("Access error")) - mock_file.content_type = "text/plain" + mock_file = MagicMock(spec=UploadFile) + type(mock_file).filename = PropertyMock(side_effect=Exception("Access error")) + type(mock_file).content_type = PropertyMock(return_value="text/plain")api/helpers/_parsermanager.py (2)
131-139: Bug: same metadata object reused across PDF pages.Mutating metadata.page in loop reuses the same instance, so all pages end with last page number.
- document = ParsedDocument(data=[]) - metadata = ParsedDocumentMetadata(document_name=params.file.filename) + document = ParsedDocument(data=[]) for page_num in range(len(pdf)): page = pdf[page_num] text = page.get_text() - metadata.page = page_num - document.data.append(ParsedDocumentPage(content=text, images={}, metadata=metadata)) + page_metadata = ParsedDocumentMetadata(document_name=params.file.filename, page=page_num) + document.data.append(ParsedDocumentPage(content=text, images={}, metadata=page_metadata))
200-211: Decode TXT via shared reader for consistency._read_content already handles encodings; reuse it so TXT content is str.
- text = await params.file.read() + text = await self._read_content(file=params.file) document = ParsedDocument( data=[ ParsedDocumentPage( content=text, images={}, metadata=ParsedDocumentMetadata(document_name=params.file.filename), ) ] )
♻️ Duplicate comments (5)
api/helpers/data/hasher/chunk_hasher.py (1)
18-46: Critical: Hash collision risk remains unresolved.The current concatenation without delimiters between key-value pairs can produce identical hashes for different content. For example:
{"content": "textpage", "x": "1"}→"content:textpagex:1"{"content": "text", "pagex": "1"}→"content:textpagex:1"(identical)This collision can cause different chunks to be incorrectly identified as duplicates, leading to data corruption during updates.
Apply this diff to fix the collision risk:
parts.append(f"{key}:{json.dumps(value, ensure_ascii=False)}") else: parts.append(f"{key}:{value}") - return "".join(parts) + return "|".join(parts) # Separator prevents collisionsapi/tests/unit/test_helpers/test_parsermanager.py (1)
455-466: Also assert the exception message for clarity.Check the detail text to lock user-facing error contract. This was suggested earlier and still applies.
with pytest.raises(UnsupportedFileTypeException) as exc_info: await manager.parse_file(file=file) + assert exc_info.value.detail == "Parquet files cannot be parsed as documents."api/helpers/_documentmanager.py (1)
559-568: Do not mass-delete on force_update; delete only after successful re-ingest.Deleting all documents upfront risks permanent data loss if embedding/upsert fails mid-run. The per-document updater already deletes the old doc after successful insert; rely on that instead.
- # Handle force update (delete all existing documents) - if force_update and existing_docs_map: - document_ids = list(existing_docs_map.values()) - logger.info(f"Force update: deleting {len(document_ids)} documents from collection {collection_id}") - - for document_id in document_ids: - await self.delete_document(session=session, user_id=user_id, document_id=document_id) - - stats["deleted_documents"] = len(document_ids) - existing_docs_map = {} + # For force_update, defer deletion to DocumentsUpdater (delete after successful reinsert) + # Keep existing_docs_map so updater can detect updates and remove old docs safely.Additionally, if you still want to expose deleted_documents, set it after processing to the number of updated documents:
- logger.info(f"Collection update complete. Stats: {stats}") - return stats + if force_update: + stats["deleted_documents"] = stats.get("updated_documents", 0) + logger.info(f"Collection update complete. Stats: {stats}") + return statsapi/helpers/files/parquet/extractor.py (1)
54-81: Chunk ID indexing mismatch remains unresolved.The code assigns 1-based chunk IDs (line 81:
range(1, num_chunks + 1)), but tests in api/tests/unit/test_helpers/test_documents_updater.py expect 0-based IDs (doc_chunk_ids = [0, 1, 2]at lines 52, 84, 112, 132). This was previously flagged as critical and marked "Skipped."While respecting the decision to skip, be aware this inconsistency may cause runtime issues when auto-generated chunk IDs don't match test expectations or vector store conventions.
api/helpers/documents/documents_updater.py (1)
217-221: Hard limit on chunk comparison remains unresolved.The 10,000 chunk limit (line 221) was previously flagged as a critical issue for documents exceeding this size. Changes beyond the first 10,000 chunks won't be detected, potentially causing incorrect update decisions. This was marked "Skipped" by the developer.
While respecting this decision, be aware this limitation may cause issues with very large documents.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
⛔ Files ignored due to path filters (3)
api/tests/integ/assets/parquet.parquetis excluded by!**/*.parquetapi/tests/integ/assets/parquet_wrong_format.parquetis excluded by!**/*.parquetapi/tests/integ/assets/updated_parquet.parquetis excluded by!**/*.parquet
📒 Files selected for processing (32)
.github/badges/coverage-unit.json(1 hunks).github/badges/coverage.json(1 hunks)api/endpoints/collections.py(4 hunks)api/helpers/_accesscontroller.py(1 hunks)api/helpers/_documentmanager.py(2 hunks)api/helpers/_parsermanager.py(3 hunks)api/helpers/data/hasher/__init__.py(1 hunks)api/helpers/data/hasher/chunk_hasher.py(1 hunks)api/helpers/documents/__init__.py(1 hunks)api/helpers/documents/documents_updater.py(1 hunks)api/helpers/files/__init__.py(1 hunks)api/helpers/files/parquet/__init__.py(1 hunks)api/helpers/files/parquet/extractor.py(1 hunks)api/helpers/files/parquet/reader.py(1 hunks)api/schemas/collections.py(2 hunks)api/schemas/core/documents.py(1 hunks)api/tests/integ/cassettes/test_collections.test_create_collection_with_parquet_file.yaml(1 hunks)api/tests/integ/cassettes/test_collections.test_patch_collection_metadata_and_file.yaml(1 hunks)api/tests/integ/cassettes/test_collections.test_put_collection_with_parquet_file.yaml(1 hunks)api/tests/integ/conftest.py(1 hunks)api/tests/integ/test_admin.py(2 hunks)api/tests/integ/test_chat.py(1 hunks)api/tests/integ/test_chunks.py(1 hunks)api/tests/integ/test_collections.py(14 hunks)api/tests/integ/test_documents.py(1 hunks)api/tests/integ/test_files.py(1 hunks)api/tests/integ/test_search.py(2 hunks)api/tests/unit/test_helpers/test_chunk_hasher.py(1 hunks)api/tests/unit/test_helpers/test_documents_updater.py(1 hunks)api/tests/unit/test_helpers/test_parsermanager.py(2 hunks)api/utils/exceptions.py(2 hunks)pyproject.toml(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (20)
api/helpers/data/hasher/__init__.py (1)
api/helpers/data/hasher/chunk_hasher.py (1)
ChunkHasher(10-99)
api/helpers/documents/__init__.py (1)
api/helpers/documents/documents_updater.py (1)
DocumentsUpdater(19-318)
api/tests/unit/test_helpers/test_parsermanager.py (4)
api/schemas/core/documents.py (1)
FileType(19-25)api/helpers/_parsermanager.py (2)
ParserManager(16-215)parse_file(110-120)api/utils/exceptions.py (1)
UnsupportedFileTypeException(163-165)api/helpers/_documentmanager.py (1)
parse_file(335-346)
api/tests/integ/test_search.py (1)
api/tests/integ/conftest.py (1)
client(335-361)
api/tests/integ/test_documents.py (1)
api/schemas/collections.py (1)
CollectionVisibility(9-11)
api/helpers/files/parquet/reader.py (3)
api/schemas/core/documents.py (1)
FileType(19-25)api/utils/exceptions.py (2)
FileSizeLimitExceededException(124-129)InvalidPARQUETFormatException(143-145)api/helpers/_parsermanager.py (1)
_detect_file_type(61-92)
api/tests/integ/test_chunks.py (1)
api/schemas/collections.py (1)
CollectionVisibility(9-11)
api/helpers/_parsermanager.py (2)
api/schemas/core/documents.py (1)
FileType(19-25)api/utils/exceptions.py (1)
UnsupportedFileTypeException(163-165)
api/tests/unit/test_helpers/test_chunk_hasher.py (1)
api/helpers/data/hasher/chunk_hasher.py (4)
ChunkHasher(10-99)concatenate_chunk_values(19-46)compute_chunk_hash(49-66)compute_existing_chunk_hash(69-99)
api/tests/unit/test_helpers/test_documents_updater.py (2)
api/helpers/documents/documents_updater.py (3)
_prepare_chunks_data(139-184)_check_update_needed(186-242)process_document(40-137)api/utils/exceptions.py (1)
VectorizationFailedException(190-192)
api/helpers/_documentmanager.py (6)
api/helpers/documents/documents_updater.py (2)
DocumentsUpdater(19-318)process_document(40-137)api/helpers/files/parquet/extractor.py (3)
ParquetDataExtractor(13-179)get_unique_document_names(22-35)filter_document_table(38-51)api/helpers/files/parquet/reader.py (4)
ParquetReader(20-139)read_and_validate(32-83)sort_table(101-118)identify_metadata_columns(121-139)api/utils/exceptions.py (2)
CollectionNotFoundException(78-80)VectorizationFailedException(190-192)api/clients/vector_store/_elasticsearchvectorstoreclient.py (2)
create_collection(27-97)delete_document(117-120)api/clients/vector_store/_qdrantvectorstoreclient.py (2)
create_collection(45-54)delete_document(77-79)
api/helpers/_accesscontroller.py (2)
api/schemas/collections.py (1)
CollectionVisibility(9-11)api/schemas/admin/roles.py (1)
PermissionType(10-14)
api/endpoints/collections.py (5)
api/schemas/collections.py (3)
Collection(17-26)Collections(29-31)CollectionVisibility(9-11)api/helpers/_documentmanager.py (4)
create_collection(74-83)update_collection_from_parquet(496-608)delete_collection(86-101)update_collection(104-126)api/sql/session.py (1)
get_db_session(33-39)api/tests/integ/test_documents.py (1)
collection(13-21)api/utils/exceptions.py (1)
CollectionNotFoundException(78-80)
api/tests/integ/test_admin.py (1)
api/schemas/collections.py (1)
CollectionVisibility(9-11)
api/tests/integ/test_chat.py (1)
api/tests/integ/conftest.py (1)
client(335-361)
api/helpers/files/parquet/__init__.py (2)
api/helpers/files/parquet/extractor.py (1)
ParquetDataExtractor(13-179)api/helpers/files/parquet/reader.py (1)
ParquetReader(20-139)
api/helpers/data/hasher/chunk_hasher.py (1)
api/schemas/chunks.py (1)
Chunk(6-10)
api/tests/integ/test_files.py (2)
api/schemas/collections.py (1)
CollectionVisibility(9-11)api/tests/integ/conftest.py (1)
client(335-361)
api/helpers/documents/documents_updater.py (4)
api/helpers/data/hasher/chunk_hasher.py (4)
ChunkHasher(10-99)concatenate_chunk_values(19-46)compute_chunk_hash(49-66)compute_existing_chunk_hash(69-99)api/helpers/files/parquet/extractor.py (5)
ParquetDataExtractor(13-179)extract_chunk_ids(54-81)extract_metadata_columns(109-120)build_chunk_metadata(123-164)get_row_values(167-179)api/schemas/chunks.py (1)
Chunk(6-10)api/utils/exceptions.py (2)
CollectionNotFoundException(78-80)VectorizationFailedException(190-192)
api/tests/integ/test_collections.py (3)
api/tests/integ/conftest.py (1)
client(335-361)api/schemas/collections.py (1)
CollectionVisibility(9-11)api/tests/integ/test_documents.py (1)
collection(13-21)
🔇 Additional comments (34)
pyproject.toml (2)
3-3: Version bump to 0.3.0 is appropriate for this feature release.The minor version bump aligns with the scope of changes—Parquet support, document updates via hashing, and collection endpoints—representing new functionality without breaking compatibility.
46-47: Security fix and new dependencies approved.The upgrade to
pyarrow==22.0.0successfully resolves the critical security vulnerability from version 21.0.0 (as previously flagged and addressed). The addition ofxxhash==3.6.0is well-placed in the data dependencies section and pinned to the latest stable version with no known advisories. Both dependencies are needed for the new Parquet-driven document processing and hashing workflows.Ensure all tests pass with these new optional dependencies, particularly integration tests for Parquet file handling and collection update flows.
api/helpers/data/hasher/chunk_hasher.py (2)
48-66: LGTM!The use of xxHash64 is appropriate for fast, non-cryptographic content comparison. The implementation correctly encodes to UTF-8 before hashing.
68-99: Logic is correct, but inherits collision risk.The metadata filtering logic correctly excludes internal fields that shouldn't affect comparison. However, this method depends on
concatenate_chunk_values, which has the hash collision issue flagged above.api/schemas/core/documents.py (1)
21-21: LGTM!The PARQUET enum member addition is consistent with existing file type declarations and follows the established pattern.
api/tests/integ/test_chunks.py (1)
17-17: LGTM!The test correctly migrates to parameter-based collection creation using
CollectionVisibility.PRIVATE.value, aligning with the updated API endpoint behavior.api/tests/integ/test_admin.py (1)
11-11: LGTM!The import and test update correctly implement the parameter-based collection creation pattern with proper enum value usage.
Also applies to: 406-406
api/tests/integ/conftest.py (1)
343-343: LGTM!The PUT request wrappers are correctly implemented using the same pattern as existing HTTP method wrappers, with proper authorization header injection for all three permission levels.
Also applies to: 350-350, 357-357
api/tests/integ/test_documents.py (1)
16-16: LGTM!The test correctly migrates to parameter-based collection creation, consistent with the API endpoint changes.
api/tests/integ/test_search.py (1)
17-17: LGTM - relies on default visibility.The tests correctly migrate to parameter-based collection creation. Note that these tests rely on default visibility behavior rather than explicitly specifying it like other tests in the PR.
Also applies to: 103-103
api/utils/exceptions.py (1)
126-126: LGTM!The new
InvalidPARQUETFormatExceptionfollows existing exception patterns, and theMAX_PARQUET_CONTENT_SIZEconstant provides a clear size constraint for PARQUET file handling.Also applies to: 143-145
api/tests/integ/test_files.py (1)
15-16: LGTM: switch to query params with enum primitives.Using params and CollectionVisibility.*.value matches schema and endpoint expectations.
Also applies to: 22-23
api/helpers/data/hasher/__init__.py (1)
1-5: LGTM: clear public export.Docstring + all expose ChunkHasher as intended.
api/tests/unit/test_helpers/test_parsermanager.py (2)
435-447: LGTM: include .parquet and mapping completeness.Extension map test updated correctly with FileType.PARQUET.
451-453: LGTM: content-type coverage includes PARQUET.Test ensures VALID_CONTENT_TYPES has entries for all supported types.
api/helpers/_parsermanager.py (2)
33-37: LGTM: comprehensive Parquet MIME types.Accepts application/vnd.apache.parquet, application/x-parquet, and octet-stream. Good coverage.
114-117: LGTM: explicit PARQUET rejection at parse level.Clear error path keeps parquet flow scoped to collections endpoints.
api/helpers/_documentmanager.py (1)
486-493: Python 3.12+ is properly declared and enforced.Verification confirms the codebase declares
requires-python = ">=3.12"in pyproject.toml and CI workflows (run_tests.yml, codeql_scan.yml) are configured to use Python 3.12. The import ofitertools.batchedat line 3 of _documentmanager.py is compatible with these requirements. No replacement or runtime compatibility issues exist.api/helpers/files/parquet/__init__.py (1)
1-6: LGTM!Clean package initialization following Python conventions. The module properly imports and re-exports the Parquet utilities.
api/tests/unit/test_helpers/test_chunk_hasher.py (1)
1-141: LGTM!Comprehensive test coverage for ChunkHasher with well-structured fixtures and parametrized tests. The tests properly validate:
- Hash determinism
- Content and metadata differences
- Exclusion of auto-generated metadata fields
- Conditional chunk_index behavior
api/tests/unit/test_helpers/test_documents_updater.py (1)
1-450: LGTM!Excellent test suite with comprehensive coverage of DocumentsUpdater functionality:
- Chunk data preparation with and without metadata
- Update necessity detection (new documents, force updates, hash comparisons)
- Document processing flows (creation, updates, rollbacks)
- Error handling and cleanup on failures
The tests properly use mocks and fixtures to isolate behavior.
api/tests/integ/test_collections.py (2)
254-384: LGTM!Excellent integration test coverage for parquet-based collection updates:
- Creation with initial parquet upload
- PUT endpoint for force updates
- PATCH with metadata and file updates, including multi-step validation of idempotency and change detection
- Validation of malformed parquet files
The multi-step test (lines 295-371) particularly well validates that identical files don't trigger updates and that actual changes are properly detected.
11-13: All required test assets are present and properly committed.The verification confirms that all three parquet files referenced in the tests (
parquet.parquet,updated_parquet.parquet, andparquet_wrong_format.parquet) exist in theapi/tests/integ/assets/directory with appropriate file sizes. Tests can access these files without issues.api/helpers/files/parquet/extractor.py (2)
143-163: LGTM!The JSON parsing heuristic has been properly improved to check both start and end characters, reducing false positives. The exception handling is now specific to json.JSONDecodeError while still catching unexpected errors with proper logging.
22-35: LGTM!The extraction and validation methods are well-implemented:
- Efficient use of PyArrow's vectorized compute operations
- Proper filtering and validation logic
- Clear separation of concerns
Also applies to: 38-51, 84-106, 109-120, 167-179
api/helpers/files/parquet/reader.py (3)
32-83: LGTM!The
read_and_validatemethod properly handles:
- Memory-efficient file size checking using seek/tell
- File type validation
- Parquet parsing with appropriate error handling
- Column validation and chunk_index detection
120-139: LGTM!The class constant mutation issue has been properly fixed. Line 135 now creates a copy before modification, preventing unintended side effects across method calls.
86-98: LGTM!The validation and sorting methods are well-implemented:
- Required columns validation is clear and provides helpful error messages
- Sorting strategy ensures chunks are properly grouped by document and ordered by index
Also applies to: 101-118
api/endpoints/collections.py (3)
39-59: LGTM!Excellent error handling in the collection creation flow:
- Creates collection first
- Processes parquet file if provided
- On any failure, properly cleans up by deleting the created collection
- Ensures no orphaned collections remain on error
The try-except with cleanup prevents partial state from persisting.
170-187: LGTM!The atomicity concern has been properly addressed. The file processing now happens before metadata updates (lines 170-178), ensuring that if the parquet upload fails, the collection metadata remains unchanged. This prevents partial updates.
192-220: LGTM!Clean implementation of the force update endpoint. The endpoint properly:
- Validates collection existence
- Calls the document manager with force_update=True
- Returns processing statistics
api/helpers/documents/documents_updater.py (3)
101-115: LGTM - Critical data loss issue properly resolved!The update flow now correctly prevents data loss:
- Creates new document (line 102)
- Inserts and vectorizes chunks (lines 105-110)
- Only deletes old document after successful vectorization (lines 113-115)
- On failure, cleans up the new document while preserving the old one (lines 131-133)
This ensures transient vectorization errors don't result in permanent data loss.
Also applies to: 126-137
139-184: LGTM!The chunk preparation logic is clean and efficient:
- Properly extracts content and metadata for each chunk
- Conditionally computes hashes for change detection (skipped on force_update)
- Uses static utility methods appropriately
186-242: LGTM!The remaining methods are well-implemented:
_check_update_needed: Properly detects new documents, respects force_update flag, and compares hashes for change detection_create_empty_document: Creates document records with appropriate error handling for missing collections_insert_chunks: Properly batches chunk insertion with full metadata and wraps vectorization errorsAlso applies to: 244-270, 272-318
api/tests/integ/cassettes/test_collections.test_create_collection_with_parquet_file.yaml
Show resolved
Hide resolved
api/tests/integ/cassettes/test_collections.test_patch_collection_metadata_and_file.yaml
Show resolved
Hide resolved
api/tests/integ/cassettes/test_collections.test_put_collection_with_parquet_file.yaml
Show resolved
Hide resolved
13743a6 to
6578eab
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
♻️ Duplicate comments (4)
.github/badges/coverage-unit.json (1)
1-1: Unit test coverage trade-off noted.The slight decline in unit test coverage (-0.09pp) reflects the addition of Parquet features primarily covered by integration tests. While unit tests would provide faster feedback, the existing integration test coverage provides end-to-end validation of the critical paths.
docs/docs/guides/rag.md (1)
49-57: Verify consistency with notebook and API implementation.This change mirrors the one in
docs/tutorials/rag.ipynb(line 112), switching from JSON to form-encoded data. Ensure both documentation sources remain synchronized and accurately reflect the API's expected request format.api/helpers/_documentmanager.py (1)
558-567: Force update still risks data loss despite earlier flag.The force update flow deletes all existing documents (lines 563-565) before processing new ones from the parquet file. If any subsequent operation fails (e.g., vectorization timeout at line 605, memory issues during batch processing), the collection is left empty with no recovery path. The earlier review comment flagging this was marked "Skipped," but the data loss risk remains unmitigated.
Consider one of these approaches:
- Staged deletion: Process all new documents first, then delete old ones only after successful completion
- Backup mechanism: Store document IDs in a recovery table before deletion, with cleanup on success
- Transaction wrapper: Use a database transaction with rollback capability (though vector store operations may not be transactional)
# Handle force update (delete all existing documents) if force_update and existing_docs_map: document_ids = list(existing_docs_map.values()) logger.info(f"Force update: deleting {len(document_ids)} documents from collection {collection_id}") + # Store for potential rollback + original_doc_ids = document_ids.copy() + + # Process new documents FIRST + # ... (process documents here) + + # Only delete old documents after successful processing + if force_update and original_doc_ids: for document_id in document_ids: await self.delete_document(session=session, user_id=user_id, document_id=document_id) stats["deleted_documents"] = len(document_ids) - existing_docs_map = {}api/helpers/data/hasher/chunk_hasher.py (1)
38-46: Hash collision risk persists—add inter-pair delimiter.The concatenation still lacks a delimiter between key-value pairs (line 46), leaving the earlier collision concern unresolved. While each pair includes
key:value, the absence of a separator between pairs can create ambiguous strings:{"content": "abc", "page": "1"} → "content:abcpage:1" {"content": "abcpage:", "metadata": "1"} → "content:abcpage:metadata:1"Add a separator character between pairs:
for key in sorted(chunk.keys()): value = chunk.get(key, None) if isinstance(value, (dict | list)): parts.append(f"{key}:{json.dumps(value, ensure_ascii=False)}") else: parts.append(f"{key}:{value}") - return "".join(parts) + return "|".join(parts) # or "\n", "\x00", etc.Also update the docstring example to reflect the separator.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
⛔ Files ignored due to path filters (3)
api/tests/integ/assets/parquet.parquetis excluded by!**/*.parquetapi/tests/integ/assets/parquet_wrong_format.parquetis excluded by!**/*.parquetapi/tests/integ/assets/updated_parquet.parquetis excluded by!**/*.parquet
📒 Files selected for processing (35)
.github/badges/coverage-unit.json(1 hunks).github/badges/coverage.json(1 hunks)api/endpoints/collections.py(4 hunks)api/helpers/_accesscontroller.py(1 hunks)api/helpers/_documentmanager.py(2 hunks)api/helpers/_parsermanager.py(3 hunks)api/helpers/data/hasher/__init__.py(1 hunks)api/helpers/data/hasher/chunk_hasher.py(1 hunks)api/helpers/documents/__init__.py(1 hunks)api/helpers/documents/documents_updater.py(1 hunks)api/helpers/files/__init__.py(1 hunks)api/helpers/files/parquet/__init__.py(1 hunks)api/helpers/files/parquet/extractor.py(1 hunks)api/helpers/files/parquet/reader.py(1 hunks)api/schemas/collections.py(2 hunks)api/schemas/core/documents.py(1 hunks)api/tests/integ/cassettes/test_collections.test_create_collection_with_parquet_file.yaml(1 hunks)api/tests/integ/cassettes/test_collections.test_patch_collection_metadata_and_file.yaml(1 hunks)api/tests/integ/cassettes/test_collections.test_put_collection_with_parquet_file.yaml(1 hunks)api/tests/integ/conftest.py(1 hunks)api/tests/integ/test_admin.py(2 hunks)api/tests/integ/test_chat.py(1 hunks)api/tests/integ/test_chunks.py(1 hunks)api/tests/integ/test_collections.py(14 hunks)api/tests/integ/test_documents.py(1 hunks)api/tests/integ/test_files.py(1 hunks)api/tests/integ/test_search.py(2 hunks)api/tests/unit/test_helpers/test_chunk_hasher.py(1 hunks)api/tests/unit/test_helpers/test_documents_updater.py(1 hunks)api/tests/unit/test_helpers/test_parsermanager.py(2 hunks)api/utils/exceptions.py(2 hunks)docs/docs/functionalities/document-management/collections.md(1 hunks)docs/docs/guides/rag.md(1 hunks)docs/tutorials/rag.ipynb(2 hunks)pyproject.toml(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (20)
api/tests/integ/test_admin.py (1)
api/schemas/collections.py (1)
CollectionVisibility(9-11)
api/helpers/data/hasher/__init__.py (1)
api/helpers/data/hasher/chunk_hasher.py (1)
ChunkHasher(10-99)
api/helpers/_parsermanager.py (2)
api/schemas/core/documents.py (1)
FileType(19-25)api/utils/exceptions.py (1)
UnsupportedFileTypeException(163-165)
api/helpers/_documentmanager.py (5)
api/helpers/documents/documents_updater.py (2)
DocumentsUpdater(19-318)process_document(40-137)api/helpers/files/parquet/extractor.py (3)
ParquetDataExtractor(13-179)get_unique_document_names(22-35)filter_document_table(38-51)api/helpers/files/parquet/reader.py (4)
ParquetReader(20-139)read_and_validate(32-83)sort_table(101-118)identify_metadata_columns(121-139)api/utils/exceptions.py (2)
CollectionNotFoundException(78-80)VectorizationFailedException(190-192)api/endpoints/collections.py (1)
create_collection(16-61)
api/tests/integ/test_search.py (1)
api/tests/integ/conftest.py (1)
client(335-361)
api/helpers/documents/__init__.py (1)
api/helpers/documents/documents_updater.py (1)
DocumentsUpdater(19-318)
api/helpers/files/parquet/reader.py (3)
api/schemas/core/documents.py (1)
FileType(19-25)api/utils/exceptions.py (2)
FileSizeLimitExceededException(124-129)InvalidPARQUETFormatException(143-145)api/helpers/_parsermanager.py (1)
_detect_file_type(61-92)
api/tests/integ/test_collections.py (3)
api/tests/integ/conftest.py (1)
client(335-361)api/schemas/collections.py (1)
CollectionVisibility(9-11)api/tests/integ/test_documents.py (1)
collection(13-21)
api/tests/integ/test_documents.py (1)
api/schemas/collections.py (1)
CollectionVisibility(9-11)
api/tests/unit/test_helpers/test_parsermanager.py (4)
api/schemas/core/documents.py (1)
FileType(19-25)api/helpers/_parsermanager.py (2)
ParserManager(16-215)parse_file(110-120)api/utils/exceptions.py (1)
UnsupportedFileTypeException(163-165)api/helpers/_documentmanager.py (1)
parse_file(335-346)
api/helpers/files/parquet/__init__.py (2)
api/helpers/files/parquet/extractor.py (1)
ParquetDataExtractor(13-179)api/helpers/files/parquet/reader.py (1)
ParquetReader(20-139)
api/tests/integ/test_chunks.py (1)
api/schemas/collections.py (1)
CollectionVisibility(9-11)
api/helpers/_accesscontroller.py (2)
api/schemas/collections.py (1)
CollectionVisibility(9-11)api/schemas/admin/roles.py (1)
PermissionType(10-14)
api/helpers/documents/documents_updater.py (4)
api/helpers/data/hasher/chunk_hasher.py (4)
ChunkHasher(10-99)concatenate_chunk_values(19-46)compute_chunk_hash(49-66)compute_existing_chunk_hash(69-99)api/helpers/files/parquet/extractor.py (5)
ParquetDataExtractor(13-179)extract_chunk_ids(54-81)extract_metadata_columns(109-120)build_chunk_metadata(123-164)get_row_values(167-179)api/schemas/chunks.py (1)
Chunk(6-10)api/utils/exceptions.py (2)
CollectionNotFoundException(78-80)VectorizationFailedException(190-192)
api/tests/integ/test_chat.py (1)
api/tests/integ/conftest.py (1)
client(335-361)
api/tests/integ/test_files.py (2)
api/schemas/collections.py (1)
CollectionVisibility(9-11)api/tests/integ/conftest.py (1)
client(335-361)
api/tests/unit/test_helpers/test_documents_updater.py (2)
api/helpers/documents/documents_updater.py (4)
DocumentsUpdater(19-318)_prepare_chunks_data(139-184)_check_update_needed(186-242)process_document(40-137)api/utils/exceptions.py (1)
VectorizationFailedException(190-192)
api/endpoints/collections.py (2)
api/schemas/collections.py (3)
Collection(17-26)Collections(29-31)CollectionVisibility(9-11)api/helpers/_documentmanager.py (4)
create_collection(74-83)update_collection_from_parquet(496-608)delete_collection(86-101)update_collection(104-126)
api/helpers/data/hasher/chunk_hasher.py (1)
api/schemas/chunks.py (1)
Chunk(6-10)
api/tests/unit/test_helpers/test_chunk_hasher.py (1)
api/helpers/data/hasher/chunk_hasher.py (4)
ChunkHasher(10-99)concatenate_chunk_values(19-46)compute_chunk_hash(49-66)compute_existing_chunk_hash(69-99)
🪛 markdownlint-cli2 (0.18.1)
docs/docs/functionalities/document-management/collections.md
51-51: Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
59-59: Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
67-67: Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
94-94: Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
115-115: Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
124-124: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
128-128: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
133-133: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
136-136: Trailing spaces
Expected: 0 or 2; Actual: 1
(MD009, no-trailing-spaces)
142-142: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
154-154: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
168-168: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Above
(MD022, blanks-around-headings)
168-168: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
🔇 Additional comments (36)
api/schemas/core/documents.py (1)
19-26: LGTM! Clean addition of PARQUET file type support.The new enum member follows the established pattern and enables Parquet file handling throughout the codebase.
api/helpers/files/__init__.py (1)
1-1: LGTM! Good practice to document the module's purpose.The docstring clearly describes the module's role in the codebase.
api/utils/exceptions.py (1)
143-146: LGTM! New exception follows established pattern.The
InvalidPARQUETFormatExceptionis consistent with existing format exceptions (e.g.,InvalidJSONFormatException) and appropriately returns a 422 status code for validation errors.api/helpers/data/hasher/__init__.py (1)
1-5: LGTM! Clean module structure with clear public API.The module provides a well-defined interface for chunk hashing utilities with:
- Clear docstring describing the module's purpose
- Explicit
__all__declaration for controlled public exports- Single import/export keeping the interface focused
docs/tutorials/rag.ipynb (1)
99-119: The tutorial correctly uses form-encoded data—no changes needed.The verification confirms the API endpoint has been properly updated to accept form-encoded data via
Form(...)parameters, including support for optional Parquet file uploads. The tutorial's change fromjson=todata=correctly matches this implementation.api/tests/integ/conftest.py (1)
343-343: LGTM! PUT method wrappers added consistently.The three new PUT wrappers (put_without_permissions, put_with_permissions, put_master) follow the established pattern for other HTTP methods and enable comprehensive testing of the new PUT endpoints across different permission levels.
Also applies to: 350-350, 357-357
pyproject.toml (2)
3-3: Version bump looks appropriate.The version increment from 0.2.8post3 to 0.3.0 correctly reflects the addition of new features (Parquet support, PUT endpoints for collections).
46-47: Dependencies added correctly.Both pyarrow==22.0.0 and xxhash==3.6.0 are properly scoped to the api optional dependencies. The pyarrow upgrade addresses the critical security vulnerability identified in previous reviews.
api/helpers/_parsermanager.py (2)
19-19: LGTM! Parquet file type properly registered.The extension mapping and MIME type validation for Parquet files are correctly configured. The MIME types include the official IANA-registered type (
application/vnd.apache.parquet) and common variants, as confirmed in previous reviews.Also applies to: 33-37
114-116: Appropriate runtime restriction for Parquet files.The explicit rejection of Parquet files in
parse_fileis correct, as Parquet files are intended for direct collection ingestion (via the new endpoints) rather than document parsing. The error message clearly communicates this constraint.api/tests/integ/test_chat.py (1)
29-29: LGTM! Test updated for form-data payload.The change from
json=todata=correctly aligns with the API's shift to form-encoded payloads for collection creation, ensuring the test exercises the updated endpoint behavior.api/tests/integ/test_chunks.py (1)
17-17: LGTM! Consistent with API changes.The update to use form data (
data=) and the enum value (CollectionVisibility.PRIVATE.value) correctly aligns with the API's transition to form-encoded payloads and enum value serialization.api/tests/integ/test_admin.py (2)
11-11: LGTM! Import added for enum usage.The
CollectionVisibilityimport is correctly added to support the enum value usage in the test.
406-406: LGTM! Test updated consistently.The change to form data and enum value usage aligns with the broader API updates and improves type safety, as confirmed in previous reviews.
api/tests/integ/test_files.py (1)
15-22: LGTM! Test payload updated correctly for form-based collection creation.The changes properly reflect the API's shift from JSON body to form data for collection creation. Using
CollectionVisibility.*.valueensures the correct string values are passed in the form data.api/tests/integ/test_documents.py (1)
16-16: LGTM! Consistent with form-based collection creation.The test correctly uses form data with the visibility enum value.
api/tests/integ/test_search.py (2)
17-17: LGTM! Payload format updated correctly.The test now uses form data (
data=) instead of JSON payload. Note that visibility is not explicitly specified here, which relies on the API's default value. This aligns with your previous decision to skip adding explicit visibility parameters.
103-103: LGTM! Consistent payload format.Same pattern as line 17 - using form data without explicit visibility, consistent with the previous design decision.
api/helpers/_accesscontroller.py (1)
278-282: LGTM! Form-based visibility check implemented correctly.The code now correctly reads visibility from form data and compares it against the enum value. This aligns with the broader API changes to use form-based inputs for collection creation/updates.
Note: As you mentioned, the security concern about PUT/PATCH requests bypassing this check is being addressed in a separate PR.
api/tests/integ/cassettes/test_collections.test_patch_collection_metadata_and_file.yaml (1)
1-170: LGTM! Valid cassette fixture for integration tests.This cassette correctly records embeddings API interactions for testing. The full embedding vectors are preserved, which ensures test fidelity. As you noted previously, cassette optimization was considered but skipped.
api/tests/unit/test_helpers/test_parsermanager.py (3)
437-437: LGTM! PARQUET extension mapping added correctly.The test correctly verifies that
.parquetfiles are mapped toFileType.PARQUET.
451-453: LGTM! Content type validation updated for PARQUET.The test now correctly includes
FileType.PARQUETin the validation check, ensuring PARQUET has valid content types defined.
455-466: LGTM! PARQUET rejection in parse_file verified.The test correctly verifies that PARQUET files are rejected by
parse_filewithUnsupportedFileTypeException. As you noted previously, checking the specific exception message was considered but skipped.api/helpers/documents/__init__.py (1)
1-5: LGTM! Clean module structure.The module correctly re-exports
DocumentsUpdaterwith proper docstring and__all__declaration, following Python packaging best practices.api/schemas/collections.py (1)
14-14: LGTM! Well-defined type alias for collection names.The
CollectionNamealias provides clear validation requirements (strip whitespace, minimum length of 1) and can be reused across form-based collection endpoints. This is a clean approach for the shift from body-based to form-based inputs.api/helpers/files/parquet/__init__.py (1)
1-6: LGTM!Clean module initialization with proper public API exposure via
__all__. The structure follows Python best practices.api/tests/unit/test_helpers/test_chunk_hasher.py (1)
1-141: LGTM!Comprehensive test suite with good coverage of ChunkHasher functionality:
- Deterministic hashing behavior
- Hash sensitivity to content and metadata changes
- Proper exclusion of internal fields
- Conditional chunk_index handling
The use of fixtures and parametrized tests enhances maintainability.
api/tests/unit/test_helpers/test_documents_updater.py (1)
1-450: LGTM!Excellent test coverage for DocumentsUpdater with:
- Comprehensive fixture setup using AsyncMock for async operations
- Full coverage of chunk preparation, update detection, and document processing flows
- Proper rollback testing to ensure error handling works correctly
- Good use of patching to isolate units under test
The test suite provides confidence in the DocumentsUpdater implementation.
api/helpers/files/parquet/extractor.py (1)
1-179: LGTM!Well-implemented ParquetDataExtractor with efficient PyArrow operations:
- Vectorized operations for unique/filter using PyArrow compute
- Proper sequential validation for chunk IDs
- Improved JSON parsing heuristic with start/end checks and specific exception handling
- Clean static method organization
The implementation provides a solid foundation for parquet-based document processing.
api/tests/integ/test_collections.py (1)
254-384: Excellent integration test coverage for parquet uploads — verified.The new tests comprehensively validate parquet file handling:
- Initial upload on collection creation
- Force update via PUT
- Incremental update via PATCH with change detection (no-op on identical file, proper stats on changes)
- Error handling for invalid formats
The multi-step test at lines 295–371 is particularly valuable as it verifies the hash-based change detection works correctly across multiple uploads.
All test asset files (parquet.parquet, updated_parquet.parquet, parquet_wrong_format.parquet) are present in the repository and committed.
api/helpers/files/parquet/reader.py (1)
1-139: LGTM! The ParquetReader implementation is solid.The class correctly handles Parquet file reading and validation with appropriate error handling. The mutation bug previously flagged (line 135) has been properly fixed using
.copy(), and the file validation logic is integrated as intended.api/endpoints/collections.py (3)
39-60: Excellent error handling with cleanup.The try-except block correctly handles failures during file processing by cleaning up the created collection before re-raising the exception. This prevents orphaned collections when Parquet processing fails.
170-188: Correct operation ordering for atomicity.The file processing now happens before metadata updates, ensuring that metadata changes only persist if the Parquet upload succeeds. This correctly addresses the atomicity concern raised in previous reviews.
192-220: LGTM! Clean force update implementation.The
force_update_collectionendpoint correctly delegates toupdate_collection_from_parquetwithforce_update=Trueand has consistent response structure (no type mismatch like the other endpoints).api/helpers/documents/documents_updater.py (2)
101-125: Excellent update flow preventing data loss.The implementation correctly creates the new document and vectorizes chunks before deleting the old document. This ensures the original document remains intact if vectorization fails, addressing the critical data loss concern raised in previous reviews.
272-318: Clean chunk insertion with proper error handling.The
_insert_chunksmethod correctly prepares chunk metadata and handles vectorization failures by raisingVectorizationFailedExceptionwith context. The batch operation approach is efficient.
b5515c2 to
9e9f8bf
Compare
Summary by CodeRabbit
New Features
Improvements
Tests
Chores