Skip to content

Conversation

@FaheemBEG
Copy link
Contributor

@FaheemBEG FaheemBEG commented Nov 10, 2025

Summary by CodeRabbit

  • New Features

    • Parquet file upload to create, update, or force-replace collection content via API (PUT replaces all documents).
  • Improvements

    • Create/Update endpoints now accept multipart/form-data (form fields + file) and return JSON with detailed processing results.
    • Stronger Parquet validation, size limits, deterministic chunk hashing, clearer error responses, and automatic rollback on failures.
    • Docs updated with Parquet requirements and examples.
  • Tests

    • New unit and integration tests covering Parquet flows and hashing.
  • Chores

    • Project version bumped to 0.3.0

@coderabbitai
Copy link

coderabbitai bot commented Nov 10, 2025

Walkthrough

Collection endpoints now accept form parameters and optional Parquet uploads; new ParquetReader/ParquetDataExtractor, DocumentsUpdater, and ChunkHasher implement Parquet-driven document extraction, hashing, change detection, upserts, and rollback. Parser, document manager, exceptions, tests, cassettes, docs, and optional deps updated; version bumped to 0.3.0.

Changes

Cohort / File(s) Summary
Coverage badges
\.github/badges/coverage.json, \.github/badges/coverage-unit.json
Updated coverage badge messages: coverage.json -> 73.01%; coverage-unit.json -> 58.86%. Labels, colors, and schemaVersion unchanged.
Collection endpoints
api/endpoints/collections.py
Create/PATCH/PUT endpoints now accept form params (name, visibility, description) and optional UploadFile Parquet; added force_update_collection (PUT) to replace collection from Parquet; endpoints return JSON payloads with id and details; parquet failures trigger cleanup and re-raise.
Access control
api/helpers/_accesscontroller.py
_check_collections reads visibility from request.form() for public-creation permission checks instead of request body parsing.
Document manager
api/helpers/_documentmanager.py
Added update_collection_from_parquet(session, user_id, collection_id, parquet_file, force_update=False) to validate access, read/validate Parquet, orchestrate DocumentsUpdater, optionally force-delete existing docs, and return processing stats.
Parser manager
api/helpers/_parsermanager.py
Added .parquetFileType.PARQUET mapping and Parquet MIME types; parse_file now rejects PARQUET with UnsupportedFileTypeException.
Chunk hashing utilities
api/helpers/data/hasher/__init__.py, api/helpers/data/hasher/chunk_hasher.py
New ChunkHasher with deterministic concatenation and xxHash64 hashing; methods: concatenate_chunk_values, compute_chunk_hash, compute_existing_chunk_hash (excludes internal metadata, optional chunk_index).
Parquet file processing
api/helpers/files/parquet/__init__.py, api/helpers/files/parquet/reader.py, api/helpers/files/parquet/extractor.py
New ParquetReader validates size/type, reads PyArrow table, enforces required columns, sorts table, identifies metadata columns; ParquetDataExtractor extracts document names, per-document tables, chunk IDs (sequential validation), metadata, and row values; raises 4xx exceptions for invalid inputs.
Documents updater
api/helpers/documents/__init__.py, api/helpers/documents/documents_updater.py
New DocumentsUpdater prepares per-chunk payloads (with optional hashes), decides create/update/no-op by comparing against existing chunks, creates documents, upserts chunks via callback, optionally deletes old documents, tracks stats, and rolls back on errors.
Files package initializer
api/helpers/files/__init__.py
Added package docstring/init for file utilities.
Schemas
api/schemas/collections.py, api/schemas/core/documents.py
Removed CollectionRequest/CollectionUpdateRequest; added CollectionName = Annotated[str, StringConstraints(...)]; added FileType.PARQUET enum member.
Exceptions
api/utils/exceptions.py
Added MAX_PARQUET_CONTENT_SIZE to FileSizeLimitExceededException; new InvalidPARQUETFormatException (HTTP 422).
Tests — fixtures & helpers
api/tests/integ/conftest.py
Added PUT test client helpers: put_without_permissions, put_with_permissions, put_master.
Tests — integration & cassettes
api/tests/integ/*.py, api/tests/integ/cassettes/*.yaml
Many tests updated to send collection metadata as form data (use .value for enums); PATCH expectations updated to 200; added Parquet-focused integration tests and new embedding VCR cassettes.
Tests — unit
api/tests/unit/test_helpers/test_chunk_hasher.py, api/tests/unit/test_helpers/test_parsermanager.py, api/tests/unit/test_helpers/test_documents_updater.py
Added unit tests for ChunkHasher, ParserManager PARQUET mapping and rejection, and comprehensive DocumentsUpdater prepare/check/process/rollback flows.
Docs & tutorials
docs/docs/functionalities/..., docs/docs/guides/rag.md, docs/tutorials/rag.ipynb
Examples switched from JSON bodies to multipart/form-data; documentation added for Parquet requirements, PATCH vs PUT semantics, and new workflows.
Project metadata / deps
pyproject.toml
Bumped project version to 0.3.0; added optional deps pyarrow==22.0.0 and xxhash==3.6.0 under [project.optional-dependencies].api.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Endpoint as Collections Endpoint
    participant Reader as ParquetReader
    participant Extractor as ParquetDataExtractor
    participant Updater as DocumentsUpdater
    participant Vector as Vector Store

    Client->>Endpoint: POST/PATCH/PUT /v1/collections (multipart form with optional parquet)
    Endpoint->>Reader: read_and_validate(parquet_file)
    alt invalid file (size/type/format)
        Reader-->>Endpoint: raise (413/400/422)
        Endpoint-->>Client: 4xx error
    else valid file
        Reader-->>Endpoint: (table, available_columns, has_chunk_index)
        Endpoint->>Extractor: get_unique_document_names(table)
        loop per document
            Endpoint->>Extractor: filter_document_table(...) / extract_chunk_ids(...)
            Endpoint->>Updater: process_document(session,user,collection,doc_name,table_slice,...)
            Updater->>Vector: fetch existing chunks / upsert new chunks
            Vector-->>Updater: existing chunks / upsert result
            Updater-->>Endpoint: update stats (added/updated/deleted/chunks)
        end
        Endpoint-->>Client: 200/201 { id, details: { added, updated, deleted, total_chunks_processed } }
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Pay extra attention to:
    • api/helpers/documents/documents_updater.py — chunk hashing, comparison logic, upsert/delete callbacks, rollback/error propagation.
    • api/helpers/files/parquet/reader.py and api/helpers/files/parquet/extractor.py — PyArrow usage, required-column validation, chunk_index sequencing, and table sorting.
    • api/endpoints/collections.py and api/helpers/_documentmanager.py — new signatures, multipart handling, rollback semantics, and JSONResponse payload shapes.
    • Tests and cassettes — ensure mocks, cassettes, and test parameter usage align with form payload formats and embeddings interactions.

Poem

🐰 Parquet leaves fall in tidy rows,
xxHash hums where the carrot grows.
Chunks hop in order, hashes bright —
Documents sprout beneath moonlight.
Upserts stitched, the meadow sleeps tonight.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 64.52% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately captures the main feature: updating POST, PATCH, and PUT endpoints for /collections to enable collection updates.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch 467-add-a-put-collections-endpoint-for-collections-update

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b5515c2 and a64b222.

⛔ Files ignored due to path filters (3)
  • api/tests/integ/assets/parquet.parquet is excluded by !**/*.parquet
  • api/tests/integ/assets/parquet_wrong_format.parquet is excluded by !**/*.parquet
  • api/tests/integ/assets/updated_parquet.parquet is excluded by !**/*.parquet
📒 Files selected for processing (35)
  • .github/badges/coverage-unit.json (1 hunks)
  • .github/badges/coverage.json (1 hunks)
  • api/endpoints/collections.py (4 hunks)
  • api/helpers/_accesscontroller.py (1 hunks)
  • api/helpers/_documentmanager.py (2 hunks)
  • api/helpers/_parsermanager.py (3 hunks)
  • api/helpers/data/hasher/__init__.py (1 hunks)
  • api/helpers/data/hasher/chunk_hasher.py (1 hunks)
  • api/helpers/documents/__init__.py (1 hunks)
  • api/helpers/documents/documents_updater.py (1 hunks)
  • api/helpers/files/__init__.py (1 hunks)
  • api/helpers/files/parquet/__init__.py (1 hunks)
  • api/helpers/files/parquet/extractor.py (1 hunks)
  • api/helpers/files/parquet/reader.py (1 hunks)
  • api/schemas/collections.py (2 hunks)
  • api/schemas/core/documents.py (1 hunks)
  • api/tests/integ/cassettes/test_collections.test_create_collection_with_parquet_file.yaml (1 hunks)
  • api/tests/integ/cassettes/test_collections.test_patch_collection_metadata_and_file.yaml (1 hunks)
  • api/tests/integ/cassettes/test_collections.test_put_collection_with_parquet_file.yaml (1 hunks)
  • api/tests/integ/conftest.py (1 hunks)
  • api/tests/integ/test_admin.py (2 hunks)
  • api/tests/integ/test_chat.py (1 hunks)
  • api/tests/integ/test_chunks.py (1 hunks)
  • api/tests/integ/test_collections.py (14 hunks)
  • api/tests/integ/test_documents.py (1 hunks)
  • api/tests/integ/test_files.py (1 hunks)
  • api/tests/integ/test_search.py (2 hunks)
  • api/tests/unit/test_helpers/test_chunk_hasher.py (1 hunks)
  • api/tests/unit/test_helpers/test_documents_updater.py (1 hunks)
  • api/tests/unit/test_helpers/test_parsermanager.py (2 hunks)
  • api/utils/exceptions.py (2 hunks)
  • docs/docs/functionalities/document-management/collections.md (1 hunks)
  • docs/docs/guides/rag.md (1 hunks)
  • docs/tutorials/rag.ipynb (2 hunks)
  • pyproject.toml (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (21)
api/tests/integ/test_documents.py (1)
api/schemas/collections.py (1)
  • CollectionVisibility (9-11)
api/helpers/_documentmanager.py (3)
api/helpers/documents/documents_updater.py (2)
  • DocumentsUpdater (19-318)
  • process_document (40-137)
api/helpers/files/parquet/extractor.py (3)
  • ParquetDataExtractor (13-179)
  • get_unique_document_names (22-35)
  • filter_document_table (38-51)
api/helpers/files/parquet/reader.py (4)
  • ParquetReader (20-139)
  • read_and_validate (32-83)
  • sort_table (101-118)
  • identify_metadata_columns (121-139)
api/helpers/files/parquet/__init__.py (2)
api/helpers/files/parquet/extractor.py (1)
  • ParquetDataExtractor (13-179)
api/helpers/files/parquet/reader.py (1)
  • ParquetReader (20-139)
api/tests/integ/test_search.py (1)
api/tests/integ/conftest.py (1)
  • client (335-361)
api/tests/unit/test_helpers/test_documents_updater.py (2)
api/helpers/documents/documents_updater.py (4)
  • DocumentsUpdater (19-318)
  • _prepare_chunks_data (139-184)
  • _check_update_needed (186-242)
  • process_document (40-137)
api/utils/exceptions.py (1)
  • VectorizationFailedException (190-192)
api/helpers/data/hasher/chunk_hasher.py (1)
api/schemas/chunks.py (1)
  • Chunk (6-10)
api/helpers/documents/documents_updater.py (5)
api/helpers/data/hasher/chunk_hasher.py (4)
  • ChunkHasher (10-99)
  • concatenate_chunk_values (19-46)
  • compute_chunk_hash (49-66)
  • compute_existing_chunk_hash (69-99)
api/helpers/files/parquet/extractor.py (5)
  • ParquetDataExtractor (13-179)
  • extract_chunk_ids (54-81)
  • extract_metadata_columns (109-120)
  • build_chunk_metadata (123-164)
  • get_row_values (167-179)
api/schemas/chunks.py (1)
  • Chunk (6-10)
api/utils/exceptions.py (2)
  • CollectionNotFoundException (78-80)
  • VectorizationFailedException (190-192)
api/helpers/_documentmanager.py (1)
  • get_chunks (303-332)
api/helpers/files/parquet/reader.py (3)
api/schemas/core/documents.py (1)
  • FileType (19-25)
api/utils/exceptions.py (2)
  • FileSizeLimitExceededException (124-129)
  • InvalidPARQUETFormatException (143-145)
api/helpers/_parsermanager.py (1)
  • _detect_file_type (61-92)
api/tests/integ/test_files.py (2)
api/schemas/collections.py (1)
  • CollectionVisibility (9-11)
api/tests/integ/conftest.py (1)
  • client (335-361)
api/tests/integ/test_admin.py (1)
api/schemas/collections.py (1)
  • CollectionVisibility (9-11)
api/tests/unit/test_helpers/test_parsermanager.py (3)
api/schemas/core/documents.py (1)
  • FileType (19-25)
api/helpers/_parsermanager.py (2)
  • ParserManager (16-215)
  • parse_file (110-120)
api/utils/exceptions.py (1)
  • UnsupportedFileTypeException (163-165)
api/helpers/_parsermanager.py (2)
api/schemas/core/documents.py (1)
  • FileType (19-25)
api/utils/exceptions.py (1)
  • UnsupportedFileTypeException (163-165)
api/helpers/_accesscontroller.py (2)
api/schemas/collections.py (1)
  • CollectionVisibility (9-11)
api/schemas/admin/roles.py (1)
  • PermissionType (10-14)
api/endpoints/collections.py (2)
api/schemas/collections.py (3)
  • Collection (17-26)
  • Collections (29-31)
  • CollectionVisibility (9-11)
api/helpers/_documentmanager.py (4)
  • create_collection (74-83)
  • update_collection_from_parquet (496-608)
  • delete_collection (86-101)
  • update_collection (104-126)
api/helpers/data/hasher/__init__.py (1)
api/helpers/data/hasher/chunk_hasher.py (1)
  • ChunkHasher (10-99)
api/helpers/files/parquet/extractor.py (1)
api/utils/logging.py (1)
  • filter (11-14)
api/tests/integ/test_chunks.py (1)
api/schemas/collections.py (1)
  • CollectionVisibility (9-11)
api/tests/integ/test_collections.py (3)
api/tests/integ/conftest.py (1)
  • client (335-361)
api/schemas/collections.py (1)
  • CollectionVisibility (9-11)
api/tests/integ/test_documents.py (1)
  • collection (13-21)
api/helpers/documents/__init__.py (1)
api/helpers/documents/documents_updater.py (1)
  • DocumentsUpdater (19-318)
api/tests/integ/test_chat.py (1)
api/tests/integ/conftest.py (1)
  • client (335-361)
api/tests/unit/test_helpers/test_chunk_hasher.py (1)
api/helpers/data/hasher/chunk_hasher.py (4)
  • ChunkHasher (10-99)
  • concatenate_chunk_values (19-46)
  • compute_chunk_hash (49-66)
  • compute_existing_chunk_hash (69-99)
🪛 markdownlint-cli2 (0.18.1)
docs/docs/functionalities/document-management/collections.md

191-191: Files should end with a single newline character

(MD047, single-trailing-newline)

🔇 Additional comments (30)
api/helpers/files/__init__.py (1)

1-1: LGTM!

Clean module docstring for the file processing utilities package.

api/tests/integ/test_chunks.py (1)

17-17: LGTM!

The test correctly updates to use form data encoding and extracts the enum value. This aligns with the API changes to accept form-based collection creation.

api/schemas/collections.py (1)

1-14: LGTM! Clean schema refactoring.

The CollectionName type alias with StringConstraints is a cleaner approach than separate request models. The validation correctly ensures non-empty collection names after stripping whitespace.

api/tests/integ/test_chat.py (1)

29-29: LGTM!

The test correctly updates to use form data encoding, consistent with the API changes to support form-based collection creation.

api/schemas/core/documents.py (1)

19-25: LGTM! Clean enum extension for Parquet support.

The addition of the PARQUET member to the FileType enum is straightforward and consistent with existing enum members. This properly extends document type support for Parquet files introduced elsewhere in the PR.

api/tests/integ/test_admin.py (2)

11-11: LGTM! Import added to support enum value usage.

The import addition supports the consistent use of CollectionVisibility.PRIVATE.value in the test, aligning with the PR's migration to form-data payloads.


403-407: LGTM! Consistent migration to form data with enum values.

The change from JSON payload to form data (data=) and the use of CollectionVisibility.PRIVATE.value (instead of the enum instance) is correct and consistent with the broader API changes in this PR.

docs/tutorials/rag.ipynb (1)

98-119: LGTM! Documentation updated consistently with API changes.

The notebook has been updated to use form data (data=) instead of JSON payload, which aligns with the API changes for collection creation. Clearing the execution count is also standard practice for version control.

api/tests/integ/test_documents.py (1)

14-17: LGTM! Properly migrated to form data with enum value.

The change correctly switches to form data encoding and uses CollectionVisibility.PRIVATE.value to pass the string value (required for form data) instead of the enum instance.

api/tests/integ/test_files.py (1)

12-27: LGTM! Consistent form data migration across both collection creations.

Both collection creation calls have been properly updated to use form data encoding with enum values. The changes are consistent with the broader API migration in this PR.

docs/docs/guides/rag.md (1)

46-57: LGTM! Documentation updated to reflect API changes.

The guide has been properly updated to use form data (data=) for collection creation, which aligns with the API endpoint changes in this PR.

api/tests/integ/test_search.py (2)

14-19: LGTM! Consistent migration to form data encoding.

The collection creation has been updated to use form data encoding, aligning with the broader API changes in this PR.


99-105: LGTM! Consistent form data usage.

The collection creation follows the same pattern as other test updates, using form data encoding.

api/utils/exceptions.py (2)

143-146: LGTM! New exception added for Parquet validation.

The InvalidPARQUETFormatException follows the existing exception pattern and provides appropriate error signaling for Parquet file validation failures. The 422 status code is appropriate for format validation errors.


124-129: LGTM! Parquet size limit constant added.

The MAX_PARQUET_CONTENT_SIZE constant has been added to support Parquet-specific size validation. The value matches the existing MAX_CONTENT_SIZE, which is reasonable for consistency.

api/helpers/_accesscontroller.py (1)

278-282: LGTM: Form-based visibility check aligns with API changes.

The switch from JSON body parsing to request.form() correctly reflects the broader form-based collection endpoint changes introduced in this PR. The permission check logic remains sound: comparing against CollectionVisibility.PUBLIC.value and enforcing CREATE_PUBLIC_COLLECTION permission.

api/tests/integ/conftest.py (1)

343-343: LGTM: PUT helper methods complete the test client API.

The put_without_permissions, put_with_permissions, and put_master additions (lines 343, 350, 357) follow the established pattern for permission-scoped test helpers, enabling comprehensive integration testing of PUT endpoints with Parquet-based collection updates.

api/helpers/data/hasher/__init__.py (1)

1-5: LGTM: Clean module export for chunk hashing utilities.

The module follows Python packaging best practices with a clear docstring and explicit __all__ export. ChunkHasher provides deterministic hashing for chunk comparison during Parquet-based document updates.

api/tests/integ/cassettes/test_collections.test_put_collection_with_parquet_file.yaml (1)

1-86: Cassette serves its testing purpose.

This VCR cassette correctly captures embedding API interactions for PUT collection with Parquet file flows. The large response bodies are typical for embedding fixtures. (Note: Optimization suggestions were previously skipped per developer feedback.)

api/helpers/documents/__init__.py (1)

1-5: LGTM: Clean module export for document management.

The module follows consistent package structure with clear docstring and explicit __all__ export. DocumentsUpdater orchestrates document updates in vector-store workflows with chunk comparison, creation, and rollback capabilities.

api/tests/unit/test_helpers/test_parsermanager.py (3)

437-437: LGTM: Extension map test updated for PARQUET support.

Line 437 correctly adds the .parquetFileType.PARQUET mapping to the expected extensions, aligning with the new PARQUET file type introduced in this PR.


451-451: LGTM: Content type validation includes PARQUET.

Line 451 appropriately includes FileType.PARQUET in the list of file types expected to have valid content-type definitions, ensuring comprehensive validation coverage.


455-466: LGTM: PARQUET rejection test validates intended behavior.

The test correctly verifies that parse_file raises UnsupportedFileTypeException when encountering PARQUET files, enforcing the design constraint that Parquet files are only supported via collection endpoints, not document parsing. The mock file with PAR1 magic bytes appropriately simulates a real Parquet file.

api/tests/integ/cassettes/test_collections.test_patch_collection_metadata_and_file.yaml (1)

1-170: Cassette correctly captures PATCH workflow interactions.

This VCR cassette properly records the embedding API calls for PATCH collection metadata and Parquet file operations. The fixture supports integration testing of the form-based PATCH flows introduced in this PR. (Note: Size optimization suggestions were previously acknowledged and skipped.)

api/helpers/_parsermanager.py (3)

19-19: LGTM: PARQUET extension mapping added correctly.

Line 19 properly extends EXTENSION_MAP with .parquetFileType.PARQUET, following the established pattern for file type detection by extension.


33-37: LGTM: Comprehensive PARQUET MIME type coverage.

Lines 33-37 define appropriate PARQUET MIME types:

  • application/vnd.apache.parquet (official IANA-registered type)
  • application/x-parquet (common cloud service identifier)
  • application/octet-stream (generic binary fallback)

This combination handles legitimate Parquet uploads from various sources. Good refinement from past review feedback.


114-117: LGTM: Clear rejection message for PARQUET parsing.

Lines 114-117 correctly enforce the design constraint that Parquet files cannot be parsed as documents. The explicit error message guides users to the appropriate collections endpoints. The check is well-placed before the method dispatch, preventing potential KeyError.

api/helpers/_documentmanager.py (1)

559-567: Avoid purging documents before the new parquet import succeeds.

Force updates still delete every existing document (Lines [559]-[567]) before any replacement chunks are persisted. If vectorization fails mid-run, the collection stays empty—exactly the data-loss scenario previously flagged. Please reorder the flow so the old documents are only removed after their replacements are fully inserted, or wrap the whole delete/create/upsert sequence in a transaction with compensating logic. tagraise_major_issueduplicate_comment

api/helpers/data/hasher/chunk_hasher.py (1)

38-46: Add separators between key/value pairs to avoid hash collisions.

concatenate_chunk_values still joins entries without a delimiter, so distinct payloads can collapse to the same string (e.g., {a: "b:c", d: "e"} vs {a: "b", c: "d:e"}), breaking change detection. Please serialize each key:value pair deterministically and join them with an explicit separator before hashing. tagraise_major_issueduplicate_comment

         parts = []
         for key in sorted(chunk.keys()):
             value = chunk.get(key, None)
             if isinstance(value, (dict | list)):
                 # Serialize complex types to JSON for consistent representation
-                parts.append(f"{key}:{json.dumps(value, ensure_ascii=False)}")
+                parts.append(f"{key}:{json.dumps(value, ensure_ascii=False)}")
             else:
-                parts.append(f"{key}:{value}")
-        return "".join(parts)
+                parts.append(f"{key}:{value}")
+        return "||".join(parts)
docs/docs/functionalities/document-management/collections.md (1)

190-191: Add trailing newline for markdownlint (MD047).

markdownlint still flags this file because it lacks a terminating newline. Please add a single newline at the end of the document to satisfy MD047. tagsuggest_optional_refactor


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2370427 and 47956d4.

⛔ Files ignored due to path filters (3)
  • api/tests/integ/assets/parquet.parquet is excluded by !**/*.parquet
  • api/tests/integ/assets/parquet_wrong_format.parquet is excluded by !**/*.parquet
  • api/tests/integ/assets/updated_parquet.parquet is excluded by !**/*.parquet
📒 Files selected for processing (31)
  • .github/badges/coverage-unit.json (1 hunks)
  • .github/badges/coverage.json (1 hunks)
  • api/endpoints/collections.py (3 hunks)
  • api/helpers/_accesscontroller.py (1 hunks)
  • api/helpers/_documentmanager.py (2 hunks)
  • api/helpers/_parsermanager.py (3 hunks)
  • api/helpers/data/hasher/__init__.py (1 hunks)
  • api/helpers/data/hasher/chunk_hasher.py (1 hunks)
  • api/helpers/documents/__init__.py (1 hunks)
  • api/helpers/documents/documents_updater.py (1 hunks)
  • api/helpers/files/__init__.py (1 hunks)
  • api/helpers/files/parquet/__init__.py (1 hunks)
  • api/helpers/files/parquet/extractor.py (1 hunks)
  • api/helpers/files/parquet/reader.py (1 hunks)
  • api/schemas/collections.py (2 hunks)
  • api/schemas/core/documents.py (1 hunks)
  • api/tests/integ/cassettes/test_collections.test_create_collection_with_parquet_file.yaml (1 hunks)
  • api/tests/integ/cassettes/test_collections.test_patch_collection_metadata_and_file.yaml (1 hunks)
  • api/tests/integ/cassettes/test_collections.test_put_collection_with_parquet_file.yaml (1 hunks)
  • api/tests/integ/conftest.py (1 hunks)
  • api/tests/integ/test_admin.py (1 hunks)
  • api/tests/integ/test_chat.py (1 hunks)
  • api/tests/integ/test_chunks.py (1 hunks)
  • api/tests/integ/test_collections.py (14 hunks)
  • api/tests/integ/test_documents.py (1 hunks)
  • api/tests/integ/test_files.py (1 hunks)
  • api/tests/integ/test_search.py (2 hunks)
  • api/tests/unit/test_helpers/test_chunk_hasher.py (1 hunks)
  • api/tests/unit/test_helpers/test_parsermanager.py (2 hunks)
  • api/utils/exceptions.py (2 hunks)
  • pyproject.toml (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (18)
api/tests/integ/test_search.py (1)
api/tests/integ/conftest.py (1)
  • client (335-361)
api/helpers/data/hasher/chunk_hasher.py (1)
api/schemas/chunks.py (1)
  • Chunk (6-10)
api/helpers/documents/__init__.py (1)
api/helpers/documents/documents_updater.py (1)
  • DocumentsUpdater (19-303)
api/tests/integ/test_collections.py (3)
api/tests/integ/conftest.py (1)
  • client (335-361)
api/schemas/collections.py (1)
  • CollectionVisibility (9-11)
api/tests/integ/test_documents.py (1)
  • collection (13-21)
api/helpers/files/parquet/reader.py (1)
api/utils/exceptions.py (1)
  • InvalidPARQUETFormatException (143-145)
api/tests/integ/test_chunks.py (1)
api/schemas/collections.py (1)
  • CollectionVisibility (9-11)
api/helpers/_accesscontroller.py (2)
api/schemas/collections.py (1)
  • CollectionVisibility (9-11)
api/schemas/admin/roles.py (1)
  • PermissionType (10-14)
api/helpers/files/parquet/__init__.py (2)
api/helpers/files/parquet/extractor.py (1)
  • ParquetDataExtractor (13-167)
api/helpers/files/parquet/reader.py (1)
  • ParquetReader (15-116)
api/tests/integ/test_documents.py (1)
api/schemas/collections.py (1)
  • CollectionVisibility (9-11)
api/tests/integ/test_chat.py (1)
api/tests/integ/conftest.py (1)
  • client (335-361)
api/helpers/_parsermanager.py (2)
api/schemas/core/documents.py (1)
  • FileType (19-25)
api/utils/exceptions.py (1)
  • UnsupportedFileTypeException (163-165)
api/helpers/documents/documents_updater.py (5)
api/helpers/data/hasher/chunk_hasher.py (4)
  • ChunkHasher (10-100)
  • concatenate_chunk_values (19-47)
  • compute_chunk_hash (50-67)
  • compute_existing_chunk_hash (70-100)
api/helpers/files/parquet/extractor.py (5)
  • ParquetDataExtractor (13-167)
  • extract_chunk_ids (54-81)
  • extract_metadata_columns (109-120)
  • build_chunk_metadata (123-152)
  • get_row_values (155-167)
api/schemas/chunks.py (1)
  • Chunk (6-10)
api/utils/exceptions.py (2)
  • CollectionNotFoundException (78-80)
  • VectorizationFailedException (190-192)
api/helpers/_documentmanager.py (1)
  • get_chunks (303-332)
api/helpers/data/hasher/__init__.py (1)
api/helpers/data/hasher/chunk_hasher.py (1)
  • ChunkHasher (10-100)
api/tests/unit/test_helpers/test_chunk_hasher.py (1)
api/helpers/data/hasher/chunk_hasher.py (3)
  • ChunkHasher (10-100)
  • concatenate_chunk_values (19-47)
  • compute_chunk_hash (50-67)
api/tests/integ/test_files.py (2)
api/schemas/collections.py (1)
  • CollectionVisibility (9-11)
api/tests/integ/conftest.py (1)
  • client (335-361)
api/tests/unit/test_helpers/test_parsermanager.py (4)
api/schemas/core/documents.py (1)
  • FileType (19-25)
api/helpers/_parsermanager.py (2)
  • ParserManager (16-214)
  • parse_file (109-119)
api/utils/exceptions.py (1)
  • UnsupportedFileTypeException (163-165)
api/helpers/_documentmanager.py (1)
  • parse_file (335-346)
api/helpers/_documentmanager.py (5)
api/helpers/documents/documents_updater.py (2)
  • DocumentsUpdater (19-303)
  • process_document (42-122)
api/helpers/files/parquet/extractor.py (3)
  • ParquetDataExtractor (13-167)
  • get_unique_document_names (22-35)
  • filter_document_table (38-51)
api/helpers/files/parquet/reader.py (4)
  • ParquetReader (15-116)
  • read_and_validate (27-60)
  • sort_table (78-95)
  • identify_metadata_columns (98-116)
api/utils/exceptions.py (1)
  • CollectionNotFoundException (78-80)
api/clients/vector_store/_basevectorstoreclient.py (1)
  • delete_document (55-56)
api/endpoints/collections.py (4)
api/schemas/core/documents.py (1)
  • FileType (19-25)
api/utils/exceptions.py (3)
  • CollectionNotFoundException (78-80)
  • FileSizeLimitExceededException (124-129)
  • UnsupportedFileTypeException (163-165)
api/helpers/_documentmanager.py (4)
  • create_collection (74-83)
  • delete_collection (86-101)
  • update_collection_from_parquet (496-598)
  • update_collection (104-126)
api/helpers/_parsermanager.py (1)
  • _detect_file_type (60-91)
🔇 Additional comments (45)
api/helpers/files/__init__.py (1)

1-1: LGTM!

The module docstring provides clear documentation for the package purpose.

api/tests/integ/conftest.py (3)

343-343: LGTM!

The addition of put_without_permissions test helper follows the established pattern for other HTTP methods and enables comprehensive testing of PUT endpoints.


350-350: LGTM!

The addition of put_with_permissions test helper follows the established pattern and supports testing PUT operations with admin permissions.


357-357: LGTM!

The addition of put_master test helper follows the established pattern and enables testing PUT operations with master key authentication.

api/utils/exceptions.py (1)

143-146: LGTM!

The InvalidPARQUETFormatException correctly follows the established pattern for format validation exceptions (HTTP 422) and aligns well with the existing InvalidJSONFormatException.

api/schemas/collections.py (1)

1-14: LGTM!

The introduction of CollectionName as a validated type alias with strip_whitespace=True and min_length=1 provides consistent validation for collection names across the API. This is a clean, type-safe approach that replaces the removed request models and ensures that collection names are always properly validated.

pyproject.toml (1)

46-47: I need to verify the actual latest stable releases and understand the release statuses better.

Remove the incorrect assumption about PyArrow version ranges.

PyArrow has progressed beyond the 10.x-15.x range to 22.0.0 (released October 24, 2025). The pinned version pyarrow==21.0.0 (released July 17, 2025) exists and is a stable release, though not the absolute latest. xxhash==3.6.0 exists and is the latest stable version. Both versions are valid and appropriate for production use.

Likely an incorrect or invalid review comment.

api/helpers/_accesscontroller.py (1)

278-281: No issues found. The query parameter handling for None values is correct and intentional.

The access controller correctly handles the case when visibility is not provided in query parameters. When request.query_params.get("visibility") returns None, the equality check None == CollectionVisibility.PUBLIC.value simply evaluates to False, and the permission exception is not raised—which is the intended behavior. Permission validation only applies when visibility is explicitly set to "public".

The endpoint parameters with defaults (CollectionVisibility.PRIVATE for create, None for update) operate separately at the FastAPI layer and are applied after the access controller check completes successfully.

api/tests/integ/cassettes/test_collections.test_patch_collection_metadata_and_file.yaml (1)

1-170: LGTM - Test cassette structure is correct.

This test cassette file follows the standard VCR format for recording HTTP interactions during integration tests. The use of dummy tokens ("Bearer dummy_token_for_test") is appropriate for test fixtures, and the recorded embeddings API responses will enable deterministic testing of the Parquet collection patch functionality.

api/schemas/core/documents.py (1)

21-21: LGTM - Parquet file type support added.

The addition of PARQUET = "parquet" to the FileType enum follows the existing pattern and enables the Parquet file handling functionality described in the PR objectives.

api/tests/integ/test_chunks.py (1)

17-17: LGTM - Correct enum value usage.

The change properly uses CollectionVisibility.PRIVATE.value to pass the string value "private" as a query parameter, which is the correct approach for query parameter serialization.

api/tests/integ/test_documents.py (1)

16-16: LGTM - Consistent with test suite pattern.

The collection creation correctly uses query parameters with the enum value, matching the pattern established across other test files in this PR.

api/tests/integ/test_search.py (1)

17-17: LGTM - Default visibility behavior.

Both collection creation calls now use query parameters and rely on the server's default visibility setting. This is acceptable for these test scenarios where the visibility isn't the focus of the test.

Also applies to: 103-103

api/tests/integ/test_files.py (1)

15-15: LGTM - Proper enum value usage for both visibility types.

Both collection creations correctly use the enum's .value property to pass the appropriate visibility string values as query parameters. This properly tests both PRIVATE and PUBLIC collection scenarios.

Also applies to: 22-22

api/tests/integ/test_chat.py (1)

29-29: Default collection visibility is PRIVATE—verification complete.

The endpoint defines visibility: CollectionVisibility = Query(CollectionVisibility.PRIVATE, description="Visibility for the collection"), confirming that omitting the visibility parameter in the test defaults to a private collection, which is appropriate for isolated test scenarios.

api/tests/unit/test_helpers/test_chunk_hasher.py (1)

1-68: LGTM! Well-structured test coverage for ChunkHasher.

The test suite covers essential hash computation behaviors: determinism, content sensitivity, metadata sensitivity, and concatenation consistency. The tests validate both the individual methods and their integration.

api/tests/integ/cassettes/test_collections.test_create_collection_with_parquet_file.yaml (1)

1-86: Test cassette file looks correct.

This VCR cassette records embeddings API interactions for the parquet collection test. The format is standard and the recorded interactions align with the test expectations.

api/tests/unit/test_helpers/test_parsermanager.py (3)

437-437: LGTM! Extension map updated for PARQUET support.

The addition of .parquet mapping to FileType.PARQUET aligns with the new file type support introduced in this PR.


451-453: LGTM! Content types check updated for PARQUET.

The test now correctly validates that FileType.PARQUET is included in the valid content types mapping.


455-466: LGTM! Parquet rejection test validates the expected behavior.

The test correctly verifies that parquet files are explicitly rejected by parse_file with an UnsupportedFileTypeException, as parquet files require special handling outside the document parsing flow.

api/helpers/files/parquet/__init__.py (1)

1-6: LGTM! Clean package initialization.

The module clearly defines its public API by re-exporting ParquetReader and ParquetDataExtractor, making them easily accessible to consumers.

api/tests/integ/test_collections.py (7)

11-13: LGTM! Asset directory setup is correct.

The path setup for test assets enables file-based testing with parquet files.


19-21: LGTM! Consistent migration to params-based requests.

The tests now consistently use params for query parameters instead of JSON bodies, aligning with the updated API contract for collection endpoints.

Also applies to: 40-42, 54-56, 71-73, 95-97, 100-102, 106-108, 125-127, 137-139, 148-150, 159-161, 177-179, 192-194, 210-212, 215-217, 229-231


254-270: LGTM! Well-structured parquet creation test.

The test validates the new collection creation flow with an initial parquet file upload, checking for the expected response structure (id and details fields).


271-294: LGTM! PUT endpoint test validates force update flow.

The test correctly validates the PUT endpoint for force-updating a collection with a parquet file, checking for updated and details in the response.


295-371: Excellent comprehensive test for document change detection!

This test provides thorough coverage of the parquet update workflow:

  1. First upload with metadata update
  2. Second upload with identical file (validates no spurious changes detected)
  3. Third upload with updated file (validates correct change detection)

The assertions on added_documents, updated_documents, deleted_documents, and total_chunks_processed ensure the change detection logic works correctly.


372-384: LGTM! Validation test for malformed parquet files.

The test correctly validates that parquet files missing required columns are rejected with a 422 status code. Test asset files verified to exist in the repository.


61-62: PATCH status code verification complete.

All three test assertions (lines 61-62, 133, 245) correctly expect status_code == 200, which aligns with the PATCH endpoint implementation in api/endpoints/collections.py (line 148 decorator and line 207 return). The status code change from 204 to 200 has been properly implemented and consistently tested.

api/helpers/data/hasher/chunk_hasher.py (1)

49-67: LGTM! Hash computation and existing chunk handling are well-designed.

The use of xxHash64 is appropriate for fast, non-cryptographic hashing. The exclusion of internal metadata fields (collection_id, document_id, document_created_at) in compute_existing_chunk_hash correctly focuses on content comparison.

Also applies to: 69-100

api/helpers/files/parquet/extractor.py (6)

21-35: LGTM! Efficient use of PyArrow vectorized operations.

Using pc.unique() for extracting unique document names leverages PyArrow's optimized compute functions for better performance with large datasets.


37-51: LGTM! Clean filtering implementation.

The use of PyArrow's pc.equal() and filter() provides efficient row filtering without loading unnecessary data into memory.


53-81: LGTM! Robust chunk ID handling.

The method correctly handles both scenarios: extracting and validating existing chunk indices, or assigning sequential IDs when the column is absent. The logging on line 80 provides helpful debugging information.


83-106: LGTM! Sequential validation logic is sound.

The validation correctly checks for gaps in the sequence by comparing against the expected range. The early return for empty lists (lines 95-96) prevents errors.


108-152: LGTM! Safe JSON parsing with appropriate fallbacks.

The build_chunk_metadata method safely handles:

  • None values
  • JSON strings (both objects and arrays)
  • Parse errors (falls back to raw string)

This defensive approach prevents crashes from malformed data.


154-167: LGTM! Straightforward row extraction.

The method provides a clean interface for extracting all column values for a specific row.

api/helpers/files/parquet/reader.py (3)

26-60: LGTM! Robust file reading with proper error handling.

The method correctly:

  • Reads the file content asynchronously
  • Validates the parquet format
  • Extracts and validates columns
  • Returns all necessary information

The exception handling on lines 44-46 provides clear error messages for debugging.


62-75: LGTM! Clear validation with helpful error messages.

The validation correctly identifies missing required columns and provides actionable error messages listing both what's missing and what's available.


77-95: LGTM! Efficient sorting for document grouping.

The multi-level sort ensures chunks are grouped by document and ordered sequentially, which is essential for downstream processing.

api/helpers/documents/documents_updater.py (5)

27-40: LGTM: Clean dependency injection pattern.

The initialization properly accepts dependencies and instantiates helper utilities. The design allows for flexible testing and component composition.


124-169: LGTM: Efficient chunk preparation.

The method efficiently prepares chunk data with conditional hash computation. The logic correctly builds metadata and only computes hashes when needed for comparison.


229-255: LGTM: Proper error handling.

The method correctly handles foreign key constraint violations and translates them into appropriate exceptions. The error handling ensures meaningful feedback when collections are deleted during document creation.


257-303: LGTM: Robust chunk insertion.

The method properly prepares chunks with complete metadata and handles vectorization failures gracefully with appropriate error wrapping and logging.


102-111: <!-- [scratchpad_start]
[task review_file_1/1 verification_document_id_stability_4/5]
Complete analysis of document ID stability concern

[observations]

  • Document schema (api/schemas/documents.py): exposes id field in public API response
  • Document model includes: id, name, collection_id, created_at, chunks
  • delete_document in DocumentManager: deletes from DocumentTable by ID
  • process_document: deletes old doc (ID), creates new doc (new ID assigned)
  • document_id stored in vector store metadata - properly deleted/recreated
  • document_name used as stable identifier internally (existing_docs_map)
  • No ID remapping, versioning, or migration logic found
  • No tests found documenting expected update behavior
  • No ForeignKey declarations found to other tables (checked via rg)

[analysis]
Review comment concern is VALID:

  1. Document IDs ARE exposed to clients in API (Document schema includes id)
  2. Document IDs CHANGE on update (delete old, create new)
  3. If clients store/bookmark document IDs, they will break after update
  4. Pattern appears intentional (explicit delete-recreate) but not documented
  5. No evidence of ID stability requirement or migration logic

This is a DESIGN QUESTION, not a bug:

  • Code works as written
  • But design intent unclear: is changing IDs intentional or oversight?
  • Depends on whether external systems cache/reference document IDs
  • No tests document this behavior

[pending]
None - sufficient information gathered

[analysis complete]
The review comment raises a legitimate design concern that needs developer consideration.
</scratchpad_end] -->

Document IDs change on update—verify if this design is acceptable.

The code exposes document IDs in the public API (schema includes id field), but the update logic deletes the old document and creates a new one, resulting in a new ID being returned to clients. If external systems, bookmarks, or stored references rely on document IDs, they will break after an update.

This appears intentional (document names serve as the stable identifier internally), but the implications depend on your use case:

  • If IDs are transient/internal-only: Current behavior is fine.
  • If external systems reference IDs: Consider maintaining stable IDs and updating in-place, or implementing ID redirection/mapping.

Confirm whether this behavior aligns with your design requirements.

api/endpoints/collections.py (2)

1-1: LGTM: Appropriate imports for Parquet support.

The new imports correctly support file upload handling and Parquet-specific validations for the enhanced collection endpoints.

Also applies to: 6-7, 10-10


148-148: Verify status code change for PATCH endpoint.

The status code for the PATCH endpoint was changed from the default (likely 200 or 204) to explicitly 200. Ensure this aligns with your API design:

  • 200 (OK) with response body is appropriate when returning update details
  • 204 (No Content) would be more RESTful if no body is returned

The current implementation returns a JSON response, so 200 is correct.

@FaheemBEG FaheemBEG force-pushed the 467-add-a-put-collections-endpoint-for-collections-update branch from 47956d4 to d15048e Compare November 10, 2025 15:45
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (2)
api/utils/exceptions.py (1)

126-129: MAX_PARQUET_CONTENT_SIZE is unused in the default error message.

The newly added MAX_PARQUET_CONTENT_SIZE constant (line 126) is not used in the default error message (line 128), which still references MAX_CONTENT_SIZE. This was previously raised and marked as "Skipped" by the developer.

api/helpers/data/hasher/chunk_hasher.py (1)

38-46: Add an unambiguous serialization before hashing.

Different metadata dictionaries still collapse to the same concatenated string (e.g., {"a": "bc", "ab": "c"} vs {"a": "bca", "b": "c"} both yield a:bcab:c). Because the hash is built on top of this string, distinct chunks can produce identical hashes and slip through change detection. Please serialize each key/value pair unambiguously (e.g., JSON encode values and separate entries) before hashing.

Apply this diff to fix the issue:

         parts = []
         for key in sorted(chunk.keys()):
             value = chunk.get(key, None)
-            if isinstance(value, (dict | list)):
-                # Serialize complex types to JSON for consistent representation
-                parts.append(f"{key}:{json.dumps(value, ensure_ascii=False)}")
-            else:
-                parts.append(f"{key}:{value}")
-        return "".join(parts)
+            if isinstance(value, (dict | list)):
+                serialized = json.dumps(value, ensure_ascii=False, sort_keys=True)
+            else:
+                serialized = json.dumps(value, ensure_ascii=False, default=str)
+            parts.append(f"{key}={serialized}")
+        return "|".join(parts)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 47956d4 and 534ce22.

⛔ Files ignored due to path filters (3)
  • api/tests/integ/assets/parquet.parquet is excluded by !**/*.parquet
  • api/tests/integ/assets/parquet_wrong_format.parquet is excluded by !**/*.parquet
  • api/tests/integ/assets/updated_parquet.parquet is excluded by !**/*.parquet
📒 Files selected for processing (31)
  • .github/badges/coverage-unit.json (1 hunks)
  • .github/badges/coverage.json (1 hunks)
  • api/endpoints/collections.py (4 hunks)
  • api/helpers/_accesscontroller.py (1 hunks)
  • api/helpers/_documentmanager.py (2 hunks)
  • api/helpers/_parsermanager.py (3 hunks)
  • api/helpers/data/hasher/__init__.py (1 hunks)
  • api/helpers/data/hasher/chunk_hasher.py (1 hunks)
  • api/helpers/documents/__init__.py (1 hunks)
  • api/helpers/documents/documents_updater.py (1 hunks)
  • api/helpers/files/__init__.py (1 hunks)
  • api/helpers/files/parquet/__init__.py (1 hunks)
  • api/helpers/files/parquet/extractor.py (1 hunks)
  • api/helpers/files/parquet/reader.py (1 hunks)
  • api/schemas/collections.py (2 hunks)
  • api/schemas/core/documents.py (1 hunks)
  • api/tests/integ/cassettes/test_collections.test_create_collection_with_parquet_file.yaml (1 hunks)
  • api/tests/integ/cassettes/test_collections.test_patch_collection_metadata_and_file.yaml (1 hunks)
  • api/tests/integ/cassettes/test_collections.test_put_collection_with_parquet_file.yaml (1 hunks)
  • api/tests/integ/conftest.py (1 hunks)
  • api/tests/integ/test_admin.py (2 hunks)
  • api/tests/integ/test_chat.py (1 hunks)
  • api/tests/integ/test_chunks.py (1 hunks)
  • api/tests/integ/test_collections.py (14 hunks)
  • api/tests/integ/test_documents.py (1 hunks)
  • api/tests/integ/test_files.py (1 hunks)
  • api/tests/integ/test_search.py (2 hunks)
  • api/tests/unit/test_helpers/test_chunk_hasher.py (1 hunks)
  • api/tests/unit/test_helpers/test_parsermanager.py (2 hunks)
  • api/utils/exceptions.py (2 hunks)
  • pyproject.toml (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (18)
api/helpers/_documentmanager.py (5)
api/helpers/documents/documents_updater.py (2)
  • DocumentsUpdater (19-303)
  • process_document (42-122)
api/helpers/files/parquet/extractor.py (3)
  • ParquetDataExtractor (13-167)
  • get_unique_document_names (22-35)
  • filter_document_table (38-51)
api/helpers/files/parquet/reader.py (4)
  • ParquetReader (20-139)
  • read_and_validate (32-83)
  • sort_table (101-118)
  • identify_metadata_columns (121-139)
api/utils/exceptions.py (1)
  • CollectionNotFoundException (78-80)
api/clients/vector_store/_basevectorstoreclient.py (1)
  • delete_document (55-56)
api/helpers/_accesscontroller.py (2)
api/schemas/collections.py (1)
  • CollectionVisibility (9-11)
api/schemas/admin/roles.py (1)
  • PermissionType (10-14)
api/helpers/files/parquet/__init__.py (2)
api/helpers/files/parquet/extractor.py (1)
  • ParquetDataExtractor (13-167)
api/helpers/files/parquet/reader.py (1)
  • ParquetReader (20-139)
api/helpers/data/hasher/__init__.py (1)
api/helpers/data/hasher/chunk_hasher.py (1)
  • ChunkHasher (10-99)
api/tests/integ/test_files.py (2)
api/schemas/collections.py (1)
  • CollectionVisibility (9-11)
api/tests/integ/conftest.py (1)
  • client (335-361)
api/tests/integ/test_admin.py (1)
api/schemas/collections.py (1)
  • CollectionVisibility (9-11)
api/tests/unit/test_helpers/test_chunk_hasher.py (1)
api/helpers/data/hasher/chunk_hasher.py (3)
  • ChunkHasher (10-99)
  • concatenate_chunk_values (19-46)
  • compute_chunk_hash (49-66)
api/tests/integ/test_chat.py (1)
api/tests/integ/conftest.py (1)
  • client (335-361)
api/tests/integ/test_chunks.py (1)
api/schemas/collections.py (1)
  • CollectionVisibility (9-11)
api/tests/integ/test_search.py (1)
api/tests/integ/conftest.py (1)
  • client (335-361)
api/helpers/data/hasher/chunk_hasher.py (1)
api/schemas/chunks.py (1)
  • Chunk (6-10)
api/helpers/files/parquet/reader.py (3)
api/schemas/core/documents.py (1)
  • FileType (19-25)
api/utils/exceptions.py (2)
  • FileSizeLimitExceededException (124-129)
  • InvalidPARQUETFormatException (143-145)
api/helpers/_parsermanager.py (1)
  • _detect_file_type (61-92)
api/tests/integ/test_documents.py (1)
api/schemas/collections.py (1)
  • CollectionVisibility (9-11)
api/helpers/_parsermanager.py (2)
api/schemas/core/documents.py (1)
  • FileType (19-25)
api/utils/exceptions.py (1)
  • UnsupportedFileTypeException (163-165)
api/tests/integ/test_collections.py (3)
api/tests/integ/conftest.py (1)
  • client (335-361)
api/schemas/collections.py (1)
  • CollectionVisibility (9-11)
api/tests/integ/test_documents.py (1)
  • collection (13-21)
api/helpers/documents/__init__.py (1)
api/helpers/documents/documents_updater.py (1)
  • DocumentsUpdater (19-303)
api/tests/unit/test_helpers/test_parsermanager.py (4)
api/schemas/core/documents.py (1)
  • FileType (19-25)
api/helpers/_parsermanager.py (2)
  • ParserManager (16-215)
  • parse_file (110-120)
api/utils/exceptions.py (1)
  • UnsupportedFileTypeException (163-165)
api/helpers/_documentmanager.py (1)
  • parse_file (335-346)
api/helpers/documents/documents_updater.py (4)
api/helpers/data/hasher/chunk_hasher.py (4)
  • ChunkHasher (10-99)
  • concatenate_chunk_values (19-46)
  • compute_chunk_hash (49-66)
  • compute_existing_chunk_hash (69-99)
api/helpers/files/parquet/extractor.py (5)
  • ParquetDataExtractor (13-167)
  • extract_chunk_ids (54-81)
  • extract_metadata_columns (109-120)
  • build_chunk_metadata (123-152)
  • get_row_values (155-167)
api/schemas/chunks.py (1)
  • Chunk (6-10)
api/utils/exceptions.py (2)
  • CollectionNotFoundException (78-80)
  • VectorizationFailedException (190-192)
🔇 Additional comments (28)
.github/badges/coverage.json (1)

1-1: Overall coverage increased: +0.37 percentage points.

The overall test coverage improved to 73.14%, which is encouraging. However, review the unit test coverage trend (see related badge file).

api/helpers/files/__init__.py (1)

1-1: LGTM!

The docstring appropriately describes this package's purpose.

pyproject.toml (2)

3-3: LGTM!

Version bump to 0.3.0 is appropriate for the new Parquet ingestion feature.


46-47: Dependency versions verified—both are secure.

  • xxhash==3.6.0: Already at latest version with no known vulnerabilities.
  • pyarrow==21.0.0: Secure (not affected by known CVEs). Latest available is 22.0.0; consider upgrading to 22.0.0 for the most current release.
api/helpers/documents/__init__.py (1)

1-5: LGTM!

Clean module structure with appropriate public export of DocumentsUpdater.

api/tests/integ/cassettes/test_collections.test_put_collection_with_parquet_file.yaml (1)

1-86: LGTM!

Standard VCR cassette file recording embeddings API interactions for the PUT collection with Parquet file test.

api/tests/integ/test_chat.py (1)

29-29: LGTM!

Test correctly updated to send collection name via query parameters, aligning with the updated API endpoint signature.

api/schemas/core/documents.py (1)

21-21: LGTM!

Addition of PARQUET file type to the enum is appropriate and follows the existing pattern.

api/tests/integ/test_chunks.py (1)

15-18: LGTM!

Test correctly updated to use query parameters and enum value, aligning with API endpoint changes.

api/tests/integ/test_documents.py (1)

14-17: LGTM!

Test correctly updated to use query parameters with enum value serialization, consistent with the updated collections endpoint.

api/tests/integ/conftest.py (1)

343-343: LGTM! Consistent test helper pattern.

The addition of PUT method helpers (put_without_permissions, put_with_permissions, put_master) follows the established pattern for other HTTP methods and properly configures authorization headers using partial.

Also applies to: 350-350, 357-357

api/tests/integ/test_search.py (1)

17-17: LGTM! Consistent migration to query parameters.

The collection creation calls correctly switched from JSON body to query parameters, aligning with the endpoint signature changes documented in the broader PR context.

Also applies to: 103-103

api/utils/exceptions.py (1)

143-145: LGTM! Clear exception for Parquet format validation.

The new InvalidPARQUETFormatException follows the established pattern for format-specific exceptions and provides appropriate HTTP 422 status for validation errors.

api/tests/integ/test_admin.py (1)

11-11: LGTM! Improved type safety with enum.

The import of CollectionVisibility and its usage with .PRIVATE.value in the collection creation request improves type safety and consistency across the test suite.

Also applies to: 406-406

api/helpers/_accesscontroller.py (1)

278-281: LGTM! Correctly adapted to query parameter inputs.

The permission check now reads visibility from request.query_params instead of the request body, aligning with the endpoint signature changes that accept collection metadata as query parameters.

api/tests/integ/test_files.py (1)

15-15: LGTM! Consistent enum usage and query parameters.

The collection creation calls correctly use query parameters with CollectionVisibility.PRIVATE.value and CollectionVisibility.PUBLIC.value, ensuring type safety and consistency with the updated endpoint signatures.

Also applies to: 22-22

api/helpers/data/hasher/__init__.py (1)

1-5: LGTM! Clean module structure.

The package initialization follows Python best practices by providing a clear docstring and using __all__ to explicitly define the public API surface.

api/tests/unit/test_helpers/test_chunk_hasher.py (4)

12-26: LGTM! Essential determinism test.

The test correctly verifies that identical chunk dictionaries produce identical hashes and validates the expected 16-character hexadecimal output format from xxHash64.


27-38: LGTM! Collision detection test.

The test correctly verifies that different content produces different hashes, which is critical for change detection functionality.


40-51: LGTM! Metadata sensitivity test.

The test correctly validates that metadata changes (e.g., metadata_version) affect the hash output, ensuring comprehensive change detection beyond just content.


53-68: LGTM! Concatenation consistency test.

The test correctly verifies that concatenate_chunk_values produces deterministic output and includes all expected keys, which is essential for reproducible hash computation.

api/helpers/files/parquet/reader.py (2)

31-83: LGTM! Well-structured file validation and reading.

The read_and_validate method efficiently handles file size validation using seek/tell, properly resets the file pointer, validates file type and structure, and provides clear error messages. The separation of sorting logic (via sort_table) to the caller is appropriate.


120-139: LGTM! Mutation bug properly fixed.

The identify_metadata_columns method now correctly uses .copy() to avoid mutating the class constant REQUIRED_COLUMNS. Good fix from the previous review feedback.

api/helpers/documents/documents_updater.py (2)

42-122: LGTM! Well-orchestrated document processing flow.

The process_document method cleanly orchestrates the update workflow: extracting data, checking for changes, handling new vs. updated documents, and managing statistics. The delete-before-create pattern for updates is appropriate.


171-227: LGTM! Efficient change detection with hash comparison.

The update detection logic efficiently compares chunk counts first, then hashes, stopping at the first difference. The logging of specific differences is helpful for debugging.

Note: The 10,000 chunk limit at line 206 was previously flagged and you chose to skip addressing it.

api/endpoints/collections.py (3)

39-59: LGTM! Robust error handling with collection cleanup.

The exception handling properly rolls back the collection creation if parquet processing fails, ensuring no orphaned collections are left in the database. The re-raising of the exception maintains proper error propagation to the client.


138-189: LGTM! Clean implementation of metadata updates with good refactoring.

The update endpoint properly separates metadata updates from document updates. The dictionary-based approach for tracking metadata changes (lines 162-177) is a good improvement from previous feedback, making the code more maintainable and the response construction cleaner.


192-220: LGTM! Well-designed force update endpoint.

The force_update_collection endpoint appropriately uses PUT for full replacement semantics and makes the file parameter required. The force_update=True flag clearly communicates the delete-all-then-recreate behavior, and the documentation properly explains the requirements and behavior.

@FaheemBEG FaheemBEG force-pushed the 467-add-a-put-collections-endpoint-for-collections-update branch from 534ce22 to 6b9aadd Compare November 10, 2025 16:43
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (3)
.github/badges/coverage-unit.json (1)

1-1: Coverage decline already flagged in previous review.

The unit test coverage decrease has already been comprehensively addressed in the previous review comment, which identified that DocumentsUpdater and Parquet-related features lack unit test coverage.

api/helpers/data/hasher/chunk_hasher.py (1)

38-46: Fix chunk hash collisions between content and metadata.

Concatenating key/value pairs without a delimiter collapses different chunks to the same hash. For example, {"content": "foo", "source": "bar"} and {"content": "foosource:bar"} both currently produce content:foosource:bar, so we would fail to notice metadata changes. Please emit each pair with an unambiguous separator (and normalize None) before hashing to make collisions impossible.

         parts = []
         for key in sorted(chunk.keys()):
-            value = chunk.get(key, None)
-            if isinstance(value, (dict | list)):
-                # Serialize complex types to JSON for consistent representation
-                parts.append(f"{key}:{json.dumps(value, ensure_ascii=False)}")
-            else:
-                parts.append(f"{key}:{value}")
-        return "".join(parts)
+            value = chunk.get(key)
+            if isinstance(value, (dict | list)):
+                serialized = json.dumps(value, ensure_ascii=False, separators=(",", ":"))
+            elif value is None:
+                serialized = "null"
+            else:
+                serialized = str(value)
+            parts.append(f"{key}:{serialized}")
+        return "\u001f".join(parts)
api/helpers/documents/documents_updater.py (1)

200-215: Remove the 10 000 chunk cap; it still breaks update detection.

get_chunks(..., limit=10000) means we only ever compare the first 10 000 stored chunks. Any document larger than that is always flagged as “changed” (because the new dataset has N chunks but we only fetched 10 000), so we delete and rebuild it on every update—even when nothing changed. Combined with the issue above, that false positive can wipe a healthy document if vectorization glitches once. Please paginate until all chunks are loaded (as previously suggested) or adopt a hash that covers the full set.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 534ce22 and cfcef3e.

⛔ Files ignored due to path filters (3)
  • api/tests/integ/assets/parquet.parquet is excluded by !**/*.parquet
  • api/tests/integ/assets/parquet_wrong_format.parquet is excluded by !**/*.parquet
  • api/tests/integ/assets/updated_parquet.parquet is excluded by !**/*.parquet
📒 Files selected for processing (31)
  • .github/badges/coverage-unit.json (1 hunks)
  • .github/badges/coverage.json (1 hunks)
  • api/endpoints/collections.py (4 hunks)
  • api/helpers/_accesscontroller.py (1 hunks)
  • api/helpers/_documentmanager.py (2 hunks)
  • api/helpers/_parsermanager.py (3 hunks)
  • api/helpers/data/hasher/__init__.py (1 hunks)
  • api/helpers/data/hasher/chunk_hasher.py (1 hunks)
  • api/helpers/documents/__init__.py (1 hunks)
  • api/helpers/documents/documents_updater.py (1 hunks)
  • api/helpers/files/__init__.py (1 hunks)
  • api/helpers/files/parquet/__init__.py (1 hunks)
  • api/helpers/files/parquet/extractor.py (1 hunks)
  • api/helpers/files/parquet/reader.py (1 hunks)
  • api/schemas/collections.py (2 hunks)
  • api/schemas/core/documents.py (1 hunks)
  • api/tests/integ/cassettes/test_collections.test_create_collection_with_parquet_file.yaml (1 hunks)
  • api/tests/integ/cassettes/test_collections.test_patch_collection_metadata_and_file.yaml (1 hunks)
  • api/tests/integ/cassettes/test_collections.test_put_collection_with_parquet_file.yaml (1 hunks)
  • api/tests/integ/conftest.py (1 hunks)
  • api/tests/integ/test_admin.py (2 hunks)
  • api/tests/integ/test_chat.py (1 hunks)
  • api/tests/integ/test_chunks.py (1 hunks)
  • api/tests/integ/test_collections.py (14 hunks)
  • api/tests/integ/test_documents.py (1 hunks)
  • api/tests/integ/test_files.py (1 hunks)
  • api/tests/integ/test_search.py (2 hunks)
  • api/tests/unit/test_helpers/test_chunk_hasher.py (1 hunks)
  • api/tests/unit/test_helpers/test_parsermanager.py (2 hunks)
  • api/utils/exceptions.py (2 hunks)
  • pyproject.toml (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (19)
api/tests/integ/test_admin.py (1)
api/schemas/collections.py (1)
  • CollectionVisibility (9-11)
api/helpers/_accesscontroller.py (2)
api/schemas/collections.py (1)
  • CollectionVisibility (9-11)
api/schemas/admin/roles.py (1)
  • PermissionType (10-14)
api/tests/integ/test_chat.py (1)
api/tests/integ/conftest.py (1)
  • client (335-361)
api/tests/integ/test_files.py (2)
api/schemas/collections.py (1)
  • CollectionVisibility (9-11)
api/tests/integ/conftest.py (1)
  • client (335-361)
api/helpers/documents/__init__.py (1)
api/helpers/documents/documents_updater.py (1)
  • DocumentsUpdater (19-301)
api/tests/integ/test_documents.py (1)
api/schemas/collections.py (1)
  • CollectionVisibility (9-11)
api/helpers/files/parquet/reader.py (3)
api/schemas/core/documents.py (1)
  • FileType (19-25)
api/utils/exceptions.py (2)
  • FileSizeLimitExceededException (124-129)
  • InvalidPARQUETFormatException (143-145)
api/helpers/_parsermanager.py (1)
  • _detect_file_type (61-92)
api/tests/integ/test_search.py (1)
api/tests/integ/conftest.py (1)
  • client (335-361)
api/helpers/data/hasher/__init__.py (1)
api/helpers/data/hasher/chunk_hasher.py (1)
  • ChunkHasher (10-99)
api/tests/integ/test_chunks.py (1)
api/schemas/collections.py (1)
  • CollectionVisibility (9-11)
api/tests/integ/test_collections.py (3)
api/tests/integ/conftest.py (1)
  • client (335-361)
api/schemas/collections.py (1)
  • CollectionVisibility (9-11)
api/tests/integ/test_documents.py (1)
  • collection (13-21)
api/helpers/documents/documents_updater.py (4)
api/helpers/data/hasher/chunk_hasher.py (4)
  • ChunkHasher (10-99)
  • concatenate_chunk_values (19-46)
  • compute_chunk_hash (49-66)
  • compute_existing_chunk_hash (69-99)
api/helpers/files/parquet/extractor.py (5)
  • ParquetDataExtractor (13-167)
  • extract_chunk_ids (54-81)
  • extract_metadata_columns (109-120)
  • build_chunk_metadata (123-152)
  • get_row_values (155-167)
api/schemas/chunks.py (1)
  • Chunk (6-10)
api/utils/exceptions.py (2)
  • CollectionNotFoundException (78-80)
  • VectorizationFailedException (190-192)
api/endpoints/collections.py (2)
api/schemas/collections.py (3)
  • Collection (17-26)
  • Collections (29-31)
  • CollectionVisibility (9-11)
api/helpers/_documentmanager.py (4)
  • create_collection (74-83)
  • update_collection_from_parquet (496-608)
  • delete_collection (86-101)
  • update_collection (104-126)
api/tests/unit/test_helpers/test_chunk_hasher.py (1)
api/helpers/data/hasher/chunk_hasher.py (3)
  • ChunkHasher (10-99)
  • concatenate_chunk_values (19-46)
  • compute_chunk_hash (49-66)
api/helpers/files/parquet/__init__.py (2)
api/helpers/files/parquet/extractor.py (1)
  • ParquetDataExtractor (13-167)
api/helpers/files/parquet/reader.py (1)
  • ParquetReader (20-139)
api/helpers/data/hasher/chunk_hasher.py (1)
api/schemas/chunks.py (1)
  • Chunk (6-10)
api/tests/unit/test_helpers/test_parsermanager.py (4)
api/schemas/core/documents.py (1)
  • FileType (19-25)
api/helpers/_parsermanager.py (2)
  • ParserManager (16-215)
  • parse_file (110-120)
api/utils/exceptions.py (1)
  • UnsupportedFileTypeException (163-165)
api/helpers/_documentmanager.py (1)
  • parse_file (335-346)
api/helpers/_parsermanager.py (2)
api/schemas/core/documents.py (1)
  • FileType (19-25)
api/utils/exceptions.py (1)
  • UnsupportedFileTypeException (163-165)
api/helpers/_documentmanager.py (8)
api/helpers/documents/documents_updater.py (2)
  • DocumentsUpdater (19-301)
  • process_document (40-120)
api/helpers/files/parquet/extractor.py (3)
  • ParquetDataExtractor (13-167)
  • get_unique_document_names (22-35)
  • filter_document_table (38-51)
api/helpers/files/parquet/reader.py (4)
  • ParquetReader (20-139)
  • read_and_validate (32-83)
  • sort_table (101-118)
  • identify_metadata_columns (121-139)
api/utils/exceptions.py (2)
  • CollectionNotFoundException (78-80)
  • VectorizationFailedException (190-192)
api/endpoints/collections.py (1)
  • create_collection (16-61)
api/clients/vector_store/_elasticsearchvectorstoreclient.py (2)
  • create_collection (27-97)
  • delete_document (117-120)
api/clients/vector_store/_qdrantvectorstoreclient.py (2)
  • create_collection (45-54)
  • delete_document (77-79)
api/clients/vector_store/_basevectorstoreclient.py (2)
  • create_collection (39-40)
  • delete_document (55-56)
🔇 Additional comments (21)
.github/badges/coverage.json (1)

1-1: Confirm this is CI-generated, not manually committed.

Coverage badge files should typically be auto-generated by your CI/CD pipeline (e.g., codecov workflow) rather than manually committed to the repository. If this was generated locally or via a pre-commit hook, please ensure the generation process is documented and automated to prevent manual divergence.

Do you have a GitHub Actions workflow or similar CI job that regenerates this badge automatically? If so, this file may belong in .gitignore to avoid merge conflicts and drift from actual coverage values.

api/schemas/core/documents.py (1)

21-21: LGTM! FileType enum extended for Parquet support.

The addition of the PARQUET file type is consistent with the existing enum structure and aligns with the broader Parquet support being added across the codebase.

api/helpers/files/__init__.py (1)

1-1: LGTM! Module documentation added.

The docstring appropriately describes the module's purpose.

api/tests/integ/conftest.py (1)

343-357: LGTM! PUT helpers added consistently.

The new PUT method helpers (put_without_permissions, put_with_permissions, put_master) follow the same pattern as existing HTTP method helpers and properly configure authorization headers for different permission levels.

api/helpers/_accesscontroller.py (1)

278-281: LGTM! Visibility check updated to use query parameters.

The change from reading visibility from the request body to query parameters aligns with the updated API design where collection metadata is passed via query params. The permission check logic remains correct.

api/tests/integ/cassettes/test_collections.test_patch_collection_metadata_and_file.yaml (1)

1-170: LGTM! Test cassette for collection metadata and file patching.

This VCR cassette fixture correctly records embeddings API interactions for integration testing of the collection PATCH endpoint with parquet file uploads.

api/tests/unit/test_helpers/test_parsermanager.py (2)

437-446: LGTM! Extension map test updated for PARQUET.

The test correctly verifies that .parquet is mapped to FileType.PARQUET in the extension map.


451-453: LGTM! Content types validation includes PARQUET.

The test correctly verifies that FileType.PARQUET has valid content types defined.

api/tests/integ/test_chat.py (1)

29-29: LGTM: Clean migration to query parameters.

The collection creation correctly uses query parameters instead of a JSON body, aligning with the updated endpoint signature across the codebase.

api/tests/integ/test_documents.py (1)

14-17: LGTM: Proper enum serialization and parameter usage.

The fixture correctly passes collection metadata via query parameters and uses .value to serialize the visibility enum to its string representation.

api/tests/integ/test_chunks.py (1)

15-18: LGTM: Consistent with test suite pattern.

The setup fixture follows the same query parameter pattern established across other integration tests.

api/tests/integ/test_search.py (2)

17-17: LGTM: Setup fixture updated correctly.

The collection creation in the setup fixture properly uses query parameters.


103-103: LGTM: Test updated consistently.

The test correctly creates a collection using query parameters, maintaining consistency with other test updates.

api/utils/exceptions.py (2)

126-126: LGTM: Parquet size limit constant added.

The MAX_PARQUET_CONTENT_SIZE constant provides a clear size constraint for Parquet file processing.


143-146: LGTM: Clean exception class for Parquet validation.

The InvalidPARQUETFormatException follows the established exception pattern and provides appropriate HTTP status code (422) for format validation errors.

api/tests/integ/test_files.py (2)

13-17: LGTM: Private collection setup updated correctly.

The fixture properly creates a private collection using query parameters with enum value serialization.


20-24: LGTM: Public collection setup follows same pattern.

Consistent with the private collection setup and other test files in the suite.

api/tests/integ/test_admin.py (2)

11-11: LGTM: Import added for enum usage.

The CollectionVisibility import supports the updated collection creation pattern in the test.


403-407: LGTM: Test correctly uses enum value.

The collection creation properly uses query parameters with CollectionVisibility.PRIVATE.value for type safety and consistency.

api/helpers/data/hasher/__init__.py (1)

1-5: LGTM: Clean module interface.

The module properly exposes ChunkHasher via __all__ with a clear docstring describing its purpose. This follows Python packaging best practices for public API definition.

api/tests/unit/test_helpers/test_chunk_hasher.py (1)

12-68: Nice coverage for the hash helper.

The deterministic, content/metadata variance, and concat smoke checks give us confidence the helper behaves as intended.

@FaheemBEG FaheemBEG force-pushed the 467-add-a-put-collections-endpoint-for-collections-update branch from cfcef3e to 2035450 Compare November 12, 2025 13:37
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

♻️ Duplicate comments (2)
api/helpers/data/hasher/chunk_hasher.py (1)

38-46: Add separators to prevent hash collisions (unresolved from past review).

The past review flagged this collision risk, and while marked "Partially Done," the fundamental issue remains unresolved. Without delimiters between key-value pairs, distinct dictionaries can produce identical concatenated strings:

{"a": "b:c", "d": "e"}  → "a:b:cd:e"
{"a": "b", "c": "d:e"}  → "a:bcd:e"  ← collision!

This creates a risk of false hash matches in change detection, causing the system to incorrectly skip legitimate document updates.

Apply this fix to add unambiguous separators:

         parts = []
         for key in sorted(chunk.keys()):
             value = chunk.get(key, None)
             if isinstance(value, (dict | list)):
                 # Serialize complex types to JSON for consistent representation
                 parts.append(f"{key}:{json.dumps(value, ensure_ascii=False)}")
             else:
                 parts.append(f"{key}:{value}")
-        return "".join(parts)
+        return "|".join(parts)  # Separator prevents collisions
api/helpers/documents/documents_updater.py (1)

215-233: Remove the 10k chunk cap when comparing existing content.

Limiting get_chunks to 10 000 still causes us to ignore differences in chunks beyond that index, so large documents can silently skip updates. Please page through all chunks (or otherwise compare full content) before deciding “no change.”

-        existing_chunks = await self.vector_store.get_chunks(
-            collection_id=collection_id,
-            document_id=existing_doc_id,
-            offset=0,
-            limit=10000,
-        )
-        existing_chunks.sort(key=lambda c: c.id)
+        existing_chunks: list[Chunk] = []
+        offset = 0
+        page_size = 1000
+        while True:
+            batch = await self.vector_store.get_chunks(
+                collection_id=collection_id,
+                document_id=existing_doc_id,
+                offset=offset,
+                limit=page_size,
+            )
+            if not batch:
+                break
+            existing_chunks.extend(batch)
+            if len(batch) < page_size:
+                break
+            offset += page_size
+        existing_chunks.sort(key=lambda c: c.id)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cfcef3e and 9f01d87.

⛔ Files ignored due to path filters (3)
  • api/tests/integ/assets/parquet.parquet is excluded by !**/*.parquet
  • api/tests/integ/assets/parquet_wrong_format.parquet is excluded by !**/*.parquet
  • api/tests/integ/assets/updated_parquet.parquet is excluded by !**/*.parquet
📒 Files selected for processing (32)
  • .github/badges/coverage-unit.json (1 hunks)
  • .github/badges/coverage.json (1 hunks)
  • api/endpoints/collections.py (4 hunks)
  • api/helpers/_accesscontroller.py (1 hunks)
  • api/helpers/_documentmanager.py (2 hunks)
  • api/helpers/_parsermanager.py (3 hunks)
  • api/helpers/data/hasher/__init__.py (1 hunks)
  • api/helpers/data/hasher/chunk_hasher.py (1 hunks)
  • api/helpers/documents/__init__.py (1 hunks)
  • api/helpers/documents/documents_updater.py (1 hunks)
  • api/helpers/files/__init__.py (1 hunks)
  • api/helpers/files/parquet/__init__.py (1 hunks)
  • api/helpers/files/parquet/extractor.py (1 hunks)
  • api/helpers/files/parquet/reader.py (1 hunks)
  • api/schemas/collections.py (2 hunks)
  • api/schemas/core/documents.py (1 hunks)
  • api/tests/integ/cassettes/test_collections.test_create_collection_with_parquet_file.yaml (1 hunks)
  • api/tests/integ/cassettes/test_collections.test_patch_collection_metadata_and_file.yaml (1 hunks)
  • api/tests/integ/cassettes/test_collections.test_put_collection_with_parquet_file.yaml (1 hunks)
  • api/tests/integ/conftest.py (1 hunks)
  • api/tests/integ/test_admin.py (2 hunks)
  • api/tests/integ/test_chat.py (1 hunks)
  • api/tests/integ/test_chunks.py (1 hunks)
  • api/tests/integ/test_collections.py (14 hunks)
  • api/tests/integ/test_documents.py (1 hunks)
  • api/tests/integ/test_files.py (1 hunks)
  • api/tests/integ/test_search.py (2 hunks)
  • api/tests/unit/test_helpers/test_chunk_hasher.py (1 hunks)
  • api/tests/unit/test_helpers/test_documents_updater.py (1 hunks)
  • api/tests/unit/test_helpers/test_parsermanager.py (2 hunks)
  • api/utils/exceptions.py (2 hunks)
  • pyproject.toml (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (20)
api/tests/integ/test_files.py (2)
api/schemas/collections.py (1)
  • CollectionVisibility (9-11)
api/tests/integ/conftest.py (1)
  • client (335-361)
api/tests/integ/test_search.py (1)
api/tests/integ/conftest.py (1)
  • client (335-361)
api/tests/unit/test_helpers/test_chunk_hasher.py (1)
api/helpers/data/hasher/chunk_hasher.py (3)
  • ChunkHasher (10-99)
  • concatenate_chunk_values (19-46)
  • compute_chunk_hash (49-66)
api/tests/unit/test_helpers/test_parsermanager.py (4)
api/schemas/core/documents.py (1)
  • FileType (19-25)
api/helpers/_parsermanager.py (2)
  • ParserManager (16-215)
  • parse_file (110-120)
api/utils/exceptions.py (1)
  • UnsupportedFileTypeException (163-165)
api/helpers/_documentmanager.py (1)
  • parse_file (335-346)
api/helpers/files/parquet/reader.py (3)
api/schemas/core/documents.py (1)
  • FileType (19-25)
api/utils/exceptions.py (2)
  • FileSizeLimitExceededException (124-129)
  • InvalidPARQUETFormatException (143-145)
api/helpers/_parsermanager.py (1)
  • _detect_file_type (61-92)
api/helpers/data/hasher/chunk_hasher.py (1)
api/schemas/chunks.py (1)
  • Chunk (6-10)
api/helpers/documents/__init__.py (1)
api/helpers/documents/documents_updater.py (1)
  • DocumentsUpdater (19-318)
api/tests/integ/test_chunks.py (1)
api/schemas/collections.py (1)
  • CollectionVisibility (9-11)
api/tests/integ/test_collections.py (3)
api/tests/integ/conftest.py (1)
  • client (335-361)
api/schemas/collections.py (1)
  • CollectionVisibility (9-11)
api/tests/integ/test_documents.py (1)
  • collection (13-21)
api/helpers/files/parquet/__init__.py (2)
api/helpers/files/parquet/extractor.py (1)
  • ParquetDataExtractor (13-167)
api/helpers/files/parquet/reader.py (1)
  • ParquetReader (20-139)
api/endpoints/collections.py (6)
api/helpers/_accesscontroller.py (1)
  • AccessController (56-399)
api/schemas/collections.py (3)
  • Collection (17-26)
  • Collections (29-31)
  • CollectionVisibility (9-11)
api/helpers/_documentmanager.py (4)
  • create_collection (74-83)
  • update_collection_from_parquet (496-608)
  • delete_collection (86-101)
  • update_collection (104-126)
api/sql/session.py (1)
  • get_db_session (33-39)
api/tests/integ/test_documents.py (1)
  • collection (13-21)
api/utils/exceptions.py (1)
  • CollectionNotFoundException (78-80)
api/tests/integ/test_chat.py (1)
api/tests/integ/conftest.py (1)
  • client (335-361)
api/helpers/data/hasher/__init__.py (1)
api/helpers/data/hasher/chunk_hasher.py (1)
  • ChunkHasher (10-99)
api/tests/integ/test_documents.py (1)
api/schemas/collections.py (1)
  • CollectionVisibility (9-11)
api/helpers/_parsermanager.py (2)
api/schemas/core/documents.py (1)
  • FileType (19-25)
api/utils/exceptions.py (1)
  • UnsupportedFileTypeException (163-165)
api/tests/unit/test_helpers/test_documents_updater.py (2)
api/helpers/documents/documents_updater.py (3)
  • _prepare_chunks_data (139-184)
  • _check_update_needed (186-242)
  • process_document (40-137)
api/utils/exceptions.py (1)
  • VectorizationFailedException (190-192)
api/helpers/documents/documents_updater.py (5)
api/helpers/data/hasher/chunk_hasher.py (4)
  • ChunkHasher (10-99)
  • concatenate_chunk_values (19-46)
  • compute_chunk_hash (49-66)
  • compute_existing_chunk_hash (69-99)
api/helpers/files/parquet/extractor.py (5)
  • ParquetDataExtractor (13-167)
  • extract_chunk_ids (54-81)
  • extract_metadata_columns (109-120)
  • build_chunk_metadata (123-152)
  • get_row_values (155-167)
api/schemas/chunks.py (1)
  • Chunk (6-10)
api/utils/exceptions.py (2)
  • CollectionNotFoundException (78-80)
  • VectorizationFailedException (190-192)
api/helpers/_documentmanager.py (1)
  • get_chunks (303-332)
api/helpers/_accesscontroller.py (2)
api/schemas/collections.py (1)
  • CollectionVisibility (9-11)
api/schemas/admin/roles.py (1)
  • PermissionType (10-14)
api/helpers/_documentmanager.py (3)
api/helpers/documents/documents_updater.py (2)
  • DocumentsUpdater (19-318)
  • process_document (40-137)
api/helpers/files/parquet/extractor.py (3)
  • ParquetDataExtractor (13-167)
  • get_unique_document_names (22-35)
  • filter_document_table (38-51)
api/helpers/files/parquet/reader.py (4)
  • ParquetReader (20-139)
  • read_and_validate (32-83)
  • sort_table (101-118)
  • identify_metadata_columns (121-139)
api/tests/integ/test_admin.py (1)
api/schemas/collections.py (1)
  • CollectionVisibility (9-11)
🔇 Additional comments (27)
.github/badges/coverage.json (1)

1-1: Coverage improved: +0.28 percentage points.

Overall test coverage increased from 72.77% to 73.05%. This is a positive change.

.github/badges/coverage-unit.json (1)

1-1: Unit test coverage declined slightly: −0.17 percentage points.

Unit test coverage decreased from 58.95% to 58.78%. However, the enriched summary confirms that new unit tests were added for ChunkHasher (api/tests/unit/test_helpers/test_chunk_hasher.py) and DocumentsUpdater (api/tests/unit/test_helpers/test_documents_updater.py). The slight decline is likely due to the proportion of new code added vs. unit test coverage added.

api/schemas/core/documents.py (1)

19-26: LGTM! Clean addition of Parquet file type.

The PARQUET enum member follows the existing pattern and enables Parquet file handling in the codebase.

api/helpers/files/__init__.py (1)

1-1: LGTM! Appropriate module docstring.

Standard Python practice to document module purposes.

pyproject.toml (2)

3-3: LGTM! Appropriate version bump.

Version bumped from 0.2.8post3 to 0.3.0, which correctly indicates a minor release with new features (Parquet support, collection update endpoints).


46-48: LGTM! Security vulnerability resolved.

pyarrow upgraded to 22.0.0, which addresses the critical security vulnerability flagged in previous reviews. xxhash 3.6.0 is the latest version. Both dependencies are correctly placed in the optional api group to support Parquet processing and chunk hashing features.

api/tests/integ/conftest.py (3)

343-343: LGTM! Consistent test helper for PUT requests.

The put_without_permissions helper follows the existing pattern for other HTTP methods, enabling permission-scoped testing of the new PUT endpoint.


350-350: LGTM! Consistent test helper for PUT requests.

The put_with_permissions helper follows the existing pattern, enabling admin-level testing of the new PUT endpoint.


357-357: LGTM! Consistent test helper for PUT requests.

The put_master helper follows the existing pattern, enabling master-level testing of the new PUT endpoint.

api/schemas/collections.py (1)

1-14: LGTM! Clean migration to Pydantic v2 patterns.

The code now uses modern Pydantic v2 features (Annotated, StringConstraints) instead of v1-style patterns. The CollectionName type alias provides reusable validation (whitespace stripping, minimum length) for collection names. This aligns with the API changes where collection metadata is now passed via query parameters instead of request bodies.

api/tests/integ/test_chat.py (1)

29-29: LGTM! Test updated to match new API contract.

Collection creation now uses query parameters instead of a JSON body, aligning with the removal of CollectionRequest and the new query-parameter-based API design.

api/tests/integ/test_chunks.py (1)

17-17: LGTM! Collection creation migrated to query parameters.

The test correctly uses query parameters for collection creation with CollectionVisibility.PRIVATE.value to pass the enum's underlying string value, aligning with the updated endpoint signature.

api/tests/integ/test_documents.py (1)

16-16: LGTM! Consistent with endpoint signature changes.

The collection fixture correctly migrated to query parameters, using CollectionVisibility.PRIVATE.value to provide the string representation of the enum.

api/tests/integ/test_admin.py (2)

11-11: LGTM! Import added to support enum usage.

The CollectionVisibility import is correctly added to support the updated collection creation pattern using enum values.


406-406: LGTM! Test updated to use enum for type safety.

The collection creation now correctly uses CollectionVisibility.PRIVATE.value with query parameters, improving type safety and consistency across tests.

api/utils/exceptions.py (2)

126-126: LGTM! Parquet size limit constant added.

The MAX_PARQUET_CONTENT_SIZE constant provides a clear size limit for Parquet file validation, maintaining consistency with the existing MAX_CONTENT_SIZE pattern.


143-145: LGTM! Exception class follows established patterns.

The InvalidPARQUETFormatException is correctly implemented with status code 422 and follows the same structure as other format validation exceptions (e.g., InvalidJSONFormatException).

api/helpers/_accesscontroller.py (1)

277-281: LGTM! Access control correctly migrated to query parameters.

The _check_collections method now reads visibility from request.query_params instead of the request body, correctly aligning with the updated endpoint signature. The permission check logic remains sound.

api/tests/integ/test_files.py (2)

15-15: LGTM! Private collection creation updated correctly.

The test properly uses query parameters with CollectionVisibility.PRIVATE.value to create the private test collection.


22-22: LGTM! Public collection creation updated correctly.

The test properly uses query parameters with CollectionVisibility.PUBLIC.value to create the public test collection, ensuring appropriate test coverage for both visibility types.

api/helpers/data/hasher/__init__.py (1)

1-5: LGTM! Well-structured module initialization.

The module correctly exposes the ChunkHasher utility with proper documentation and follows Python packaging conventions. The explicit __all__ export list provides clear public API boundaries.

api/helpers/_parsermanager.py (1)

114-116: LGTM! PARQUET handling is correctly implemented.

The guard clause properly prevents PARQUET files from being parsed as documents with a clear error message. This aligns with the PR's design where PARQUET files are exclusively handled by collection endpoints.

api/tests/unit/test_helpers/test_parsermanager.py (1)

455-466: LGTM! PARQUET rejection test is adequate.

The test correctly verifies that PARQUET files trigger UnsupportedFileTypeException in parse_file. While the exception message isn't explicitly verified (per the skipped past review suggestion), the test adequately ensures the rejection behavior.

api/helpers/files/parquet/__init__.py (1)

1-6: LGTM! Package exports are correctly defined.

The module properly exports the two main Parquet utility classes with appropriate __all__ declaration.

api/tests/unit/test_helpers/test_documents_updater.py (1)

1-450: LGTM! Comprehensive test coverage for DocumentsUpdater.

The test suite is well-structured and thorough:

  • Proper fixtures for mocking dependencies (vector store, callbacks)
  • Comprehensive coverage of chunk preparation scenarios (metadata, force_update, empty content)
  • Update detection logic tested across multiple scenarios (new docs, force updates, hash comparisons, chunk count differences)
  • Document processing workflow tested including rollback behavior on failure
  • Appropriate use of AsyncMock for async operations and patch for internal methods
api/tests/integ/test_collections.py (2)

295-370: Well-structured multi-step integration test.

This test effectively validates the complete Parquet processing pipeline through three upload scenarios:

  1. Initial upload with metadata update (lines 304-325)
  2. Identical file re-upload detecting no changes (lines 327-347)
  3. Updated file detecting specific changes (lines 349-370)

The assertions correctly verify both metadata updates and document processing statistics, providing good coverage of the update detection logic.


11-13: All required Parquet test assets are present in the repository.

Verification confirms that all three Parquet files referenced in the integration tests are committed and available:

  • api/tests/integ/assets/parquet.parquet
  • api/tests/integ/assets/updated_parquet.parquet
  • api/tests/integ/assets/parquet_wrong_format.parquet

The tests will execute successfully with these assets in place.

@FaheemBEG FaheemBEG force-pushed the 467-add-a-put-collections-endpoint-for-collections-update branch from 9f01d87 to fb77e16 Compare November 12, 2025 14:25
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (5)
api/tests/integ/test_files.py (1)

40-41: Remove redundant file.close() inside context managers.

The with block closes the file; explicit close is unnecessary and brittle if reused later.

-            response = client.post_without_permissions(url=f"/v1{ENDPOINT__FILES}", data=data, files=files)
-            file.close()
+            response = client.post_without_permissions(url=f"/v1{ENDPOINT__FILES}", data=data, files=files)

(Apply similarly to other tests in this file.)

Also applies to: 52-53, 65-65, 76-76, 88-88, 100-100, 112-112, 124-124, 136-136, 148-148, 159-159

api/tests/unit/test_helpers/test_parsermanager.py (2)

158-167: Fix tautological assertion; assert replacement char.

"" in result is always true for any string; assert U+FFFD occurrence and non-empty content.

-        # Should contain replacement characters
-        assert "" in result or len(result) > 0
+        # Should contain the Unicode replacement character from decode(errors="replace")
+        assert "\uFFFD" in result
+        assert len(result) > 0

110-121: Use PropertyMock on the type, not the instance.

To raise on attribute access, set the PropertyMock on the mock’s type; current code may not exercise the intended path.

-        mock_file = MagicMock(spec=UploadFile)
-        # Make accessing filename raise an exception
-        mock_file.filename = PropertyMock(side_effect=Exception("Access error"))
-        mock_file.content_type = "text/plain"
+        mock_file = MagicMock(spec=UploadFile)
+        type(mock_file).filename = PropertyMock(side_effect=Exception("Access error"))
+        type(mock_file).content_type = PropertyMock(return_value="text/plain")
api/helpers/_parsermanager.py (2)

131-139: Bug: same metadata object reused across PDF pages.

Mutating metadata.page in loop reuses the same instance, so all pages end with last page number.

-            document = ParsedDocument(data=[])
-            metadata = ParsedDocumentMetadata(document_name=params.file.filename)
+            document = ParsedDocument(data=[])
             for page_num in range(len(pdf)):
                 page = pdf[page_num]
                 text = page.get_text()
-                metadata.page = page_num
-                document.data.append(ParsedDocumentPage(content=text, images={}, metadata=metadata))
+                page_metadata = ParsedDocumentMetadata(document_name=params.file.filename, page=page_num)
+                document.data.append(ParsedDocumentPage(content=text, images={}, metadata=page_metadata))

200-211: Decode TXT via shared reader for consistency.

_read_content already handles encodings; reuse it so TXT content is str.

-            text = await params.file.read()
+            text = await self._read_content(file=params.file)
             document = ParsedDocument(
                 data=[
                     ParsedDocumentPage(
                         content=text,
                         images={},
                         metadata=ParsedDocumentMetadata(document_name=params.file.filename),
                     )
                 ]
             )
♻️ Duplicate comments (5)
api/helpers/data/hasher/chunk_hasher.py (1)

18-46: Critical: Hash collision risk remains unresolved.

The current concatenation without delimiters between key-value pairs can produce identical hashes for different content. For example:

  • {"content": "textpage", "x": "1"}"content:textpagex:1"
  • {"content": "text", "pagex": "1"}"content:textpagex:1" (identical)

This collision can cause different chunks to be incorrectly identified as duplicates, leading to data corruption during updates.

Apply this diff to fix the collision risk:

             parts.append(f"{key}:{json.dumps(value, ensure_ascii=False)}")
         else:
             parts.append(f"{key}:{value}")
-    return "".join(parts)
+    return "|".join(parts)  # Separator prevents collisions
api/tests/unit/test_helpers/test_parsermanager.py (1)

455-466: Also assert the exception message for clarity.

Check the detail text to lock user-facing error contract. This was suggested earlier and still applies.

         with pytest.raises(UnsupportedFileTypeException) as exc_info:
             await manager.parse_file(file=file)
+        assert exc_info.value.detail == "Parquet files cannot be parsed as documents."
api/helpers/_documentmanager.py (1)

559-568: Do not mass-delete on force_update; delete only after successful re-ingest.

Deleting all documents upfront risks permanent data loss if embedding/upsert fails mid-run. The per-document updater already deletes the old doc after successful insert; rely on that instead.

-        # Handle force update (delete all existing documents)
-        if force_update and existing_docs_map:
-            document_ids = list(existing_docs_map.values())
-            logger.info(f"Force update: deleting {len(document_ids)} documents from collection {collection_id}")
-
-            for document_id in document_ids:
-                await self.delete_document(session=session, user_id=user_id, document_id=document_id)
-
-            stats["deleted_documents"] = len(document_ids)
-            existing_docs_map = {}
+        # For force_update, defer deletion to DocumentsUpdater (delete after successful reinsert)
+        # Keep existing_docs_map so updater can detect updates and remove old docs safely.

Additionally, if you still want to expose deleted_documents, set it after processing to the number of updated documents:

-        logger.info(f"Collection update complete. Stats: {stats}")
-        return stats
+        if force_update:
+            stats["deleted_documents"] = stats.get("updated_documents", 0)
+        logger.info(f"Collection update complete. Stats: {stats}")
+        return stats
api/helpers/files/parquet/extractor.py (1)

54-81: Chunk ID indexing mismatch remains unresolved.

The code assigns 1-based chunk IDs (line 81: range(1, num_chunks + 1)), but tests in api/tests/unit/test_helpers/test_documents_updater.py expect 0-based IDs (doc_chunk_ids = [0, 1, 2] at lines 52, 84, 112, 132). This was previously flagged as critical and marked "Skipped."

While respecting the decision to skip, be aware this inconsistency may cause runtime issues when auto-generated chunk IDs don't match test expectations or vector store conventions.

api/helpers/documents/documents_updater.py (1)

217-221: Hard limit on chunk comparison remains unresolved.

The 10,000 chunk limit (line 221) was previously flagged as a critical issue for documents exceeding this size. Changes beyond the first 10,000 chunks won't be detected, potentially causing incorrect update decisions. This was marked "Skipped" by the developer.

While respecting this decision, be aware this limitation may cause issues with very large documents.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9f01d87 and 13743a6.

⛔ Files ignored due to path filters (3)
  • api/tests/integ/assets/parquet.parquet is excluded by !**/*.parquet
  • api/tests/integ/assets/parquet_wrong_format.parquet is excluded by !**/*.parquet
  • api/tests/integ/assets/updated_parquet.parquet is excluded by !**/*.parquet
📒 Files selected for processing (32)
  • .github/badges/coverage-unit.json (1 hunks)
  • .github/badges/coverage.json (1 hunks)
  • api/endpoints/collections.py (4 hunks)
  • api/helpers/_accesscontroller.py (1 hunks)
  • api/helpers/_documentmanager.py (2 hunks)
  • api/helpers/_parsermanager.py (3 hunks)
  • api/helpers/data/hasher/__init__.py (1 hunks)
  • api/helpers/data/hasher/chunk_hasher.py (1 hunks)
  • api/helpers/documents/__init__.py (1 hunks)
  • api/helpers/documents/documents_updater.py (1 hunks)
  • api/helpers/files/__init__.py (1 hunks)
  • api/helpers/files/parquet/__init__.py (1 hunks)
  • api/helpers/files/parquet/extractor.py (1 hunks)
  • api/helpers/files/parquet/reader.py (1 hunks)
  • api/schemas/collections.py (2 hunks)
  • api/schemas/core/documents.py (1 hunks)
  • api/tests/integ/cassettes/test_collections.test_create_collection_with_parquet_file.yaml (1 hunks)
  • api/tests/integ/cassettes/test_collections.test_patch_collection_metadata_and_file.yaml (1 hunks)
  • api/tests/integ/cassettes/test_collections.test_put_collection_with_parquet_file.yaml (1 hunks)
  • api/tests/integ/conftest.py (1 hunks)
  • api/tests/integ/test_admin.py (2 hunks)
  • api/tests/integ/test_chat.py (1 hunks)
  • api/tests/integ/test_chunks.py (1 hunks)
  • api/tests/integ/test_collections.py (14 hunks)
  • api/tests/integ/test_documents.py (1 hunks)
  • api/tests/integ/test_files.py (1 hunks)
  • api/tests/integ/test_search.py (2 hunks)
  • api/tests/unit/test_helpers/test_chunk_hasher.py (1 hunks)
  • api/tests/unit/test_helpers/test_documents_updater.py (1 hunks)
  • api/tests/unit/test_helpers/test_parsermanager.py (2 hunks)
  • api/utils/exceptions.py (2 hunks)
  • pyproject.toml (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (20)
api/helpers/data/hasher/__init__.py (1)
api/helpers/data/hasher/chunk_hasher.py (1)
  • ChunkHasher (10-99)
api/helpers/documents/__init__.py (1)
api/helpers/documents/documents_updater.py (1)
  • DocumentsUpdater (19-318)
api/tests/unit/test_helpers/test_parsermanager.py (4)
api/schemas/core/documents.py (1)
  • FileType (19-25)
api/helpers/_parsermanager.py (2)
  • ParserManager (16-215)
  • parse_file (110-120)
api/utils/exceptions.py (1)
  • UnsupportedFileTypeException (163-165)
api/helpers/_documentmanager.py (1)
  • parse_file (335-346)
api/tests/integ/test_search.py (1)
api/tests/integ/conftest.py (1)
  • client (335-361)
api/tests/integ/test_documents.py (1)
api/schemas/collections.py (1)
  • CollectionVisibility (9-11)
api/helpers/files/parquet/reader.py (3)
api/schemas/core/documents.py (1)
  • FileType (19-25)
api/utils/exceptions.py (2)
  • FileSizeLimitExceededException (124-129)
  • InvalidPARQUETFormatException (143-145)
api/helpers/_parsermanager.py (1)
  • _detect_file_type (61-92)
api/tests/integ/test_chunks.py (1)
api/schemas/collections.py (1)
  • CollectionVisibility (9-11)
api/helpers/_parsermanager.py (2)
api/schemas/core/documents.py (1)
  • FileType (19-25)
api/utils/exceptions.py (1)
  • UnsupportedFileTypeException (163-165)
api/tests/unit/test_helpers/test_chunk_hasher.py (1)
api/helpers/data/hasher/chunk_hasher.py (4)
  • ChunkHasher (10-99)
  • concatenate_chunk_values (19-46)
  • compute_chunk_hash (49-66)
  • compute_existing_chunk_hash (69-99)
api/tests/unit/test_helpers/test_documents_updater.py (2)
api/helpers/documents/documents_updater.py (3)
  • _prepare_chunks_data (139-184)
  • _check_update_needed (186-242)
  • process_document (40-137)
api/utils/exceptions.py (1)
  • VectorizationFailedException (190-192)
api/helpers/_documentmanager.py (6)
api/helpers/documents/documents_updater.py (2)
  • DocumentsUpdater (19-318)
  • process_document (40-137)
api/helpers/files/parquet/extractor.py (3)
  • ParquetDataExtractor (13-179)
  • get_unique_document_names (22-35)
  • filter_document_table (38-51)
api/helpers/files/parquet/reader.py (4)
  • ParquetReader (20-139)
  • read_and_validate (32-83)
  • sort_table (101-118)
  • identify_metadata_columns (121-139)
api/utils/exceptions.py (2)
  • CollectionNotFoundException (78-80)
  • VectorizationFailedException (190-192)
api/clients/vector_store/_elasticsearchvectorstoreclient.py (2)
  • create_collection (27-97)
  • delete_document (117-120)
api/clients/vector_store/_qdrantvectorstoreclient.py (2)
  • create_collection (45-54)
  • delete_document (77-79)
api/helpers/_accesscontroller.py (2)
api/schemas/collections.py (1)
  • CollectionVisibility (9-11)
api/schemas/admin/roles.py (1)
  • PermissionType (10-14)
api/endpoints/collections.py (5)
api/schemas/collections.py (3)
  • Collection (17-26)
  • Collections (29-31)
  • CollectionVisibility (9-11)
api/helpers/_documentmanager.py (4)
  • create_collection (74-83)
  • update_collection_from_parquet (496-608)
  • delete_collection (86-101)
  • update_collection (104-126)
api/sql/session.py (1)
  • get_db_session (33-39)
api/tests/integ/test_documents.py (1)
  • collection (13-21)
api/utils/exceptions.py (1)
  • CollectionNotFoundException (78-80)
api/tests/integ/test_admin.py (1)
api/schemas/collections.py (1)
  • CollectionVisibility (9-11)
api/tests/integ/test_chat.py (1)
api/tests/integ/conftest.py (1)
  • client (335-361)
api/helpers/files/parquet/__init__.py (2)
api/helpers/files/parquet/extractor.py (1)
  • ParquetDataExtractor (13-179)
api/helpers/files/parquet/reader.py (1)
  • ParquetReader (20-139)
api/helpers/data/hasher/chunk_hasher.py (1)
api/schemas/chunks.py (1)
  • Chunk (6-10)
api/tests/integ/test_files.py (2)
api/schemas/collections.py (1)
  • CollectionVisibility (9-11)
api/tests/integ/conftest.py (1)
  • client (335-361)
api/helpers/documents/documents_updater.py (4)
api/helpers/data/hasher/chunk_hasher.py (4)
  • ChunkHasher (10-99)
  • concatenate_chunk_values (19-46)
  • compute_chunk_hash (49-66)
  • compute_existing_chunk_hash (69-99)
api/helpers/files/parquet/extractor.py (5)
  • ParquetDataExtractor (13-179)
  • extract_chunk_ids (54-81)
  • extract_metadata_columns (109-120)
  • build_chunk_metadata (123-164)
  • get_row_values (167-179)
api/schemas/chunks.py (1)
  • Chunk (6-10)
api/utils/exceptions.py (2)
  • CollectionNotFoundException (78-80)
  • VectorizationFailedException (190-192)
api/tests/integ/test_collections.py (3)
api/tests/integ/conftest.py (1)
  • client (335-361)
api/schemas/collections.py (1)
  • CollectionVisibility (9-11)
api/tests/integ/test_documents.py (1)
  • collection (13-21)
🔇 Additional comments (34)
pyproject.toml (2)

3-3: Version bump to 0.3.0 is appropriate for this feature release.

The minor version bump aligns with the scope of changes—Parquet support, document updates via hashing, and collection endpoints—representing new functionality without breaking compatibility.


46-47: Security fix and new dependencies approved.

The upgrade to pyarrow==22.0.0 successfully resolves the critical security vulnerability from version 21.0.0 (as previously flagged and addressed). The addition of xxhash==3.6.0 is well-placed in the data dependencies section and pinned to the latest stable version with no known advisories. Both dependencies are needed for the new Parquet-driven document processing and hashing workflows.

Ensure all tests pass with these new optional dependencies, particularly integration tests for Parquet file handling and collection update flows.

api/helpers/data/hasher/chunk_hasher.py (2)

48-66: LGTM!

The use of xxHash64 is appropriate for fast, non-cryptographic content comparison. The implementation correctly encodes to UTF-8 before hashing.


68-99: Logic is correct, but inherits collision risk.

The metadata filtering logic correctly excludes internal fields that shouldn't affect comparison. However, this method depends on concatenate_chunk_values, which has the hash collision issue flagged above.

api/schemas/core/documents.py (1)

21-21: LGTM!

The PARQUET enum member addition is consistent with existing file type declarations and follows the established pattern.

api/tests/integ/test_chunks.py (1)

17-17: LGTM!

The test correctly migrates to parameter-based collection creation using CollectionVisibility.PRIVATE.value, aligning with the updated API endpoint behavior.

api/tests/integ/test_admin.py (1)

11-11: LGTM!

The import and test update correctly implement the parameter-based collection creation pattern with proper enum value usage.

Also applies to: 406-406

api/tests/integ/conftest.py (1)

343-343: LGTM!

The PUT request wrappers are correctly implemented using the same pattern as existing HTTP method wrappers, with proper authorization header injection for all three permission levels.

Also applies to: 350-350, 357-357

api/tests/integ/test_documents.py (1)

16-16: LGTM!

The test correctly migrates to parameter-based collection creation, consistent with the API endpoint changes.

api/tests/integ/test_search.py (1)

17-17: LGTM - relies on default visibility.

The tests correctly migrate to parameter-based collection creation. Note that these tests rely on default visibility behavior rather than explicitly specifying it like other tests in the PR.

Also applies to: 103-103

api/utils/exceptions.py (1)

126-126: LGTM!

The new InvalidPARQUETFormatException follows existing exception patterns, and the MAX_PARQUET_CONTENT_SIZE constant provides a clear size constraint for PARQUET file handling.

Also applies to: 143-145

api/tests/integ/test_files.py (1)

15-16: LGTM: switch to query params with enum primitives.

Using params and CollectionVisibility.*.value matches schema and endpoint expectations.

Also applies to: 22-23

api/helpers/data/hasher/__init__.py (1)

1-5: LGTM: clear public export.

Docstring + all expose ChunkHasher as intended.

api/tests/unit/test_helpers/test_parsermanager.py (2)

435-447: LGTM: include .parquet and mapping completeness.

Extension map test updated correctly with FileType.PARQUET.


451-453: LGTM: content-type coverage includes PARQUET.

Test ensures VALID_CONTENT_TYPES has entries for all supported types.

api/helpers/_parsermanager.py (2)

33-37: LGTM: comprehensive Parquet MIME types.

Accepts application/vnd.apache.parquet, application/x-parquet, and octet-stream. Good coverage.


114-117: LGTM: explicit PARQUET rejection at parse level.

Clear error path keeps parquet flow scoped to collections endpoints.

api/helpers/_documentmanager.py (1)

486-493: Python 3.12+ is properly declared and enforced.

Verification confirms the codebase declares requires-python = ">=3.12" in pyproject.toml and CI workflows (run_tests.yml, codeql_scan.yml) are configured to use Python 3.12. The import of itertools.batched at line 3 of _documentmanager.py is compatible with these requirements. No replacement or runtime compatibility issues exist.

api/helpers/files/parquet/__init__.py (1)

1-6: LGTM!

Clean package initialization following Python conventions. The module properly imports and re-exports the Parquet utilities.

api/tests/unit/test_helpers/test_chunk_hasher.py (1)

1-141: LGTM!

Comprehensive test coverage for ChunkHasher with well-structured fixtures and parametrized tests. The tests properly validate:

  • Hash determinism
  • Content and metadata differences
  • Exclusion of auto-generated metadata fields
  • Conditional chunk_index behavior
api/tests/unit/test_helpers/test_documents_updater.py (1)

1-450: LGTM!

Excellent test suite with comprehensive coverage of DocumentsUpdater functionality:

  • Chunk data preparation with and without metadata
  • Update necessity detection (new documents, force updates, hash comparisons)
  • Document processing flows (creation, updates, rollbacks)
  • Error handling and cleanup on failures

The tests properly use mocks and fixtures to isolate behavior.

api/tests/integ/test_collections.py (2)

254-384: LGTM!

Excellent integration test coverage for parquet-based collection updates:

  • Creation with initial parquet upload
  • PUT endpoint for force updates
  • PATCH with metadata and file updates, including multi-step validation of idempotency and change detection
  • Validation of malformed parquet files

The multi-step test (lines 295-371) particularly well validates that identical files don't trigger updates and that actual changes are properly detected.


11-13: All required test assets are present and properly committed.

The verification confirms that all three parquet files referenced in the tests (parquet.parquet, updated_parquet.parquet, and parquet_wrong_format.parquet) exist in the api/tests/integ/assets/ directory with appropriate file sizes. Tests can access these files without issues.

api/helpers/files/parquet/extractor.py (2)

143-163: LGTM!

The JSON parsing heuristic has been properly improved to check both start and end characters, reducing false positives. The exception handling is now specific to json.JSONDecodeError while still catching unexpected errors with proper logging.


22-35: LGTM!

The extraction and validation methods are well-implemented:

  • Efficient use of PyArrow's vectorized compute operations
  • Proper filtering and validation logic
  • Clear separation of concerns

Also applies to: 38-51, 84-106, 109-120, 167-179

api/helpers/files/parquet/reader.py (3)

32-83: LGTM!

The read_and_validate method properly handles:

  • Memory-efficient file size checking using seek/tell
  • File type validation
  • Parquet parsing with appropriate error handling
  • Column validation and chunk_index detection

120-139: LGTM!

The class constant mutation issue has been properly fixed. Line 135 now creates a copy before modification, preventing unintended side effects across method calls.


86-98: LGTM!

The validation and sorting methods are well-implemented:

  • Required columns validation is clear and provides helpful error messages
  • Sorting strategy ensures chunks are properly grouped by document and ordered by index

Also applies to: 101-118

api/endpoints/collections.py (3)

39-59: LGTM!

Excellent error handling in the collection creation flow:

  • Creates collection first
  • Processes parquet file if provided
  • On any failure, properly cleans up by deleting the created collection
  • Ensures no orphaned collections remain on error

The try-except with cleanup prevents partial state from persisting.


170-187: LGTM!

The atomicity concern has been properly addressed. The file processing now happens before metadata updates (lines 170-178), ensuring that if the parquet upload fails, the collection metadata remains unchanged. This prevents partial updates.


192-220: LGTM!

Clean implementation of the force update endpoint. The endpoint properly:

  • Validates collection existence
  • Calls the document manager with force_update=True
  • Returns processing statistics
api/helpers/documents/documents_updater.py (3)

101-115: LGTM - Critical data loss issue properly resolved!

The update flow now correctly prevents data loss:

  1. Creates new document (line 102)
  2. Inserts and vectorizes chunks (lines 105-110)
  3. Only deletes old document after successful vectorization (lines 113-115)
  4. On failure, cleans up the new document while preserving the old one (lines 131-133)

This ensures transient vectorization errors don't result in permanent data loss.

Also applies to: 126-137


139-184: LGTM!

The chunk preparation logic is clean and efficient:

  • Properly extracts content and metadata for each chunk
  • Conditionally computes hashes for change detection (skipped on force_update)
  • Uses static utility methods appropriately

186-242: LGTM!

The remaining methods are well-implemented:

  • _check_update_needed: Properly detects new documents, respects force_update flag, and compares hashes for change detection
  • _create_empty_document: Creates document records with appropriate error handling for missing collections
  • _insert_chunks: Properly batches chunk insertion with full metadata and wraps vectorization errors

Also applies to: 244-270, 272-318

@FaheemBEG FaheemBEG force-pushed the 467-add-a-put-collections-endpoint-for-collections-update branch from 13743a6 to 6578eab Compare November 13, 2025 10:36
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (4)
.github/badges/coverage-unit.json (1)

1-1: Unit test coverage trade-off noted.

The slight decline in unit test coverage (-0.09pp) reflects the addition of Parquet features primarily covered by integration tests. While unit tests would provide faster feedback, the existing integration test coverage provides end-to-end validation of the critical paths.

docs/docs/guides/rag.md (1)

49-57: Verify consistency with notebook and API implementation.

This change mirrors the one in docs/tutorials/rag.ipynb (line 112), switching from JSON to form-encoded data. Ensure both documentation sources remain synchronized and accurately reflect the API's expected request format.

api/helpers/_documentmanager.py (1)

558-567: Force update still risks data loss despite earlier flag.

The force update flow deletes all existing documents (lines 563-565) before processing new ones from the parquet file. If any subsequent operation fails (e.g., vectorization timeout at line 605, memory issues during batch processing), the collection is left empty with no recovery path. The earlier review comment flagging this was marked "Skipped," but the data loss risk remains unmitigated.

Consider one of these approaches:

  1. Staged deletion: Process all new documents first, then delete old ones only after successful completion
  2. Backup mechanism: Store document IDs in a recovery table before deletion, with cleanup on success
  3. Transaction wrapper: Use a database transaction with rollback capability (though vector store operations may not be transactional)
         # Handle force update (delete all existing documents)
         if force_update and existing_docs_map:
             document_ids = list(existing_docs_map.values())
             logger.info(f"Force update: deleting {len(document_ids)} documents from collection {collection_id}")
 
+            # Store for potential rollback
+            original_doc_ids = document_ids.copy()
+            
+        # Process new documents FIRST
+        # ... (process documents here)
+        
+        # Only delete old documents after successful processing
+        if force_update and original_doc_ids:
             for document_id in document_ids:
                 await self.delete_document(session=session, user_id=user_id, document_id=document_id)
 
             stats["deleted_documents"] = len(document_ids)
-            existing_docs_map = {}
api/helpers/data/hasher/chunk_hasher.py (1)

38-46: Hash collision risk persists—add inter-pair delimiter.

The concatenation still lacks a delimiter between key-value pairs (line 46), leaving the earlier collision concern unresolved. While each pair includes key:value, the absence of a separator between pairs can create ambiguous strings:

{"content": "abc", "page": "1"}      → "content:abcpage:1"
{"content": "abcpage:", "metadata": "1"} → "content:abcpage:metadata:1"  

Add a separator character between pairs:

         for key in sorted(chunk.keys()):
             value = chunk.get(key, None)
             if isinstance(value, (dict | list)):
                 parts.append(f"{key}:{json.dumps(value, ensure_ascii=False)}")
             else:
                 parts.append(f"{key}:{value}")
-        return "".join(parts)
+        return "|".join(parts)  # or "\n", "\x00", etc.

Also update the docstring example to reflect the separator.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 13743a6 and b5515c2.

⛔ Files ignored due to path filters (3)
  • api/tests/integ/assets/parquet.parquet is excluded by !**/*.parquet
  • api/tests/integ/assets/parquet_wrong_format.parquet is excluded by !**/*.parquet
  • api/tests/integ/assets/updated_parquet.parquet is excluded by !**/*.parquet
📒 Files selected for processing (35)
  • .github/badges/coverage-unit.json (1 hunks)
  • .github/badges/coverage.json (1 hunks)
  • api/endpoints/collections.py (4 hunks)
  • api/helpers/_accesscontroller.py (1 hunks)
  • api/helpers/_documentmanager.py (2 hunks)
  • api/helpers/_parsermanager.py (3 hunks)
  • api/helpers/data/hasher/__init__.py (1 hunks)
  • api/helpers/data/hasher/chunk_hasher.py (1 hunks)
  • api/helpers/documents/__init__.py (1 hunks)
  • api/helpers/documents/documents_updater.py (1 hunks)
  • api/helpers/files/__init__.py (1 hunks)
  • api/helpers/files/parquet/__init__.py (1 hunks)
  • api/helpers/files/parquet/extractor.py (1 hunks)
  • api/helpers/files/parquet/reader.py (1 hunks)
  • api/schemas/collections.py (2 hunks)
  • api/schemas/core/documents.py (1 hunks)
  • api/tests/integ/cassettes/test_collections.test_create_collection_with_parquet_file.yaml (1 hunks)
  • api/tests/integ/cassettes/test_collections.test_patch_collection_metadata_and_file.yaml (1 hunks)
  • api/tests/integ/cassettes/test_collections.test_put_collection_with_parquet_file.yaml (1 hunks)
  • api/tests/integ/conftest.py (1 hunks)
  • api/tests/integ/test_admin.py (2 hunks)
  • api/tests/integ/test_chat.py (1 hunks)
  • api/tests/integ/test_chunks.py (1 hunks)
  • api/tests/integ/test_collections.py (14 hunks)
  • api/tests/integ/test_documents.py (1 hunks)
  • api/tests/integ/test_files.py (1 hunks)
  • api/tests/integ/test_search.py (2 hunks)
  • api/tests/unit/test_helpers/test_chunk_hasher.py (1 hunks)
  • api/tests/unit/test_helpers/test_documents_updater.py (1 hunks)
  • api/tests/unit/test_helpers/test_parsermanager.py (2 hunks)
  • api/utils/exceptions.py (2 hunks)
  • docs/docs/functionalities/document-management/collections.md (1 hunks)
  • docs/docs/guides/rag.md (1 hunks)
  • docs/tutorials/rag.ipynb (2 hunks)
  • pyproject.toml (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (20)
api/tests/integ/test_admin.py (1)
api/schemas/collections.py (1)
  • CollectionVisibility (9-11)
api/helpers/data/hasher/__init__.py (1)
api/helpers/data/hasher/chunk_hasher.py (1)
  • ChunkHasher (10-99)
api/helpers/_parsermanager.py (2)
api/schemas/core/documents.py (1)
  • FileType (19-25)
api/utils/exceptions.py (1)
  • UnsupportedFileTypeException (163-165)
api/helpers/_documentmanager.py (5)
api/helpers/documents/documents_updater.py (2)
  • DocumentsUpdater (19-318)
  • process_document (40-137)
api/helpers/files/parquet/extractor.py (3)
  • ParquetDataExtractor (13-179)
  • get_unique_document_names (22-35)
  • filter_document_table (38-51)
api/helpers/files/parquet/reader.py (4)
  • ParquetReader (20-139)
  • read_and_validate (32-83)
  • sort_table (101-118)
  • identify_metadata_columns (121-139)
api/utils/exceptions.py (2)
  • CollectionNotFoundException (78-80)
  • VectorizationFailedException (190-192)
api/endpoints/collections.py (1)
  • create_collection (16-61)
api/tests/integ/test_search.py (1)
api/tests/integ/conftest.py (1)
  • client (335-361)
api/helpers/documents/__init__.py (1)
api/helpers/documents/documents_updater.py (1)
  • DocumentsUpdater (19-318)
api/helpers/files/parquet/reader.py (3)
api/schemas/core/documents.py (1)
  • FileType (19-25)
api/utils/exceptions.py (2)
  • FileSizeLimitExceededException (124-129)
  • InvalidPARQUETFormatException (143-145)
api/helpers/_parsermanager.py (1)
  • _detect_file_type (61-92)
api/tests/integ/test_collections.py (3)
api/tests/integ/conftest.py (1)
  • client (335-361)
api/schemas/collections.py (1)
  • CollectionVisibility (9-11)
api/tests/integ/test_documents.py (1)
  • collection (13-21)
api/tests/integ/test_documents.py (1)
api/schemas/collections.py (1)
  • CollectionVisibility (9-11)
api/tests/unit/test_helpers/test_parsermanager.py (4)
api/schemas/core/documents.py (1)
  • FileType (19-25)
api/helpers/_parsermanager.py (2)
  • ParserManager (16-215)
  • parse_file (110-120)
api/utils/exceptions.py (1)
  • UnsupportedFileTypeException (163-165)
api/helpers/_documentmanager.py (1)
  • parse_file (335-346)
api/helpers/files/parquet/__init__.py (2)
api/helpers/files/parquet/extractor.py (1)
  • ParquetDataExtractor (13-179)
api/helpers/files/parquet/reader.py (1)
  • ParquetReader (20-139)
api/tests/integ/test_chunks.py (1)
api/schemas/collections.py (1)
  • CollectionVisibility (9-11)
api/helpers/_accesscontroller.py (2)
api/schemas/collections.py (1)
  • CollectionVisibility (9-11)
api/schemas/admin/roles.py (1)
  • PermissionType (10-14)
api/helpers/documents/documents_updater.py (4)
api/helpers/data/hasher/chunk_hasher.py (4)
  • ChunkHasher (10-99)
  • concatenate_chunk_values (19-46)
  • compute_chunk_hash (49-66)
  • compute_existing_chunk_hash (69-99)
api/helpers/files/parquet/extractor.py (5)
  • ParquetDataExtractor (13-179)
  • extract_chunk_ids (54-81)
  • extract_metadata_columns (109-120)
  • build_chunk_metadata (123-164)
  • get_row_values (167-179)
api/schemas/chunks.py (1)
  • Chunk (6-10)
api/utils/exceptions.py (2)
  • CollectionNotFoundException (78-80)
  • VectorizationFailedException (190-192)
api/tests/integ/test_chat.py (1)
api/tests/integ/conftest.py (1)
  • client (335-361)
api/tests/integ/test_files.py (2)
api/schemas/collections.py (1)
  • CollectionVisibility (9-11)
api/tests/integ/conftest.py (1)
  • client (335-361)
api/tests/unit/test_helpers/test_documents_updater.py (2)
api/helpers/documents/documents_updater.py (4)
  • DocumentsUpdater (19-318)
  • _prepare_chunks_data (139-184)
  • _check_update_needed (186-242)
  • process_document (40-137)
api/utils/exceptions.py (1)
  • VectorizationFailedException (190-192)
api/endpoints/collections.py (2)
api/schemas/collections.py (3)
  • Collection (17-26)
  • Collections (29-31)
  • CollectionVisibility (9-11)
api/helpers/_documentmanager.py (4)
  • create_collection (74-83)
  • update_collection_from_parquet (496-608)
  • delete_collection (86-101)
  • update_collection (104-126)
api/helpers/data/hasher/chunk_hasher.py (1)
api/schemas/chunks.py (1)
  • Chunk (6-10)
api/tests/unit/test_helpers/test_chunk_hasher.py (1)
api/helpers/data/hasher/chunk_hasher.py (4)
  • ChunkHasher (10-99)
  • concatenate_chunk_values (19-46)
  • compute_chunk_hash (49-66)
  • compute_existing_chunk_hash (69-99)
🪛 markdownlint-cli2 (0.18.1)
docs/docs/functionalities/document-management/collections.md

51-51: Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


59-59: Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


67-67: Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


94-94: Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


115-115: Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


124-124: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)


128-128: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)


133-133: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)


136-136: Trailing spaces
Expected: 0 or 2; Actual: 1

(MD009, no-trailing-spaces)


142-142: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


154-154: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)


168-168: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Above

(MD022, blanks-around-headings)


168-168: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)

🔇 Additional comments (36)
api/schemas/core/documents.py (1)

19-26: LGTM! Clean addition of PARQUET file type support.

The new enum member follows the established pattern and enables Parquet file handling throughout the codebase.

api/helpers/files/__init__.py (1)

1-1: LGTM! Good practice to document the module's purpose.

The docstring clearly describes the module's role in the codebase.

api/utils/exceptions.py (1)

143-146: LGTM! New exception follows established pattern.

The InvalidPARQUETFormatException is consistent with existing format exceptions (e.g., InvalidJSONFormatException) and appropriately returns a 422 status code for validation errors.

api/helpers/data/hasher/__init__.py (1)

1-5: LGTM! Clean module structure with clear public API.

The module provides a well-defined interface for chunk hashing utilities with:

  • Clear docstring describing the module's purpose
  • Explicit __all__ declaration for controlled public exports
  • Single import/export keeping the interface focused
docs/tutorials/rag.ipynb (1)

99-119: The tutorial correctly uses form-encoded data—no changes needed.

The verification confirms the API endpoint has been properly updated to accept form-encoded data via Form(...) parameters, including support for optional Parquet file uploads. The tutorial's change from json= to data= correctly matches this implementation.

api/tests/integ/conftest.py (1)

343-343: LGTM! PUT method wrappers added consistently.

The three new PUT wrappers (put_without_permissions, put_with_permissions, put_master) follow the established pattern for other HTTP methods and enable comprehensive testing of the new PUT endpoints across different permission levels.

Also applies to: 350-350, 357-357

pyproject.toml (2)

3-3: Version bump looks appropriate.

The version increment from 0.2.8post3 to 0.3.0 correctly reflects the addition of new features (Parquet support, PUT endpoints for collections).


46-47: Dependencies added correctly.

Both pyarrow==22.0.0 and xxhash==3.6.0 are properly scoped to the api optional dependencies. The pyarrow upgrade addresses the critical security vulnerability identified in previous reviews.

api/helpers/_parsermanager.py (2)

19-19: LGTM! Parquet file type properly registered.

The extension mapping and MIME type validation for Parquet files are correctly configured. The MIME types include the official IANA-registered type (application/vnd.apache.parquet) and common variants, as confirmed in previous reviews.

Also applies to: 33-37


114-116: Appropriate runtime restriction for Parquet files.

The explicit rejection of Parquet files in parse_file is correct, as Parquet files are intended for direct collection ingestion (via the new endpoints) rather than document parsing. The error message clearly communicates this constraint.

api/tests/integ/test_chat.py (1)

29-29: LGTM! Test updated for form-data payload.

The change from json= to data= correctly aligns with the API's shift to form-encoded payloads for collection creation, ensuring the test exercises the updated endpoint behavior.

api/tests/integ/test_chunks.py (1)

17-17: LGTM! Consistent with API changes.

The update to use form data (data=) and the enum value (CollectionVisibility.PRIVATE.value) correctly aligns with the API's transition to form-encoded payloads and enum value serialization.

api/tests/integ/test_admin.py (2)

11-11: LGTM! Import added for enum usage.

The CollectionVisibility import is correctly added to support the enum value usage in the test.


406-406: LGTM! Test updated consistently.

The change to form data and enum value usage aligns with the broader API updates and improves type safety, as confirmed in previous reviews.

api/tests/integ/test_files.py (1)

15-22: LGTM! Test payload updated correctly for form-based collection creation.

The changes properly reflect the API's shift from JSON body to form data for collection creation. Using CollectionVisibility.*.value ensures the correct string values are passed in the form data.

api/tests/integ/test_documents.py (1)

16-16: LGTM! Consistent with form-based collection creation.

The test correctly uses form data with the visibility enum value.

api/tests/integ/test_search.py (2)

17-17: LGTM! Payload format updated correctly.

The test now uses form data (data=) instead of JSON payload. Note that visibility is not explicitly specified here, which relies on the API's default value. This aligns with your previous decision to skip adding explicit visibility parameters.


103-103: LGTM! Consistent payload format.

Same pattern as line 17 - using form data without explicit visibility, consistent with the previous design decision.

api/helpers/_accesscontroller.py (1)

278-282: LGTM! Form-based visibility check implemented correctly.

The code now correctly reads visibility from form data and compares it against the enum value. This aligns with the broader API changes to use form-based inputs for collection creation/updates.

Note: As you mentioned, the security concern about PUT/PATCH requests bypassing this check is being addressed in a separate PR.

api/tests/integ/cassettes/test_collections.test_patch_collection_metadata_and_file.yaml (1)

1-170: LGTM! Valid cassette fixture for integration tests.

This cassette correctly records embeddings API interactions for testing. The full embedding vectors are preserved, which ensures test fidelity. As you noted previously, cassette optimization was considered but skipped.

api/tests/unit/test_helpers/test_parsermanager.py (3)

437-437: LGTM! PARQUET extension mapping added correctly.

The test correctly verifies that .parquet files are mapped to FileType.PARQUET.


451-453: LGTM! Content type validation updated for PARQUET.

The test now correctly includes FileType.PARQUET in the validation check, ensuring PARQUET has valid content types defined.


455-466: LGTM! PARQUET rejection in parse_file verified.

The test correctly verifies that PARQUET files are rejected by parse_file with UnsupportedFileTypeException. As you noted previously, checking the specific exception message was considered but skipped.

api/helpers/documents/__init__.py (1)

1-5: LGTM! Clean module structure.

The module correctly re-exports DocumentsUpdater with proper docstring and __all__ declaration, following Python packaging best practices.

api/schemas/collections.py (1)

14-14: LGTM! Well-defined type alias for collection names.

The CollectionName alias provides clear validation requirements (strip whitespace, minimum length of 1) and can be reused across form-based collection endpoints. This is a clean approach for the shift from body-based to form-based inputs.

api/helpers/files/parquet/__init__.py (1)

1-6: LGTM!

Clean module initialization with proper public API exposure via __all__. The structure follows Python best practices.

api/tests/unit/test_helpers/test_chunk_hasher.py (1)

1-141: LGTM!

Comprehensive test suite with good coverage of ChunkHasher functionality:

  • Deterministic hashing behavior
  • Hash sensitivity to content and metadata changes
  • Proper exclusion of internal fields
  • Conditional chunk_index handling

The use of fixtures and parametrized tests enhances maintainability.

api/tests/unit/test_helpers/test_documents_updater.py (1)

1-450: LGTM!

Excellent test coverage for DocumentsUpdater with:

  • Comprehensive fixture setup using AsyncMock for async operations
  • Full coverage of chunk preparation, update detection, and document processing flows
  • Proper rollback testing to ensure error handling works correctly
  • Good use of patching to isolate units under test

The test suite provides confidence in the DocumentsUpdater implementation.

api/helpers/files/parquet/extractor.py (1)

1-179: LGTM!

Well-implemented ParquetDataExtractor with efficient PyArrow operations:

  • Vectorized operations for unique/filter using PyArrow compute
  • Proper sequential validation for chunk IDs
  • Improved JSON parsing heuristic with start/end checks and specific exception handling
  • Clean static method organization

The implementation provides a solid foundation for parquet-based document processing.

api/tests/integ/test_collections.py (1)

254-384: Excellent integration test coverage for parquet uploads — verified.

The new tests comprehensively validate parquet file handling:

  • Initial upload on collection creation
  • Force update via PUT
  • Incremental update via PATCH with change detection (no-op on identical file, proper stats on changes)
  • Error handling for invalid formats

The multi-step test at lines 295–371 is particularly valuable as it verifies the hash-based change detection works correctly across multiple uploads.

All test asset files (parquet.parquet, updated_parquet.parquet, parquet_wrong_format.parquet) are present in the repository and committed.

api/helpers/files/parquet/reader.py (1)

1-139: LGTM! The ParquetReader implementation is solid.

The class correctly handles Parquet file reading and validation with appropriate error handling. The mutation bug previously flagged (line 135) has been properly fixed using .copy(), and the file validation logic is integrated as intended.

api/endpoints/collections.py (3)

39-60: Excellent error handling with cleanup.

The try-except block correctly handles failures during file processing by cleaning up the created collection before re-raising the exception. This prevents orphaned collections when Parquet processing fails.


170-188: Correct operation ordering for atomicity.

The file processing now happens before metadata updates, ensuring that metadata changes only persist if the Parquet upload succeeds. This correctly addresses the atomicity concern raised in previous reviews.


192-220: LGTM! Clean force update implementation.

The force_update_collection endpoint correctly delegates to update_collection_from_parquet with force_update=True and has consistent response structure (no type mismatch like the other endpoints).

api/helpers/documents/documents_updater.py (2)

101-125: Excellent update flow preventing data loss.

The implementation correctly creates the new document and vectorizes chunks before deleting the old document. This ensures the original document remains intact if vectorization fails, addressing the critical data loss concern raised in previous reviews.


272-318: Clean chunk insertion with proper error handling.

The _insert_chunks method correctly prepares chunk metadata and handles vectorization failures by raising VectorizationFailedException with context. The batch operation approach is efficient.

@FaheemBEG FaheemBEG force-pushed the 467-add-a-put-collections-endpoint-for-collections-update branch from b5515c2 to 9e9f8bf Compare November 13, 2025 11:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add a PUT /collections and modify the POST & PATCH /collection endpoints for collections update

3 participants