Skip to content

Comments

Implement chunking#88

Merged
shubham3121 merged 12 commits intomainfrom
chore/implement-chunking
Feb 23, 2026
Merged

Implement chunking#88
shubham3121 merged 12 commits intomainfrom
chore/implement-chunking

Conversation

@shubham3121
Copy link
Member

@shubham3121 shubham3121 commented Feb 19, 2026

This pull request refactors and enhances the ChromaDB-backed dataset type, focusing on improved document chunking, ingestion, and search functionality. It removes the dependency on docling for direct document conversion, introduces a new chunking pipeline, and enables richer context retrieval for search results. The update also adds support for deleting entire datasets and their associated images.

Major changes include:

Dependency and Import Updates

  • Upgraded docling to version 2.60.0 and added docling-core as a dependency in pyproject.toml, reflecting the move to a new document chunking pipeline.
  • Removed direct imports and usage of docling’s DocumentConverter and related classes, switching to the new DocumentChunker and utility functions.

Ingestion Pipeline Refactor

  • Replaced the old document parsing and ingestion logic with a chunk-based pipeline using DocumentChunker. Each file is now split into multiple chunks, with each chunk stored as a separate vector in ChromaDB, including metadata for navigation and context retrieval. [1] [2] [3]
  • Added a new _store_chunks method to handle embedding and storing document chunks with neighbor references (prev/next pointers), supporting better context-aware retrieval.

Search Improvements

  • Refactored the search method to return top-k matching chunks, each enriched with neighboring chunk text for improved retrieval-augmented generation (RAG) context. Added helper methods for processing query results and enriching them with neighbor context. [1] [2]
  • Updated similarity filtering logic and ensured image URLs are constructed using dataset IDs to prevent leaking internal collection names.

API and Interface Changes

  • Changed ingestion and search method signatures to use IngestContext and SearchContext, reflecting more precise context handling. [1] [2]
  • Added a delete method to support deletion of entire ChromaDB collections and their associated page images.

Miscellaneous

  • Updated the healthcheck and enabled checks to reflect the new dependency requirements (ChromaDB only).

These changes collectively modernize the ingestion and search experience, improve scalability, and lay the groundwork for richer document processing and retrieval features.

  Introduce tokenizer-aware chunking using Docling's HybridChunker so
  documents are stored as individual vectors instead of whole documents.

  - Add shared DocumentChunker utility (chunking.py) used by all dataset types
  - Use Docling's DocumentStream for in-memory conversion (no temp files)
  - Save page images (PDF) and extracted pictures to disk during ingest
  - Store chunk metadata (page_numbers, headings, picture_ids) for image references
  - Rewrite chromadb and weaviate dataset types to use shared chunker
  - Simplify Dockerfile by removing torch and docling system deps
- Add serve_image handler with regex validation and path traversal prevention
- Add build_image_urls utility and IMAGE_ENDPOINT_PREFIX constant in chunking.py
- Partition image storage by collection name for dataset isolation
- Update ChromaDB and Weaviate search to include image_urls in metadata
- revert to add pytorch cpu to Dockerfile
…nd dataset handlers

- Change image filename generation to use UUIDs instead of sequential naming.
- Update validation regex for filenames to match the new UUID format.
- Adjust documentation to reflect changes in filename structure for image URLs.
- Update dataset-related context classes to include dataset_id for better identification.
- Refactor image-serving endpoint to use dataset_id instead of collection_name, improving security by avoiding exposure of internal names.
- Introduce new purge_page_images method for managing image storage.
- Adjust various dataset types and handlers to utilize the updated context structure for ingestion and search operations.
@shubham3121 shubham3121 merged commit 9f63f55 into main Feb 23, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant