Conversation
Introduce tokenizer-aware chunking using Docling's HybridChunker so documents are stored as individual vectors instead of whole documents. - Add shared DocumentChunker utility (chunking.py) used by all dataset types - Use Docling's DocumentStream for in-memory conversion (no temp files) - Save page images (PDF) and extracted pictures to disk during ingest - Store chunk metadata (page_numbers, headings, picture_ids) for image references - Rewrite chromadb and weaviate dataset types to use shared chunker - Simplify Dockerfile by removing torch and docling system deps
- Add serve_image handler with regex validation and path traversal prevention - Add build_image_urls utility and IMAGE_ENDPOINT_PREFIX constant in chunking.py - Partition image storage by collection name for dataset isolation - Update ChromaDB and Weaviate search to include image_urls in metadata
- revert to add pytorch cpu to Dockerfile
…nd dataset handlers - Change image filename generation to use UUIDs instead of sequential naming. - Update validation regex for filenames to match the new UUID format. - Adjust documentation to reflect changes in filename structure for image URLs.
- Update dataset-related context classes to include dataset_id for better identification. - Refactor image-serving endpoint to use dataset_id instead of collection_name, improving security by avoiding exposure of internal names. - Introduce new purge_page_images method for managing image storage. - Adjust various dataset types and handlers to utilize the updated context structure for ingestion and search operations.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request refactors and enhances the ChromaDB-backed dataset type, focusing on improved document chunking, ingestion, and search functionality. It removes the dependency on
doclingfor direct document conversion, introduces a new chunking pipeline, and enables richer context retrieval for search results. The update also adds support for deleting entire datasets and their associated images.Major changes include:
Dependency and Import Updates
doclingto version 2.60.0 and addeddocling-coreas a dependency inpyproject.toml, reflecting the move to a new document chunking pipeline.docling’sDocumentConverterand related classes, switching to the newDocumentChunkerand utility functions.Ingestion Pipeline Refactor
DocumentChunker. Each file is now split into multiple chunks, with each chunk stored as a separate vector in ChromaDB, including metadata for navigation and context retrieval. [1] [2] [3]_store_chunksmethod to handle embedding and storing document chunks with neighbor references (prev/next pointers), supporting better context-aware retrieval.Search Improvements
API and Interface Changes
IngestContextandSearchContext, reflecting more precise context handling. [1] [2]deletemethod to support deletion of entire ChromaDB collections and their associated page images.Miscellaneous
These changes collectively modernize the ingestion and search experience, improve scalability, and lay the groundwork for richer document processing and retrieval features.