Implement chunking by shubham3121 · Pull Request #88 · OpenMined/syft-space

shubham3121 · 2026-02-19T09:52:30Z

This pull request refactors and enhances the ChromaDB-backed dataset type, focusing on improved document chunking, ingestion, and search functionality. It removes the dependency on docling for direct document conversion, introduces a new chunking pipeline, and enables richer context retrieval for search results. The update also adds support for deleting entire datasets and their associated images.

Major changes include:

Dependency and Import Updates

Upgraded docling to version 2.60.0 and added docling-core as a dependency in pyproject.toml, reflecting the move to a new document chunking pipeline.
Removed direct imports and usage of docling’s DocumentConverter and related classes, switching to the new DocumentChunker and utility functions.

Ingestion Pipeline Refactor

Replaced the old document parsing and ingestion logic with a chunk-based pipeline using DocumentChunker. Each file is now split into multiple chunks, with each chunk stored as a separate vector in ChromaDB, including metadata for navigation and context retrieval. [1] [2] [3]
Added a new _store_chunks method to handle embedding and storing document chunks with neighbor references (prev/next pointers), supporting better context-aware retrieval.

Search Improvements

Refactored the search method to return top-k matching chunks, each enriched with neighboring chunk text for improved retrieval-augmented generation (RAG) context. Added helper methods for processing query results and enriching them with neighbor context. [1] [2]
Updated similarity filtering logic and ensured image URLs are constructed using dataset IDs to prevent leaking internal collection names.

API and Interface Changes

Changed ingestion and search method signatures to use IngestContext and SearchContext, reflecting more precise context handling. [1] [2]
Added a delete method to support deletion of entire ChromaDB collections and their associated page images.

Miscellaneous

Updated the healthcheck and enabled checks to reflect the new dependency requirements (ChromaDB only).

These changes collectively modernize the ingestion and search experience, improve scalability, and lay the groundwork for richer document processing and retrieval features.

Introduce tokenizer-aware chunking using Docling's HybridChunker so documents are stored as individual vectors instead of whole documents. - Add shared DocumentChunker utility (chunking.py) used by all dataset types - Use Docling's DocumentStream for in-memory conversion (no temp files) - Save page images (PDF) and extracted pictures to disk during ingest - Store chunk metadata (page_numbers, headings, picture_ids) for image references - Rewrite chromadb and weaviate dataset types to use shared chunker - Simplify Dockerfile by removing torch and docling system deps

- Add serve_image handler with regex validation and path traversal prevention - Add build_image_urls utility and IMAGE_ENDPOINT_PREFIX constant in chunking.py - Partition image storage by collection name for dataset isolation - Update ChromaDB and Weaviate search to include image_urls in metadata

- revert to add pytorch cpu to Dockerfile

…nd dataset handlers - Change image filename generation to use UUIDs instead of sequential naming. - Update validation regex for filenames to match the new UUID format. - Adjust documentation to reflect changes in filename structure for image URLs.

- Update dataset-related context classes to include dataset_id for better identification. - Refactor image-serving endpoint to use dataset_id instead of collection_name, improving security by avoiding exposure of internal names. - Introduce new purge_page_images method for managing image storage. - Adjust various dataset types and handlers to utilize the updated context structure for ingestion and search operations.

shubham3121 added 12 commits February 19, 2026 15:20

fix: stop saving page renders, only extract and serve pictures

da66c7b

chore: upgrade version of docling and docling core

17c1829

- revert to add pytorch cpu to Dockerfile

refactor: integrate dataset type partition deletion on dataset deletion

f2c0000

fix: fix import of SearchContext

5fe0758

fix: fix dataset deletion cleanup and remove dead code

c0deda0

refactor: update dataset_id type to UUID in context classes

8a3eb4b

fix: add ORM cascade delete for dataset and model endpoints

b273809

Merge branch 'main' into chore/implement-chunking

f9c09f6

shubham3121 merged commit 9f63f55 into main Feb 23, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Comments

Implement chunking#88

Implement chunking#88
shubham3121 merged 12 commits intomainfrom
chore/implement-chunking

shubham3121 commented Feb 19, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Comments

Conversation

shubham3121 commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Dependency and Import Updates

Ingestion Pipeline Refactor

Search Improvements

API and Interface Changes

Miscellaneous

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

shubham3121 commented Feb 19, 2026 •

edited

Loading