Skip to content

feat: add document chunking#53

Closed
MagnusS0 wants to merge 2 commits intodocling-project:mainfrom
MagnusS0:feature/add-chunking
Closed

feat: add document chunking#53
MagnusS0 wants to merge 2 commits intodocling-project:mainfrom
MagnusS0:feature/add-chunking

Conversation

@MagnusS0
Copy link
Contributor

@MagnusS0 MagnusS0 commented Aug 26, 2025

This should build the foundation for new chunking endpoints in docling-project/docling-serve#299.

@dolfim-ibm when you have time please let me know if it was something in this direction you guys were thinking 😊

This pull request introduces a new document chunking feature. The changes include new configuration options for chunking, a chunking implementation using HybridChunker from docling-core, updates to result handling to support chunked responses, and dependency management for chunking functionality.

Document Chunking Feature:

  • Added ChunkingOptions, ChunkedDocumentResponseItem, and ChunkedDocumentResponse models to describe chunking configuration and output format in docling_jobkit/datamodel/chunking.py.
  • Implemented the DocumentChunker class in docling_jobkit/convert/chunking.py to handle document chunking, caching of chunker instances, and conversion of documents and conversion results into chunked responses.

Integration with Conversion Workflow:

  • Updated ConvertDocumentsOptions in docling_jobkit/datamodel/convert.py to include do_chunking and chunking_options fields, defaults to false for backward compatibility.
  • Modified result processing in docling_jobkit/convert/results.py to support chunked document export when requested,

Result Type and Dependency Updates:

  • Extended the ResultType union in docling_jobkit/datamodel/result.py to include ChunkedDocumentResponse, ensuring chunked results are handled consistently.
  • Added a new chunking dependency group in pyproject.toml to require docling[chunking] for chunking support.

Issue resolved by this Pull Request:
Resolves docling-project/docling-serve#44

Signed-off-by: Magnus Samuelsen <97634880+MagnusS0@users.noreply.github.com>
Also improves documentation

Signed-off-by: Magnus Samuelsen <97634880+MagnusS0@users.noreply.github.com>
@github-actions
Copy link
Contributor

DCO Check Passed

Thanks @MagnusS0, all your commits are properly signed off. 🎉

@mergify
Copy link

mergify bot commented Aug 26, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@dolfim-ibm dolfim-ibm self-requested a review September 1, 2025 12:28
@codecov
Copy link

codecov bot commented Sep 1, 2025

Codecov Report

❌ Patch coverage is 50.00000% with 76 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
docling_jobkit/convert/chunking.py 36.53% 66 Missing ⚠️
docling_jobkit/convert/results.py 37.50% 10 Missing ⚠️

📢 Thoughts on this report? Let us know!

@dolfim-ibm
Copy link
Member

Thanks @MagnusS0. If you don't mind I can pick it up from here. What we would like to change is:

  1. Define a new independent task for chunking (which should be reflected in docling-serve as a new endpoint)
  2. Design the task such that it would allow to also populate a vectordb. Coming later, but we should make sure the task would be able to handle it.

I expect quick iterations in the next days.

@MagnusS0
Copy link
Contributor Author

MagnusS0 commented Sep 1, 2025

Hey @dolfim-ibm absolutely!

Excited to test it out when it's ready!

@dolfim-ibm dolfim-ibm mentioned this pull request Sep 3, 2025
@dolfim-ibm
Copy link
Member

Superseded by #54.

@dolfim-ibm dolfim-ibm closed this Sep 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Proposal: Add Chunked Output Endpoints

2 participants