Skip to content

feat: add chunking support for document conversion#299

Closed
MagnusS0 wants to merge 4 commits intodocling-project:mainfrom
MagnusS0:feat-chunked-endpoints
Closed

feat: add chunking support for document conversion#299
MagnusS0 wants to merge 4 commits intodocling-project:mainfrom
MagnusS0:feat-chunked-endpoints

Conversation

@MagnusS0
Copy link

@MagnusS0 MagnusS0 commented Aug 4, 2025

Document Chunking Endpoint

This pull request adds support for document chunking in the API, enabling Retrieval-Augmented Generation (RAG) workflows by returning chunked document responses. The changes introduce new request options, response models, and backend logic to handle chunking, including configuration for tokenization and serialization. Documentation is updated to explain usage and response formats for chunking. This ensures no changes for current usage, as the implementation uses a lazy importing pattern where chunking dependencies are only loaded when requested. When do_chunking=false (default), the API behaves exactly as before.

API and Model Changes:

  • Added do_chunking and chunking_options to ConvertDocumentsRequestOptions, allowing clients to request chunked document responses with customizable chunking parameters such as max_tokens, overlap, tokenizer, and markdown table serialization. [1] [2]
  • Introduced ChunkedDocumentResponse and ChunkedDocumentResponseItem models to represent chunked output, including contextualized text, raw text, headings, page numbers, and metadata.

Backend Logic:

  • Updated API endpoints (/v1/convert/source, /v1/convert/file, /v1/result/{task_id}) to support returning either standard or chunked document responses based on request parameters. [1] [2] [3]
  • Implemented chunking logic in response_preparation.py, including HuggingFace tokenizer caching, chunk creation, and response assembly. Added support for markdown table serialization and custom tokenizer selection. [1] [2] [3]

Testing:

  • Added tests covering various chunking scenarios and configurations
  • Tests verify both single and multiple document chunking workflows
  • Validates error handling for missing dependencies

Documentation Updates:

@github-actions
Copy link
Contributor

github-actions bot commented Aug 4, 2025

DCO Check Passed

Thanks @MagnusS0, all your commits are properly signed off. 🎉

@mergify
Copy link

mergify bot commented Aug 4, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

…noreply.github.com>

I, Magnus Samuelsen <97634880+MagnusS0@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 8a19f46

Signed-off-by: Magnus Samuelsen <97634880+MagnusS0@users.noreply.github.com>
…noreply.github.com>

I, Magnus Samuelsen <97634880+MagnusS0@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 8a19f46
I, Magnus Samuelsen <97634880+MagnusS0@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 8f4b928

Signed-off-by: Magnus Samuelsen <97634880+MagnusS0@users.noreply.github.com>
@dolfim-ibm
Copy link
Member

@MagnusS0 as introduced in the issue, we moved the results generation to the jobkit library, so most of this PR should be moved there.

My proposal would also be to use a different endpoint, e.g. /v1/chunk/....

Do you want to contribute this change or do you prefer if we do it from our side?
I was thinking of taking part of the work to docling-jobkit, and then update your PR to use it. This way your contribution history would still be there.

@MagnusS0
Copy link
Author

@dolfim-ibm Awsome to hear! I'm a bit busy this weekend, but can probably spend some time on it sometime in the next couple weeks.

If you have any input and have given it some thought for how to implement it in the jobkit library I would love to hear it 😊

@dolfim-ibm
Copy link
Member

Superseded by #353

@dolfim-ibm dolfim-ibm closed this Sep 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Proposal: Add Chunked Output Endpoints

2 participants