feat: add chunking support for document conversion by MagnusS0 · Pull Request #299 · docling-project/docling-serve

MagnusS0 · 2025-08-04T21:43:09Z

Document Chunking Endpoint

This pull request adds support for document chunking in the API, enabling Retrieval-Augmented Generation (RAG) workflows by returning chunked document responses. The changes introduce new request options, response models, and backend logic to handle chunking, including configuration for tokenization and serialization. Documentation is updated to explain usage and response formats for chunking. This ensures no changes for current usage, as the implementation uses a lazy importing pattern where chunking dependencies are only loaded when requested. When do_chunking=false (default), the API behaves exactly as before.

API and Model Changes:

Added do_chunking and chunking_options to ConvertDocumentsRequestOptions, allowing clients to request chunked document responses with customizable chunking parameters such as max_tokens, overlap, tokenizer, and markdown table serialization. [1] [2]
Introduced ChunkedDocumentResponse and ChunkedDocumentResponseItem models to represent chunked output, including contextualized text, raw text, headings, page numbers, and metadata.

Backend Logic:

Updated API endpoints (/v1/convert/source, /v1/convert/file, /v1/result/{task_id}) to support returning either standard or chunked document responses based on request parameters. [1] [2] [3]
Implemented chunking logic in response_preparation.py, including HuggingFace tokenizer caching, chunk creation, and response assembly. Added support for markdown table serialization and custom tokenizer selection. [1] [2] [3]

Testing:

Added tests covering various chunking scenarios and configurations
Tests verify both single and multiple document chunking workflows
Validates error handling for missing dependencies

Documentation Updates:

Expanded docs/usage.md with detailed instructions and examples for using chunking options, including configuration parameters, example requests, and the chunked response format. [1] [2] [3]
Resolves Feature Proposal: Add Chunked Output Endpoints #44

github-actions · 2025-08-04T21:43:19Z

✅ DCO Check Passed

Thanks @MagnusS0, all your commits are properly signed off. 🎉

mergify · 2025-08-04T21:43:45Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

…noreply.github.com> I, Magnus Samuelsen <97634880+MagnusS0@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 8a19f46 Signed-off-by: Magnus Samuelsen <97634880+MagnusS0@users.noreply.github.com>

…noreply.github.com> I, Magnus Samuelsen <97634880+MagnusS0@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 8a19f46 I, Magnus Samuelsen <97634880+MagnusS0@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 8f4b928 Signed-off-by: Magnus Samuelsen <97634880+MagnusS0@users.noreply.github.com>

dolfim-ibm · 2025-08-22T14:53:07Z

@MagnusS0 as introduced in the issue, we moved the results generation to the jobkit library, so most of this PR should be moved there.

My proposal would also be to use a different endpoint, e.g. /v1/chunk/....

Do you want to contribute this change or do you prefer if we do it from our side?
I was thinking of taking part of the work to docling-jobkit, and then update your PR to use it. This way your contribution history would still be there.

MagnusS0 · 2025-08-22T15:37:11Z

@dolfim-ibm Awsome to hear! I'm a bit busy this weekend, but can probably spend some time on it sometime in the next couple weeks.

If you have any input and have given it some thought for how to implement it in the jobkit library I would love to hear it 😊

dolfim-ibm · 2025-09-09T06:58:48Z

Superseded by #353

feat: add chunking support for document conversion

8a19f46

MagnusS0 mentioned this pull request Aug 4, 2025

Feature Proposal: Add Chunked Output Endpoints #44

Closed

MagnusS0 added 2 commits August 5, 2025 20:04

chore: run pre commits

8f4b928

MagnusS0 mentioned this pull request Aug 26, 2025

feat: add document chunking docling-project/docling-jobkit#53

Closed

dolfim-ibm closed this Sep 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add chunking support for document conversion#299

feat: add chunking support for document conversion#299
MagnusS0 wants to merge 4 commits intodocling-project:mainfrom
MagnusS0:feat-chunked-endpoints

MagnusS0 commented Aug 4, 2025

Uh oh!

github-actions bot commented Aug 4, 2025 •

edited

Loading

Uh oh!

mergify bot commented Aug 4, 2025

Uh oh!

dolfim-ibm commented Aug 22, 2025

Uh oh!

MagnusS0 commented Aug 22, 2025

Uh oh!

dolfim-ibm commented Sep 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

MagnusS0 commented Aug 4, 2025

Document Chunking Endpoint

Uh oh!

github-actions bot commented Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Aug 4, 2025

Merge Protections

🟢 Enforce conventional commit

Uh oh!

dolfim-ibm commented Aug 22, 2025

Uh oh!

MagnusS0 commented Aug 22, 2025

Uh oh!

dolfim-ibm commented Sep 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions bot commented Aug 4, 2025 •

edited

Loading