feat: add chunking support for document conversion#299
feat: add chunking support for document conversion#299MagnusS0 wants to merge 4 commits intodocling-project:mainfrom
Conversation
|
✅ DCO Check Passed Thanks @MagnusS0, all your commits are properly signed off. 🎉 |
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
…noreply.github.com> I, Magnus Samuelsen <97634880+MagnusS0@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 8a19f46 Signed-off-by: Magnus Samuelsen <97634880+MagnusS0@users.noreply.github.com>
…noreply.github.com> I, Magnus Samuelsen <97634880+MagnusS0@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 8a19f46 I, Magnus Samuelsen <97634880+MagnusS0@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 8f4b928 Signed-off-by: Magnus Samuelsen <97634880+MagnusS0@users.noreply.github.com>
|
@MagnusS0 as introduced in the issue, we moved the results generation to the jobkit library, so most of this PR should be moved there. My proposal would also be to use a different endpoint, e.g. Do you want to contribute this change or do you prefer if we do it from our side? |
|
@dolfim-ibm Awsome to hear! I'm a bit busy this weekend, but can probably spend some time on it sometime in the next couple weeks. If you have any input and have given it some thought for how to implement it in the jobkit library I would love to hear it 😊 |
|
Superseded by #353 |
Document Chunking Endpoint
This pull request adds support for document chunking in the API, enabling Retrieval-Augmented Generation (RAG) workflows by returning chunked document responses. The changes introduce new request options, response models, and backend logic to handle chunking, including configuration for tokenization and serialization. Documentation is updated to explain usage and response formats for chunking. This ensures no changes for current usage, as the implementation uses a lazy importing pattern where chunking dependencies are only loaded when requested. When
do_chunking=false(default), the API behaves exactly as before.API and Model Changes:
do_chunkingandchunking_optionstoConvertDocumentsRequestOptions, allowing clients to request chunked document responses with customizable chunking parameters such asmax_tokens,overlap,tokenizer, and markdown table serialization. [1] [2]ChunkedDocumentResponseandChunkedDocumentResponseItemmodels to represent chunked output, including contextualized text, raw text, headings, page numbers, and metadata.Backend Logic:
/v1/convert/source,/v1/convert/file,/v1/result/{task_id}) to support returning either standard or chunked document responses based on request parameters. [1] [2] [3]response_preparation.py, including HuggingFace tokenizer caching, chunk creation, and response assembly. Added support for markdown table serialization and custom tokenizer selection. [1] [2] [3]Testing:
Documentation Updates:
docs/usage.mdwith detailed instructions and examples for using chunking options, including configuration parameters, example requests, and the chunked response format. [1] [2] [3]Resolves Feature Proposal: Add Chunked Output Endpoints #44