Skip to content

feat: convert Docling JSON inputs to image streams in FileDatasetBuilder#184

Merged
cau-git merged 2 commits intomainfrom
cau/file-dataset-builder-images
Dec 5, 2025
Merged

feat: convert Docling JSON inputs to image streams in FileDatasetBuilder#184
cau-git merged 2 commits intomainfrom
cau/file-dataset-builder-images

Conversation

@cau-git
Copy link
Member

@cau-git cau-git commented Dec 4, 2025

This provides a way to let the FileDatasetBuilder produce normalized ground-truth parquet datasets where the DatasetRecord is populated with a PNG/multi-page TIFF image of the pages when the source is an existing DoclingDocument JSON file. It aims to avoid custom logic required in BasePredictionProvider to handle the case and other defects in parts of the pipelines that have format assumptions on the DatasetRecord.original field (a.k.a. BinaryDocument column).

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
@github-actions
Copy link
Contributor

github-actions bot commented Dec 4, 2025

DCO Check Passed

Thanks @cau-git, all your commits are properly signed off. 🎉

@mergify
Copy link

mergify bot commented Dec 4, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@cau-git cau-git changed the title feat: convert Docling JSON inputs to image streamsin FileDatasetBuilder feat: convert Docling JSON inputs to image streams in FileDatasetBuilder Dec 4, 2025
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
@cau-git cau-git marked this pull request as ready for review December 5, 2025 09:51
@cau-git cau-git merged commit 15888fd into main Dec 5, 2025
10 checks passed
@cau-git cau-git deleted the cau/file-dataset-builder-images branch December 5, 2025 12:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants