Skip to content

Commit 09c40f1

Browse files
cau-gitPeterStaar-IBMSaidgurbuznikos-livathinos
authored
feat: Add DocVQA dataset builder (#54)
* correct mpy Signed-off-by: Peter Staar <[email protected]> * reformatting Signed-off-by: Peter Staar <[email protected]> * adding the script to make an initial dataset from pdf's Signed-off-by: Peter Staar <[email protected]> * before switching to specific docling-core branch Signed-off-by: Peter Staar <[email protected]> * rebased on kv-items and updated the create script in CVAT Signed-off-by: Peter Staar <[email protected]> * fixed the cvat Signed-off-by: Peter Staar <[email protected]> * added the annotation description on CVAT Signed-off-by: Peter Staar <[email protected]> * added the annotation description on CVAT (2) Signed-off-by: Peter Staar <[email protected]> * added the annotation description on CVAT (3) Signed-off-by: Peter Staar <[email protected]> * [WIP] Crafting new dataset builder and prediction provider API Signed-off-by: Christoph Auer <[email protected]> * Restructure to docling_eval_next Signed-off-by: Christoph Auer <[email protected]> * Fix mypy Signed-off-by: Christoph Auer <[email protected]> * Fix f-strings Signed-off-by: Christoph Auer <[email protected]> * Changes for prediction_provider interface, to support all cases. Signed-off-by: Christoph Auer <[email protected]> * Add omnidocbench DatasetBuilder Signed-off-by: Christoph Auer <[email protected]> * Add doclaynet v1, funsd Signed-off-by: Christoph Auer <[email protected]> * Fixes Signed-off-by: Christoph Auer <[email protected]> * Add XFUND, more fixes Signed-off-by: Christoph Auer <[email protected]> * update the kv cell creation to prevent false positives Signed-off-by: Saidgurbuz <[email protected]> * chore: Fixing imports Signed-off-by: Nikos Livathinos <[email protected]> * chore: Update docling-core version Signed-off-by: Nikos Livathinos <[email protected]> * feat: Introduce new design for Evaluators based on BaseEvaluator that accept external predictions. And utility adapters. Signed-off-by: Nikos Livathinos <[email protected]> * Factor PredictionProvider out of dataset builder, many fixes on DatasetRecord Signed-off-by: Christoph Auer <[email protected]> * Sketch example for file-directory prediction provider Signed-off-by: Christoph Auer <[email protected]> * chore: Fix typing hints Signed-off-by: Nikos Livathinos <[email protected]> * chore: Update poetry to doclign-core 2.24.0 Signed-off-by: Nikos Livathinos <[email protected]> * feat: WIP: Introduce the FilePredictionProvider that reads files with predictions from the disk - It currently supports doctags, markdown, json, yaml formats. - We still need to improve the returned type so that it allows for no DoclingDocument but only for the source data (e.g. in case of markdown). Signed-off-by: Nikos Livathinos <[email protected]> * Add DocLayNetV2DatasetBuilder Signed-off-by: Christoph Auer <[email protected]> * Added TableDatasetBuilder and test, update TableFormerPredictionProvider Signed-off-by: Christoph Auer <[email protected]> * chore: Update MyPy configuration in toml Signed-off-by: Nikos Livathinos <[email protected]> * feat: Refactor the BasePredictionProvider.predict() to return DatasetRecordWithPrediction Signed-off-by: Nikos Livathinos <[email protected]> * Fixes Signed-off-by: Christoph Auer <[email protected]> * fix: Fix the FilePredictionProvider. Return None in the predicted document in case of Markdown. Signed-off-by: Nikos Livathinos <[email protected]> * fix: Remove the kwargs from all PredictonProvider classes and introduce provider specific initialization arguments Signed-off-by: Nikos Livathinos <[email protected]> * feat: Introduce the parameter "ignore_missing_files" in FilePredictionProvider Signed-off-by: Nikos Livathinos <[email protected]> * Add do_visualization to PredictionProvider Signed-off-by: Christoph Auer <[email protected]> * Move next-gen API to main source tree, re-organize module paths Signed-off-by: Christoph Auer <[email protected]> * Fixes Signed-off-by: Christoph Auer <[email protected]> * Cleanup, change path handling Signed-off-by: Christoph Auer <[email protected]> * Cleanup, change path handling Signed-off-by: Christoph Auer <[email protected]> * More module removal and renaming Signed-off-by: Christoph Auer <[email protected]> * Small test fixes Signed-off-by: Christoph Auer <[email protected]> * fix: Add the "prediction_format" in the serialization of DatasetRecordWithPrediction Signed-off-by: Nikos Livathinos <[email protected]> * feat: Refactor the MarkdownTextEvaluator to support the new classes design. Add unit test. Signed-off-by: Nikos Livathinos <[email protected]> * fix: Improve the new design of MarkdownEvaluator to move common functionalities into the base class Signed-off-by: Nikos Livathinos <[email protected]> * feat: Refactor the LayoutEvaluator to use the new class design. Add unit test. Signed-off-by: Nikos Livathinos <[email protected]> * fix: Clean up LayoutEvaluator code Signed-off-by: Nikos Livathinos <[email protected]> * chore: Implementation cleanup and fixes for new class design (#52) * More module removal and renaming Signed-off-by: Christoph Auer <[email protected]> * Small test fixes Signed-off-by: Christoph Auer <[email protected]> * Small test fixes Signed-off-by: Christoph Auer <[email protected]> * Cleanup of tests and more fixes Signed-off-by: Christoph Auer <[email protected]> --------- Signed-off-by: Christoph Auer <[email protected]> * Add visualization for tables Signed-off-by: Christoph Auer <[email protected]> * Add visualization for all tests Signed-off-by: Christoph Auer <[email protected]> * Fixes for test files, FilePredictionProvider changes Signed-off-by: Christoph Auer <[email protected]> * Put new CLI Signed-off-by: Christoph Auer <[email protected]> * Cleanup Signed-off-by: Christoph Auer <[email protected]> * Rename CLI Signed-off-by: Christoph Auer <[email protected]> * Update all README with new commands. Signed-off-by: Christoph Auer <[email protected]> * Remove old examples Signed-off-by: Christoph Auer <[email protected]> * Several Fixes Signed-off-by: Christoph Auer <[email protected]> * README updates Signed-off-by: Christoph Auer <[email protected]> * Add gt_dir arg to create-eval, README fixes Signed-off-by: Christoph Auer <[email protected]> * Fixes, pass tests Signed-off-by: Christoph Auer <[email protected]> * feat: Refactor the TableEvaluator to use the new class design. Move common evaluator code to BaseEvaluator. Add more unit tests. Introduce pytest dependencies. Signed-off-by: Nikos Livathinos <[email protected]> * Update lockfile Signed-off-by: Christoph Auer <[email protected]> * Update lockfile Signed-off-by: Christoph Auer <[email protected]> * Make pytest CI output more verbose Signed-off-by: Christoph Auer <[email protected]> * feat: Refactor the ReadingOrderEvaluator to use the new class design. Remove the BaseReadingOrderEvaluator. Add unit test. Signed-off-by: Nikos Livathinos <[email protected]> * Optimize GT downloading behaviour Signed-off-by: Christoph Auer <[email protected]> * Add file sources Signed-off-by: Christoph Auer <[email protected]> * Allow pytest output on CI Signed-off-by: Christoph Auer <[email protected]> * Disable tests in CI Signed-off-by: Christoph Auer <[email protected]> * Reenable tests in CI Signed-off-by: Christoph Auer <[email protected]> * Add correct @pytest.mark.dependency() Signed-off-by: Christoph Auer <[email protected]> * feat: Introduce TypeVars for the UnitEvaluation and DatasetEvaluation used by the BaseEvaluator. Signed-off-by: Nikos Livathinos <[email protected]> * Minimize tests in CI Signed-off-by: Christoph Auer <[email protected]> * feat: Refactor BboxTestEvaluator to use the new design. Introduce unit test. Signed-off-by: Nikos Livathinos <[email protected]> * Remove streaming in DocLaynet v1 Signed-off-by: Christoph Auer <[email protected]> * Add back test dependency Signed-off-by: Christoph Auer <[email protected]> * Add DocVQA dataset builder Signed-off-by: Christoph Auer <[email protected]> * Bugfixes Signed-off-by: Christoph Auer <[email protected]> * Remove prints Signed-off-by: Christoph Auer <[email protected]> * Cleanup Signed-off-by: Christoph Auer <[email protected]> * Add DocVQA to CLI Signed-off-by: Christoph Auer <[email protected]> --------- Signed-off-by: Peter Staar <[email protected]> Signed-off-by: Christoph Auer <[email protected]> Signed-off-by: Saidgurbuz <[email protected]> Signed-off-by: Nikos Livathinos <[email protected]> Co-authored-by: Peter Staar <[email protected]> Co-authored-by: Saidgurbuz <[email protected]> Co-authored-by: Nikos Livathinos <[email protected]>
1 parent a3d99b9 commit 09c40f1

File tree

4 files changed

+224
-0
lines changed

4 files changed

+224
-0
lines changed

docling_eval/cli/main.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@
1919
)
2020
from docling_eval.dataset_builders.doclaynet_v1_builder import DocLayNetV1DatasetBuilder
2121
from docling_eval.dataset_builders.doclaynet_v2_builder import DocLayNetV2DatasetBuilder
22+
from docling_eval.dataset_builders.docvqa_builder import DocVQADatasetBuilder
2223
from docling_eval.dataset_builders.dpbench_builder import DPBenchDatasetBuilder
2324
from docling_eval.dataset_builders.funsd_builder import FUNSDDatasetBuilder
2425
from docling_eval.dataset_builders.omnidocbench_builder import (
@@ -171,6 +172,9 @@ def get_dataset_builder(
171172
elif benchmark == BenchMarkNames.PUBTABNET:
172173
return PubTabNetDatasetBuilder(**common_params) # type: ignore
173174

175+
elif benchmark == BenchMarkNames.DOCVQA:
176+
return DocVQADatasetBuilder(**common_params) # type: ignore
177+
174178
else:
175179
raise ValueError(f"Unsupported benchmark: {benchmark}")
176180

docling_eval/datamodels/types.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,7 @@ class EvaluationModality(str, Enum):
4545
CAPTIONING = "captioning" # to compute the accuracy of captions to table/figure
4646
BBOXES_TEXT = "bboxes_text"
4747
KEY_VALUE = "key_value"
48+
QUESTION_ANSWERING = "question_answering"
4849

4950

5051
class BenchMarkNames(str, Enum):
@@ -67,6 +68,8 @@ class BenchMarkNames(str, Enum):
6768
FINTABNET = "FinTabNet"
6869
WIKITABNET = "WikiTabNet"
6970

71+
DOCVQA = "DocVQA"
72+
7073
# Formula
7174
# ???
7275

Lines changed: 195 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,195 @@
1+
import io
2+
import logging
3+
from pathlib import Path
4+
from typing import Iterable, List, Optional, Set
5+
6+
import PIL.Image
7+
from datasets import load_dataset
8+
from docling_core.types import DoclingDocument
9+
from docling_core.types.doc import (
10+
BoundingBox,
11+
CoordOrigin,
12+
DocItemLabel,
13+
GroupItem,
14+
GroupLabel,
15+
ImageRef,
16+
PageItem,
17+
ProvenanceItem,
18+
Size,
19+
TableCell,
20+
TableData,
21+
)
22+
from docling_core.types.io import DocumentStream
23+
from tqdm import tqdm
24+
25+
from docling_eval.datamodels.dataset_record import DatasetRecord
26+
from docling_eval.datamodels.types import BenchMarkColumns, EvaluationModality
27+
from docling_eval.dataset_builders.dataset_builder import (
28+
BaseEvaluationDatasetBuilder,
29+
HFSource,
30+
)
31+
from docling_eval.utils.utils import (
32+
add_pages_to_true_doc,
33+
crop_bounding_box,
34+
extract_images,
35+
from_pil_to_base64uri,
36+
get_binhash,
37+
)
38+
39+
# Get logger
40+
_log = logging.getLogger(__name__)
41+
42+
43+
class DocVQADatasetBuilder(BaseEvaluationDatasetBuilder):
44+
"""
45+
DocVQA dataset builder implementing the base dataset builder interface.
46+
47+
This builder processes the DocVQA dataset, which contains document
48+
layout annotations for a variety of document types.
49+
"""
50+
51+
def __init__(
52+
self,
53+
target: Path,
54+
split: str = "test",
55+
begin_index: int = 0,
56+
end_index: int = -1,
57+
):
58+
"""
59+
Initialize the DocVQA dataset builder.
60+
61+
Args:
62+
target: Path where processed dataset will be saved
63+
split: Dataset split to use
64+
begin_index: Start index for processing (inclusive)
65+
end_index: End index for processing (exclusive), -1 means process all
66+
"""
67+
super().__init__(
68+
name="DocVQA",
69+
dataset_source=HFSource(repo_id="lmms-lab/DocVQA"),
70+
target=target,
71+
split=split,
72+
begin_index=begin_index,
73+
end_index=end_index,
74+
)
75+
76+
def _process_document(self, doc_id, qa_items) -> DatasetRecord:
77+
"""Process all QA items for a single document."""
78+
_log.debug(f"Processing document: {doc_id}")
79+
80+
doc = DoclingDocument(name=f"{doc_id}")
81+
image: PIL.Image.Image = qa_items[0]["image"]
82+
image = image.convert("RGB")
83+
image_ref = ImageRef(
84+
mimetype="image/png",
85+
dpi=72,
86+
size=Size(width=image.width, height=image.height),
87+
uri=from_pil_to_base64uri(image),
88+
)
89+
page_item = PageItem(
90+
page_no=1,
91+
size=Size(width=float(image.width), height=float(image.height)),
92+
image=image_ref,
93+
)
94+
95+
doc.pages[1] = page_item
96+
for qa_item in qa_items:
97+
_log.debug(f" Processing QA item data...")
98+
99+
# Extract images from the ground truth document
100+
doc, true_pictures, true_page_images = extract_images(
101+
document=doc,
102+
pictures_column=BenchMarkColumns.GROUNDTRUTH_PICTURES.value,
103+
page_images_column=BenchMarkColumns.GROUNDTRUTH_PAGE_IMAGES.value,
104+
)
105+
106+
# Convert image to bytes for storage
107+
with io.BytesIO() as img_byte_stream:
108+
image.save(img_byte_stream, format="PNG")
109+
img_byte_stream.seek(0)
110+
img_bytes = img_byte_stream.getvalue()
111+
112+
# Create dataset record
113+
record = DatasetRecord(
114+
doc_id=str(doc_id),
115+
doc_hash=get_binhash(img_bytes),
116+
ground_truth_doc=doc,
117+
original=DocumentStream(name=str(doc_id), stream=io.BytesIO(img_bytes)),
118+
mime_type="image/png",
119+
modalities=[
120+
EvaluationModality.LAYOUT,
121+
EvaluationModality.QUESTION_ANSWERING,
122+
],
123+
ground_truth_pictures=true_pictures,
124+
ground_truth_page_images=true_page_images,
125+
)
126+
127+
return record
128+
129+
def iterate(self) -> Iterable[DatasetRecord]:
130+
"""
131+
Iterate through the dataset and yield DatasetRecord objects.
132+
133+
Yields:
134+
DatasetRecord objects
135+
"""
136+
assert isinstance(self.dataset_source, HFSource)
137+
138+
path = self.dataset_source.repo_id
139+
if self.dataset_local_path is not None:
140+
path = str(self.dataset_local_path)
141+
# Load dataset from the retrieved path
142+
ds = load_dataset(path, split=self.split, name="DocVQA")
143+
144+
# Apply HuggingFace's select method for index ranges
145+
total_ds_len = len(ds)
146+
begin, end = self.get_effective_indices(total_ds_len)
147+
148+
# Select the range (HuggingFace datasets have a convenient select method)
149+
ds = ds.select(range(begin, end))
150+
selected_ds_len = len(ds)
151+
152+
# Log stats
153+
self.log_dataset_stats(total_ds_len, selected_ds_len)
154+
155+
skipped_rows = 0
156+
exported_rows = 0
157+
158+
sorted_dataset = ds.sort("docId")
159+
160+
# Initialize variables
161+
current_doc_id = None
162+
current_doc_qa_items = [] # type: ignore
163+
164+
# Iterate through the sorted dataset
165+
for sample in tqdm(
166+
sorted_dataset,
167+
total=selected_ds_len,
168+
ncols=128,
169+
desc="Processing DocVQA records...",
170+
):
171+
# Check if we've moved to a new docId
172+
if sample["docId"] != current_doc_id:
173+
# Process the previous doc's QA items (skip first iteration)
174+
if current_doc_qa_items:
175+
rec = self._process_document(current_doc_id, current_doc_qa_items)
176+
yield rec
177+
exported_rows += 1
178+
179+
# Start a new document group
180+
current_doc_id = sample["docId"]
181+
current_doc_qa_items = [sample]
182+
else:
183+
current_doc_qa_items.append(sample)
184+
185+
# Process the final document group
186+
if current_doc_qa_items:
187+
rec = self._process_document(current_doc_id, current_doc_qa_items)
188+
yield rec
189+
exported_rows += 1
190+
191+
_log.info(
192+
"Exported rows: %s. Skipped rows: %s.",
193+
exported_rows,
194+
skipped_rows,
195+
)

tests/test_dataset_builder.py

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@
2121
)
2222
from docling_eval.dataset_builders.doclaynet_v1_builder import DocLayNetV1DatasetBuilder
2323
from docling_eval.dataset_builders.doclaynet_v2_builder import DocLayNetV2DatasetBuilder
24+
from docling_eval.dataset_builders.docvqa_builder import DocVQADatasetBuilder
2425
from docling_eval.dataset_builders.dpbench_builder import DPBenchDatasetBuilder
2526
from docling_eval.dataset_builders.funsd_builder import FUNSDDatasetBuilder
2627
from docling_eval.dataset_builders.omnidocbench_builder import (
@@ -579,3 +580,24 @@ def test_run_pubtabnet_builder():
579580
odir=target_path / "evaluations" / EvaluationModality.TABLE_STRUCTURE.value,
580581
split="val",
581582
)
583+
584+
585+
@pytest.mark.skipif(
586+
IS_CI, reason="Skipping test in CI because the dataset is too heavy."
587+
)
588+
def test_run_docvqa_builder():
589+
target_path = Path(f"./scratch/{BenchMarkNames.DOCVQA.value}/")
590+
591+
dataset_layout = DocVQADatasetBuilder(
592+
target=target_path / "gt_dataset",
593+
end_index=25,
594+
)
595+
596+
dataset_layout.save_to_disk() # does all the job of iterating the dataset, making GT+prediction records, and saving them in shards as parquet.
597+
docling_provider = create_docling_prediction_provider(page_image_scale=2.0)
598+
599+
docling_provider.create_prediction_dataset(
600+
name=dataset_layout.name,
601+
gt_dataset_dir=target_path / "gt_dataset",
602+
target_dataset_dir=target_path / "eval_dataset_e2e",
603+
)

0 commit comments

Comments
 (0)