feat: Add DocVQA dataset builder (#54)

cau-git · PeterStaar-IBM · Saidgurbuz · web-flow · commit 09c40f16f20e · 2025-04-01T13:34:58.000+02:00
* correct mpy Signed-off-by: Peter Staar <taa@zurich.ibm.com> * reformatting Signed-off-by: Peter Staar <taa@zurich.ibm.com> * adding the script to make an initial dataset from pdf's Signed-off-by: Peter Staar <taa@zurich.ibm.com> * before switching to specific docling-core branch Signed-off-by: Peter Staar <taa@zurich.ibm.com> * rebased on kv-items and updated the create script in CVAT Signed-off-by: Peter Staar <taa@zurich.ibm.com> * fixed the cvat Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added the annotation description on CVAT Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added the annotation description on CVAT (2) Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added the annotation description on CVAT (3) Signed-off-by: Peter Staar <taa@zurich.ibm.com> * [WIP] Crafting new dataset builder and prediction provider API Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Restructure to docling_eval_next Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Fix mypy Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Fix f-strings Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Changes for prediction_provider interface, to support all cases. Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add omnidocbench DatasetBuilder Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add doclaynet v1, funsd Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Fixes Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add XFUND, more fixes Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * update the kv cell creation to prevent false positives Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch> * chore: Fixing imports Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * chore: Update docling-core version Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * feat: Introduce new design for Evaluators based on BaseEvaluator that accept external predictions. And utility adapters. Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * Factor PredictionProvider out of dataset builder, many fixes on DatasetRecord Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Sketch example for file-directory prediction provider Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * chore: Fix typing hints Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * chore: Update poetry to doclign-core 2.24.0 Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * feat: WIP: Introduce the FilePredictionProvider that reads files with predictions from the disk - It currently supports doctags, markdown, json, yaml formats. - We still need to improve the returned type so that it allows for no DoclingDocument but only for the source data (e.g. in case of markdown). Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * Add DocLayNetV2DatasetBuilder Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Added TableDatasetBuilder and test, update TableFormerPredictionProvider Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * chore: Update MyPy configuration in toml Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * feat: Refactor the BasePredictionProvider.predict() to return DatasetRecordWithPrediction Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * Fixes Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * fix: Fix the FilePredictionProvider. Return None in the predicted document in case of Markdown. Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * fix: Remove the kwargs from all PredictonProvider classes and introduce provider specific initialization arguments Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * feat: Introduce the parameter "ignore_missing_files" in FilePredictionProvider Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * Add do_visualization to PredictionProvider Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Move next-gen API to main source tree, re-organize module paths Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Fixes Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Cleanup, change path handling Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Cleanup, change path handling Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * More module removal and renaming Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Small test fixes Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * fix: Add the "prediction_format" in the serialization of DatasetRecordWithPrediction Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * feat: Refactor the MarkdownTextEvaluator to support the new classes design. Add unit test. Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * fix: Improve the new design of MarkdownEvaluator to move common functionalities into the base class Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * feat: Refactor the LayoutEvaluator to use the new class design. Add unit test. Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * fix: Clean up LayoutEvaluator code Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * chore: Implementation cleanup and fixes for new class design (#52) * More module removal and renaming Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Small test fixes Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Small test fixes Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Cleanup of tests and more fixes Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add visualization for tables Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add visualization for all tests Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Fixes for test files, FilePredictionProvider changes Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Put new CLI Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Cleanup Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Rename CLI Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Update all README with new commands. Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Remove old examples Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Several Fixes Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * README updates Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add gt_dir arg to create-eval, README fixes Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Fixes, pass tests Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * feat: Refactor the TableEvaluator to use the new class design. Move common evaluator code to BaseEvaluator. Add more unit tests. Introduce pytest dependencies. Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * Update lockfile Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Update lockfile Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Make pytest CI output more verbose Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * feat: Refactor the ReadingOrderEvaluator to use the new class design. Remove the BaseReadingOrderEvaluator. Add unit test. Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * Optimize GT downloading behaviour Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add file sources Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Allow pytest output on CI Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Disable tests in CI Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Reenable tests in CI Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add correct @pytest.mark.dependency() Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * feat: Introduce TypeVars for the UnitEvaluation and DatasetEvaluation used by the BaseEvaluator. Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * Minimize tests in CI Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * feat: Refactor BboxTestEvaluator to use the new design. Introduce unit test. Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * Remove streaming in DocLaynet v1 Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add back test dependency Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add DocVQA dataset builder Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Bugfixes Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Remove prints Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Cleanup Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add DocVQA to CLI Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: Peter Staar <taa@zurich.ibm.com> Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch> Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> Co-authored-by: Peter Staar <taa@zurich.ibm.com> Co-authored-by: Saidgurbuz <said.gurbuz@epfl.ch> Co-authored-by: Nikos Livathinos <nli@zurich.ibm.com>
diff --git a/docling_eval/cli/main.py b/docling_eval/cli/main.py
@@ -19,6 +19,7 @@
 )
 from docling_eval.dataset_builders.doclaynet_v1_builder import DocLayNetV1DatasetBuilder
 from docling_eval.dataset_builders.doclaynet_v2_builder import DocLayNetV2DatasetBuilder
+from docling_eval.dataset_builders.docvqa_builder import DocVQADatasetBuilder
 from docling_eval.dataset_builders.dpbench_builder import DPBenchDatasetBuilder
 from docling_eval.dataset_builders.funsd_builder import FUNSDDatasetBuilder
 from docling_eval.dataset_builders.omnidocbench_builder import (
@@ -171,6 +172,9 @@ def get_dataset_builder(
     elif benchmark == BenchMarkNames.PUBTABNET:
         return PubTabNetDatasetBuilder(**common_params)  # type: ignore
 
+    elif benchmark == BenchMarkNames.DOCVQA:
+        return DocVQADatasetBuilder(**common_params)  # type: ignore
+
     else:
         raise ValueError(f"Unsupported benchmark: {benchmark}")
 
diff --git a/docling_eval/datamodels/types.py b/docling_eval/datamodels/types.py
@@ -45,6 +45,7 @@ class EvaluationModality(str, Enum):
     CAPTIONING = "captioning"  # to compute the accuracy of captions to table/figure
     BBOXES_TEXT = "bboxes_text"
     KEY_VALUE = "key_value"
+    QUESTION_ANSWERING = "question_answering"
 
 
 class BenchMarkNames(str, Enum):
@@ -67,6 +68,8 @@ class BenchMarkNames(str, Enum):
     FINTABNET = "FinTabNet"
     WIKITABNET = "WikiTabNet"
 
+    DOCVQA = "DocVQA"
+
     # Formula
     # ???
 
diff --git a/docling_eval/dataset_builders/docvqa_builder.py b/docling_eval/dataset_builders/docvqa_builder.py
@@ -0,0 +1,195 @@
+import io
+import logging
+from pathlib import Path
+from typing import Iterable, List, Optional, Set
+
+import PIL.Image
+from datasets import load_dataset
+from docling_core.types import DoclingDocument
+from docling_core.types.doc import (
+    BoundingBox,
+    CoordOrigin,
+    DocItemLabel,
+    GroupItem,
+    GroupLabel,
+    ImageRef,
+    PageItem,
+    ProvenanceItem,
+    Size,
+    TableCell,
+    TableData,
+)
+from docling_core.types.io import DocumentStream
+from tqdm import tqdm
+
+from docling_eval.datamodels.dataset_record import DatasetRecord
+from docling_eval.datamodels.types import BenchMarkColumns, EvaluationModality
+from docling_eval.dataset_builders.dataset_builder import (
+    BaseEvaluationDatasetBuilder,
+    HFSource,
+)
+from docling_eval.utils.utils import (
+    add_pages_to_true_doc,
+    crop_bounding_box,
+    extract_images,
+    from_pil_to_base64uri,
+    get_binhash,
+)
+
+# Get logger
+_log = logging.getLogger(__name__)
+
+
+class DocVQADatasetBuilder(BaseEvaluationDatasetBuilder):
+    """
+    DocVQA dataset builder implementing the base dataset builder interface.
+
+    This builder processes the DocVQA dataset, which contains document
+    layout annotations for a variety of document types.
+    """
+
+    def __init__(
+        self,
+        target: Path,
+        split: str = "test",
+        begin_index: int = 0,
+        end_index: int = -1,
+    ):
+        """
+        Initialize the DocVQA dataset builder.
+
+        Args:
+            target: Path where processed dataset will be saved
+            split: Dataset split to use
+            begin_index: Start index for processing (inclusive)
+            end_index: End index for processing (exclusive), -1 means process all
+        """
+        super().__init__(
+            name="DocVQA",
+            dataset_source=HFSource(repo_id="lmms-lab/DocVQA"),
+            target=target,
+            split=split,
+            begin_index=begin_index,
+            end_index=end_index,
+        )
+
+    def _process_document(self, doc_id, qa_items) -> DatasetRecord:
+        """Process all QA items for a single document."""
+        _log.debug(f"Processing document: {doc_id}")
+
+        doc = DoclingDocument(name=f"{doc_id}")
+        image: PIL.Image.Image = qa_items[0]["image"]
+        image = image.convert("RGB")
+        image_ref = ImageRef(
+            mimetype="image/png",
+            dpi=72,
+            size=Size(width=image.width, height=image.height),
+            uri=from_pil_to_base64uri(image),
+        )
+        page_item = PageItem(
+            page_no=1,
+            size=Size(width=float(image.width), height=float(image.height)),
+            image=image_ref,
+        )
+
+        doc.pages[1] = page_item
+        for qa_item in qa_items:
+            _log.debug(f"  Processing QA item data...")
+
+        # Extract images from the ground truth document
+        doc, true_pictures, true_page_images = extract_images(
+            document=doc,
+            pictures_column=BenchMarkColumns.GROUNDTRUTH_PICTURES.value,
+            page_images_column=BenchMarkColumns.GROUNDTRUTH_PAGE_IMAGES.value,
+        )
+
+        # Convert image to bytes for storage
+        with io.BytesIO() as img_byte_stream:
+            image.save(img_byte_stream, format="PNG")
+            img_byte_stream.seek(0)
+            img_bytes = img_byte_stream.getvalue()
+
+        # Create dataset record
+        record = DatasetRecord(
+            doc_id=str(doc_id),
+            doc_hash=get_binhash(img_bytes),
+            ground_truth_doc=doc,
+            original=DocumentStream(name=str(doc_id), stream=io.BytesIO(img_bytes)),
+            mime_type="image/png",
+            modalities=[
+                EvaluationModality.LAYOUT,
+                EvaluationModality.QUESTION_ANSWERING,
+            ],
+            ground_truth_pictures=true_pictures,
+            ground_truth_page_images=true_page_images,
+        )
+
+        return record
+
+    def iterate(self) -> Iterable[DatasetRecord]:
+        """
+        Iterate through the dataset and yield DatasetRecord objects.
+
+        Yields:
+            DatasetRecord objects
+        """
+        assert isinstance(self.dataset_source, HFSource)
+
+        path = self.dataset_source.repo_id
+        if self.dataset_local_path is not None:
+            path = str(self.dataset_local_path)
+        # Load dataset from the retrieved path
+        ds = load_dataset(path, split=self.split, name="DocVQA")
+
+        # Apply HuggingFace's select method for index ranges
+        total_ds_len = len(ds)
+        begin, end = self.get_effective_indices(total_ds_len)
+
+        # Select the range (HuggingFace datasets have a convenient select method)
+        ds = ds.select(range(begin, end))
+        selected_ds_len = len(ds)
+
+        # Log stats
+        self.log_dataset_stats(total_ds_len, selected_ds_len)
+
+        skipped_rows = 0
+        exported_rows = 0
+
+        sorted_dataset = ds.sort("docId")
+
+        # Initialize variables
+        current_doc_id = None
+        current_doc_qa_items = []  # type: ignore
+
+        # Iterate through the sorted dataset
+        for sample in tqdm(
+            sorted_dataset,
+            total=selected_ds_len,
+            ncols=128,
+            desc="Processing DocVQA records...",
+        ):
+            # Check if we've moved to a new docId
+            if sample["docId"] != current_doc_id:
+                # Process the previous doc's QA items (skip first iteration)
+                if current_doc_qa_items:
+                    rec = self._process_document(current_doc_id, current_doc_qa_items)
+                    yield rec
+                    exported_rows += 1
+
+                # Start a new document group
+                current_doc_id = sample["docId"]
+                current_doc_qa_items = [sample]
+            else:
+                current_doc_qa_items.append(sample)
+
+        # Process the final document group
+        if current_doc_qa_items:
+            rec = self._process_document(current_doc_id, current_doc_qa_items)
+            yield rec
+            exported_rows += 1
+
+        _log.info(
+            "Exported rows: %s. Skipped rows: %s.",
+            exported_rows,
+            skipped_rows,
+        )
diff --git a/tests/test_dataset_builder.py b/tests/test_dataset_builder.py
@@ -21,6 +21,7 @@
 )
 from docling_eval.dataset_builders.doclaynet_v1_builder import DocLayNetV1DatasetBuilder
 from docling_eval.dataset_builders.doclaynet_v2_builder import DocLayNetV2DatasetBuilder
+from docling_eval.dataset_builders.docvqa_builder import DocVQADatasetBuilder
 from docling_eval.dataset_builders.dpbench_builder import DPBenchDatasetBuilder
 from docling_eval.dataset_builders.funsd_builder import FUNSDDatasetBuilder
 from docling_eval.dataset_builders.omnidocbench_builder import (
@@ -579,3 +580,24 @@ def test_run_pubtabnet_builder():
         odir=target_path / "evaluations" / EvaluationModality.TABLE_STRUCTURE.value,
         split="val",
     )
+
+
+@pytest.mark.skipif(
+    IS_CI, reason="Skipping test in CI because the dataset is too heavy."
+)
+def test_run_docvqa_builder():
+    target_path = Path(f"./scratch/{BenchMarkNames.DOCVQA.value}/")
+
+    dataset_layout = DocVQADatasetBuilder(
+        target=target_path / "gt_dataset",
+        end_index=25,
+    )
+
+    dataset_layout.save_to_disk()  # does all the job of iterating the dataset, making GT+prediction records, and saving them in shards as parquet.
+    docling_provider = create_docling_prediction_provider(page_image_scale=2.0)
+
+    docling_provider.create_prediction_dataset(
+        name=dataset_layout.name,
+        gt_dataset_dir=target_path / "gt_dataset",
+        target_dataset_dir=target_path / "eval_dataset_e2e",
+    )