Skip to content

Request to Release or Integrate 89-PDF Benchmark Dataset Mentioned in the Technical Report #117

@CHN-ChenYi

Description

@CHN-ChenYi

Hi Docling Team,

I really enjoyed reading your technical report, especially the section describing the 89-PDF benchmark dataset:

To enable a meaningful benchmark, we composed a test set of 89 PDF files covering a large variety of styles, features, content, and length (see Figure 2). This dataset is based to a large extend on our DocLayNet (Pfitzmann et al. 2022) dataset and augmented with additional samples from CCpdf (Turski et al. 2023) to increase the variety. Overall, it includes 4008 pages, 56246 text items, 1842 tables and 4676 pictures. As such, it is large enough to provide variety without requiring excessively long benchmarking times.

Would you consider making this dataset more accessible to the community? In particular, since existing datasets like OmniDocBench and DP-Bench are primarily composed of single-page PDFs, having access to a larger, more diverse dataset like 89-PDF would be incredibly valuable.

It would be great if:

  1. The 89-PDF dataset could be added as an optional benchmark within docling-eval, alongside OmniDocBench and DP-Bench; or
  2. The specific CCpdf and DocLayNet samples used for constructing the benchmark could be released separately.

Thanks again for your excellent work!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions