Request to Release or Integrate 89-PDF Benchmark Dataset Mentioned in the Technical Report

Hi Docling Team,

I really enjoyed reading your technical report, especially the section describing the 89-PDF benchmark dataset:

> To enable a meaningful benchmark, we composed a test set of 89 PDF files covering a large variety of styles, features, content, and length (see Figure 2). This dataset is based to a large extend on our DocLayNet (Pfitzmann et al. 2022) dataset and augmented with additional samples from CCpdf (Turski et al. 2023) to increase the variety. Overall, it includes 4008 pages, 56246 text items, 1842 tables and 4676 pictures. As such, it is large enough to provide variety without requiring excessively long benchmarking times.

Would you consider making this dataset more accessible to the community? In particular, since existing datasets like OmniDocBench and DP-Bench are primarily composed of single-page PDFs, having access to a larger, more diverse dataset like 89-PDF would be incredibly valuable.

It would be great if:

1. The 89-PDF dataset could be added as an optional benchmark within docling-eval, alongside OmniDocBench and DP-Bench; **or**
2. The specific CCpdf and DocLayNet samples used for constructing the benchmark could be released separately.

Thanks again for your excellent work!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Request to Release or Integrate 89-PDF Benchmark Dataset Mentioned in the Technical Report #117

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Request to Release or Integrate 89-PDF Benchmark Dataset Mentioned in the Technical Report #117

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions