-
Notifications
You must be signed in to change notification settings - Fork 11
Description
Hi Docling Team,
I really enjoyed reading your technical report, especially the section describing the 89-PDF benchmark dataset:
To enable a meaningful benchmark, we composed a test set of 89 PDF files covering a large variety of styles, features, content, and length (see Figure 2). This dataset is based to a large extend on our DocLayNet (Pfitzmann et al. 2022) dataset and augmented with additional samples from CCpdf (Turski et al. 2023) to increase the variety. Overall, it includes 4008 pages, 56246 text items, 1842 tables and 4676 pictures. As such, it is large enough to provide variety without requiring excessively long benchmarking times.
Would you consider making this dataset more accessible to the community? In particular, since existing datasets like OmniDocBench and DP-Bench are primarily composed of single-page PDFs, having access to a larger, more diverse dataset like 89-PDF would be incredibly valuable.
It would be great if:
- The 89-PDF dataset could be added as an optional benchmark within docling-eval, alongside OmniDocBench and DP-Bench; or
- The specific CCpdf and DocLayNet samples used for constructing the benchmark could be released separately.
Thanks again for your excellent work!