Parsing table to analyses, relevant resources #1295

jdkent · 2026-01-07T21:08:01Z

jdkent
Jan 7, 2026
Maintainer

here is a list of benchmark datasets that could be useful:

Yes — there are several widely used benchmarks / training datasets for table parsing, and they map pretty cleanly onto PDF, HTML, and XML sources (often with aligned pairs like PDF+XML).

Best-known datasets for PDF / document-table extraction

PubTabNet (PDF table images ↔ HTML structure)

What it’s good for: table structure recognition from rendered table images (often with OCR as a separate step).
Scale: ~568k table images with HTML representations; introduced the TEDS metric for structure similarity.
Built by aligning PMC Open Access XML with PDFs. ([arXiv]1)

PubTables-1M (scientific PDFs; multimodal annotations)

What it’s good for: more “end-to-end” table extraction: table detection + structure recognition + functional analysis (headers, etc.).
Scale: ~1M tables; includes richer structural annotations (rows/cols/cells, etc.). ([CVF Open Access]2)

PubTables-v2 (newer; full-page + multi-page)

What it’s good for: harder, more realistic setups like full-page context and multi-page table structure recognition.
Note: paper is late 2025; “data/code/models will be released” (so check current availability). ([arXiv]3)

TableBank (Word/LaTeX → weak supervision; detection + recognition)

What it’s good for: lots of variety in table styles; common pretraining source.
Scale: ~417k labeled tables. ([GitHub]4)

SciTSR (PDF tables with structure labels from LaTeX sources)

What it’s good for: complicated table structure recognition (including a “complicated tables” test split).
Scale: 15k tables. ([GitHub]5)

Marmot / ICDAR competition datasets (classics; smaller but common)

What they’re good for: benchmarking table detection and table recognition in older pipelines / comparisons.
Marmot is a public distribution site for a ground-truthed table dataset. ([icst.pku.edu.cn]6)
ICDAR 2019 cTDaR is a well-known competition dataset for detection + recognition. ([cndplab-founder.github.io]7)
ICDAR 2021 also has a scientific table image recognition competition (to LaTeX). ([arXiv]8)

HTML / web-table corpora (great for “tables in the wild”)

Web Data Commons (WDC) Web Tables corpora

What it’s good for: large-scale HTML relational tables mined from Common Crawl; great for classification/filtering (relational vs layout), schema matching, table-to-KB, etc.
Note: these are often weakly labeled or require downstream filtering/annotation depending on your task. ([webdatacommons.org]9)

WDC Schema.org Table Corpus (2023)

What it’s good for: big set of tables derived from schema.org-annotated content (more structured/typed than generic HTML tables).
Scale: on the order of millions of relational tables. ([webdatacommons.org]10)

XML sources that are especially useful for supervision

PMC Open Access Subset (bulk XML, plus aligned PDFs for many articles)

What it’s good for: building your own aligned corpora (XML table markup ↔ PDF rendering), or reproducing approaches like PubTabNet-style alignment.
PMC explicitly provides machine-readable datasets for text mining, including OA subsets. ([PMC]11)

How to pick the right benchmark (quick guide)

If you want PDF → structure (cells/rows/cols): start with PubTables-1M, PubTabNet, SciTSR.
If you want full-page / multi-table document extraction: PubTables-v2 (and PubTables-1M as baseline).
If you want HTML “real web” tables: WDC WebTables + (optionally) WDC Schema.org Tables.
If you want lots of synthetic-ish variety / pretraining: TableBank.

If you tell me your target output format (CSV grid? HTML? LaTeX? cell bounding boxes? logical header structure?) and whether your inputs are born-digital PDFs vs scans, I can recommend a concrete “training mix” (e.g., pretrain on TableBank → finetune on PubTables-1M → evaluate on SciTSR-COMP, etc.).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parsing table to analyses, relevant resources #1295

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Parsing table to analyses, relevant resources #1295

Uh oh!

jdkent Jan 7, 2026 Maintainer

Best-known datasets for PDF / document-table extraction

HTML / web-table corpora (great for “tables in the wild”)

XML sources that are especially useful for supervision

How to pick the right benchmark (quick guide)

Replies: 0 comments

jdkent
Jan 7, 2026
Maintainer