You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Yes — there are several widely used benchmarks / training datasets for table parsing, and they map pretty cleanly onto PDF, HTML, and XML sources (often with aligned pairs like PDF+XML).
Best-known datasets for PDF / document-table extraction
PubTabNet (PDF table images ↔ HTML structure)
What it’s good for: table structure recognition from rendered table images (often with OCR as a separate step).
Scale: ~568k table images with HTML representations; introduced the TEDS metric for structure similarity.
Built by aligning PMC Open Access XML with PDFs. ([arXiv]1)
Marmot / ICDAR competition datasets (classics; smaller but common)
What they’re good for: benchmarking table detection and table recognition in older pipelines / comparisons.
Marmot is a public distribution site for a ground-truthed table dataset. ([icst.pku.edu.cn]6)
ICDAR 2019 cTDaR is a well-known competition dataset for detection + recognition. ([cndplab-founder.github.io]7)
ICDAR 2021 also has a scientific table image recognition competition (to LaTeX). ([arXiv]8)
HTML / web-table corpora (great for “tables in the wild”)
Web Data Commons (WDC) Web Tables corpora
What it’s good for: large-scale HTML relational tables mined from Common Crawl; great for classification/filtering (relational vs layout), schema matching, table-to-KB, etc.
Note: these are often weakly labeled or require downstream filtering/annotation depending on your task. ([webdatacommons.org]9)
WDC Schema.org Table Corpus (2023)
What it’s good for: big set of tables derived from schema.org-annotated content (more structured/typed than generic HTML tables).
XML sources that are especially useful for supervision
PMC Open Access Subset (bulk XML, plus aligned PDFs for many articles)
What it’s good for: building your own aligned corpora (XML table markup ↔ PDF rendering), or reproducing approaches like PubTabNet-style alignment.
PMC explicitly provides machine-readable datasets for text mining, including OA subsets. ([PMC]11)
How to pick the right benchmark (quick guide)
If you want PDF → structure (cells/rows/cols): start with PubTables-1M, PubTabNet, SciTSR.
If you want full-page / multi-table document extraction: PubTables-v2 (and PubTables-1M as baseline).
If you want HTML “real web” tables: WDC WebTables + (optionally) WDC Schema.org Tables.
If you want lots of synthetic-ish variety / pretraining: TableBank.
If you tell me your target output format (CSV grid? HTML? LaTeX? cell bounding boxes? logical header structure?) and whether your inputs are born-digital PDFs vs scans, I can recommend a concrete “training mix” (e.g., pretrain on TableBank → finetune on PubTables-1M → evaluate on SciTSR-COMP, etc.).
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
here is a list of benchmark datasets that could be useful:
Yes — there are several widely used benchmarks / training datasets for table parsing, and they map pretty cleanly onto PDF, HTML, and XML sources (often with aligned pairs like PDF+XML).
Best-known datasets for PDF / document-table extraction
PubTabNet (PDF table images ↔ HTML structure)
PubTables-1M (scientific PDFs; multimodal annotations)
PubTables-v2 (newer; full-page + multi-page)
TableBank (Word/LaTeX → weak supervision; detection + recognition)
SciTSR (PDF tables with structure labels from LaTeX sources)
Marmot / ICDAR competition datasets (classics; smaller but common)
HTML / web-table corpora (great for “tables in the wild”)
Web Data Commons (WDC) Web Tables corpora
WDC Schema.org Table Corpus (2023)
XML sources that are especially useful for supervision
PMC Open Access Subset (bulk XML, plus aligned PDFs for many articles)
How to pick the right benchmark (quick guide)
If you tell me your target output format (CSV grid? HTML? LaTeX? cell bounding boxes? logical header structure?) and whether your inputs are born-digital PDFs vs scans, I can recommend a concrete “training mix” (e.g., pretrain on TableBank → finetune on PubTables-1M → evaluate on SciTSR-COMP, etc.).
Beta Was this translation helpful? Give feedback.
All reactions