Skip to content

Commit 55d3f5b

Browse files
committed
feat: add ground truth for benchmark quality scoring across all non-archive file types
Add ~469 ground truth text files covering all non-archive, non-image file types, expanding quality score coverage from 77 to 546 fixtures (95% of all benchmarks). Ground truth generated using independent tools not part of benchmarked frameworks: - Raw source copy for text formats (md, txt, rst, org, tex, toml, yaml, etc.) - pdftotext for PDFs, python-docx/pptx/openpyxl for Office XML formats - pandoc for epub, fb2, docbook, odt, rtf, opml - beautifulsoup for HTML, Python email stdlib for EML, extract-msg for MSG - odfpy for ODS, xlrd for XLS, nbformat for ipynb, textutil for legacy doc/ppt Includes generation script, fixture JSON patches, and updated ground truth mapping.
1 parent ccae248 commit 55d3f5b

File tree

942 files changed

+314033
-95
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

942 files changed

+314033
-95
lines changed
Lines changed: 151 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,151 @@
1+
@article{entry_article_one,
2+
author = {Alice Smith and Bob Johnson},
3+
title = {Exploring Article Formats},
4+
journal = {Journal of Examples},
5+
year = {2012},
6+
volume = {42},
7+
number = {3},
8+
pages = {101-120}
9+
}
10+
11+
@book{entry_book_foundations,
12+
author = {Carla Ruiz and Deepak Patel},
13+
title = {Foundations of Structured Documents},
14+
publisher = {Example Press},
15+
year = {2010},
16+
address = {Berlin}
17+
}
18+
19+
@inproceedings{entry_inproceedings_parallel,
20+
author = {Ethan Lee and Fatima Khan},
21+
title = {Parallel Parsing Techniques},
22+
booktitle = {Proceedings of the Structured Data Conference},
23+
year = {2019},
24+
pages = {55-68}
25+
}
26+
27+
@phdthesis{entry_phdthesis_semantics,
28+
author = {Grace Muller},
29+
title = {Semantics Aware Document Pipelines},
30+
school = {University of Hamburg},
31+
year = {2015}
32+
}
33+
34+
@mastersthesis{entry_mastersthesis_design,
35+
author = {Hugo Silva},
36+
title = {Designing Resilient Extraction Systems},
37+
school = {Technical University of Lisbon},
38+
year = {2018}
39+
}
40+
41+
@techreport{entry_techreport_scalability,
42+
author = {Ingrid Novak},
43+
title = {Scalability Benchmarks for Extraction Engines},
44+
institution = {Alpine Research Labs},
45+
number = {ARL-TR-011},
46+
year = {2011}
47+
}
48+
49+
@manual{entry_manual_reference,
50+
author = {Javier Torres},
51+
title = {Kreuzberg Reference Manual},
52+
organization = {Kreuzberg Labs},
53+
year = {2005}
54+
}
55+
56+
@misc{entry_misc_dataset,
57+
author = {Keiko Tanaka},
58+
title = {Annotated Extraction Dataset},
59+
year = {2022},
60+
howpublished = {Dataset Repository},
61+
note = {Version 3.2}
62+
}
63+
64+
@unpublished{entry_unpublished_notes,
65+
author = {Liam O'Connor},
66+
title = {Notes on Incremental Extraction},
67+
note = {Draft manuscript},
68+
year = {2021}
69+
}
70+
71+
@incollection{entry_incollection_story,
72+
author = {Mei Huang},
73+
title = {Story Driven Testing},
74+
booktitle = {Modern Document Pipelines},
75+
publisher = {Insight Publishing},
76+
year = {2016},
77+
chapter = {4}
78+
}
79+
80+
@inbook{entry_inbook_chapter,
81+
author = {Noah Becker},
82+
title = {Advanced Pipeline Patterns},
83+
chapter = {11},
84+
publisher = {TechWorks},
85+
year = {2009}
86+
}
87+
88+
@proceedings{entry_proceedings_ai,
89+
title = {Proceedings of the AI Extraction Summit},
90+
year = {2017},
91+
editor = {Olivia Rossi}
92+
}
93+
94+
@booklet{entry_booklet_summary,
95+
author = {Priya Desai},
96+
title = {Summary of Extraction Benchmarks},
97+
year = {2014},
98+
howpublished = {Internal Memo}
99+
}
100+
101+
@article{entry_article_modern,
102+
author = {Quinn Parker},
103+
title = {Modern Approaches to Layout Analysis},
104+
journal = {International Journal of OCR},
105+
year = {2020},
106+
volume = {7},
107+
number = {1}
108+
}
109+
110+
@book{entry_book_distributed,
111+
author = {Rina Haddad},
112+
title = {Distributed Text Processing},
113+
publisher = {Northern Lights},
114+
year = {2002}
115+
}
116+
117+
@inproceedings{entry_inproceedings_reproducibility,
118+
author = {Samuel Ortega},
119+
title = {Reproducible Extraction Pipelines},
120+
booktitle = {Workshop on Reliable NLP},
121+
year = {2013},
122+
pages = {12-20}
123+
}
124+
125+
@article{entry_article_architecture,
126+
author = {Talia Cohen},
127+
title = {Architecture Patterns for Extractors},
128+
journal = {Systems Journal},
129+
year = {2008}
130+
}
131+
132+
@techreport{entry_techreport_innovation,
133+
author = {Umar Farouk},
134+
title = {Innovation in Document Processing},
135+
institution = {Global Research Institute},
136+
year = {2006}
137+
}
138+
139+
@booklet{entry_booklet_research,
140+
author = {Valeria Costa},
141+
title = {Research Highlights in OCR},
142+
year = {1998},
143+
note = {Conference supplement}
144+
}
145+
146+
@article{entry_article_future,
147+
author = {Wei Zhang},
148+
title = {Future of Unified Extraction},
149+
journal = {Journal of Pipeline Research},
150+
year = {2024}
151+
}
Lines changed: 136 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,136 @@
1+
# CommonMark Sample Document
2+
3+
This is a comprehensive CommonMark document showcasing standard markdown elements for testing and benchmarking purposes.
4+
5+
## Introduction
6+
7+
CommonMark is a **standardized specification** of Markdown. This document demonstrates *all* the core elements that should be properly parsed and extracted.
8+
9+
### About Markdown
10+
11+
Markdown is a lightweight markup language with plain-text formatting syntax. It's designed to be *easy to read* and **easy to write**.
12+
13+
## Text Formatting
14+
15+
You can use various text formatting options:
16+
17+
- **Bold text** using double asterisks or underscores
18+
- *Italic text* using single asterisks or underscores
19+
- ***Bold and italic*** combined
20+
- `inline code` for short snippets
21+
22+
## Lists
23+
24+
### Unordered Lists
25+
26+
- First item
27+
- Second item
28+
- Nested item 1.1
29+
- Nested item 1.2
30+
- Third item
31+
- Nested item 3.1
32+
- Double nested 3.1.1
33+
- Double nested 3.1.2
34+
35+
### Ordered Lists
36+
37+
1. First step
38+
2. Second step
39+
1. Sub-step 2.1
40+
2. Sub-step 2.2
41+
3. Third step
42+
1. Sub-step 3.1
43+
4. Fourth step
44+
45+
## Code Blocks
46+
47+
### Inline Code
48+
49+
Use the `function_name()` method to process data.
50+
51+
### Code Blocks
52+
53+
```rust
54+
fn hello_world() {
55+
println!("Hello, World!");
56+
}
57+
```
58+
59+
```python
60+
def fibonacci(n):
61+
if n <= 1:
62+
return n
63+
return fibonacci(n-1) + fibonacci(n-2)
64+
```
65+
66+
```javascript
67+
const greeting = (name) => {
68+
return `Hello, ${name}!`;
69+
};
70+
```
71+
72+
## Links and References
73+
74+
Visit the [CommonMark specification](https://spec.commonmark.org/) for complete documentation.
75+
76+
You can also use [reference-style links][commonmark-spec] if preferred.
77+
78+
[commonmark-spec]: https://spec.commonmark.org/
79+
80+
## Blockquotes
81+
82+
> This is a blockquote. It's useful for highlighting important information or quotes from other sources.
83+
84+
> Blockquotes can contain **bold text** and *italic text*.
85+
>
86+
> They can also span multiple paragraphs.
87+
88+
## Horizontal Rule
89+
90+
---
91+
92+
## Tables
93+
94+
| Language | Type | Package Manager |
95+
|----------|------|-----------------|
96+
| Python | Dynamic | pip |
97+
| Rust | Compiled | cargo |
98+
| JavaScript | Dynamic | npm |
99+
| Java | Compiled | Maven |
100+
101+
## Images
102+
103+
![Alt text for image](https://example.com/image.png)
104+
105+
## Mixed Content
106+
107+
Here's a paragraph with multiple types of formatting. It contains **bold**, *italic*, `code`, and [links](https://example.com).
108+
109+
### Complex Nested List
110+
111+
1. First item with **bold** and *italic*
112+
- Sub-item with `code`
113+
- Another sub-item with [link](https://example.com)
114+
2. Second item
115+
- Multiple levels
116+
- Level 3 item 1
117+
- Level 3 item 2 with **formatting**
118+
- Back to level 2
119+
3. Third item
120+
121+
## Conclusion
122+
123+
This CommonMark document includes all standard elements:
124+
- Headers at multiple levels
125+
- Paragraphs with inline formatting
126+
- Unordered and ordered lists
127+
- Nested lists
128+
- Code blocks with syntax highlighting
129+
- Inline code
130+
- Links and reference-style links
131+
- Blockquotes
132+
- Horizontal rules
133+
- Tables
134+
- Mixed formatting and content
135+
136+
The document is structured for comprehensive testing of markdown parsing and extraction capabilities.
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
Stanley Cups,,
2+
Team,Location,Stanley Cups
3+
Blues,STL,1
4+
Flyers,PHI,2
5+
Maple Leafs,TOR,13
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
Name,Department,Salary,Start Date,Active
2+
Alice Johnson,Engineering,95000,2020-01-15,true
3+
Bob Smith,Marketing,75000,2019-06-01,true
4+
Carol White,Sales,82000,2021-03-10,false
5+
David Brown,Engineering,105000,2018-11-20,true
6+
Emily Chen,HR,70000,2022-02-28,true
7+
Frank Davis,Operations,88000,2020-07-15,false
8+
Grace Lee,Marketing,79000,2021-09-01,true
9+
Henry Wilson,Engineering,115000,2017-04-12,true
10+
Isabella Martinez,Sales,91000,2019-12-05,true
11+
Jack Thompson,HR,68000,2023-01-10,true
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
–¼‘O,”N—î,ZŠ
2+
²“¡‘¾˜Y,30,“Œ‹ž
3+
ŽO–؉pŽq,25,‘åã
4+
îà‹´~,35,–¼ŒÃ‰®
Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
Simple table with caption:
2+
3+
| Right | Left | Center | Default |
4+
|------:|:-----|:------:|-------|
5+
| 12 | 12 | 12 | 12 |
6+
| 123 | 123 | 123 | 123 |
7+
| 1 | 1 | 1 | 1 |
8+
9+
^ Demonstration of simple table syntax.
10+
11+
Simple table without caption:
12+
13+
| Right | Left | Center | Default |
14+
|------:|:-----|:------:|-------|
15+
| 12 | 12 | 12 | 12 |
16+
| 123 | 123 | 123 | 123 |
17+
| 1 | 1 | 1 | 1 |
18+
19+
Simple table indented two spaces:
20+
21+
| Right | Left | Center | Default |
22+
|------:|:-----|:------:|-------|
23+
| 12 | 12 | 12 | 12 |
24+
| 123 | 123 | 123 | 123 |
25+
| 1 | 1 | 1 | 1 |
26+
27+
^ Demonstration of simple table syntax.
28+
29+
Multiline table with caption:
30+
31+
| Centered Header | Left Aligned | Right Aligned | Default aligned |
32+
|:---------------:|:-------------|--------------:|:------------------------------------------------------|
33+
| First | row | 12.0 | Example of a row that spans multiple lines. |
34+
| Second | row | 5.0 | Here's another one. Note the blank line between rows. |
35+
36+
^ Here's the caption. It may span multiple lines.
37+
38+
Multiline table without caption:
39+
40+
| Centered Header | Left Aligned | Right Aligned | Default aligned |
41+
|:---------------:|:-------------|--------------:|:------------------------------------------------------|
42+
| First | row | 12.0 | Example of a row that spans multiple lines. |
43+
| Second | row | 5.0 | Here's another one. Note the blank line between rows. |
44+
45+
Table without column headers:
46+
47+
|----:|:----|:---:|----:|
48+
| 12 | 12 | 12 | 12 |
49+
| 123 | 123 | 123 | 123 |
50+
| 1 | 1 | 1 | 1 |
51+
52+
Multiline table without column headers:
53+
54+
|:------:|:----|-----:|-----------------------------------------------------|
55+
| First | row | 12.0 | Example of a row that spans multiple lines. |
56+
| Second | row | 5.0 | Here's another one. Note the blank line between rows. |

0 commit comments

Comments
 (0)