This document acknowledges the sources of test documents and baseline data used in the Kreuzberg project.
Test documents and reference baseline outputs derived from the Pandoc test suite:
- Source: https://github.com/jgm/pandoc
- License: GPL-2.0-or-later
- Usage: Test documents and reference baselines only (no code copied from Pandoc)
- Attribution: John MacFarlane and Pandoc contributors
- Purpose: Baseline reference testing - used to validate our native Rust extractors work correctly on the same documents that Pandoc processes
The following test documents were copied from the Pandoc repository to /test_documents/:
org-select-tags.org- SELECT_TAGS and EXCLUDE_TAGS testingpandoc-tables.org- Org Mode table formatspandoc-writer.org- Comprehensive Pandoc test suite in Org Mode format
typst-reader.typ- Fibonacci sequence with mathematical formulasundergradmath.typ- Comprehensive undergraduate mathematics document (16KB)
docbook-chapter.docbook- Recursive section hierarchy (7 nested levels)docbook-reader.docbook- Comprehensive DocBook 4.4 test suite (36KB, 1704 lines)docbook-xref.docbook- Cross-reference (xref) functionality testing
jats-reader.xml- Comprehensive JATS (Z39.96) Journal Archiving test document (38KB, 1460 lines)
test_documents/fictionbook/pandoc/- 13 FictionBook test files including:basic.fb2- Basic FictionBook structureimages-embedded.fb2- Embedded base64 imagesmath.fb2- Mathematical contentmeta.fb2- Document metadata testingreader/emphasis.fb2- Text emphasis testingreader/epigraph.fb2- Epigraph/quote elementsreader/meta.fb2- Document metadata and title inforeader/notes.fb2- Footnotes/endnotes with cross-referencesreader/poem.fb2- Poem/verse structurereader/titles.fb2- Section titles and heading hierarchy- And others
opml-reader.opml- OPML 2.0 outline structure (US states example)pandoc-writer.opml- Comprehensive Pandoc test suite in OPML format
For each test document listed above, three baseline outputs were generated using Pandoc 3.8.3:
- Plain Text (
*_pandoc_baseline.txt) - Raw text content extraction - JSON Metadata (
*_pandoc_meta.json) - Full Pandoc AST with document structure and metadata - Markdown (
*_pandoc_markdown.md) - Markdown representation for format comparison
Total: 132 baseline files for 44 documents across 6 formats
We acknowledge that Pandoc is licensed under GPL-2.0-or-later. We have:
- ✓ Used Pandoc's test documents (test data is allowed under GPL)
- ✓ Generated baseline outputs using Pandoc for comparison purposes
- ✓ NOT copied any Pandoc source code
- ✓ Implemented our extractors independently in Rust
- ✓ Used Pandoc only as a behavioral baseline for testing
Our Rust extractors are independently implemented and do not contain any GPL-licensed code from Pandoc.
Test documents and baselines can be regenerated at any time using:
./generate_pandoc_baselines.shThis script processes all test documents and generates fresh baselines using the installed version of Pandoc.
Last Updated: December 6, 2025 Pandoc Version Used: 3.8.3 Baseline Generation Date: December 6, 2025