PDF-to-Markdown converts PDF documents to markdown using a multi-stage transformation pipeline. Each stage analyzes and enriches the document structure until final markdown output.
PDF File
↓
PDF.js extraction → TextItems (raw text fragments with positions)
↓
Stage 1: CalculateGlobalStats → Baseline statistics (font sizes, spacing)
↓
Stage 2-5: TextItem transformations → Detect structure (headers, lists, TOC)
↓
Stage 6: ToLineItemTransformation → TextItems → LineItems (grouped by y-coordinate)
↓
Stage 7-9: LineItem transformations → Refine structure (remove duplicates, compact lines)
↓
Stage 10: ToLineItemBlockTransformation → LineItems → Blocks (paragraphs, lists, code)
↓
Stage 11-12: Block transformations → Detect code/quotes, list hierarchy
↓
Stage 13: ToTextBlocks → Blocks → Text strings
↓
Stage 14: ToMarkdown → Text → Final markdown
↓
Markdown string
Container for transformation pipeline output. Immutable pattern - each stage returns new ParseResult.
class ParseResult {
pages: Page[] // Document pages
globals?: GlobalStats // Shared statistics
messages?: string[] // Debug messages
}Single page with items that evolve through pipeline.
class Page {
index: number // 0-based page number
items: PageItem[] // TextItem | LineItem | LineItemBlock | string
}Raw PDF.js text fragment with positioning.
class TextItem {
x, y, width, height // Position/size
text: string // Content
font: string // Font ID
type?: BlockTypeValue // Detected type (H1, LIST, etc)
}Grouped text items forming single line.
class LineItem {
x, y, width, height // Position/size
words: Word[] // Words with formatting
type?: BlockTypeValue // H1, H2, LIST, CODE, etc
}Group of lines forming semantic block.
class LineItemBlock {
items: LineItem[] // Lines in block
type?: BlockTypeValue // PARAGRAPH, LIST, CODE_BLOCK, etc
}Word with markdown formatting metadata.
class Word {
string: string // Text
type?: WordTypeValue // LINK, FOOTNOTE
format?: WordFormatValue // BOLD, OBLIQUE, BOLD_OBLIQUE
}Analyzes entire PDF for baseline statistics.
Output:
globals: {
mostUsedHeight: number // Common font size
mostUsedFont: string // Common font
mostUsedDistance: number // Line spacing
maxHeight: number // Largest font (likely title)
fontToFormats: Map<string, string> // Font → format mapping
}Identifies headers using multiple strategies:
- Title page detection (first page, large font)
- TOC-based detection (if TOC found, map headlines)
- Height-based categorization (font size → H1/H2/H3)
Marks items with type: H1/H2/H3/H4/H5/H6
Detects bullet/numbered lists using:
- List character patterns (•, -, *, 1., a., etc)
- Indentation consistency
- Vertical spacing
Marks items with type: LIST
Complex TOC detection algorithm:
- Find TOC pages (title + page numbers pattern)
- Extract headline items
- Map TOC entries to actual headlines in document
- Link pages for navigation
Updates globals.tocPages
Converts vertical text orientation to horizontal (e.g., rotated headers).
Converts: TextItems → LineItems
Groups text items by y-coordinate proximity into lines. Sorts words left-to-right.
Uses TextItemLineGrouper class.
Merges adjacent lines that should be single line (e.g., lines broken by formatting).
Removes page headers/footers that repeat across pages.
Heuristic: If line appears on most pages at same position, likely header/footer.
(Minimal - mostly passes through)
Converts: LineItems → LineItemBlocks
Groups consecutive lines of same type into blocks.
Example:
H1 line
PARAGRAPH line 1
PARAGRAPH line 2
LIST line 1
LIST line 2
→
Block(type=H1, items=[H1 line])
Block(type=PARAGRAPH, items=[PARAGRAPH line 1, PARAGRAPH line 2])
Block(type=LIST, items=[LIST line 1, LIST line 2])
Detects code blocks using indentation heuristics.
If block consistently indented > threshold → type: CODE_BLOCK
Calculates list nesting levels using x-coordinates.
Updates list items with indentation level for markdown rendering (-, -, -).
Converts: LineItemBlocks → Text strings
Calls BlockType.toText(block) for each block to render markdown.
Collapses text blocks to final markdown string.
Defined in BlockType.ts:
- H1-H6 - Headers (# → ######)
- PARAGRAPH - Normal text
- LIST - Bullet/numbered lists
- CODE_BLOCK -
code - QUOTE - > quote
- TOC - Table of contents
Each block type has toText(block) method for markdown conversion.
GlobalStats uses hybrid typing:
interface GlobalStats {
// Core stats (set by CalculateGlobalStats)
mostUsedHeight: number
mostUsedFont: string
mostUsedDistance: number
maxHeight: number
// Transformation-specific
tocPages?: number[]
headlineTypeToHeightRange?: Record<string, HeightRange>
// Extension point
[key: string]: unknown
}Rationale: Type safety for core properties + flexibility for custom transformations.
Text analysis utilities:
isListItem(str)- Detects list patternshasUpperCaseCharacterInMiddleOfWord(str)- Identifies camelCasecalculateWordMatchScore(str1, str2)- Fuzzy matching for TOC
Position utilities:
minXFromBlocks(blocks)- Leftmost x-coordinatesortByX(items)- Sort left-to-right
Multi-line headline matching for TOC. Uses character-by-character comparison to find headlines across lines.
Groups TextItems into LineItems by y-coordinate proximity. Configurable threshold.
App
├── UploadView (file selection)
├── LoadingView (PDF parsing + progress)
│ ├── PDF.js extraction
│ └── Transformation pipeline
├── ResultView (markdown output)
│ ├── Markdown display
│ └── Copy/download
└── DebugView (pipeline visualization)
└── Per-stage inspection
Manages document state:
class AppState {
metadata: Metadata // PDF info
pages: Page[] // Document pages
transformations: Transformation[] // Pipeline stages
}See TESTING.md for details.
- Unit tests - Utilities, models, transformations
- Integration tests - Full pipeline with small PDFs
- Snapshot tests - Verify output consistency
- Incremental processing - Pages processed one-by-one
- Worker threads - PDF.js runs in worker (not main thread)
- Lazy rendering - Debug view only renders visible stages
- Memoization - React.memo() on pure components
- Extend
Transformationbase class - Implement
transform(parseResult)method - Add to pipeline in appropriate stage
- Update
GlobalStatsif needed
Example:
class DetectTables extends ToLineItemTransformation {
transform(parseResult: ParseResult): ParseResult {
// Detect table patterns
// Mark items with type: TABLE
return new ParseResult({ pages: newPages })
}
}- Add to
BlockType.ts - Define
toText(block)method - Update transformations to detect/set type
Example:
export const BlockType = {
TABLE: {
name: 'TABLE',
toText(block: Block) {
// Render markdown table
return '| col1 | col2 |\n|---|---|\n...'
}
}
}- Font detection - Uses internal PDF.js API (
_transport.commonObjs) - may break in future PDF.js versions - Table detection - Limited support (treats as paragraphs)
- Multi-column layouts - May merge columns incorrectly
- Image extraction - Positions noted, but images not embedded
- Rotated text - Partial support (VerticalToHorizontal handles some cases)
- Better table detection/rendering
- Multi-column layout handling
- Image embedding (base64 or external files)
- OCR integration for scanned PDFs
- Custom transformation plugins
- CLI version for batch processing