|
| 1 | +# 035: Text Extraction |
| 2 | + |
| 3 | +## Problem Statement |
| 4 | + |
| 5 | +Users need to extract text content from PDF pages with position information. Key use cases: |
| 6 | + |
| 7 | +1. **Search and locate** — Find text patterns (e.g., `{{ field }}`) and get their bounding boxes for replacement or annotation |
| 8 | +2. **Plain text extraction** — Get readable text for indexing, accessibility, or processing |
| 9 | +3. **Structured extraction** — Access text by line/span with font metadata for layout analysis |
| 10 | + |
| 11 | +This is a Tier 3 feature per GOALS.md, supporting search, indexing, and accessibility. |
| 12 | + |
| 13 | +## Scope |
| 14 | + |
| 15 | +### In Scope |
| 16 | + |
| 17 | +- Extract text content from page content streams |
| 18 | +- Track text positioning (bounding boxes) |
| 19 | +- Group text into lines and spans based on baseline/font |
| 20 | +- Support string and regex search with position results |
| 21 | +- Handle common text operators (Tj, TJ, ', ") |
| 22 | +- Basic Unicode mapping via ToUnicode CMaps and font encodings |
| 23 | +- Page-level and document-level extraction/search APIs |
| 24 | + |
| 25 | +### Out of Scope |
| 26 | + |
| 27 | +- Complex layout analysis (multi-column detection, tables) |
| 28 | +- Right-to-left and vertical text layout |
| 29 | +- Marked content / tagged PDF structure |
| 30 | +- Text extraction from annotations (separate feature) |
| 31 | +- OCR or image-based text |
| 32 | +- Hyphenation joining across lines |
| 33 | + |
| 34 | +## Dependencies |
| 35 | + |
| 36 | +- **Content stream parser** — Already exists at `src/content/` |
| 37 | +- **Font layer** — Already exists at `src/fonts/` with `decode()` and ToUnicode support |
| 38 | +- **Graphics state tracking** — Needs CTM (current transformation matrix) integration |
| 39 | + |
| 40 | +## Desired API |
| 41 | + |
| 42 | +### Basic Usage |
| 43 | + |
| 44 | +```typescript |
| 45 | +const pdf = await PDF.load(bytes); |
| 46 | +const page = pdf.getPage(0); |
| 47 | + |
| 48 | +// Extract all text with positions |
| 49 | +const pageText = await page.extractText(); |
| 50 | +console.log(pageText.text); // "Hello World\nSecond line..." |
| 51 | + |
| 52 | +// Access structured content |
| 53 | +for (const line of pageText.lines) { |
| 54 | + console.log(`Line at y=${line.baseline}: "${line.text}"`); |
| 55 | +} |
| 56 | +``` |
| 57 | + |
| 58 | +### Search |
| 59 | + |
| 60 | +```typescript |
| 61 | +// String search on a page |
| 62 | +const matches = await page.findText("{{ name }}"); |
| 63 | +for (const match of matches) { |
| 64 | + console.log(`Found at:`, match.bbox); // { x, y, width, height } |
| 65 | +} |
| 66 | + |
| 67 | +// Regex search |
| 68 | +const fields = await page.findText(/\{\{\s*\w+\s*\}\}/g); |
| 69 | + |
| 70 | +// Document-wide search |
| 71 | +const allMatches = await pdf.findText("invoice", { |
| 72 | + pages: [0, 1, 2], |
| 73 | + caseSensitive: false, |
| 74 | +}); |
| 75 | +``` |
| 76 | + |
| 77 | +### Template Replacement Pattern |
| 78 | + |
| 79 | +```typescript |
| 80 | +const placeholders = await pdf.findText(/\{\{\s*(\w+)\s*\}\}/g); |
| 81 | + |
| 82 | +for (const match of placeholders) { |
| 83 | + const fieldName = match.text.replace(/[{}]/g, "").trim(); |
| 84 | + const value = data[fieldName]; |
| 85 | + |
| 86 | + const page = pdf.getPage(match.pageIndex); |
| 87 | + // Cover original text with white rectangle |
| 88 | + page.drawRectangle({ ...match.bbox, color: rgb(1, 1, 1) }); |
| 89 | + // Draw replacement text |
| 90 | + page.drawText(value, { x: match.bbox.x, y: match.bbox.y, fontSize: 12 }); |
| 91 | +} |
| 92 | +``` |
| 93 | + |
| 94 | +## Types |
| 95 | + |
| 96 | +```typescript |
| 97 | +/** Rectangle in PDF coordinates (origin at bottom-left) */ |
| 98 | +interface BoundingBox { |
| 99 | + x: number; // Left edge |
| 100 | + y: number; // Bottom edge |
| 101 | + width: number; |
| 102 | + height: number; |
| 103 | +} |
| 104 | + |
| 105 | +/** Single character with position */ |
| 106 | +interface ExtractedChar { |
| 107 | + char: string; |
| 108 | + bbox: BoundingBox; |
| 109 | + fontSize: number; |
| 110 | + fontName: string; |
| 111 | + baseline: number; |
| 112 | +} |
| 113 | + |
| 114 | +/** Text span (same font/size on same line) */ |
| 115 | +interface TextSpan { |
| 116 | + text: string; |
| 117 | + bbox: BoundingBox; |
| 118 | + chars: ExtractedChar[]; |
| 119 | + fontSize: number; |
| 120 | + fontName: string; |
| 121 | +} |
| 122 | + |
| 123 | +/** Line of text (multiple spans on same baseline) */ |
| 124 | +interface TextLine { |
| 125 | + text: string; |
| 126 | + bbox: BoundingBox; |
| 127 | + spans: TextSpan[]; |
| 128 | + baseline: number; |
| 129 | +} |
| 130 | + |
| 131 | +/** Full page extraction result */ |
| 132 | +interface PageText { |
| 133 | + pageIndex: number; |
| 134 | + width: number; |
| 135 | + height: number; |
| 136 | + lines: TextLine[]; |
| 137 | + text: string; // Plain text (lines joined with \n) |
| 138 | +} |
| 139 | + |
| 140 | +/** Search match */ |
| 141 | +interface TextMatch { |
| 142 | + text: string; |
| 143 | + bbox: BoundingBox; |
| 144 | + pageIndex: number; |
| 145 | + charBoxes: BoundingBox[]; // Per-character boxes for highlighting |
| 146 | +} |
| 147 | + |
| 148 | +/** Extraction options */ |
| 149 | +interface ExtractTextOptions { |
| 150 | + /** Include individual character positions (default: true for search support) */ |
| 151 | + includeChars?: boolean; |
| 152 | +} |
| 153 | + |
| 154 | +/** Search options */ |
| 155 | +interface FindTextOptions { |
| 156 | + /** Pages to search (default: all) */ |
| 157 | + pages?: number[]; |
| 158 | + /** Case-sensitive matching (default: true) */ |
| 159 | + caseSensitive?: boolean; |
| 160 | + /** Match whole words only (default: false) */ |
| 161 | + wholeWord?: boolean; |
| 162 | +} |
| 163 | +``` |
| 164 | + |
| 165 | +## Architecture |
| 166 | + |
| 167 | +### Components |
| 168 | + |
| 169 | +``` |
| 170 | +PDFPage.extractText() |
| 171 | + │ |
| 172 | + ▼ |
| 173 | +TextExtractor |
| 174 | + │ |
| 175 | + ├─► ContentStreamParser (existing) |
| 176 | + │ |
| 177 | + ├─► TextState (tracks Tm, Tc, Tw, etc.) |
| 178 | + │ |
| 179 | + ├─► Font.decode() (existing) |
| 180 | + │ |
| 181 | + └─► LineGrouper (groups chars into lines/spans) |
| 182 | +``` |
| 183 | + |
| 184 | +### TextState |
| 185 | + |
| 186 | +Tracks text-related graphics state during content stream processing: |
| 187 | + |
| 188 | +- **Tm** — Text matrix (position/transform) |
| 189 | +- **Tlm** — Text line matrix (start of current line) |
| 190 | +- **Tf** — Current font and size |
| 191 | +- **Tc** — Character spacing |
| 192 | +- **Tw** — Word spacing |
| 193 | +- **Tz** — Horizontal scaling |
| 194 | +- **TL** — Leading (line spacing) |
| 195 | +- **Ts** — Text rise (superscript/subscript) |
| 196 | +- **CTM** — Current transformation matrix (from graphics state) |
| 197 | + |
| 198 | +### Text Operators to Handle |
| 199 | + |
| 200 | +| Operator | Description | |
| 201 | +|----------|-------------| |
| 202 | +| BT/ET | Begin/end text object | |
| 203 | +| Tf | Set font and size | |
| 204 | +| Tm | Set text matrix | |
| 205 | +| Td | Move to next line (relative) | |
| 206 | +| TD | Move and set leading | |
| 207 | +| T* | Move to next line (using TL) | |
| 208 | +| Tc | Set character spacing | |
| 209 | +| Tw | Set word spacing | |
| 210 | +| Tz | Set horizontal scaling | |
| 211 | +| TL | Set leading | |
| 212 | +| Ts | Set text rise | |
| 213 | +| Tj | Show string | |
| 214 | +| TJ | Show strings with positioning | |
| 215 | +| ' | Move to next line and show string | |
| 216 | +| " | Set spacing, move, and show string | |
| 217 | + |
| 218 | +### Coordinate Transformation |
| 219 | + |
| 220 | +Character positions must account for: |
| 221 | + |
| 222 | +1. **Text matrix (Tm)** — Position within text object |
| 223 | +2. **CTM** — Page-level transformation (rotation, scaling) |
| 224 | +3. **Font metrics** — Glyph widths, ascender/descender |
| 225 | + |
| 226 | +Final position = CTM × Tm × glyph_position |
| 227 | + |
| 228 | +### Line Grouping Algorithm |
| 229 | + |
| 230 | +1. Sort characters by baseline Y coordinate (with tolerance for slight variations) |
| 231 | +2. Within each baseline group, sort by X coordinate |
| 232 | +3. Detect spans based on font/size changes |
| 233 | +4. Join characters into text strings, inferring spaces from gaps |
| 234 | + |
| 235 | +Space detection heuristic: |
| 236 | +- If gap between characters > 0.3 × font size, insert space |
| 237 | +- Configurable threshold for different PDF generators |
| 238 | + |
| 239 | +## Test Plan |
| 240 | + |
| 241 | +### Unit Tests |
| 242 | + |
| 243 | +- Parse text operators correctly (Tj, TJ, Td, Tm, etc.) |
| 244 | +- Calculate character positions with various transforms |
| 245 | +- Group characters into lines/spans correctly |
| 246 | +- Space detection between words |
| 247 | +- Handle font changes mid-line |
| 248 | + |
| 249 | +### Integration Tests |
| 250 | + |
| 251 | +- Extract text from simple single-page PDF |
| 252 | +- Extract text with multiple fonts/sizes |
| 253 | +- Extract text from rotated pages |
| 254 | +- Search for literal strings |
| 255 | +- Search with regex patterns |
| 256 | +- Document-wide search across pages |
| 257 | +- Handle PDFs with missing ToUnicode (use font encoding fallback) |
| 258 | + |
| 259 | +### Fixtures Needed |
| 260 | + |
| 261 | +- `fixtures/text/simple.pdf` — Basic text content |
| 262 | +- `fixtures/text/multiline.pdf` — Multiple lines and paragraphs |
| 263 | +- `fixtures/text/fonts.pdf` — Multiple fonts and sizes |
| 264 | +- `fixtures/text/rotated.pdf` — Rotated page content |
| 265 | +- `fixtures/text/positioned.pdf` — Text with TJ positioning adjustments |
| 266 | +- `fixtures/text/template.pdf` — Document with `{{ placeholder }}` markers |
| 267 | + |
| 268 | +## Open Questions |
| 269 | + |
| 270 | +1. **Reading order**: Should we attempt to detect multi-column layouts, or just use raw position order? |
| 271 | + - *Initial approach*: Position order (left-to-right, top-to-bottom). Complex layout analysis is out of scope. |
| 272 | + |
| 273 | +2. **Whitespace handling**: How to handle multiple spaces, tabs, and form feeds? |
| 274 | + - *Initial approach*: Normalize to single spaces. Provide `preserveWhitespace` option if needed. |
| 275 | + |
| 276 | +3. **Ligatures**: How to handle fi, fl ligatures in glyph names? |
| 277 | + - *Initial approach*: Map via ToUnicode if available, otherwise return ligature character. |
| 278 | + |
| 279 | +4. **CID fonts without ToUnicode**: Some PDFs have CID fonts without ToUnicode CMaps. |
| 280 | + - *Initial approach*: Return replacement character or empty string. Log warning. |
| 281 | + |
| 282 | +## Risks |
| 283 | + |
| 284 | +- **Performance**: Large documents may have many text objects. Consider lazy extraction per page. |
| 285 | +- **Font encoding edge cases**: Legacy PDFs may use obscure encodings. May need fallback strategies. |
| 286 | +- **Accuracy**: Bounding boxes depend on font metrics which may be incomplete or inaccurate in some PDFs. |
| 287 | + |
| 288 | +## Implementation Phases |
| 289 | + |
| 290 | +### Phase 1: Core Extraction |
| 291 | +- TextState class |
| 292 | +- Process text operators |
| 293 | +- Character-level extraction with positions |
| 294 | + |
| 295 | +### Phase 2: Grouping |
| 296 | +- Line grouping by baseline |
| 297 | +- Span grouping by font |
| 298 | +- Space detection |
| 299 | + |
| 300 | +### Phase 3: Search |
| 301 | +- String search with bbox |
| 302 | +- Regex search |
| 303 | +- Document-wide search API |
| 304 | + |
| 305 | +### Phase 4: API Integration |
| 306 | +- `PDFPage.extractText()` |
| 307 | +- `PDFPage.findText()` |
| 308 | +- `PDF.findText()` (document-wide) |
0 commit comments