Skip to content

Commit 6a5a988

Browse files
committed
feat(text): add text extraction and search API
Implement comprehensive text extraction from PDF content streams with: - Character-level position tracking via TextState and text operators - Line and span grouping based on baseline/font (LineGrouper) - String and regex search with bounding box results - Page-level (extractText, findText) and document-level APIs Supports ToUnicode CMaps and font encodings for Unicode mapping.
1 parent 9a52eca commit 6a5a988

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

41 files changed

+4853
-115
lines changed
Lines changed: 308 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,308 @@
1+
# 035: Text Extraction
2+
3+
## Problem Statement
4+
5+
Users need to extract text content from PDF pages with position information. Key use cases:
6+
7+
1. **Search and locate** — Find text patterns (e.g., `{{ field }}`) and get their bounding boxes for replacement or annotation
8+
2. **Plain text extraction** — Get readable text for indexing, accessibility, or processing
9+
3. **Structured extraction** — Access text by line/span with font metadata for layout analysis
10+
11+
This is a Tier 3 feature per GOALS.md, supporting search, indexing, and accessibility.
12+
13+
## Scope
14+
15+
### In Scope
16+
17+
- Extract text content from page content streams
18+
- Track text positioning (bounding boxes)
19+
- Group text into lines and spans based on baseline/font
20+
- Support string and regex search with position results
21+
- Handle common text operators (Tj, TJ, ', ")
22+
- Basic Unicode mapping via ToUnicode CMaps and font encodings
23+
- Page-level and document-level extraction/search APIs
24+
25+
### Out of Scope
26+
27+
- Complex layout analysis (multi-column detection, tables)
28+
- Right-to-left and vertical text layout
29+
- Marked content / tagged PDF structure
30+
- Text extraction from annotations (separate feature)
31+
- OCR or image-based text
32+
- Hyphenation joining across lines
33+
34+
## Dependencies
35+
36+
- **Content stream parser** — Already exists at `src/content/`
37+
- **Font layer** — Already exists at `src/fonts/` with `decode()` and ToUnicode support
38+
- **Graphics state tracking** — Needs CTM (current transformation matrix) integration
39+
40+
## Desired API
41+
42+
### Basic Usage
43+
44+
```typescript
45+
const pdf = await PDF.load(bytes);
46+
const page = pdf.getPage(0);
47+
48+
// Extract all text with positions
49+
const pageText = await page.extractText();
50+
console.log(pageText.text); // "Hello World\nSecond line..."
51+
52+
// Access structured content
53+
for (const line of pageText.lines) {
54+
console.log(`Line at y=${line.baseline}: "${line.text}"`);
55+
}
56+
```
57+
58+
### Search
59+
60+
```typescript
61+
// String search on a page
62+
const matches = await page.findText("{{ name }}");
63+
for (const match of matches) {
64+
console.log(`Found at:`, match.bbox); // { x, y, width, height }
65+
}
66+
67+
// Regex search
68+
const fields = await page.findText(/\{\{\s*\w+\s*\}\}/g);
69+
70+
// Document-wide search
71+
const allMatches = await pdf.findText("invoice", {
72+
pages: [0, 1, 2],
73+
caseSensitive: false,
74+
});
75+
```
76+
77+
### Template Replacement Pattern
78+
79+
```typescript
80+
const placeholders = await pdf.findText(/\{\{\s*(\w+)\s*\}\}/g);
81+
82+
for (const match of placeholders) {
83+
const fieldName = match.text.replace(/[{}]/g, "").trim();
84+
const value = data[fieldName];
85+
86+
const page = pdf.getPage(match.pageIndex);
87+
// Cover original text with white rectangle
88+
page.drawRectangle({ ...match.bbox, color: rgb(1, 1, 1) });
89+
// Draw replacement text
90+
page.drawText(value, { x: match.bbox.x, y: match.bbox.y, fontSize: 12 });
91+
}
92+
```
93+
94+
## Types
95+
96+
```typescript
97+
/** Rectangle in PDF coordinates (origin at bottom-left) */
98+
interface BoundingBox {
99+
x: number; // Left edge
100+
y: number; // Bottom edge
101+
width: number;
102+
height: number;
103+
}
104+
105+
/** Single character with position */
106+
interface ExtractedChar {
107+
char: string;
108+
bbox: BoundingBox;
109+
fontSize: number;
110+
fontName: string;
111+
baseline: number;
112+
}
113+
114+
/** Text span (same font/size on same line) */
115+
interface TextSpan {
116+
text: string;
117+
bbox: BoundingBox;
118+
chars: ExtractedChar[];
119+
fontSize: number;
120+
fontName: string;
121+
}
122+
123+
/** Line of text (multiple spans on same baseline) */
124+
interface TextLine {
125+
text: string;
126+
bbox: BoundingBox;
127+
spans: TextSpan[];
128+
baseline: number;
129+
}
130+
131+
/** Full page extraction result */
132+
interface PageText {
133+
pageIndex: number;
134+
width: number;
135+
height: number;
136+
lines: TextLine[];
137+
text: string; // Plain text (lines joined with \n)
138+
}
139+
140+
/** Search match */
141+
interface TextMatch {
142+
text: string;
143+
bbox: BoundingBox;
144+
pageIndex: number;
145+
charBoxes: BoundingBox[]; // Per-character boxes for highlighting
146+
}
147+
148+
/** Extraction options */
149+
interface ExtractTextOptions {
150+
/** Include individual character positions (default: true for search support) */
151+
includeChars?: boolean;
152+
}
153+
154+
/** Search options */
155+
interface FindTextOptions {
156+
/** Pages to search (default: all) */
157+
pages?: number[];
158+
/** Case-sensitive matching (default: true) */
159+
caseSensitive?: boolean;
160+
/** Match whole words only (default: false) */
161+
wholeWord?: boolean;
162+
}
163+
```
164+
165+
## Architecture
166+
167+
### Components
168+
169+
```
170+
PDFPage.extractText()
171+
172+
173+
TextExtractor
174+
175+
├─► ContentStreamParser (existing)
176+
177+
├─► TextState (tracks Tm, Tc, Tw, etc.)
178+
179+
├─► Font.decode() (existing)
180+
181+
└─► LineGrouper (groups chars into lines/spans)
182+
```
183+
184+
### TextState
185+
186+
Tracks text-related graphics state during content stream processing:
187+
188+
- **Tm** — Text matrix (position/transform)
189+
- **Tlm** — Text line matrix (start of current line)
190+
- **Tf** — Current font and size
191+
- **Tc** — Character spacing
192+
- **Tw** — Word spacing
193+
- **Tz** — Horizontal scaling
194+
- **TL** — Leading (line spacing)
195+
- **Ts** — Text rise (superscript/subscript)
196+
- **CTM** — Current transformation matrix (from graphics state)
197+
198+
### Text Operators to Handle
199+
200+
| Operator | Description |
201+
|----------|-------------|
202+
| BT/ET | Begin/end text object |
203+
| Tf | Set font and size |
204+
| Tm | Set text matrix |
205+
| Td | Move to next line (relative) |
206+
| TD | Move and set leading |
207+
| T* | Move to next line (using TL) |
208+
| Tc | Set character spacing |
209+
| Tw | Set word spacing |
210+
| Tz | Set horizontal scaling |
211+
| TL | Set leading |
212+
| Ts | Set text rise |
213+
| Tj | Show string |
214+
| TJ | Show strings with positioning |
215+
| ' | Move to next line and show string |
216+
| " | Set spacing, move, and show string |
217+
218+
### Coordinate Transformation
219+
220+
Character positions must account for:
221+
222+
1. **Text matrix (Tm)** — Position within text object
223+
2. **CTM** — Page-level transformation (rotation, scaling)
224+
3. **Font metrics** — Glyph widths, ascender/descender
225+
226+
Final position = CTM × Tm × glyph_position
227+
228+
### Line Grouping Algorithm
229+
230+
1. Sort characters by baseline Y coordinate (with tolerance for slight variations)
231+
2. Within each baseline group, sort by X coordinate
232+
3. Detect spans based on font/size changes
233+
4. Join characters into text strings, inferring spaces from gaps
234+
235+
Space detection heuristic:
236+
- If gap between characters > 0.3 × font size, insert space
237+
- Configurable threshold for different PDF generators
238+
239+
## Test Plan
240+
241+
### Unit Tests
242+
243+
- Parse text operators correctly (Tj, TJ, Td, Tm, etc.)
244+
- Calculate character positions with various transforms
245+
- Group characters into lines/spans correctly
246+
- Space detection between words
247+
- Handle font changes mid-line
248+
249+
### Integration Tests
250+
251+
- Extract text from simple single-page PDF
252+
- Extract text with multiple fonts/sizes
253+
- Extract text from rotated pages
254+
- Search for literal strings
255+
- Search with regex patterns
256+
- Document-wide search across pages
257+
- Handle PDFs with missing ToUnicode (use font encoding fallback)
258+
259+
### Fixtures Needed
260+
261+
- `fixtures/text/simple.pdf` — Basic text content
262+
- `fixtures/text/multiline.pdf` — Multiple lines and paragraphs
263+
- `fixtures/text/fonts.pdf` — Multiple fonts and sizes
264+
- `fixtures/text/rotated.pdf` — Rotated page content
265+
- `fixtures/text/positioned.pdf` — Text with TJ positioning adjustments
266+
- `fixtures/text/template.pdf` — Document with `{{ placeholder }}` markers
267+
268+
## Open Questions
269+
270+
1. **Reading order**: Should we attempt to detect multi-column layouts, or just use raw position order?
271+
- *Initial approach*: Position order (left-to-right, top-to-bottom). Complex layout analysis is out of scope.
272+
273+
2. **Whitespace handling**: How to handle multiple spaces, tabs, and form feeds?
274+
- *Initial approach*: Normalize to single spaces. Provide `preserveWhitespace` option if needed.
275+
276+
3. **Ligatures**: How to handle fi, fl ligatures in glyph names?
277+
- *Initial approach*: Map via ToUnicode if available, otherwise return ligature character.
278+
279+
4. **CID fonts without ToUnicode**: Some PDFs have CID fonts without ToUnicode CMaps.
280+
- *Initial approach*: Return replacement character or empty string. Log warning.
281+
282+
## Risks
283+
284+
- **Performance**: Large documents may have many text objects. Consider lazy extraction per page.
285+
- **Font encoding edge cases**: Legacy PDFs may use obscure encodings. May need fallback strategies.
286+
- **Accuracy**: Bounding boxes depend on font metrics which may be incomplete or inaccurate in some PDFs.
287+
288+
## Implementation Phases
289+
290+
### Phase 1: Core Extraction
291+
- TextState class
292+
- Process text operators
293+
- Character-level extraction with positions
294+
295+
### Phase 2: Grouping
296+
- Line grouping by baseline
297+
- Span grouping by font
298+
- Space detection
299+
300+
### Phase 3: Search
301+
- String search with bbox
302+
- Regex search
303+
- Document-wide search API
304+
305+
### Phase 4: API Integration
306+
- `PDFPage.extractText()`
307+
- `PDFPage.findText()`
308+
- `PDF.findText()` (document-wide)

0 commit comments

Comments
 (0)