-
Notifications
You must be signed in to change notification settings - Fork 49
Description
Summary
docling-parse incorrectly calculates character widths when parsing PDFs that use Td (text displacement) commands with long text strings, resulting in word bounding boxes that are consistently 20% too small (5pt per character instead of the correct 6pt for Courier 10pt font).
Environment
- docling-parse version: Latest (tested with current main branch)
- Python version: 3.12
- Operating System: Linux
Problem Description
When parsing PDFs that use Td commands followed by text strings containing spaces, docling-parse produces bounding boxes that are systematically too narrow. The issue appears to be caused by using a hardcoded character width assumption (5pt) rather than reading actual font metrics from the PDF.
Expected Behavior
For a Courier 10pt font, each character should be 6.00 points wide (Courier is a monospace font with 600 units per character in standard PDF metrics, which equals 6pt at 10pt font size).
Actual Behavior
docling-parse calculates character widths as 5.00 points, resulting in:
- Bounding boxes that are 5/6 (≈83.3%) of their correct width
- Word positions that drift progressively left as they appear later in the text string
- A consistent 1.2× ratio between correct width and docling-parse width
Reproduction
Minimal Test PDF
I've attached test_courier_td_bug.pdf which reproduces this issue. The content stream is:
BT
/F001 10 Tf
72 720 Td
( ABC12345 COMPANY NAME INC. DOCUMENT TYPE) Tj
0 -11 Td
(ID=99999 REF1234567890123) Tj
0 -11 Td
(CUSTOMER : GENERIC STORE NAME LOCATION FROM : WESTERN) Tj
ET
With minimal font descriptor:
<<
/Type /Font
/Subtype /Type1
/BaseFont /Courier
>>
Key characteristics that trigger the bug:
- Standard Type1 Courier font with no explicit width table (relies on base font metrics)
- Uses
Tdcommand to set initial text position (e.g.,72 720 Td) - Text shown with
Tjcommand containing single long string (90+ characters) with many embedded spaces - Each
Tdcommand followed byTjwith text
The critical element is the long string with embedded spaces - docling-parse must walk through character-by-character to determine where each word begins and ends, using character widths to calculate positions.
Impact
This bug affects:
- PDFs using Type1 standard fonts (Courier, Times, Helvetica) without explicit width tables
- PDFs generated by tools that use
Tdcommands with text strings containing spaces - Use cases requiring accurate word-level bounding boxes (text extraction, OCR correction, layout analysis)
The error compounds with string length:
- Word at character position 10: 10pt error
- Word at character position 50: 50pt error
- Can be 100+ points off for words late in long strings
Additional Test Cases
I've tested multiple fonts, all showing the same 5pt character width issue:
| Font | Expected Width (10pt) | Actual Width | Ratio |
|---|---|---|---|
| Courier | 6.00pt | 5.00pt | 1.20× |
| Courier-Bold | 6.00pt | 5.00pt | 1.20× |
| Courier-Oblique | 6.00pt | 5.00pt | 1.20× |
| Helvetica | Varies (non-monospace) | 5.00pt | Wrong |
| Times-Roman | Varies (non-monospace) | 5.00pt | Wrong |
This suggests the 500-unit fallback is being applied universally instead of using proper base font metrics.