Skip to content

[Bug] Incorrect Character Width Calculation with Td Commands #180

@flexorRegev

Description

@flexorRegev

test_courier_td_bug.pdf

Summary

docling-parse incorrectly calculates character widths when parsing PDFs that use Td (text displacement) commands with long text strings, resulting in word bounding boxes that are consistently 20% too small (5pt per character instead of the correct 6pt for Courier 10pt font).

Environment

  • docling-parse version: Latest (tested with current main branch)
  • Python version: 3.12
  • Operating System: Linux

Problem Description

When parsing PDFs that use Td commands followed by text strings containing spaces, docling-parse produces bounding boxes that are systematically too narrow. The issue appears to be caused by using a hardcoded character width assumption (5pt) rather than reading actual font metrics from the PDF.

Expected Behavior

For a Courier 10pt font, each character should be 6.00 points wide (Courier is a monospace font with 600 units per character in standard PDF metrics, which equals 6pt at 10pt font size).

Actual Behavior

docling-parse calculates character widths as 5.00 points, resulting in:

  • Bounding boxes that are 5/6 (≈83.3%) of their correct width
  • Word positions that drift progressively left as they appear later in the text string
  • A consistent 1.2× ratio between correct width and docling-parse width

Reproduction

Minimal Test PDF

I've attached test_courier_td_bug.pdf which reproduces this issue. The content stream is:

BT
/F001 10 Tf
72 720 Td
( ABC12345                           COMPANY NAME INC.                          DOCUMENT TYPE) Tj
0 -11 Td
(ID=99999 REF1234567890123) Tj
0 -11 Td
(CUSTOMER : GENERIC STORE NAME                     LOCATION FROM : WESTERN) Tj
ET

With minimal font descriptor:

<<
  /Type /Font
  /Subtype /Type1
  /BaseFont /Courier
>>

Key characteristics that trigger the bug:

  • Standard Type1 Courier font with no explicit width table (relies on base font metrics)
  • Uses Td command to set initial text position (e.g., 72 720 Td)
  • Text shown with Tj command containing single long string (90+ characters) with many embedded spaces
  • Each Td command followed by Tj with text

The critical element is the long string with embedded spaces - docling-parse must walk through character-by-character to determine where each word begins and ends, using character widths to calculate positions.

Impact

This bug affects:

  • PDFs using Type1 standard fonts (Courier, Times, Helvetica) without explicit width tables
  • PDFs generated by tools that use Td commands with text strings containing spaces
  • Use cases requiring accurate word-level bounding boxes (text extraction, OCR correction, layout analysis)

The error compounds with string length:

  • Word at character position 10: 10pt error
  • Word at character position 50: 50pt error
  • Can be 100+ points off for words late in long strings

Additional Test Cases

I've tested multiple fonts, all showing the same 5pt character width issue:

Font Expected Width (10pt) Actual Width Ratio
Courier 6.00pt 5.00pt 1.20×
Courier-Bold 6.00pt 5.00pt 1.20×
Courier-Oblique 6.00pt 5.00pt 1.20×
Helvetica Varies (non-monospace) 5.00pt Wrong
Times-Roman Varies (non-monospace) 5.00pt Wrong

This suggests the 500-unit fallback is being applied universally instead of using proper base font metrics.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions