[Bug] Incorrect Character Width Calculation with Td Commands

[test_courier_td_bug.pdf](https://github.com/user-attachments/files/23555003/test_courier_td_bug.pdf)

## Summary

docling-parse incorrectly calculates character widths when parsing PDFs that use `Td` (text displacement) commands with long text strings, resulting in word bounding boxes that are **consistently 20% too small** (5pt per character instead of the correct 6pt for Courier 10pt font).

## Environment

- **docling-parse version**: Latest (tested with current main branch)
- **Python version**: 3.12
- **Operating System**: Linux

## Problem Description

When parsing PDFs that use `Td` commands followed by text strings containing spaces, docling-parse produces bounding boxes that are systematically too narrow. The issue appears to be caused by using a hardcoded character width assumption (5pt) rather than reading actual font metrics from the PDF.

### Expected Behavior

For a Courier 10pt font, each character should be **6.00 points wide** (Courier is a monospace font with 600 units per character in standard PDF metrics, which equals 6pt at 10pt font size).

### Actual Behavior

docling-parse calculates character widths as **5.00 points**, resulting in:
- Bounding boxes that are 5/6 (≈83.3%) of their correct width
- Word positions that drift progressively left as they appear later in the text string
- A consistent 1.2× ratio between correct width and docling-parse width

## Reproduction

### Minimal Test PDF

I've attached `test_courier_td_bug.pdf` which reproduces this issue. The content stream is:

```pdf
BT
/F001 10 Tf
72 720 Td
( ABC12345                           COMPANY NAME INC.                          DOCUMENT TYPE) Tj
0 -11 Td
(ID=99999 REF1234567890123) Tj
0 -11 Td
(CUSTOMER : GENERIC STORE NAME                     LOCATION FROM : WESTERN) Tj
ET
```

With minimal font descriptor:
```pdf
<<
  /Type /Font
  /Subtype /Type1
  /BaseFont /Courier
>>
```

**Key characteristics that trigger the bug:**
- Standard Type1 Courier font with **no explicit width table** (relies on base font metrics)
- Uses `Td` command to set initial text position (e.g., `72 720 Td`)
- Text shown with `Tj` command containing **single long string (90+ characters)** with many embedded spaces
- Each `Td` command followed by `Tj` with text

The critical element is the **long string with embedded spaces** - docling-parse must walk through character-by-character to determine where each word begins and ends, using character widths to calculate positions.


## Impact

This bug affects:
- PDFs using Type1 standard fonts (Courier, Times, Helvetica) without explicit width tables
- PDFs generated by tools that use `Td` commands with text strings containing spaces
- Use cases requiring accurate word-level bounding boxes (text extraction, OCR correction, layout analysis)

The error compounds with string length:
- Word at character position 10: 10pt error
- Word at character position 50: 50pt error
- Can be 100+ points off for words late in long strings

## Additional Test Cases

I've tested multiple fonts, all showing the same 5pt character width issue:

| Font | Expected Width (10pt) | Actual Width | Ratio |
|------|----------------------|--------------|-------|
| Courier | 6.00pt | 5.00pt | 1.20× |
| Courier-Bold | 6.00pt | 5.00pt | 1.20× |
| Courier-Oblique | 6.00pt | 5.00pt | 1.20× |
| Helvetica | Varies (non-monospace) | 5.00pt | Wrong |
| Times-Roman | Varies (non-monospace) | 5.00pt | Wrong |

This suggests the 500-unit fallback is being applied universally instead of using proper base font metrics.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] Incorrect Character Width Calculation with Td Commands #180

Summary

Environment

Problem Description

Expected Behavior

Actual Behavior

Reproduction

Minimal Test PDF

Impact

Additional Test Cases

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Font	Expected Width (10pt)	Actual Width	Ratio
Courier	6.00pt	5.00pt	1.20×
Courier-Bold	6.00pt	5.00pt	1.20×
Courier-Oblique	6.00pt	5.00pt	1.20×
Helvetica	Varies (non-monospace)	5.00pt	Wrong
Times-Roman	Varies (non-monospace)	5.00pt	Wrong

[Bug] Incorrect Character Width Calculation with Td Commands #180

Description

Summary

Environment

Problem Description

Expected Behavior

Actual Behavior

Reproduction

Minimal Test PDF

Impact

Additional Test Cases

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions