Regression in Docling-Parse 4.0.4: Certain PDFs produce empty text output

In docling-parse 4.0.3, text extraction works correctly for certain PDFs, but in 4.0.4 these PDFs return empty or incomplete text. This might be related to changes in the C++ backend for parsing page structures and crop/media boxes, as this is the only difference between those versions.

* Attempts to reproduce by modifying crop boxes fail.
* A workaround is to re-save or rewrite the affected pages using `PyPDF2` or `pikepdf` (even without changing content or crop boxes), which restores proper text extraction.

**Expected behavior:**
Text extraction should succeed as in 4.0.3, even without rewriting the PDF.

**Workaround:**

```python
from PyPDF2 import PdfReader, PdfWriter
from docling.document_converter import DocumentConverter

reader = PdfReader("problematic.pdf")
writer = PdfWriter()
for page in reader.pages:
    writer.add_page(page)
writer.write(open("fixed.pdf", "wb"))

doc = converter.convert('fixed.pdf').document
print(doc.document.export_to_text()[:300])
```

**Notes:**

* Cannot provide original confidential documents. Dummy PDFs may not reproduce the issue fully, as the root cause is in subtle differences in the C++ backend’s page parsing.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Regression in Docling-Parse 4.0.4: Certain PDFs produce empty text output #179

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Regression in Docling-Parse 4.0.4: Certain PDFs produce empty text output #179

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions