-
Notifications
You must be signed in to change notification settings - Fork 46
Open
Description
In docling-parse 4.0.3, text extraction works correctly for certain PDFs, but in 4.0.4 these PDFs return empty or incomplete text. This might be related to changes in the C++ backend for parsing page structures and crop/media boxes, as this is the only difference between those versions.
- Attempts to reproduce by modifying crop boxes fail.
- A workaround is to re-save or rewrite the affected pages using
PyPDF2orpikepdf(even without changing content or crop boxes), which restores proper text extraction.
Expected behavior:
Text extraction should succeed as in 4.0.3, even without rewriting the PDF.
Workaround:
from PyPDF2 import PdfReader, PdfWriter
from docling.document_converter import DocumentConverter
reader = PdfReader("problematic.pdf")
writer = PdfWriter()
for page in reader.pages:
writer.add_page(page)
writer.write(open("fixed.pdf", "wb"))
doc = converter.convert('fixed.pdf').document
print(doc.document.export_to_text()[:300])Notes:
- Cannot provide original confidential documents. Dummy PDFs may not reproduce the issue fully, as the root cause is in subtle differences in the C++ backend’s page parsing.
Metadata
Metadata
Assignees
Labels
No labels