Docling related issues

During the evaluation, I've discovered that Docling has some issues with certain documents.

- Documents with atypical dimensions cause the default Docling parser to give errors.
- EasyOCR occasionally misrecognizes digit "1" as letter "I" in certificate identifiers, especially in certificates.
- Some single page certificates are processed as empty because Docling's parser treats them as image elements rather than text content. Then the OCR doesn't enrich it because with the default config, the OCR just merges text elements.
- In some French documents, Docling produces unusual spacing patterns, sometimes inserting spaces within words.

For each issue, there are specific fixes:
- Atypical dimensions: Switch backend to pypdfium (but this produces worse results overall, so only use it for affected documents)
- Empty outputs and French spacing: Force full-page OCR mode, which means Docling doesn't merge OCR with programmatic parsing results. So the text result is purely from OCR.
- OCR misrecognition: Try different OCR engines (Tesseract, RapidOCR)

So the proposed approach could be to implement a multi-step and fallback approach, similar to the pdftotext with Tesseract strategy:
- Convert the entire dataset using "normal" Docling options
- Detect problematic documents (errors, empty outputs, suspected OCR issues, spacing anomalies)
- Reprocess affected documents with alternative configurations based on the detected issue type


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docling related issues #543

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Docling related issues #543

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions