Skip to content

Docling related issues #543

@jborsky

Description

@jborsky

During the evaluation, I've discovered that Docling has some issues with certain documents.

  • Documents with atypical dimensions cause the default Docling parser to give errors.
  • EasyOCR occasionally misrecognizes digit "1" as letter "I" in certificate identifiers, especially in certificates.
  • Some single page certificates are processed as empty because Docling's parser treats them as image elements rather than text content. Then the OCR doesn't enrich it because with the default config, the OCR just merges text elements.
  • In some French documents, Docling produces unusual spacing patterns, sometimes inserting spaces within words.

For each issue, there are specific fixes:

  • Atypical dimensions: Switch backend to pypdfium (but this produces worse results overall, so only use it for affected documents)
  • Empty outputs and French spacing: Force full-page OCR mode, which means Docling doesn't merge OCR with programmatic parsing results. So the text result is purely from OCR.
  • OCR misrecognition: Try different OCR engines (Tesseract, RapidOCR)

So the proposed approach could be to implement a multi-step and fallback approach, similar to the pdftotext with Tesseract strategy:

  • Convert the entire dataset using "normal" Docling options
  • Detect problematic documents (errors, empty outputs, suspected OCR issues, spacing anomalies)
  • Reprocess affected documents with alternative configurations based on the detected issue type

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions