-
Notifications
You must be signed in to change notification settings - Fork 12
Open
Labels
bugSomething isn't workingSomething isn't working
Description
During the evaluation, I've discovered that Docling has some issues with certain documents.
- Documents with atypical dimensions cause the default Docling parser to give errors.
- EasyOCR occasionally misrecognizes digit "1" as letter "I" in certificate identifiers, especially in certificates.
- Some single page certificates are processed as empty because Docling's parser treats them as image elements rather than text content. Then the OCR doesn't enrich it because with the default config, the OCR just merges text elements.
- In some French documents, Docling produces unusual spacing patterns, sometimes inserting spaces within words.
For each issue, there are specific fixes:
- Atypical dimensions: Switch backend to pypdfium (but this produces worse results overall, so only use it for affected documents)
- Empty outputs and French spacing: Force full-page OCR mode, which means Docling doesn't merge OCR with programmatic parsing results. So the text result is purely from OCR.
- OCR misrecognition: Try different OCR engines (Tesseract, RapidOCR)
So the proposed approach could be to implement a multi-step and fallback approach, similar to the pdftotext with Tesseract strategy:
- Convert the entire dataset using "normal" Docling options
- Detect problematic documents (errors, empty outputs, suspected OCR issues, spacing anomalies)
- Reprocess affected documents with alternative configurations based on the detected issue type
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working