Skip to content

Most words in Romanian are missing characters #16594

@vladandrew

Description

@vladandrew

🔎 Search before asking

  • I have searched the PaddleOCR Docs and found no similar bug report.
  • I have searched the PaddleOCR Issues and found no similar bug report.
  • I have searched the PaddleOCR Discussions and found no similar bug report.

🐛 Bug (问题描述)

The Romanian language is not fully supported, as diacritics (ăâîșț) are not properly handled. This makes the resulting text unusable since most words are missing characters.

🏃‍♂️ Environment (运行环境)

OS: Ubuntu 24.04.1 LTS
Environment: conda
Python: Python 3.11.5
Install: pip
python -m pip install paddlepaddle==3.0.0 -i https://www.paddlepaddle.org.cn/packages/stable/cpu/
pip install paddleocr
RAM: 128GB
CPU: AMD Ryzen Threadripper PRO 5965WX 24-Cores
CUDA: None

🌰 Minimal Reproducible Example (最小可复现问题的Demo)

For the image below, run PaddleOCR using:

paddleocr ocr -i monitorul_oficial_sample.png --lang ro --use_doc_orientation_classify False --use_doc_unwarping False --use_textline_orientation False --save_path ./output
Image

The resulting text has many words that are incorrectly extracted due to missing characters (judector vs judecător). A snippet of the resulting JSON:

        "— judector",
        "neconstituionalitate a dispoziiilor art. 327 lit. b) i ale",
        "Gheorghe Stan",
        "— judector",
        "art. 328 alin. (3) din Codul de procedur penal, excepie",
        "Livia Doina Stanciu",
        "— judector",
        "ridicat de Gheorghe Cureleac într-o cauz penalà în care",
        "Elena-Simina Tnsescu",
        "— judector",
        "autorul excepiei a fost trimis în judecatà pentru svârirea unor",
        "Varga Attila",
        "— judector",
        "infractiuni.",
        "5. În motivarea excepiei de neconstituionalitate, autorul",

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions