Skip to content

Vietnamese character dictionary is missing characters (113 out of 196) #15189

@Goldiserv

Description

@Goldiserv

🔎 Search before asking

  • I have searched the PaddleOCR Docs and found no similar bug report.
  • I have searched the PaddleOCR Issues and found no similar bug report.
  • I have searched the PaddleOCR Discussions and found no similar bug report.

PR fix

Can close this issue after merging PR #15204

🐛 Bug (问题描述)

PaddleOCR is able to detect lowercase Vietnamese characters but cannot detect uppercase ones as ppocr/utils/dict/vi_dict.txt does not have accented uppercase characters such as Á, À, Ả, etc.
The PR at ccb2ecb contains 196 characters and may improve PaddleOCR's ability to detect more Vietnamese words.

Example 1) Paddle detects SOẠT as SOAT (missing dot under A) which is wrong as the accent in Vietnamese changes the meaning of the word.
Example 2) Paddle detects BẮT as BAT

For lower case words such as điều, Paddle performs much better.

🏃‍♂️ Environment (运行环境)

N/A

🌰 Minimal Reproducible Example (最小可复现问题的Demo)

N/A

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions