Improving OCR on hungarian text #13266

Hybeee · 2024-07-04T14:28:54Z

Hybeee
Jul 4, 2024

Hey everybody!

I've been trying to use paddleOCR for hungarian text detection and recognition. However the model is struggling with special characters in the language, for example: á, é, ő, ö, ú etc.. Sometimes it detects them well and the output is correct, other times it cannot detect the character, instead it detects ä or just a without the accent.

Now I've looked at several parts of the implementation. The hungarian language is supported, however dictionary that gets used during inference for a pretrained model is a latin dictionary, which contains the aforementioned "other" characters too, like the 'a with double dots on top' - I am using the quickstart code example of paddleOCR.

I would like to ask for some advice regarding this matter. I've tried preprocessing the image, eliminating backgrounds (that, for example solved the problem for one specific example, but also decreased performance on other inputs) and some other preprocessing ideas I've had in mind. I've also thought of fine tuning a model using a hungarian dataset, however that'd be a last resort solution for me, since that would be rather time consuming.

I accept any kind of advice, thank you, and have a nice day!

Answered by GreatV

Jul 4, 2024

Here are a few suggestions that might help improve the performance:

Custom Dictionary: Modify the dictionary file used by the model to include all Hungarian characters. This way, the model will recognize these characters as valid and increase the likelihood of correct detection.
Data Augmentation: Generate synthetic training data that includes Hungarian text with special characters. This can help the model learn to recognize these characters better without the need for a full retraining.
Preprocessing: Since preprocessing worked for some cases, you could create a preprocessing pipeline that adapts based on the input image characteristics. For example, apply different preprocessing t…

View full answer

GreatV · 2024-07-04T14:54:15Z

GreatV
Jul 4, 2024
Maintainer

Here are a few suggestions that might help improve the performance:

Custom Dictionary: Modify the dictionary file used by the model to include all Hungarian characters. This way, the model will recognize these characters as valid and increase the likelihood of correct detection.
Data Augmentation: Generate synthetic training data that includes Hungarian text with special characters. This can help the model learn to recognize these characters better without the need for a full retraining.
Preprocessing: Since preprocessing worked for some cases, you could create a preprocessing pipeline that adapts based on the input image characteristics. For example, apply different preprocessing techniques and choose the one that yields the best OCR results.
Fine-Tuning: While time-consuming, fine-tuning a pre-trained model on a Hungarian dataset could significantly improve accuracy. You can reduce the dataset size to minimize time, focusing on samples with special characters.
Post-Processing: Implement a post-processing step to correct common misrecognized characters. For example, if the model frequently mistakes 'á' for 'ä', you can add a rule to correct this based on the context.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improving OCR on hungarian text #13266

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Improving OCR on hungarian text #13266

Uh oh!

Hybeee Jul 4, 2024

Replies: 1 comment

Uh oh!

GreatV Jul 4, 2024 Maintainer

Hybeee
Jul 4, 2024

GreatV
Jul 4, 2024
Maintainer