Using multiple dictionaries with paddleocr #12747

saanvib13 · 2024-05-30T13:08:41Z

saanvib13
May 30, 2024

I am using the paddleocr model for detection of tables in images and extracting them into a csv file. Currently I am using the en_dict.txt character dictionary for text detection and extraction because the images I have have English text in them. However this dictionary does not contain some symbols such as €. But these essential symbols are present in dictionaries of other languages such as te_dict.txt. Is it possible to use multiple dictionaries together with the paddleocr library for text detection?

GreatV · 2024-05-30T15:29:14Z

GreatV
May 30, 2024
Maintainer

PaddleOCR does not natively support using multiple dictionaries simultaneously. However, you can modify the dictionary file to include the characters you need from other dictionaries.

0 replies

saanvib13 · 2024-05-31T04:44:58Z

saanvib13
May 31, 2024
Author

@GreatV Thank you for your response! If I just manually add the characters that I want from the other dictionary, can I use the model directly or is it necessary to retrain the model?

0 replies

GreatV · 2024-05-31T04:55:52Z

GreatV
May 31, 2024
Maintainer

need re-train the model

0 replies

saanvib13 · 2024-05-31T12:05:03Z

saanvib13
May 31, 2024
Author

Understood thanks!!

0 replies

saanvib13 · 2024-05-31T12:10:09Z

saanvib13
May 31, 2024
Author

@GreatV There is one more thing. I tried changing the dictionary from en_dict.txt to te_dict.txt.

https://github.com/PaddlePaddle/PaddleOCR/blob/main/ppocr/utils/dict/te_dict.txt

I used the following command to do so:
ocr = PaddleOCR(use_gpu=True, character_dict_path='../../../PaddleOCR/ppocr/utils/dict/te_dict.txt' )

This dictionary contains all the letters and symbols that I want to be detected and extracted. But when I executed the code, the euro symbol still wasn't detected by the model. What could be the possible reason? I have already preprocessed the image and rest of the content such as text, numbers are being extracted accurately but symbols such as '€' and '-' are being omitted inspite of being present in the dictionary being used.

Pls provide your guidance and help me resolve this issue.

0 replies

GreatV · 2024-05-31T13:03:44Z

GreatV
May 31, 2024
Maintainer

@saanvib13 Do you train a new model to then use it for OCR?

0 replies

saanvib13 · 2024-06-02T04:40:39Z

saanvib13
Jun 2, 2024
Author

@GreatV No, I am using the same pre trained model and the dictionary provided by paddleocr (te_dict.txt). Even then, it is not able to detect the characters in that dictionary correctly.

0 replies

dsiu1 · 2024-06-02T08:26:01Z

dsiu1
Jun 2, 2024

@saanvib13 The pre-trained model doesn't understand that new dictionary. When the model makes a prediction it uses the dictionary to convert its numerical prediction as an index into a character. That's why it doesn't work.

As @GreatV said, you need to retrain the model. Instructions were described in this discussion #12302 (comment)

0 replies

RainmakerP · 2024-06-03T13:49:45Z

RainmakerP
Jun 3, 2024

@saanvib13 The pre-trained model doesn't understand that new dictionary. When the model makes a prediction it uses the dictionary to convert its numerical prediction as an index into a character. That's why it doesn't work.

As @GreatV said, you need to retrain the model. Instructions were described in this discussion #12302 (comment)

Just to confirm if I understood correctly, using any of the default dictionaries (provided here https://github.com/PaddlePaddle/PaddleOCR/blob/main/ppocr/utils/dict/te_dict.txt) requires retraining of the paddleocr model?

Thanks!

0 replies

saanvib13 · 2024-06-03T13:59:29Z

saanvib13
Jun 3, 2024
Author

@saanvib13 The pre-trained model doesn't understand that new dictionary. When the model makes a prediction it uses the dictionary to convert its numerical prediction as an index into a character. That's why it doesn't work.
As @GreatV said, you need to retrain the model. Instructions were described in this discussion #12302 (comment)

Just to confirm if I understood correctly, using any of the default dictionaries (provided here https://github.com/PaddlePaddle/PaddleOCR/blob/main/ppocr/utils/dict/te_dict.txt) requires retraining of the paddleocr model?

Thanks!

I have the same query. The dictionary that I am using is provided by paddleocr itself. It is not custom made. Then why is retraining the entire model required? Please provide your guidance @GreatV @dsiu1

Thank you !!

0 replies

dsiu1 · 2024-06-03T23:02:40Z

dsiu1
Jun 3, 2024

@saanvib13 @RainmakerP
No sorry, I thought you added the '€' symbol to the existing dictionary. You can try setting lang='te' instead of setting the dictionary directly, then it'll download the correct model and dictionary at the same time. If you're using an existing dictionary with the correct associated model, then there's two reasons I can think of for why it's not detecting properly

The text detection model isn't extracting the region of the text with the € symbol. If it's not extracted then the text recognition model won't work.
I assume you have verified that the text detection model works (probably the multi-lingual model). If so, that simply means the text recognition model hasn't seen enough sequences of values like €3,300 or €100. That's why you need to re-train the model and provide it more data.

If you need help understanding how the algorithm works, I found that their book was a fantastic resource. Page 24 generally summarizes how their OCR works https://paddleocr.bj.bcebos.com/ebook/Dive_into_OCR.pdf.

0 replies

Using multiple dictionaries with paddleocr #12747

Uh oh!

saanvib13 May 30, 2024

Replies: 11 comments

Uh oh!

GreatV May 30, 2024 Maintainer

Uh oh!

saanvib13 May 31, 2024 Author

Uh oh!

GreatV May 31, 2024 Maintainer

Uh oh!

saanvib13 May 31, 2024 Author

Uh oh!

saanvib13 May 31, 2024 Author

Uh oh!

GreatV May 31, 2024 Maintainer

Uh oh!

saanvib13 Jun 2, 2024 Author

Uh oh!

dsiu1 Jun 2, 2024

Uh oh!

RainmakerP Jun 3, 2024

Uh oh!

saanvib13 Jun 3, 2024 Author

Uh oh!

dsiu1 Jun 3, 2024

saanvib13
May 30, 2024

GreatV
May 30, 2024
Maintainer

saanvib13
May 31, 2024
Author

GreatV
May 31, 2024
Maintainer

saanvib13
May 31, 2024
Author

saanvib13
May 31, 2024
Author

GreatV
May 31, 2024
Maintainer

saanvib13
Jun 2, 2024
Author

dsiu1
Jun 2, 2024

RainmakerP
Jun 3, 2024

saanvib13
Jun 3, 2024
Author

dsiu1
Jun 3, 2024