Replies: 11 comments
-
PaddleOCR does not natively support using multiple dictionaries simultaneously. However, you can modify the dictionary file to include the characters you need from other dictionaries. |
Beta Was this translation helpful? Give feedback.
-
@GreatV Thank you for your response! If I just manually add the characters that I want from the other dictionary, can I use the model directly or is it necessary to retrain the model? |
Beta Was this translation helpful? Give feedback.
-
need re-train the model |
Beta Was this translation helpful? Give feedback.
-
Understood thanks!! |
Beta Was this translation helpful? Give feedback.
-
@GreatV There is one more thing. I tried changing the dictionary from en_dict.txt to te_dict.txt. https://github.com/PaddlePaddle/PaddleOCR/blob/main/ppocr/utils/dict/te_dict.txt I used the following command to do so: This dictionary contains all the letters and symbols that I want to be detected and extracted. But when I executed the code, the euro symbol still wasn't detected by the model. What could be the possible reason? I have already preprocessed the image and rest of the content such as text, numbers are being extracted accurately but symbols such as '€' and '-' are being omitted inspite of being present in the dictionary being used. Pls provide your guidance and help me resolve this issue. |
Beta Was this translation helpful? Give feedback.
-
@saanvib13 Do you train a new model to then use it for OCR? |
Beta Was this translation helpful? Give feedback.
-
@GreatV No, I am using the same pre trained model and the dictionary provided by paddleocr (te_dict.txt). Even then, it is not able to detect the characters in that dictionary correctly. |
Beta Was this translation helpful? Give feedback.
-
@saanvib13 The pre-trained model doesn't understand that new dictionary. When the model makes a prediction it uses the dictionary to convert its numerical prediction as an index into a character. That's why it doesn't work. As @GreatV said, you need to retrain the model. Instructions were described in this discussion #12302 (comment) |
Beta Was this translation helpful? Give feedback.
-
Just to confirm if I understood correctly, using any of the default dictionaries (provided here https://github.com/PaddlePaddle/PaddleOCR/blob/main/ppocr/utils/dict/te_dict.txt) requires retraining of the paddleocr model? Thanks! |
Beta Was this translation helpful? Give feedback.
-
I have the same query. The dictionary that I am using is provided by paddleocr itself. It is not custom made. Then why is retraining the entire model required? Please provide your guidance @GreatV @dsiu1 Thank you !! |
Beta Was this translation helpful? Give feedback.
-
@saanvib13 @RainmakerP
If you need help understanding how the algorithm works, I found that their book was a fantastic resource. Page 24 generally summarizes how their OCR works https://paddleocr.bj.bcebos.com/ebook/Dive_into_OCR.pdf. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I am using the paddleocr model for detection of tables in images and extracting them into a csv file. Currently I am using the en_dict.txt character dictionary for text detection and extraction because the images I have have English text in them. However this dictionary does not contain some symbols such as €. But these essential symbols are present in dictionaries of other languages such as te_dict.txt. Is it possible to use multiple dictionaries together with the paddleocr library for text detection?
Beta Was this translation helpful? Give feedback.
All reactions