Replies: 1 comment
-
Hello, if you only need detection and not recognition, word-level annotation is not strictly necessary; line level or even paragraph level is acceptable. However, since PP-OCRv5 can only recognize single-line text, it is recommended to use word-level and line-level annotations. Paragraph-level annotations are difficult to recognize. In practical applications, detection and recognition are often integrated. Annotations include transcription for the purpose of building a text recognition dataset later. Using placeholder text like “##” for every transcription will not affect the training of detection. It is feasible to fine-tune the detection model to detect only specific text regions. However, PP-OCRv5 is aimed at general text scenarios and has not been experimented with for detecting special text regions. Whether it is better to load a pre-trained model or start training from scratch requires specific experimentation on your part. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I'm planning to fine-tune the PaddleOCR text detection model for my specific use case. After reviewing the PaddleOCR documentation, I noticed that in the training data examples, each word within an image is labeled separately.
Is word-level annotation strictly required?
Can I instead use annotations at the line level or even paragraph level, so that the detection model learns to detect whole lines or paragraphs rather than individual words?
Why is transcription required during detection training?
Since the recognition model is responsible for reading the text, why do we need to include the transcription in the detection dataset at all? If I used placeholder text like "##" for every transcription, would the detection model still train correctly?
Can I fine-tune the detection model to detect only specific text regions?
For example, could I train it to detect only dialog bubbles or specific labels in an image, ignoring the rest of the text? Would this approach still work effectively, or would it be better to train a completely separate detection model from scratch?
Beta Was this translation helpful? Give feedback.
All reactions