new model or fine tune existing model #12374

alifpay · 2024-01-29T10:42:13Z

alifpay
Jan 29, 2024

Tajik language uses cyrillic, do I have to create new model from scratch or fine tune Russian cyrillic model?

Tajik has extra letters ӣ, ҳ, қ, ҷ, ӯ, ғ

Tajik and the Russian language are different

tink2123 · 2024-01-30T03:07:49Z

tink2123
Jan 30, 2024
Collaborator

You can add the required characters to the bottom of the original dictionary. Then fine-tune the model based on your own data.

0 replies

alifpay · 2024-01-30T05:28:56Z

alifpay
Jan 30, 2024
Author

what do you think is it ok?
I asked bard. this response:

Here's a step-by-step guide on fine-tuning the Russian Cyrillic model in PaddleOCR for recognizing Tajik text:

Data Preparation:

Gather Tajik Text Data: Collect a diverse dataset of text images containing the characters you want to recognize. Ensure it includes various fonts, writing styles, and lighting conditions. Aim for at least several thousand images.
Annotate the Data: Use tools like LabelImg or VGG Image Annotator to meticulously label each character within the text images. Consider using bounding boxes or text recognition labels.
Split the Data: Divide your labeled dataset into training, validation, and test sets. Typically, use 80% for training, 10% for validation, and 10% for testing.
Choose Your Environment:

Install Dependencies: Make sure you have PaddlePaddle (pip install paddlepaddle), PaddleOCR (pip install paddleocr), and other relevant libraries installed and configured. Choose based on your preferred environment (CPU, GPU, or TPU).
Download Pre-trained Model: Download the pre-trained Russian Cyrillic model from the PaddleOCR model zoo (https://github.com/PaddlePaddle/models): wget https://paddleocr.bj.bcebos.com/dygraph_2.0/ch_ppocr_mobile_v2_rec_chinese_lite_v1.2_infer.tar.gz
Fine-tuning Script:

Create a Python script: This script will load the dataset, define the model architecture (using the Russian model as a base), set up the training pipeline, and fine-tune the model using PaddlePaddle.
Load Data and Model: Use PaddleOCR's data loading functions to read your annotated dataset and load the pre-trained Russian model.
Define Fine-tuning Configuration: Specify hyperparameters like learning rate, optimizer, loss function, and training epochs. Adjust these based on your dataset size and complexity.
Freeze Initial Layers (Optional): Consider freezing some initial layers of the pre-trained model to retain learned features for common characters and focus training on adapting to Tajik-specific ones.
Train the Model: Run the training script to fine-tune the model on your prepared Tajik dataset. Monitor training progress through metrics like accuracy and loss on the validation set.
Save the Model: Once satisfied with the performance, save the fine-tuned model for future use.
Evaluation and Deployment:

Evaluate on Test Set: Assess the fine-tuned model's accuracy on the unseen test set to gauge its generalization ability. Consider metrics like character accuracy, word error rate, and recognition speed.
Deployment (Optional): If the performance meets your requirements, integrate the model into your application or web service for recognizing Tajik text. Ensure compatibility with your chosen environment.

Additional Tips:

Explore Data Augmentation: Use techniques like random transformations, noise injection, and color jittering to increase the diversity of your training data and improve model robustness.
Regularize the Model: Employ techniques like dropout and L1/L2 regularization to prevent overfitting and improve generalization.
Utilize PaddleOCR Resources: Refer to the official documentation, tutorials, and community forums for specific guidance and troubleshooting related to fine-tuning models in PaddleOCR.

Remember, fine-tuning requires careful data preparation, training adjustments, and evaluation to achieve optimal accuracy for your specific needs. Consider starting with a small-scale experiment and adjust based on your findings.

0 replies

rm-asif-amin · 2024-03-27T10:51:01Z

rm-asif-amin
Mar 27, 2024

@tink2123 Hello mate,
I'm trying to train for the Bengali language(it's not supported by paddle yet). Here's my approach and outcome-

Training a model from scratch with the following yml config(click to expand) -

Global: use_gpu: true epoch_num: 10 log_smooth_window: 20 print_batch_step: 10 save_model_dir: ./output/rec_ben save_epoch_step: 3 eval_batch_step: - 0 - 100 cal_metric_during_train: true pretrained_model: null checkpoints: null save_inference_dir: /content/inference use_visualdl: false infer_img: character_dict_path: /content/bn_dict.txt max_text_length: 25 infer_mode: false use_space_char: true Optimizer: name: Adam beta1: 0.9 beta2: 0.999 lr: name: Cosine learning_rate: 0.001 regularizer: name: L2 factor: 1.0e-05 Architecture: model_type: rec algorithm: CRNN Transform: null Backbone: name: MobileNetV3 scale: 0.5 model_name: small small_stride: - 1 - 2 - 2 - 2 Neck: name: SequenceEncoder encoder_type: rnn hidden_size: 48 Head: name: CTCHead fc_decay: 1.0e-05 Loss: name: CTCLoss PostProcess: name: CTCLabelDecode Metric: name: RecMetric main_indicator: acc Train: dataset: name: SimpleDataSet data_dir: /content/content/final_dataset/rec/train/train label_file_list: - /content/content/final_dataset/rec/train/rec_gt_train.txt transforms: - DecodeImage: img_mode: BGR channel_first: false - RecAug: null - CTCLabelEncode: null - RecResizeImg: image_shape: - 3 - 32 - 320 - KeepKeys: keep_keys: - image - label - length loader: shuffle: true batch_size_per_card: 256 drop_last: true num_workers: 8 Eval: dataset: name: SimpleDataSet data_dir: /content/content/final_dataset/rec/test/test label_file_list: - /content/content/final_dataset/rec/test/rec_gt_test.txt transforms: - DecodeImage: img_mode: BGR channel_first: false - CTCLabelEncode: null - RecResizeImg: image_shape: - 3 - 32 - 320 - KeepKeys: keep_keys: - image - label - length loader: shuffle: false drop_last: false batch_size_per_card: 256 num_workers: 8

2. Using a dataset of ~80k one word images. After training for 100+ epochs, the training loss looks like this-

4. The training, eval accuracy are not coherent ( low eval accuracy compared to train)-

The Edit distance metrics, on the other hand are opposite -
I've added Bengali letters to an English dictionary which seems to be encoding correctly-

Original Label: ব্যঙ্গবিদ্রূপাত্মক
Transformed Label: [157 160 139 137 157 123 152 161 126 155 122 150 159 135   0   0   0   0
   0   0   0   0   0   0   0]

My questions are-

Is my approach right? The model seems to be learning, why the low accuracy?
Should I use a pre-trained model and finetune?( A similar indic language maybe? )

Would love to know your opinion.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

new model or fine tune existing model #12374

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

new model or fine tune existing model #12374

Uh oh!

alifpay Jan 29, 2024

Replies: 3 comments

Uh oh!

tink2123 Jan 30, 2024 Collaborator

Uh oh!

alifpay Jan 30, 2024 Author

Uh oh!

Uh oh!

rm-asif-amin Mar 27, 2024

alifpay
Jan 29, 2024

tink2123
Jan 30, 2024
Collaborator

alifpay
Jan 30, 2024
Author

rm-asif-amin
Mar 27, 2024