Fine Tuning Detection on Custom Dataset #14602

Luisfrdzgr · 2025-01-27T15:59:40Z

Luisfrdzgr
Jan 27, 2025

I'm trying to fine tune the 'en_PP-OCRv3_det_distill_train' to my custom dataset, which consists on small images.
I've prepared the dataset according to the format presented in https://paddlepaddle.github.io/PaddleOCR/latest/en/datasets/ocr_datasets.html:

Bluetooth.jpg	[{"transcription": "Bluetooth", "points": [[9, 9], [238, 9], [238, 55], [9, 55]]}]

And I'm using the config file in https://paddlepaddle.github.io/PaddleOCR/latest/en/ppocr/model_list.html with some modifications:

Global:
  debug: false
  use_gpu: true
  epoch_num: 5
  log_smooth_window: 20
  print_batch_step: 1
  save_model_dir: ./output/ch_PP-OCR_v3_det/
  save_epoch_step: 1
  eval_batch_step:
  - 0
  - 400
  cal_metric_during_train: false
  pretrained_model: ./models/en_PP-OCRv3_det_distill_train
  checkpoints: null
  save_inference_dir: null
  use_visualdl: false
  infer_img: ./dataset_cropped_paddle/training/222.jpg
  save_res_path: ./checkpoints/det_db/predicts_db.txt
  distributed: true
  d2s_train_image_shape: [3, -1, -1]
  amp_dtype: bfloat16

Architecture:
  name: DistillationModel
  algorithm: Distillation
  model_type: det
  Models:
    Student:
      pretrained:
      model_type: det
      algorithm: DB
      Transform: null
      Backbone:
        name: MobileNetV3
        scale: 0.5
        model_name: large
        disable_se: true
      Neck:
        name: RSEFPN
        out_channels: 96
        shortcut: True
      Head:
        name: DBHead
        k: 50
    Student2:
      pretrained:
      model_type: det
      algorithm: DB
      Transform: null
      Backbone:
        name: MobileNetV3
        scale: 0.5
        model_name: large
        disable_se: true
      Neck:
        name: RSEFPN
        out_channels: 96
        shortcut: True
      Head:
        name: DBHead
        k: 50
    Teacher:
      freeze_params: true
      return_all_feats: false
      model_type: det
      algorithm: DB
      Backbone:
        name: ResNet_vd
        in_channels: 3
        layers: 50
      Neck:
        name: LKPAN
        out_channels: 256
      Head:
        name: DBHead
        kernel_list: [7,2,2]
        k: 50

Loss:
  name: CombinedLoss
  loss_config_list:
  - DistillationDilaDBLoss:
      weight: 1.0
      model_name_pairs:
      - ["Student", "Teacher"]
      - ["Student2", "Teacher"]
      key: maps
      balance_loss: true
      main_loss_type: DiceLoss
      alpha: 5
      beta: 10
      ohem_ratio: 3
  - DistillationDMLLoss:
      model_name_pairs:
      - ["Student", "Student2"]
      maps_name: "thrink_maps"
      weight: 1.0
      model_name_pairs: ["Student", "Student2"]
      key: maps
  - DistillationDBLoss:
      weight: 1.0
      model_name_list: ["Student", "Student2"]
      balance_loss: true
      main_loss_type: DiceLoss
      alpha: 5
      beta: 10
      ohem_ratio: 3

Optimizer:
  name: Adam
  beta1: 0.9
  beta2: 0.999
  lr:
    name: Cosine
    learning_rate: 0.001
    warmup_epoch: 2
  regularizer:
    name: L2
    factor: 5.0e-05

PostProcess:
  name: DistillationDBPostProcess
  model_name: ["Student"]
  key: head_out
  thresh: 0.3
  box_thresh: 0.6
  max_candidates: 1000
  unclip_ratio: 1.5

Metric:
  name: DistillationMetric
  base_metric_name: DetMetric
  main_indicator: hmean
  key: "Student"

Train:
  dataset:
    name: SimpleDataSet
    data_dir: ./dataset_cropped_paddle/training
    label_file_list:
      - ./dataset_cropped_paddle/training.txt
    ratio_list: [1.0]
    transforms:
    - DecodeImage:
        img_mode: BGR
        channel_first: false
    - DetLabelEncode: null
    - CopyPaste:
    - IaaAugment:
        augmenter_args:
        - type: Fliplr
          args:
            p: 0.5
        - type: Affine
          args:
            rotate:
            - -10
            - 10
        - type: Resize
          args:
            size:
            - 0.5
            - 3
    - EastRandomCropData:
        size:
        - 960
        - 960
        max_tries: 50
        keep_ratio: true
    - MakeBorderMap:
        shrink_ratio: 0.4
        thresh_min: 0.3
        thresh_max: 0.7
    - MakeShrinkMap:
        shrink_ratio: 0.4
        min_text_size: 8
    - NormalizeImage:
        scale: 1./255.
        mean:
        - 0.485
        - 0.456
        - 0.406
        std:
        - 0.229
        - 0.224
        - 0.225
        order: hwc
    - ToCHWImage: null
    - KeepKeys:
        keep_keys:
        - image
        - threshold_map
        - threshold_mask
        - shrink_map
        - shrink_mask
  loader:
    shuffle: true
    drop_last: false
    batch_size_per_card: 8
    num_workers: 4

Eval:
  dataset:
    name: SimpleDataSet
    data_dir: ./dataset_cropped_paddle/test
    label_file_list:
      - ./dataset_cropped_paddle/test
    transforms:
      - DecodeImage: # load image
          img_mode: BGR
          channel_first: False
      - DetLabelEncode: # Class handling label
      - DetResizeForTest:
      - NormalizeImage:
          scale: 1./255.
          mean: [0.485, 0.456, 0.406]
          std: [0.229, 0.224, 0.225]
          order: 'hwc'
      - ToCHWImage:
      - KeepKeys:
          keep_keys: ['image', 'shape', 'polys', 'ignore_tags']
  loader:
    shuffle: False
    drop_last: False
    batch_size_per_card: 1 # must be 1
    num_workers: 2

I have a couple of questions that I haven't been able to answer by digging in the docs:

Regarding the size of the images and the preprocessing steps in the config file, there are a couple of options (EastRandomCropData and ShrinkMap) that seem to affect the size of the image that is going to be inputed to the model. This could affect performance, as I it can mess with the images:

What's the smallest size I can input to the model? Or what size does it expect?

When fine-tuning I get the following log:

[2025/01/27 16:00:20] ppocr INFO: epoch: [3/20], global_step: 21, lr: 0.000477, dila_dbloss_Student_Teacher: 3.011597, dila_dbloss_Student2_Teacher: 3.444640, loss: 15.806180, dml_thrink_maps_0: 0.086718, db_Student_loss_shrink_maps: 3.269673, db_Student_loss_threshold_maps: 0.879420, db_Student_loss_binary_maps: 0.651912, db_Student_loss_cbn: 0.000000, db_Student2_loss_shrink_maps: 3.454082, db_Student2_loss_threshold_maps: 0.962060, db_Student2_loss_binary_maps: 0.688030, db_Student2_loss_cbn: 0.000000, avg_reader_cost: 3.92753 s, avg_batch_cost: 4.96717 s, avg_samples: 8.0, ips: 1.61057 samples/s, eta: 0:06:29, max_mem_reserved: 13762 MB, max_mem_allocated: 13244 MB
[2025/01/27 16:00:22] ppocr INFO: epoch: [3/20], global_step: 22, lr: 0.000523, dila_dbloss_Student_Teacher: 3.011597, dila_dbloss_Student2_Teacher: 3.444640, loss: 15.649834, dml_thrink_maps_0: 0.097910, db_Student_loss_shrink_maps: 2.986347, db_Student_loss_threshold_maps: 0.829845, db_Student_loss_binary_maps: 0.595068, db_Student_loss_cbn: 0.000000, db_Student2_loss_shrink_maps: 3.310991, db_Student2_loss_threshold_maps: 0.928773, db_Student2_loss_binary_maps: 0.660296, db_Student2_loss_cbn: 0.000000, avg_reader_cost: 0.55003 s, avg_batch_cost: 1.74031 s, avg_samples: 8.0, ips: 4.59688 samples/s, eta: 0:06:25, max_mem_reserved: 13762 MB, max_mem_allocated: 13244 MB
[2025/01/27 16:00:23] ppocr INFO: epoch: [3/20], global_step: 23, lr: 0.000568, dila_dbloss_Student_Teacher: 3.200613, dila_dbloss_Student2_Teacher: 3.474082, loss: 15.220205, dml_thrink_maps_0: 0.097910, db_Student_loss_shrink_maps: 2.689277, db_Student_loss_threshold_maps: 0.782924, db_Student_loss_binary_maps: 0.536239, db_Student_loss_cbn: 0.000000, db_Student2_loss_shrink_maps: 3.024721, db_Student2_loss_threshold_maps: 0.879803, db_Student2_loss_binary_maps: 0.603753, db_Student2_loss_cbn: 0.000000, avg_reader_cost: 0.40516 s, avg_batch_cost: 1.54454 s, avg_samples: 8.0, ips: 5.17954 samples/s, eta: 0:06:20, max_mem_reserved: 13762 MB, max_mem_allocated: 13244 MB
[2025/01/27 16:00:25] ppocr INFO: epoch: [3/20], global_step: 24, lr: 0.000614, dila_dbloss_Student_Teacher: 3.200613, dila_dbloss_Student2_Teacher: 3.474082, loss: 14.394117, dml_thrink_maps_0: 0.097910, db_Student_loss_shrink_maps: 2.500739, db_Student_loss_threshold_maps: 0.781113, db_Student_loss_binary_maps: 0.499470, db_Student_loss_cbn: 0.000000, db_Student2_loss_shrink_maps: 2.752180, db_Student2_loss_threshold_maps: 0.857660, db_Student2_loss_binary_maps: 0.548200, db_Student2_loss_cbn: 0.000000, avg_reader_cost: 0.14120 s, avg_batch_cost: 1.28632 s, avg_samples: 8.0, ips: 6.21931 samples/s, eta: 0:06:12, max_mem_reserved: 13762 MB, max_mem_allocated: 13244 MB
[2025/01/27 16:00:27] ppocr INFO: epoch: [3/20], global_step: 25, lr: 0.000659, dila_dbloss_Student_Teacher: 3.200613, dila_dbloss_Student2_Teacher: 3.474082, loss: 13.657948, dml_thrink_maps_0: 0.116855, db_Student_loss_shrink_maps: 2.310934, db_Student_loss_threshold_maps: 0.772095, db_Student_loss_binary_maps: 0.461086, db_Student_loss_cbn: 0.000000, db_Student2_loss_shrink_maps: 2.624928, db_Student2_loss_threshold_maps: 0.848484, db_Student2_loss_binary_maps: 0.522685, db_Student2_loss_cbn: 0.000000, avg_reader_cost: 1.06728 s, avg_batch_cost: 2.21148 s, avg_samples: 8.0, ips: 3.61748 samples/s, eta: 0:06:13, max_mem_reserved: 13762 MB, max_mem_allocated: 13244 MB
[2025/01/27 16:00:28] ppocr INFO: epoch: [3/20], global_step: 26, lr: 0.000705, dila_dbloss_Student_Teacher: 3.200613, dila_dbloss_Student2_Teacher: 3.474082, loss: 13.368200, dml_thrink_maps_0: 0.116855, db_Student_loss_shrink_maps: 2.283267, db_Student_loss_threshold_maps: 0.760780, db_Student_loss_binary_maps: 0.455353, db_Student_loss_cbn: 0.000000, db_Student2_loss_shrink_maps: 2.557586, db_Student2_loss_threshold_maps: 0.839774, db_Student2_loss_binary_maps: 0.509769, db_Student2_loss_cbn: 0.000000, avg_reader_cost: 0.35619 s, avg_batch_cost: 1.49112 s, avg_samples: 8.0, ips: 5.36510 samples/s, eta: 0:06:07, max_mem_reserved: 13762 MB, max_mem_allocated: 13244 MB
[2025/01/27 16:00:31] ppocr INFO: epoch: [3/20], global_step: 27, lr: 0.000750, dila_dbloss_Student_Teacher: 3.336690, dila_dbloss_Student2_Teacher: 3.501263, loss: 13.368200, dml_thrink_maps_0: 0.132773, db_Student_loss_shrink_maps: 2.283267, db_Student_loss_threshold_maps: 0.751820, db_Student_loss_binary_maps: 0.455353, db_Student_loss_cbn: 0.000000, db_Student2_loss_shrink_maps: 2.557586, db_Student2_loss_threshold_maps: 0.839774, db_Student2_loss_binary_maps: 0.509769, db_Student2_loss_cbn: 0.000000, avg_reader_cost: 1.66090 s, avg_batch_cost: 2.81764 s, avg_samples: 8.0, ips: 2.83926 samples/s, eta: 0:06:12, max_mem_reserved: 13762 MB, max_mem_allocated: 13244 MB
[2025/01/27 16:00:33] ppocr INFO: epoch: [3/20], global_step: 28, lr: 0.000795, dila_dbloss_Student_Teacher: 3.402617, dila_dbloss_Student2_Teacher: 3.557513, loss: 13.368200, dml_thrink_maps_0: 0.132773, db_Student_loss_shrink_maps: 2.269801, db_Student_loss_threshold_maps: 0.746117, db_Student_loss_binary_maps: 0.452314, db_Student_loss_cbn: 0.000000, db_Student2_loss_shrink_maps: 2.440670, db_Student2_loss_threshold_maps: 0.833004, db_Student2_loss_binary_maps: 0.486877, db_Student2_loss_cbn: 0.000000, avg_reader_cost: 0.88269 s, avg_batch_cost: 1.94949 s, avg_samples: 8.0, ips: 4.10364 samples/s, eta: 0:06:10, max_mem_reserved: 13762 MB, max_mem_allocated: 13244 MB
[2025/01/27 16:00:35] ppocr INFO: epoch: [3/20], global_step: 29, lr: 0.000841, dila_dbloss_Student_Teacher: 3.402617, dila_dbloss_Student2_Teacher: 3.557513, loss: 13.281643, dml_thrink_maps_0: 0.132773, db_Student_loss_shrink_maps: 2.225502, db_Student_loss_threshold_maps: 0.740633, db_Student_loss_binary_maps: 0.444126, db_Student_loss_cbn: 0.000000, db_Student2_loss_shrink_maps: 2.310580, db_Student2_loss_threshold_maps: 0.825378, db_Student2_loss_binary_maps: 0.460414, db_Student2_loss_cbn: 0.000000, avg_reader_cost: 0.90318 s, avg_batch_cost: 1.95577 s, avg_samples: 8.0, ips: 4.09046 samples/s, eta: 0:06:08, max_mem_reserved: 13762 MB, max_mem_allocated: 13244 MB
[2025/01/27 16:00:36] ppocr INFO: epoch: [3/20], global_step: 30, lr: 0.000886, dila_dbloss_Student_Teacher: 3.402617, dila_dbloss_Student2_Teacher: 3.557513, loss: 13.146376, dml_thrink_maps_0: 0.123823, db_Student_loss_shrink_maps: 2.152576, db_Student_loss_threshold_maps: 0.735423, db_Student_loss_binary_maps: 0.430643, db_Student_loss_cbn: 0.000000, db_Student2_loss_shrink_maps: 2.227296, db_Student2_loss_threshold_maps: 0.822196, db_Student2_loss_binary_maps: 0.444057, db_Student2_loss_cbn: 0.000000, avg_reader_cost: 0.32191 s, avg_batch_cost: 1.39282 s, avg_samples: 8.0, ips: 5.74374 samples/s, eta: 0:06:03, max_mem_reserved: 13762 MB, max_mem_allocated: 13244 MB
[2025/01/27 16:00:37] ppocr INFO: save model in [./models/output\latest](https://file+.vscode-resource.vscode-cdn.net/c%3A/Users/LuisFernandezGarcia/Desktop/Recursos/OCR/models/output/latest)

How should one interpret this? Are there are any parameters in the config file to get a more concise metric to evaluate how a model is learning? Why is the hmean not being calculated? Setting cal_metric_during_train makes the train process to fail on start. Setting eval_batch_step to a number that is actually reached by the learning process (i.e. 40) gives me:

[2025/01/27 16:57:28] ppocr INFO: best metric, hmean: 0.03798882681564246, is_float16: False, precision: 0.020935960591133004, recall: 0.20481927710843373, fps: 7.019348577980335, best_epoch: 4

There are some images in my dataset that do not contain any text. I want the detection to also learn to not output anything if there's no text in the image.

What is the correct format for this case in the label.txt files? I'm currently using:

Nothing_12.jpg	[]

After fine-tuning another model, even with small learning rate, performance seems to worsen, which makes me believe there is something I'm doing wrong. Is there any tutorial which tries to do something like this (just fine tune detection with a set of custom images), which goes in a little bit more depth that the ones in https://paddlepaddle.github.io/PaddleOCR/latest/en/index.html

GreatV · 2025-01-27T16:07:33Z

GreatV
Jan 27, 2025
Maintainer

To address your questions and issues with fine-tuning the en_PP-OCRv3_det_distill_train model using your custom dataset, let's break things down step by step:

1. Image Size and Preprocessing

What is the smallest image size that can be input to the model?

The PaddleOCR detection models generally don't have a strict minimum image size, as they can dynamically resize images during preprocessing. However, very small images can lead to issues due to insufficient feature extraction. The smallest size should ensure that the text regions remain distinguishable after resizing and cropping.

Relevant Configurations:

EastRandomCropData:
- Crops the image to a size determined by size (e.g., [960, 960] in your config). If your images are smaller than this size, the crop will fail.
- Ensure that your original images or augmented images are large enough to avoid excessive information loss during cropping.
MakeShrinkMap and MakeBorderMap:
- These preprocessing steps generate maps based on the text regions. If text is too small or the image is downsized excessively, these maps may not be meaningful.

Recommendation:

For small images, try reducing the crop size in EastRandomCropData to match your dataset. For example:
```
- EastRandomCropData:
    size: [640, 640]
```

Similarly, adjust Resize in the augmentations to a smaller range:

- Resize:
    size: [0.8, 2]  # Reduces the resizing scale to fit smaller images

What size does the model expect?

The model expects the input size to be a multiple of 32 due to the output stride of the backbone. For example:

[640x640], [960x960], or any size divisible by 32.
Ensure that the resized or cropped dimensions meet this requirement.

2. Training Logs and Metrics

How to interpret the logs?

The logs provide detailed metrics for each training step. Here's what the key terms mean:

loss: The total loss being optimized.
dila_dbloss_Student_Teacher: Distillation loss between the student and teacher models for the shrink maps.
db_Student_loss_*: Losses for the student model's output components (e.g., shrink maps, threshold maps, binary maps).
avg_reader_cost and avg_batch_cost: Time taken to read data and process a batch.
ips: Images processed per second.
max_mem_reserved and max_mem_allocated: GPU memory usage.

Why is `hmean` not being calculated?

The hmean metric is typically calculated during evaluation. If cal_metric_during_train: true causes the training process to fail, it's likely due to a misconfiguration or an issue with your dataset. For hmean to be calculated:

Ensure that the evaluation data is formatted correctly.
Set eval_batch_step to a number reached during training (e.g., 40).
Check the evaluation loader's configuration:
- The label_file_list in the Eval section should point to a valid test dataset.
- The KeepKeys transformation must include ['image', 'shape', 'polys', 'ignore_tags'].

Recommendation:

To debug the evaluation process:

Run evaluation separately using the tools/eval.py script in PaddleOCR.
Check if the test dataset contains valid ground truth annotations.

Why is performance worsening during fine-tuning?

Learning Rate: Fine-tuning requires a smaller learning rate. You are already using 0.001, which is reasonable, but consider lowering it further to 0.0001.
Overfitting: If your dataset is small, the model may overfit quickly. Use stronger augmentations (CopyPaste, IaaAugment).
Teacher-Student Distillation: Ensure that the teacher model is frozen (freeze_params: true) and that the student models are learning effectively.

3. Handling Images Without Text

What is the correct format for images without text?

For images without text, the correct format in the label file is:

Nothing_12.jpg	[]

This indicates that there are no text regions in the image. Your current implementation is correct.

Additional Considerations:

Ensure that such images are balanced in the dataset to avoid biasing the model toward predicting "no text."
Check if the augmentation strategies (CopyPaste, IaaAugment) inadvertently add text to these images.

4. Tutorials and Resources

Unfortunately, there isn't a detailed official tutorial specifically for fine-tuning detection models on custom datasets. However, the following steps can help:

Start with Pretrained Weights: Use the pretrained en_PP-OCRv3_det_distill_train weights and fine-tune with a small learning rate.
Evaluate Regularly: Use evaluation metrics (hmean, precision, recall) during training to monitor performance.
Debug Configurations: Experiment with the augmentation pipeline and ensure that it aligns with your dataset characteristics.
Community Discussions: The PaddleOCR GitHub discussions are a good resource for troubleshooting.

5. General Recommendations

Dataset Quality: Ensure that the annotations are accurate and consistent. Mislabeled or poorly formatted data can significantly impact performance.
Augmentation: For small datasets, use aggressive augmentations to increase diversity.
Batch Size: If GPU memory allows, increase the batch size for better gradient estimation.
Epochs: Fine-tuning usually requires fewer epochs (5–10 epochs) compared to training from scratch.

If you're still encountering issues, you can share specific error logs or dataset examples for further debugging.

Response generated by feifei-bot | chatgpt-4o-latest

1 reply

Luisfrdzgr Jan 27, 2025
Author

Thanks for the reply.

Going back to sizing, I've changed:

 - EastRandomCropData:
        size:
        - 320
        - 320

and

 - DetResizeForTest:
          image_shape: [320, 320]

and now hmean and dependant metrics are correctly calculated. Could you go a little more on detail on how to control sizing?

If I transform the best_accuracy model returned by the training process to an infer model, can I infer images of arbitrary size or I have to preprocess them somehow?

Also:

What kind of transforms are there? I don't find any documentation on the official Github page. Are there some transforms that are compulsory for some models. For example, I guess:

    - MakeBorderMap:
        shrink_ratio: 0.4
        thresh_min: 0.3
        thresh_max: 0.7
    - MakeShrinkMap:
        shrink_ratio: 0.4
        min_text_size: 8
    - NormalizeImage:
        scale: 1./255.
        mean:
        - 0.485
        - 0.456
        - 0.406
        std:
        - 0.229
        - 0.224
        - 0.225
        order: hwc

are necessary fro OCRv3 model.

Thanks in advance. Library is great but documentation could be a lil' more helpful. Great work nonetheless!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fine Tuning Detection on Custom Dataset #14602

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Fine Tuning Detection on Custom Dataset #14602

Uh oh!

Luisfrdzgr Jan 27, 2025

Replies: 1 comment · 1 reply

Uh oh!

GreatV Jan 27, 2025 Maintainer

1. Image Size and Preprocessing

What is the smallest image size that can be input to the model?

Relevant Configurations:

Recommendation:

What size does the model expect?

2. Training Logs and Metrics

How to interpret the logs?

Why is hmean not being calculated?

Recommendation:

Why is performance worsening during fine-tuning?

3. Handling Images Without Text

What is the correct format for images without text?

Additional Considerations:

4. Tutorials and Resources

5. General Recommendations

Uh oh!

Luisfrdzgr Jan 27, 2025 Author

Luisfrdzgr
Jan 27, 2025

Replies: 1 comment 1 reply

GreatV
Jan 27, 2025
Maintainer

Why is `hmean` not being calculated?

Luisfrdzgr Jan 27, 2025
Author