Fine-tune v3 detection model on handwriting images #14868

LK-NeroKwok · 2025-03-17T03:04:55Z

LK-NeroKwok
Mar 17, 2025

Hi Paddle Commnuity,

I am going to fine tune a detection model by using ch_PP-OCRv3_det_cml.yml as the config and ch_PP-OCRv3_det_slim as the pretrain model. I have built my own dataset which mainly is grayscale handwriting images with the width ranged from 78 -258 pixel and consistent height about 42 pixel. I have balanced the image batch size and make the total dataset about 1000 images.

Which hyper-parameters in the config file should I modify to suit my case? I have already fine-tuned once and mainly changed the epoch to 100 and the learning rate from 0.001 to 0.0001, but the inference result is just similar to the original pretrained model.

Here is config:

Global:
  debug: false
  use_gpu: true
  epoch_num: 100
  log_smooth_window: 20
  print_batch_step: 10
  save_model_dir: ./output/ch_PP-OCR_v3_det/
  save_epoch_step: 10
  eval_batch_step:
  - 0
  - 100
  cal_metric_during_train: false
  pretrained_model: ./pretrained_models/ch_PP-OCRv3_rec_train/best_accuracy
  checkpoints: null
  save_inference_dir: null
  use_visualdl: false
  infer_img: doc/imgs_en/img_10.jpg
  save_res_path: ./checkpoints/det_db/predicts_db.txt
  distributed: true
  d2s_train_image_shape: [3, -1, -1]
  amp_dtype: bfloat16

Architecture:
  name: DistillationModel
  algorithm: Distillation
  model_type: det
  Models:
    Student:
      pretrained:
      model_type: det
      algorithm: DB
      Transform: null
      Backbone:
        name: MobileNetV3
        scale: 0.5
        model_name: large
        disable_se: true
      Neck:
        name: RSEFPN
        out_channels: 96
        shortcut: True
      Head:
        name: DBHead
        k: 50
    Student2:
      pretrained:
      model_type: det
      algorithm: DB
      Transform: null
      Backbone:
        name: MobileNetV3
        scale: 0.5
        model_name: large
        disable_se: true
      Neck:
        name: RSEFPN
        out_channels: 96
        shortcut: True
      Head:
        name: DBHead
        k: 50
    Teacher:
      freeze_params: true
      return_all_feats: false
      model_type: det
      algorithm: DB
      Backbone:
        name: ResNet_vd
        in_channels: 3
        layers: 50
      Neck:
        name: LKPAN
        out_channels: 256
      Head:
        name: DBHead
        kernel_list: [7,2,2]
        k: 50

Loss:
  name: CombinedLoss
  loss_config_list:
  - DistillationDilaDBLoss:
      weight: 1.0
      model_name_pairs:
      - ["Student", "Teacher"]
      - ["Student2", "Teacher"]
      key: maps
      balance_loss: true
      main_loss_type: DiceLoss
      alpha: 5
      beta: 10
      ohem_ratio: 3
  - DistillationDMLLoss:
      model_name_pairs:
      - ["Student", "Student2"]
      maps_name: "thrink_maps"
      weight: 1.0
      model_name_pairs: ["Student", "Student2"]
      key: maps
  - DistillationDBLoss:
      weight: 1.0
      model_name_list: ["Student", "Student2"]
      balance_loss: true
      main_loss_type: DiceLoss
      alpha: 5
      beta: 10
      ohem_ratio: 3

Optimizer:
  name: Adam
  beta1: 0.9
  beta2: 0.999
  lr:
    name: Cosine
    learning_rate: 0.0001
    warmup_epoch: 2
  regularizer:
    name: L2
    factor: 5.0e-05

PostProcess:
  name: DistillationDBPostProcess
  model_name: ["Student"]
  key: head_out
  thresh: 0.3
  box_thresh: 0.6
  max_candidates: 1000
  unclip_ratio: 1.5

Metric:
  name: DistillationMetric
  base_metric_name: DetMetric
  main_indicator: hmean
  key: "Student"

Train:
  dataset:
    name: SimpleDataSet
    data_dir: ./PaddleOCR
    label_file_list:
      - - ./PaddleOCR/dataset/det/train/train_label.txt
    ratio_list: [1.0]
    transforms:
    - DecodeImage:
        img_mode: BGR
        channel_first: false
    - DetLabelEncode: null
    - CopyPaste:
    - IaaAugment:
        augmenter_args:
        - type: Fliplr
          args:
            p: 0.5
        - type: Affine
          args:
            rotate:
            - -10
            - 10
        - type: Resize
          args:
            size:
            - 0.5
            - 3
    - EastRandomCropData:
        size:
        - 960
        - 960
        max_tries: 50
        keep_ratio: true
    - MakeBorderMap:
        shrink_ratio: 0.4
        thresh_min: 0.3
        thresh_max: 0.7
    - MakeShrinkMap:
        shrink_ratio: 0.4
        min_text_size: 8
    - NormalizeImage:
        scale: 1./255.
        mean:
        - 0.485
        - 0.456
        - 0.406
        std:
        - 0.229
        - 0.224
        - 0.225
        order: hwc
    - ToCHWImage: null
    - KeepKeys:
        keep_keys:
        - image
        - threshold_map
        - threshold_mask
        - shrink_map
        - shrink_mask
  loader:
    shuffle: true
    drop_last: false
    batch_size_per_card: 8
    num_workers: 4

Eval:
  dataset:
    name: SimpleDataSet
    data_dir: ./PaddleOCR
    label_file_list:
      - ./PaddleOCR/dataset/det/val/val_label.txt
    transforms:
      - DecodeImage: # load image
          img_mode: BGR
          channel_first: False
      - DetLabelEncode: # Class handling label
      - DetResizeForTest:
      - NormalizeImage:
          scale: 1./255.
          mean: [0.485, 0.456, 0.406]
          std: [0.229, 0.224, 0.225]
          order: 'hwc'
      - ToCHWImage:
      - KeepKeys:
          keep_keys: ['image', 'shape', 'polys', 'ignore_tags']
  loader:
    shuffle: False
    drop_last: False
    batch_size_per_card: 1 # must be 1
    num_workers: 2

GreatV · 2025-03-17T03:09:52Z

GreatV
Mar 17, 2025
Maintainer

To fine-tune the ch_PP-OCRv3_det model for handwriting detection, you need to adjust hyperparameters to better suit your dataset of grayscale handwriting images. Based on your dataset's characteristics (small-sized grayscale images with a consistent height), the following modifications are recommended:

1. Adjust Image Processing and Augmentation

Since your images are grayscale, ensure that the model correctly handles single-channel images:

Modify img_mode: "BGR" → "GRAY" in preprocessing modules:
```
- DecodeImage:
    img_mode: GRAY
```

Since your dataset consists of small text regions, consider reducing the augmentation strength:

- IaaAugment:
    augmenter_args:
    - type: Fliplr
      args:
        p: 0.5
    - type: Affine
      args:
        rotate:
        - -5
        - 5  # Reduce rotation to avoid extreme distortions
    - type: Resize
      args:
        size:
        - 0.8
        - 2.0  # Reduce extreme scaling variation

2. Optimize Learning Rate and Training Schedule

You have reduced the learning rate from 0.001 to 0.0001, which might be too low initially. Possible alternative:

Use a warm-up phase to stabilize early training:

lr:
  name: Cosine
  learning_rate: 0.0005
  warmup_epoch: 5  # Increase warm-up period to adapt pre-trained model

Increase epoch number to compensate for lower LR:
```
epoch_num: 150
```

3. Fix Pre-Trained Model Path

Your current config refers to ch_PP-OCRv3_rec_train/best_accuracy, which is a recognition pre-trained model path, not detection.
Ensure you correctly set:

pretrained_model: ./pretrained_models/ch_PP-OCRv3_det_train/best_accuracy

If you are getting unexpected behavior, double-check the log output to confirm the pretrained weights are correctly loaded.

4. Tweak Loss Function

If training behaves similarly to the pre-trained model, try reducing distillation weight to give more priority to fine-tuning:
```
Loss:
  - DistillationDilaDBLoss:
      weight: 0.5  # Reduce teacher impact
```
Increase ohem_ratio to focus more on hard examples:
```
- DistillationDBLoss:
    ohem_ratio: 5
```

5. Modify Input Size for Small Text Regions

Your dataset contains small handwriting samples (~78-258 pixels in width). To improve training:

Modify dataset resizing (EastRandomCropData):

- EastRandomCropData:
    size:
    - 480
    - 960  # Reduce max image crop size
    keep_ratio: true

6. Experiment with Batch Size

Given your dataset size (~1000 images), you may be overfitting. You might increase batch size if GPU memory allows:
```
batch_size_per_card: 16
```

Next Steps

Train for a longer time (at least 150 epochs) and monitor validation loss.
Check that the model is loading weights correctly.
If results remain similar, try initializing without the pre-trained model (pretrained_model: null) and retraining from scratch.

These optimizations should help the detector adapt better to your handwriting dataset. Let me know if you encounter any issues! 🚀

Response generated by 🤖 feifei-bot | chatgpt-4o-latest

⚠️ This is an AI-generated response. Please verify any critical information.

1 reply

LK-NeroKwok Mar 17, 2025
Author

Thanks for the explanation!
By the way, the images are separated by the box frame and I would label the data with just fit to the numbers inside the box frame as a combined number:

data:

labelled data (just the coordinate of the frame, I am not putting this image for the training and evaluation):

Do I need to eliminate the frame pixels between the numbers first or it is good to go?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fine-tune v3 detection model on handwriting images #14868

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Fine-tune v3 detection model on handwriting images #14868

Uh oh!

LK-NeroKwok Mar 17, 2025

Replies: 1 comment · 1 reply

Uh oh!

GreatV Mar 17, 2025 Maintainer

1. Adjust Image Processing and Augmentation

2. Optimize Learning Rate and Training Schedule

3. Fix Pre-Trained Model Path

4. Tweak Loss Function

5. Modify Input Size for Small Text Regions

6. Experiment with Batch Size

Next Steps

Uh oh!

LK-NeroKwok Mar 17, 2025 Author

LK-NeroKwok
Mar 17, 2025

Replies: 1 comment 1 reply

GreatV
Mar 17, 2025
Maintainer

LK-NeroKwok Mar 17, 2025
Author