Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
149 changes: 149 additions & 0 deletions configs/rec/rec_svtrnet_igtr.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@
Global:
use_gpu: True
epoch_num: 20
log_smooth_window: 20
print_batch_step: 10
save_model_dir: ./output/rec/svtr_igtr_base/
save_epoch_step: 1
# evaluation is run every 2000 iterations after the 0th iteration
eval_batch_step: [0, 2000]
cal_metric_during_train: True
pretrained_model:
checkpoints:
save_inference_dir:
use_visualdl: False
infer_img: doc/imgs_words_en/word_10.png
# for data or label process
character_type: en
character_dict_path: &character_dict_path
max_text_length: &max_text_length 25
infer_mode: False
use_space_char: &use_space_char False
save_res_path: ./output/rec/predicts_svtr_igtr_base.txt


Optimizer:
name: AdamW
beta1: 0.9
beta2: 0.99
epsilon: 1.e-8
weight_decay: 0.05
no_weight_decay_name: norm pos_embed char_node_embed pos_node_embed char_pos_embed vis_pos_embed
one_dim_param_no_weight_decay: True
lr:
name: Cosine
learning_rate: 0.0005 # 4gpus 256bs
warmup_epoch: 2

Architecture:
model_type: rec
algorithm: IGTR
Transform:
Backbone:
name: SVTRNet2DPos
img_size: [32, -1]
out_char_num: 25
out_channels: 256
patch_merging: 'Conv'
embed_dim: [128, 256, 384]
depth: [6, 6, 6]
num_heads: [4, 8, 12]
mixer: ['ConvB','ConvB','ConvB','ConvB','ConvB','ConvB', 'ConvB','ConvB', 'Global','Global','Global','Global','Global','Global','Global','Global','Global','Global']
local_mixer: [[5, 5], [5, 5], [5, 5]]
last_stage: False
prenorm: True
use_first_sub: False
Head:
name: IGTRHead
dim: 384
num_layer: 1
ar: False
refine_iter: 0
next_pred: False
pos2d: True
ds: True

Loss:
name: IGTRLoss

PostProcess:
name: IGTRLabelDecode
character_dict_path: *character_dict_path
use_space_char: *use_space_char

Metric:
name: RecMetric
main_indicator: acc

Train:
dataset:
name: RatioDataSet
ds_width: True
padding: False
max_ratio: 4
data_dir_list: ['./train_data/data_lmdb_release/training/data_name1',
'./train_data/data_lmdb_release/training/data_name2']
transforms:
- DecodeImage: # load image
img_mode: BGR
channel_first: False
- IGTRLabelEncode: # Class handling label
k: 8
prompt_error: False
character_dict_path: *character_dict_path
use_space_char: *use_space_char
max_text_length: *max_text_length
- KeepKeys:
keep_keys: ['image', 'label', 'prompt_pos_idx_list',
'prompt_char_idx_list', 'ques_pos_idx_list', 'ques1_answer_list',
'ques2_char_idx_list', 'ques2_answer_list', 'ques3_answer', 'ques4_char_num_list',
'ques_len_list', 'ques2_len_list', 'prompt_len_list', 'length'] # dataloader will return list in this order
sampler:
name: RatioSampler
scales: [[128, 32]] # w, h
# divide_factor: to ensure the width and height dimensions can be devided by downsampling multiple
first_bs: 256
fix_bs: false
divided_factor: [4, 16] # w, h
is_training: False
loader:
shuffle: True
batch_size_per_card: 256
drop_last: True
num_workers: 4

Eval:
dataset:
name: RatioDataSet
ds_width: True
padding: False
max_ratio: 4
data_dir_list: ['./train_data/data_lmdb_release/evaluation/CUTE80',
'./train_data/data_lmdb_release/evaluation/IC13_857',
'./train_data/data_lmdb_release/evaluation/IC15_1811',
'./train_data/data_lmdb_release/evaluation/IIIT5k',
'./train_data/data_lmdb_release/evaluation/SVT',
'./train_data/data_lmdb_release/evaluation/SVTP']
transforms:
- DecodeImage: # load image
img_mode: BGR
channel_first: False
- ARLabelEncode: # Class handling label
character_dict_path: *character_dict_path
use_space_char: *use_space_char
max_text_length: *max_text_length
- KeepKeys:
keep_keys: ['image', 'label', 'length'] # dataloader will return list in this order
sampler:
name: RatioSampler
scales: [[128, 32]] # w, h
# divide_factor: to ensure the width and height dimensions can be devided by downsampling multiple
first_bs: 256
fix_bs: false
divided_factor: [4, 16] # w, h
is_training: False
loader:
shuffle: False
drop_last: False
batch_size_per_card: 256
num_workers: 4
3 changes: 3 additions & 0 deletions docs/algorithm/overview.en.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,8 @@ Supported text recognition algorithms (Click the link to get the tutorial):
- [x] [ParseQ](./text_recognition/algorithm_rec_parseq.md)
- [x] [CPPD](./text_recognition/algorithm_rec_cppd.en.md)
- [x] [SATRN](./text_recognition/algorithm_rec_satrn.en.md)
- [x] [IGTR](./text_recognition/algorithm_rec_igtr.en.md)


Refer to [DTRB](https://arxiv.org/abs/1904.01906), the training and evaluation result of these above text recognition (using MJSynth and SynthText for training, evaluate on IIIT, SVT, IC03, IC13, IC15, SVTP, CUTE) is as follow:

Expand Down Expand Up @@ -104,6 +106,7 @@ Refer to [DTRB](https://arxiv.org/abs/1904.01906), the training and evaluation r
|ParseQ|VIT| 91.24% | rec_vit_parseq_synth | [trained model](https://paddleocr.bj.bcebos.com/dygraph_v2.1/parseq/rec_vit_parseq_synth.tgz) |
|CPPD|SVTR-Base| 93.8% | rec_svtrnet_cppd_base_en | [trained model](https://paddleocr.bj.bcebos.com/CCPD/rec_svtr_cppd_base_en_train.tar) |
|SATRN|ShallowCNN| 88.05% | rec_satrn | [trained model](https://pan.baidu.com/s/10J-Bsd881bimKaclKszlaQ?pwd=lk8a) |
|IGTR|SVTR-Base| 94.78% | rec_svtr_igtr | [trained model](https://paddleocr.bj.bcebos.com/igtr/rec_svtr_igtr_train.tar) |

### 1.3 Text Super-Resolution Algorithms

Expand Down
2 changes: 2 additions & 0 deletions docs/algorithm/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,7 @@ PaddleOCR将**持续新增**支持OCR领域前沿算法与模型,**欢迎广
- [x] [ParseQ](./text_recognition/algorithm_rec_parseq.md)
- [x] [CPPD](./text_recognition/algorithm_rec_cppd.md)
- [x] [SATRN](./text_recognition/algorithm_rec_satrn.md)
- [x] [IGTR](./text_recognition/algorithm_rec_igtr.md)

参考[DTRB](https://arxiv.org/abs/1904.01906) (3)文字识别训练和评估流程,使用MJSynth和SynthText两个文字识别数据集训练,在IIIT, SVT, IC03, IC13, IC15, SVTP, CUTE数据集上进行评估,算法效果如下:

Expand Down Expand Up @@ -105,6 +106,7 @@ PaddleOCR将**持续新增**支持OCR领域前沿算法与模型,**欢迎广
|ParseQ|VIT| 91.24% | rec_vit_parseq_synth | [训练模型](https://paddleocr.bj.bcebos.com/dygraph_v2.1/parseq/rec_vit_parseq_synth.tgz) |
|CPPD|SVTR-Base| 93.8% | rec_svtrnet_cppd_base_en | [训练模型](https://paddleocr.bj.bcebos.com/CCPD/rec_svtr_cppd_base_en_train.tar) |
|SATRN|ShallowCNN| 88.05% | rec_satrn | [训练模型](https://pan.baidu.com/s/10J-Bsd881bimKaclKszlaQ?pwd=lk8a) |
|IGTR|SVTR-Base| 94.78% | rec_svtr_igtr | [训练模型](https://paddleocr.bj.bcebos.com/igtr/rec_svtr_igtr_train.tar) |

### 1.3 文本超分辨率算法

Expand Down
2 changes: 1 addition & 1 deletion docs/algorithm/text_recognition/algorithm_rec_cppd.en.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ Specifically, after the data preparation is completed, the training can be start
python3 tools/train.py -c configs/rec/rec_svtrnet_cppd_base_en.yml

# Multi GPU training, specify the gpu number through the --gpus parameter
python3 -m paddle.distributed.launch --gpus '0,1,2,3' tools/train.py -c configs/rec/rec_svtrnet_cppd_base_en.yml
python3 -m paddle.distributed.launch --gpus '0,1,2,3' tools/train.py -c configs/rec/rec_svtrnet_cppd_base_en.yml
```

### Evaluation
Expand Down
2 changes: 1 addition & 1 deletion docs/algorithm/text_recognition/algorithm_rec_cppd.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,7 @@ CPPD在场景文本识别公开数据集上的精度(%)和模型文件如下:
python3 tools/train.py -c configs/rec/rec_svtrnet_cppd_base_en.yml

# 多卡训练,通过--gpus参数指定卡号
python3 -m paddle.distributed.launch --gpus '0,1,2,3' tools/train.py -c configs/rec/rec_svtrnet_cppd_base_en.yml
python3 -m paddle.distributed.launch --gpus '0,1,2,3' tools/train.py -c configs/rec/rec_svtrnet_cppd_base_en.yml
```

### 3.2 评估
Expand Down
139 changes: 139 additions & 0 deletions docs/algorithm/text_recognition/algorithm_rec_igtr.en.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
---
comments: true
---

# IGTR

## 1. Introduction

Paper:
> [Instruction-Guided Scene Text Recognition](https://arxiv.org/abs/2401.17851),
> Yongkun Du, Zhineng Chen, Yuchen Su, Caiyan Jia, Yu-Gang Jiang,
> TPAMI 2025,
> Source Repository: [OpenOCR](https://github.com/Topdu/OpenOCR)

Multi-modal models have shown appealing performance in visual recognition tasks, as free-form text-guided training evokes the ability to understand fine-grained visual content. However, current models cannot be trivially applied to scene text recognition (STR) due to the compositional difference between natural and text images. We propose a novel instruction-guided scene text recognition (IGTR) paradigm that formulates STR as an instruction learning problem and understands text images by predicting character attributes, e.g., character frequency, position, etc. IGTR first devises $\left \langle condition,question,answer \right \rangle$ instruction triplets, providing rich and diverse descriptions of character attributes. To effectively learn these attributes through question-answering, IGTR develops a lightweight instruction encoder, a cross-modal feature fusion module and a multi-task answer head, which guides nuanced text image understanding. Furthermore, IGTR realizes different recognition pipelines simply by using different instructions, enabling a character-understanding-based text reasoning paradigm that differs from current methods considerably. Experiments on English and Chinese benchmarks show that IGTR outperforms existing models by significant margins, while maintaining a small model size and fast inference speed. Moreover, by adjusting the sampling of instructions, IGTR offers an elegant way to tackle the recognition of rarely appearing and morphologically similar characters, which were previous challenges.

The accuracy (%) and model files of IGTR on the public dataset of scene text recognition are as follows::

- Trained on Synth dataset(MJ+ST), test on Common Benchmarks, training and test datasets both from [PARSeq](https://github.com/baudm/parseq).

| Model | IC13<br/>857 | SVT | IIIT5k<br/>3000 | IC15<br/>1811 | SVTP | CUTE80 | Avg | Config&Model&Log |
| :-----: | :----------: | :--: | :-------------: | :-----------: | :--: | :----: | :---: | :---------------------------------------------------------------------------------------------: |
| IGTR-PD | 97.6 | 95.2 | 97.6 | 88.4 | 91.6 | 95.5 | 94.30 | TODO |
| IGTR-AR | 98.6 | 95.7 | 98.2 | 88.4 | 92.4 | 95.5 | 94.78 | as above |

- Test on Union14M-Benchmark, from [Union14M](https://github.com/Mountchicken/Union14M/).

| Model | Curve | Multi-<br/>Oriented | Artistic | Contextless | Salient | Multi-<br/>word | General | Avg | Config&Model&Log |
| :-----: | :---: | :-----------------: | :------: | :---------: | :-----: | :-------------: | :-----: | :---: | :---------------------: |
| IGTR-PD | 76.9 | 30.6 | 59.1 | 63.3 | 77.8 | 62.5 | 66.7 | 62.40 | Same as the above table |
| IGTR-AR | 78.4 | 31.9 | 61.3 | 66.5 | 80.2 | 69.3 | 67.9 | 65.07 | as above |

- Trained on Union14M-L-LMDB-Filtered training dataset.

| Model | IC13<br/>857 | SVT | IIIT5k<br/>3000 | IC15<br/>1811 | SVTP | CUTE80 | Avg | Config&Model&Log |
| :----------: | :----------: | :--: | :-------------: | :-----------: | :--: | :----: | :---: | :---------------------------------------------------------------------------------------------: |
| IGTR-PD | 97.7 | 97.7 | 98.3 | 89.8 | 93.7 | 97.9 | 95.86 | [PaddleOCR Model](https://paddleocr.bj.bcebos.com/igtr/rec_svtr_igtr_train.tar) |
| IGTR-AR | 98.1 | 98.4 | 98.7 | 90.5 | 94.9 | 98.3 | 96.48 | as above |
| IGTR-PD-60ep | 97.9 | 98.3 | 99.2 | 90.8 | 93.7 | 97.6 | 96.24 | TODO|
| IGTR-AR-60ep | 98.4 | 98.1 | 99.3 | 91.5 | 94.3 | 97.6 | 96.54 | as above |
| IGTR-PD-PT | 98.6 | 98.0 | 99.1 | 91.7 | 96.8 | 99.0 | 97.20 | TODO |
| IGTR-AR-PT | 98.8 | 98.3 | 99.2 | 92.0 | 96.8 | 99.0 | 97.34 | as above |

| Model | Curve | Multi-<br/>Oriented | Artistic | Contextless | Salient | Multi-<br/>word | General | Avg | Config&Model&Log |
| :----------: | :---: | :-----------------: | :------: | :---------: | :-----: | :-------------: | :-----: | :---: | :---------------------: |
| IGTR-PD | 88.1 | 89.9 | 74.2 | 80.3 | 82.8 | 79.2 | 83.0 | 82.51 | Same as the above table |
| IGTR-AR | 90.4 | 91.2 | 77.0 | 82.4 | 84.7 | 84.0 | 84.4 | 84.86 | as above |
| IGTR-PD-60ep | 90.0 | 92.1 | 77.5 | 82.8 | 86.0 | 83.0 | 84.8 | 85.18 | Same as the above table |
| IGTR-AR-60ep | 91.0 | 93.0 | 78.7 | 84.6 | 87.3 | 84.8 | 85.6 | 86.43 | as above |
| IGTR-PD-PT | 92.4 | 92.1 | 80.7 | 83.6 | 87.7 | 86.9 | 85.0 | 86.92 | Same as the above table |
| IGTR-AR-PT | 93.0 | 92.9 | 81.3 | 83.4 | 88.6 | 88.7 | 85.6 | 87.65 | as above |

- Trained and test on Chinese dataset, from [Chinese Benckmark](https://github.com/FudanVI/benchmarking-chinese-text-recognition).

| Model | Scene | Web | Document | Handwriting | Avg | Config&Model&Log |
| :---------: | :---: | :--: | :------: | :---------: | :---: | :---------------------------------------------------------------------------------------------: |
| IGTR-PD | 73.1 | 74.8 | 98.6 | 52.5 | 74.75 | |
| IGTR-AR | 75.1 | 76.4 | 98.7 | 55.3 | 76.37 | |
| IGTR-PD-TS | 73.5 | 75.9 | 98.7 | 54.5 | 75.65 | TODO |
| IGTR-AR-TS | 75.6 | 77.0 | 98.8 | 57.3 | 77.17 | as above |
| IGTR-PD-Aug | 79.5 | 80.0 | 99.4 | 58.9 | 79.45 | TODO |
| IGTR-AR-Aug | 82.0 | 81.7 | 99.5 | 63.8 | 81.74 | as above |

Download all Configs, Models, and Logs from [OpenOCR](https://github.com/Topdu/OpenOCR/blob/main/configs/rec/igtr/readme.md), and then convert to paddleocr model file.

## 2. Environment

Please refer to ["Environment Preparation"](../../ppocr/environment.en.md) to configure the PaddleOCR environment, and refer to ["Project Clone"](../../ppocr/blog/clone.en.md)to clone the project code.

### Dataset Preparation

- [English dataset download](https://github.com/baudm/parseq)

- [Union14M-L-LMDB-Filtered download](https://github.com/Topdu/OpenOCR/blob/main/docs/svtrv2.md#downloading-datasets)

- [Chinese dataset download](https://github.com/fudanvi/benchmarking-chinese-text-recognition#download)

## 3. Model Training / Evaluation / Prediction

Please refer to [Text Recognition Tutorial](../../ppocr/model_train/recognition.en.md). PaddleOCR modularizes the code, and training different recognition models only requires **changing the configuration file**.

### Training

Specifically, after the data preparation is completed, the training can be started. The training command is as follows:

```bash linenums="1"
# Single GPU training (long training period, not recommended)
python3 tools/train.py -c configs/rec/rec_svtrnet_igtr.yml

# Multi GPU training, specify the gpu number through the --gpus parameter
python3 -m paddle.distributed.launch --gpus '0,1,2,3' tools/train.py -c configs/rec/rec_svtrnet_igtr.yml
```

### Evaluation

You can download the model files and configuration files provided by `IGTR`: [download link](https://paddleocr.bj.bcebos.com/igtr/rec_svtr_igtr_train.tar), using the following command to evaluate:

```bash linenums="1"
# Download the tar archive containing the model files and configuration files of IGTR-B and extract it
wget https://paddleocr.bj.bcebos.com/igtr/rec_svtr_igtr_train.tar && tar xf rec_svtr_igtr_train.tar
# GPU evaluation
python3 -m paddle.distributed.launch --gpus '0' tools/eval.py -c configs/rec/rec_svtrnet_igtr.yml -o Global.pretrained_model=./rec_svtr_igtr_train/best_model
```

### Prediction

```bash linenums="1"
python3 tools/infer_rec.py -c configs/rec/rec_svtrnet_igtr.yml -o Global.infer_img='./doc/imgs_words/word_10.png' Global.pretrained_model=./rec_svtr_igtr_train/best_model
```

## 4. Inference and Deployment

### 4.1 Python Inference

Coming soon.

### 4.2 C++ Inference

Not supported

### 4.3 Serving

Not supported

### 4.4 More

Not supported

## Citation

```bibtex
@article{Du2025IGTR,
title = {Instruction-Guided Scene Text Recognition},
author = {Du, Yongkun and Chen, Zhineng and Su, Yuchen and Jia, Caiyan and Jiang, Yu-Gang},
journal = {IEEE Trans. Pattern Anal. Mach. Intell.},
year = {2025},
url = {https://arxiv.org/abs/2401.17851}
}
```
Loading