自构建数据集loss出现NaN的问题
#14985
Replies: 1 comment
-
出现 loss 为 NaN 的问题一般可能由以下几类原因导致。针对你自构建数据集训练 PytorchOCR v4 过程中的 NaN 问题,结合你的配置文件和描述信息,下面是详细的排查建议: 一、数据本身的问题
二、模型/配置问题
三、训练超参问题
四、调试建议
五、建议的排查操作步骤 按以下顺序尝试修复:
六、补充建议 如果怀疑是 NRTRHead 导致的,也可以考虑临时移除该 Head,仅使用 CTC 通路训练模型,确认是否是 NRTRLoss 计算导致 NaN。 如果需要测试字符集是否匹配,可参考 ppocr/utils/char_dict_filter.py 工具自动筛选非字典内的字符。 如果仍有问题可开启 GitHub Discussions 或 Issue 详细贴出数据样例和加载逻辑。 祝你解决问题!如有进一步报错细节欢迎继续补充。 Response generated by 🤖 feifei-bot | chatgpt-4o-latest |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
大佬您好,我在使用pytorch移植版ppocrv4 (https://github.com/WenmuZhou/PytorchOCR)进行训练时遇到了一些问题。我使用的是自定义数据集,数据是通过将单字数据合成为文本行的方式构建的。在训练过程中,模型的 loss 出现了 NaN 的情况。而在此之前,我使用 TTF 字体文件合成了大量数据进行训练,并没有出现类似的问题。
下面的图片选自我合成的数据集
我的配置文件是
Global:
device: gpu
epoch_num: 200
log_smooth_window: 20
print_batch_step: 10
output_dir: ./ch_PP-OCRv4_rec_train/student.pth
eval_epoch_step: [0, 1]
cal_metric_during_train: true
pretrained_model:
checkpoints:
use_tensorboard: true
infer_mode: false
infer_img: doc/imgs_words/ch/word_1.jpg
character_dict_path: &character_dict_path torchocr/utils/ppocr_keys_v1.txt
max_text_length: &max_text_length 54
use_space_char: &use_space_char true
Export:
export_dir:
export_shape: [ 1, 3, 48, 1000 ]
dynamic_axes: [ 0, 2, 3 ]
Optimizer:
name: Adam
lr: 0.001
weight_decay: 3.0e-05
LRScheduler:
name: CosineAnnealingLR
warmup_epoch: 5
Architecture:
model_type: rec
algorithm: SVTR_HGNet
Transform:
Backbone:
name: PPHGNet_small
Head:
name: MultiHead
head_list:
- CTCHead:
Neck:
name: svtr
dims: 120
depth: 2
hidden_dims: 120
kernel_size: [1, 3]
use_guide: True
- NRTRHead:
nrtr_dim: 384
max_text_length: *max_text_length
Loss:
name: MultiLoss
loss_config_list:
- CTCLoss:
- NRTRLoss:
PostProcess:
name: CTCLabelDecode
character_dict_path: *character_dict_path
use_space_char: *use_space_char
Metric:
name: RecMetric
main_indicator: acc
Train:
dataset:
name: MultiScaleDataSet
ds_width: false
data_dir: ./PytorchOCR/datasets/val.txt
ext_op_transform_idx: 1
label_file_list:
- /root/lanyun-tmp/PytorchOCR/datasets/datasets.txt
transforms:
- DecodeImage:
img_mode: BGR
channel_first: false
- RecConAug:
prob: 0.5
ext_data_num: 2
image_shape: [48, 1000, 3]
max_text_length: *max_text_length
- RecAug:
- MultiLabelEncode:
gtc_encode: NRTRLabelEncode
- KeepKeys:
keep_keys:
- image
- label_ctc
- label_gtc
- length
- valid_ratio
sampler:
name: MultiScaleSampler
scales: [[1000, 32], [1000, 48], [1000, 64]]
first_bs: &bs 89
fix_bs: false
divided_factor: [8, 16] # w, h
is_training: True
loader:
shuffle: true
batch_size_per_card: *bs
drop_last: true
num_workers: 8
Eval:
dataset:
name: SimpleDataSet
data_dir: /root/lanyun-tmp/PytorchOCR/datasets/
label_file_list:
- ./PytorchOCR/datasets/val.txt
transforms:
- DecodeImage:
img_mode: BGR
channel_first: false
- MultiLabelEncode:
gtc_encode: NRTRLabelEncode
- RecResizeImg:
image_shape: [3, 48, 1400]
- KeepKeys:
keep_keys:
- image
- label_ctc
- label_gtc
- length
- valid_ratio
loader:
shuffle: false
drop_last: false
batch_size_per_card: 100
num_workers: 4
Beta Was this translation helpful? Give feedback.
All reactions