使用ch_PP-OCRv4_rec训练数据集报错:Out of memory error on GPU 0. Cannot allocate 129.394531MB memory on GPU 0, 23.611938GB memory has been allocated and available memory is only 31.687500MB. #12284
Replies: 8 comments
-
运行前显卡上有没有其他任务? |
Beta Was this translation helpful? Give feedback.
-
没有其他任务,跑过很多次都是这样 |
Beta Was this translation helpful? Give feedback.
-
尝试一下paddle 2.5.2版本 |
Beta Was this translation helpful? Give feedback.
-
你好,请问解决了吗?我也遇到了这个问题,我有两张24G的 |
Beta Was this translation helpful? Give feedback.
-
用2.8.1版本使用ch_PP-OCRv4_det_teacher.yml也一样出现这个问题,重复多次都一样。环境是使用kaggle提供的GPU T4 x2 |
Beta Was this translation helpful? Give feedback.
-
我用3090也出现了这个问题 |
Beta Was this translation helpful? Give feedback.
-
我用1660也出現這個問題 |
Beta Was this translation helpful? Give feedback.
-
大佬们,这个问题解决没?我今天也遇到了这个问题 |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
ResourceExhaustedError:
Out of memory error on GPU 0. Cannot allocate 129.394531MB memory on GPU 0, 23.611938GB memory has been allocated and available memory is only 31.687500MB.
Please check whether there is any other process using GPU 0.
If the above ways do not solve the out of memory problem, you can try to use CUDA managed memory. The command is
export FLAGS_use_cuda_managed_memory=false
.(at /paddle/paddle/fluid/memory/allocation/cuda_allocator.cc:95)
我设置的ch_PP-OCRv4_rec.yml:
Global:
debug: false
use_gpu: true
epoch_num: 20
log_smooth_window: 20
print_batch_step: 10
save_model_dir: ./output/rec_ppocr_v4
save_epoch_step: 3
eval_batch_step: [0, 100]
cal_metric_during_train: true
pretrained_model: ./pretrained_models/ch_PP-OCRv4_rec_train/student
checkpoints:
save_inference_dir:
use_visualdl: false
infer_img: doc/imgs_words/ch/word_1.jpg
character_dict_path: ppocr/utils/ppocr_keys_v1.txt
max_text_length: &max_text_length 25
infer_mode: false
use_space_char: true
distributed: true
save_res_path: ./output/rec/predicts_ppocrv3.txt
Optimizer:
name: Adam
beta1: 0.9
beta2: 0.999
lr:
name: Cosine
learning_rate: 0.0001
warmup_epoch: 2
regularizer:
name: L2
factor: 3.0e-05
Architecture:
model_type: rec
algorithm: SVTR_LCNet
Transform:
Backbone:
name: PPLCNetV3
scale: 0.95
Head:
name: MultiHead
head_list:
- CTCHead:
Neck:
name: svtr
dims: 120
depth: 2
hidden_dims: 120
kernel_size: [1, 3]
use_guide: True
Head:
fc_decay: 0.00001
- NRTRHead:
nrtr_dim: 384
max_text_length: *max_text_length
Loss:
name: MultiLoss
loss_config_list:
- CTCLoss:
- NRTRLoss:
PostProcess:
name: CTCLabelDecode
Metric:
name: RecMetric
main_indicator: acc
Train:
dataset:
name: MultiScaleDataSet
ds_width: false
data_dir: ./train_data/train
ext_op_transform_idx: 1
label_file_list:
- ./train_data/rec/train.txt
transforms:
- DecodeImage:
img_mode: BGR
channel_first: false
- RecConAug:
prob: 0.5
ext_data_num: 2
image_shape: [ 48, 320, 3 ]
sampler:
name: MultiScaleSampler
scales: [[320, 32], [320, 48], [320, 64]]
first_bs: &bs 192
fix_bs: false
divided_factor: [8, 16] # w, h
is_training: True
loader:
shuffle: true
batch_size_per_card: 2
Eval:
dataset:
name: SimpleDataSet
data_dir: ./train_data/val
label_file_list:
- ./train_data/rec/val.txt
transforms:
- DecodeImage:
img_mode: BGR
channel_first: false
- MultiLabelEncode:
gtc_encode: NRTRLabelEncode
- RecResizeImg:
image_shape: [3, 48, 320]
- KeepKeys:
keep_keys:
- image
- label_ctc
- label_gtc
- length
- valid_ratio
loader:
shuffle: false
drop_last: false
batch_size_per_card: 1
num_workers: 4
为什么我的24G显存一下就满了,一点跑不了
Beta Was this translation helpful? Give feedback.
All reactions