微调ch_PP-OCRv4_det_server_train，训练时评估模型显示out of memory #16428

ly03240921 · 2024-08-27T12:12:23Z

ly03240921
Aug 27, 2024

🔎 Search before asking

I have searched the PaddleOCR Docs and found no similar bug report.
I have searched the PaddleOCR Issues and found no similar bug report.
I have searched the PaddleOCR Discussions and found no similar bug report.

🐛 Bug (问题描述)

[2024/08/27 19:14:23] ppocr INFO: epoch: [5/500], global_step: 10, lr: 0.001000, loss: 2.168079, loss_shrink_maps: 1.022120, loss_threshold_maps: 0.760488, loss_binary_maps: 0.204714, loss_cbn: 0.204714, avg_reader_cost: 0.03694 s, avg_batch_cost: 0.04500 s, avg_samples: 0.12, ips: 2.66682 samples/s, eta: 0:41:51, max_mem_reserved: 13909 MB, max_mem_allocated: 11894 MB
eval model::   0%|                                                                                                                                          | 0/4 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/app/ocr/PaddleOCR-release-2.8/tools/train.py", line 257, in <module>
    main(config, device, logger, vdl_writer, seed)
  File "/app/ocr/PaddleOCR-release-2.8/tools/train.py", line 209, in main
    program.train(
  File "/app/ocr/PaddleOCR-release-2.8/tools/program.py", line 452, in train
    cur_metric = eval(
  File "/app/ocr/PaddleOCR-release-2.8/tools/program.py", line 622, in eval
    preds = model(images)
  File "/home/anaconda3/envs/pd-ocr/lib/python3.10/site-packages/paddle/nn/layer/layers.py", line 1429, in __call__
    return self.forward(*inputs, **kwargs)
  File "/app/ocr/PaddleOCR-release-2.8/ppocr/modeling/architectures/base_model.py", line 99, in forward
    x = self.head(x, targets=data)
  File "/home/anaconda3/envs/pd-ocr/lib/python3.10/site-packages/paddle/nn/layer/layers.py", line 1429, in __call__
    return self.forward(*inputs, **kwargs)
  File "/app/ocr/PaddleOCR-release-2.8/ppocr/modeling/heads/det_db_head.py", line 145, in forward
    cbn_maps = self.cbn_layer(self.up_conv(f), shrink_maps, None)
  File "/home/anaconda3/envs/pd-ocr/lib/python3.10/site-packages/paddle/nn/layer/layers.py", line 1429, in __call__
    return self.forward(*inputs, **kwargs)
  File "/app/ocr/PaddleOCR-release-2.8/ppocr/modeling/heads/det_db_head.py", line 127, in forward
    out = self.last_1(self.last_3(outf))
  File "/home/anaconda3/envs/pd-ocr/lib/python3.10/site-packages/paddle/nn/layer/layers.py", line 1429, in __call__
    return self.forward(*inputs, **kwargs)
  File "/app/ocr/PaddleOCR-release-2.8/ppocr/modeling/backbones/det_mobilenet_v3.py", line 186, in forward
    x = self.conv(x)
  File "/home/anaconda3/envs/pd-ocr/lib/python3.10/site-packages/paddle/nn/layer/layers.py", line 1429, in __call__
    return self.forward(*inputs, **kwargs)
  File "/home/anaconda3/envs/pd-ocr/lib/python3.10/site-packages/paddle/nn/layer/conv.py", line 715, in forward
    out = F.conv._conv_nd(
  File "/home/anaconda3/envs/pd-ocr/lib/python3.10/site-packages/paddle/nn/functional/conv.py", line 128, in _conv_nd
    pre_bias = _C_ops.conv2d(
MemoryError: 

--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0   paddle::pybind::eager_api_conv2d(_object*, _object*, _object*)
1   conv2d_ad_func(paddle::Tensor const&, paddle::Tensor const&, std::vector<int, std::allocator<int> >, std::vector<int, std::allocator<int> >, std::string, std::vector<int, std::allocator<int> >, int, std::string)
2   paddle::experimental::conv2d(paddle::Tensor const&, paddle::Tensor const&, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, std::string const&, std::vector<int, std::allocator<int> > const&, int, std::string const&)
3   void phi::ConvCudnnKernel<float, phi::GPUContext>(phi::GPUContext const&, phi::DenseTensor const&, phi::DenseTensor const&, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, std::string const&, std::vector<int, std::allocator<int> > const&, int, std::string const&, phi::DenseTensor*)
4   float* phi::DeviceContext::Alloc<float>(phi::TensorBase*, unsigned long, bool) const
5   phi::DeviceContext::Impl::Alloc(phi::TensorBase*, phi::Place const&, phi::DataType, unsigned long, bool, bool) const
6   phi::DenseTensor::AllocateFrom(phi::Allocator*, phi::DataType, unsigned long, bool)
7   paddle::memory::allocation::Allocator::Allocate(unsigned long)
8   paddle::memory::allocation::StatAllocator::AllocateImpl(unsigned long)
9   paddle::memory::allocation::Allocator::Allocate(unsigned long)
10  paddle::memory::allocation::Allocator::Allocate(unsigned long)
11  paddle::memory::allocation::Allocator::Allocate(unsigned long)
12  paddle::memory::allocation::Allocator::Allocate(unsigned long)
13  paddle::memory::allocation::CUDAAllocator::AllocateImpl(unsigned long)
14  std::string phi::enforce::GetCompleteTraceBackString<std::string >(std::string&&, char const*, int)
15  phi::enforce::GetCurrentTraceBackString[abi:cxx11](bool)

----------------------
Error Message Summary:
----------------------
ResourceExhaustedError: 

Out of memory error on GPU 1. Cannot allocate 3.158203GB memory on GPU 1, 13.315369GB memory has been allocated and available memory is only 2.386902GB.

Please check whether there is any other process using GPU 1.
1. If yes, please stop them, or start PaddlePaddle on another GPU.
2. If no, please decrease the batch size of your model. 
 (at /paddle/paddle/fluid/memory/allocation/cuda_allocator.cc:86)

🏃‍♂️ Environment (运行环境)

PaddlePaddle-gpu：2.6 PaddleOCR：2.8 RAM：16G

🌰 Minimal Reproducible Example (最小可复现问题的Demo)

python tools/train.py -c configs/det/ch_PP-OCRv4/ch_PP-OCRv4_det_teacher.yml

alanxinn · 2024-08-28T00:08:06Z

alanxinn
Aug 28, 2024

显存不够不够，调小batchsize

0 replies

ly03240921 · 2024-08-28T00:48:22Z

ly03240921
Aug 28, 2024
Author

显存不够不够，调小batchsize

train的batch size是8，跑的时候没问题。eval的batch_size是1，但跑不起来。训练中途每1000个step评估一次嘛，然后它就爆”显存不足“。前面1000个step训练都是正常的

0 replies

alanxinn · 2024-08-28T08:26:35Z

alanxinn
Aug 28, 2024

那有试过更改每次评估的step间隔吗？改小

0 replies

ly03240921 · 2024-08-28T09:17:49Z

ly03240921
Aug 28, 2024
Author

那有试过更改每次评估的step间隔吗？改小

我已经改成10了还是

没用，每10个step评估一次

0 replies

alanxinn · 2024-08-28T09:22:42Z

alanxinn
Aug 28, 2024

观察一下到底是内存爆了还是显存爆了吧，把batchsize改成4 看看，虽然我也不知道有没有用，没碰到过这种问题

0 replies

ly03240921 · 2024-08-29T03:31:27Z

ly03240921
Aug 29, 2024
Author

观察一下到底是内存爆了还是显存爆了吧，把batchsize改成4 看看，虽然我也不知道有没有用，没碰到过这种问题

是显存爆了，调了train的batchsize也不行，我训完之后用tools/infer_det.py检测图片也是说显存爆了，就很搞不懂。。。

0 replies

alanxinn · 2024-08-29T09:18:05Z

alanxinn
Aug 29, 2024

观察一下到底是内存爆了还是显存爆了吧，把batchsize改成4 看看，虽然我也不知道有没有用，没碰到过这种问题

是显存爆了，调了train的batchsize也不行，我训完之后用tools/infer_det.py检测图片也是说显存爆了，就很搞不懂。。。

paddle有时候有些奇奇怪怪的bug，要不重新装一下训练环境看看（doge

0 replies

nissansz · 2024-10-26T00:21:35Z

nissansz
Oct 26, 2024

要安装哪个版本paddlepaddle? 我都是设置为1，还是爆显存

0 replies

759325100 · 2024-12-03T08:26:56Z

759325100
Dec 3, 2024

遇到同样的问题，不知如何解决

0 replies

kerry-weic · 2024-12-19T02:10:35Z

kerry-weic
Dec 19, 2024

有类似问题

0 replies

actorUser · 2025-01-22T08:59:17Z

actorUser
Jan 22, 2025

遇到同样的问题，不知如何解决，cpu下识别图片，内存不断增长，最后崩了，给一个清理内存的方法吧

0 replies

DarrenZhangug · 2025-02-06T11:56:50Z

DarrenZhangug
Feb 6, 2025

遇到了同样的问题，显卡时1050ti的，有点老，但是近4G的内存不至于跑不起来吧，线程数和batch_size设置的都很小
`
Out of memory error on GPU 0. Cannot allocate 160.000000MB memory on GPU 0, 3.861304GB memory has been allocated and available memory is only 142.025001MB.

Please check whether there is any other process using GPU 0.

If yes, please stop them, or start PaddlePaddle on another GPU.
If no, please decrease the batch size of your model.
(at ..\paddle\fluid\memory\allocation\cuda_allocator.cc:86)
`

0 replies

nissansz · 2025-02-06T12:48:32Z

nissansz
Feb 6, 2025

你是训练什么语种？什么训练样本？

…

--------------------------------------------------------------------------------

------------------ 原始邮件 ------------------ 发件人: DarrenZhangug ***@***.***> 发送时间: 2025-02-06 19:57:12 收件人:PaddlePaddle/PaddleOCR ***@***.***> 抄送:nissanjp ***@***.***>,Comment ***@***.***> 主题: Re: [PaddlePaddle/PaddleOCR] 微调ch_PP-OCRv4_det_server_train，训练时评估模型显示out of memory (Issue #13759) 遇到了同样的问题，显卡时1050ti的，有点老，但是近4G的内存不至于跑不起来吧，线程数和batch_size设置的都很小 ` Out of memory error on GPU 0. Cannot allocate 160.000000MB memory on GPU 0, 3.861304GB memory has been allocated and available memory is only 142.025001MB. Please check whether there is any other process using GPU 0. If yes, please stop them, or start PaddlePaddle on another GPU. If no, please decrease the batch size of your model. (at ..\paddle\fluid\memory\allocation\cuda_allocator.cc:86) ` — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: ***@***.***>

0 replies

Kyo1234567 · 2025-02-15T08:35:09Z

Kyo1234567
Feb 15, 2025

我也遇到了一个和这几乎一模一样的问题，目前也没有解决 #14633。也是这个模型，也是在eval过程中OOM

0 replies

lyj201644070230 · 2025-03-06T03:04:44Z

lyj201644070230
Mar 6, 2025

请问有解决吗？我也遇到同样的问题,用的4090，batchsize改成2了，在评估阶段显存还是会爆

0 replies

ly03240921 · 2025-03-06T03:10:44Z

ly03240921
Mar 6, 2025
Author

请问有解决吗？我也遇到同样的问题,用的4090，batchsize改成2了，在评估阶段显存还是会爆

没有，我把它转成推理模型之后再测了。。。用tools/infer/predict_system.py检测+识别一起测，看效果

0 replies

2025-09-04T02:03:41Z

github-actions[bot]
bot Sep 4, 2025

This issue is stale because it has been open for 90 days with no activity.

0 replies

微调ch_PP-OCRv4_det_server_train，训练时评估模型显示out of memory #16428

Uh oh!

Uh oh!

🔎 Search before asking

🐛 Bug (问题描述)

🏃‍♂️ Environment (运行环境)

🌰 Minimal Reproducible Example (最小可复现问题的Demo)

Replies: 17 comments

Uh oh!

Uh oh!

ly03240921 Aug 28, 2024 Author

Uh oh!

Uh oh!

ly03240921 Aug 28, 2024 Author

Uh oh!

Uh oh!

ly03240921 Aug 29, 2024 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ly03240921 Mar 6, 2025 Author

Uh oh!

github-actions[bot] bot Sep 4, 2025

ly03240921
Aug 28, 2024
Author

ly03240921
Aug 28, 2024
Author

ly03240921
Aug 29, 2024
Author

ly03240921
Mar 6, 2025
Author

github-actions[bot]
bot Sep 4, 2025