Skip to content

Memory usage keeps growing without limit when training PP-OCRv5 rec model with PaddleOCR v3.2.0 on PaddlePaddle 3.1.0/3.2.0ย #16613

@qyhou

Description

@qyhou

๐Ÿ”Ž Search before asking

  • I have searched the PaddleOCR Docs and found no similar bug report.
  • I have searched the PaddleOCR Issues and found no similar bug report.
  • I have searched the PaddleOCR Discussions and found no similar bug report.

๐Ÿ› Bug (้—ฎ้ข˜ๆ่ฟฐ)

When training PP-OCRv5 rec model using PaddleOCR v3.2.0 inside Docker, I noticed that memory usage keeps increasing without limit on the following PaddlePaddle versions:

  • PaddlePaddle 3.1.0 (paddlepaddle/paddle:3.1.0-gpu-cuda12.9-cudnn9.9, paddlepaddle/paddle:3.1.0-gpu-cuda12.6-cudnn9.5)
  • PaddlePaddle 3.2.0 (paddlepaddle/paddle:3.2.0-gpu-cuda12.9-cudnn9.9, paddlepaddle/paddle:3.2.0-gpu-cuda12.6-cudnn9.5)

The same training setup works normally (no memory leak) on PaddlePaddle 3.0.0 (paddlepaddle/paddle:3.0.0-gpu-cuda12.6-cudnn9.5-trt10.5).

๐Ÿƒโ€โ™‚๏ธ Environment (่ฟ่กŒ็Žฏๅขƒ)

PaddleOCR v3.2.0 release
PaddlePaddle 3.1.0/3.2.0 docker images
OS: Ubuntu 24.04.3 LTS
CPU: Intel(R) Xeon(R) Platinum 8469C
GPU: H20 96G
Memory: 1.0T

๐ŸŒฐ Minimal Reproducible Example (ๆœ€ๅฐๅฏๅค็Žฐ้—ฎ้ข˜็š„Demo)

python -m paddle.distributed.launch --gpus '0,1,2,3' tools/train.py -c xxx.yml

Training data: ~10M samples (3ร—48ร—320)
Memory: eventually occupying 1 TB RAM after 15 epochs

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions