Replies: 1 comment
-
飞桨云端机子可以进行分布式训练吗? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
请提供下述完整信息以便快速定位问题
系统环境:飞桨云端BML Codelab
版本号 Paddle:2.2.2 PaddleOCR:
运行指令:python3 -m paddle.distributed.launch --log_dir=./log/ --ips="10.32.116.210,10.32.167.53" --gpus="0,1" tools/train.py -c configs/rec/PP-OCRv3/ch_PP-OCRv3_rec_distillation.yml
完整报错:
INFO 2024-03-08 14:53:46,608 launch_utils.py:532] details abouts PADDLE_TRAINER_ENDPOINTS can be found in ./log//endpoints.log, and detail running logs maybe found in ./log//workerlog.0
launch proc_id:2480 idx:0
launch proc_id:2485 idx:1
I0308 14:53:49.568403 2480 gen_comm_id_helper.cc:190] Server listening on: 10.32.167.53:6070 successful.
INFO 2024-03-08 14:53:52,677 launch_utils.py:320] terminate process group gid:2480
INFO 2024-03-08 14:53:56,680 launch_utils.py:341] terminate all the procs
ERROR 2024-03-08 14:53:56,680 launch_utils.py:604] ABORT!!! Out of all 4 trainers, the trainer process with rank=[3] was aborted. Please check its log.
INFO 2024-03-08 14:54:00,681 launch_utils.py:341] terminate all the procs
INFO 2024-03-08 14:54:00,682 launch.py:311] Local processes completed.
Beta Was this translation helpful? Give feedback.
All reactions