使用docker 部署多GPU报错 #12287
Unanswered
q465414859
asked this question in
Q&A
使用docker 部署多GPU报错
#12287
Replies: 3 comments
-
需要安装好NCCL才能完成多卡通信 |
Beta Was this translation helpful? Give feedback.
0 replies
-
我不用docker 时候都安装了 也是这个问题,git上留言他们说让用docker然后用了还这样 |
Beta Was this translation helpful? Give feedback.
0 replies
-
推荐使用飞桨的官方docker镜像:https://www.paddlepaddle.org.cn/ |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
请提供下述完整信息以便快速定位问题/Please provide the following information to quickly locate the problem
root@ecs-1cf4:
# docker ps# docker exec -it 9e2548b1330a /bin/bahsCONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
9e2548b1330a registry.baidubce.com/paddlepaddle/paddle:2.6.1-gpu-cuda12.0-cudnn8.9-trt8.6 "/opt/nvidia/nvidia_…" 35 hours ago Up 2 seconds 22/tcp paddle
root@ecs-1cf4:
OCI runtime exec failed: exec failed: unable to start container process: exec: "/bin/bahs": stat /bin/bahs: no such file or directory: unknown
root@ecs-1cf4:~# docker exec -it 9e2548b1330a /bin/bash
λ 9e2548b1330a /home ls
ccache-4.8.2/ cmake-3.18.0-Linux-x86_64/ log/ test.py
λ 9e2548b1330a /home python test.py
grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
Running verify PaddlePaddle program ...
I0415 01:15:44.325572 50 program_interpreter.cc:212] New Executor is Running.
W0415 01:15:44.326015 50 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.0, Driver API Version: 12.0, Runtime API Version: 12.0
W0415 01:15:44.327044 50 gpu_resources.cc:164] device: 0, cuDNN Version: 8.8.
I0415 01:15:46.750312 50 interpreter_util.cc:624] Standalone Executor is Used.
PaddlePaddle works well on 1 GPU.
grep: grep: warning: GREP_OPTIONS is deprecated; please use an alias or scriptwarning: GREP_OPTIONS is deprecated; please use an alias or script
Running verify PaddlePaddle program ...
Running verify PaddlePaddle program ...
I0415 01:15:50.691897 125 program_interpreter.cc:212] New Executor is Running.
W0415 01:15:50.692445 125 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.0, Driver API Version: 12.0, Runtime API Version: 12.0
W0415 01:15:50.693403 125 gpu_resources.cc:164] device: 0, cuDNN Version: 8.8.
I0415 01:15:50.697796 126 program_interpreter.cc:212] New Executor is Running.
W0415 01:15:50.698231 126 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.0, Driver API Version: 12.0, Runtime API Version: 12.0
W0415 01:15:50.699101 126 gpu_resources.cc:164] device: 0, cuDNN Version: 8.8.
I0415 01:15:53.014739 126 interpreter_util.cc:624] Standalone Executor is Used.
I0415 01:15:53.020294 125 interpreter_util.cc:624] Standalone Executor is Used.
PaddlePaddle works well on 1 GPU.
[2024-04-15 01:15:53,026] [ WARNING] install_check.py:289 - PaddlePaddle meets some problem with 2 GPUs. This may be caused by:
There is not enough GPUs visible on your system
Some GPUs are occupied by other process now
NVIDIA-NCCL2 is not installed correctly on your system. Please follow instruction on https://github.com/NVIDIA/nccl-tests
to test your NCCL, or reinstall it following https://docs.nvidia.com/deeplearning/sdk/nccl-install-guide/index.html
[2024-04-15 01:15:53,026] [ WARNING] install_check.py:297 -
Original Error is:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
PaddlePaddle is installed successfully ONLY for single GPU! Let's start deep learning with PaddlePaddle now.
Traceback (most recent call last):
File "", line 1, in
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 125, in _main
prepare(preparation_data)
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 236, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
main_content = runpy.run_path(main_path,
File "/usr/lib/python3.10/runpy.py", line 289, in run_path
return _run_module_code(code, init_globals, run_name,
File "/usr/lib/python3.10/runpy.py", line 96, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/test.py", line 2, in
paddle.utils.run_check()
File "/usr/local/lib/python3.10/dist-packages/paddle/utils/install_check.py", line 302, in run_check
raise e
File "/usr/local/lib/python3.10/dist-packages/paddle/utils/install_check.py", line 283, in run_check
_run_parallel(device_list)
File "/usr/local/lib/python3.10/dist-packages/paddle/utils/install_check.py", line 210, in _run_parallel
paddle.distributed.spawn(train_for_run_parallel, nprocs=len(device_list))
File "/usr/local/lib/python3.10/dist-packages/paddle/distributed/spawn.py", line 604, in spawn
process.start()
File "/usr/lib/python3.10/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/usr/lib/python3.10/multiprocessing/context.py", line 288, in _Popen
return Popen(process_obj)
File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in init
super().init(process_obj)
File "/usr/lib/python3.10/multiprocessing/popen_fork.py", line 19, in init
self._launch(process_obj)
File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 42, in _launch
prep_data = spawn.get_preparation_data(process_obj._name)
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 154, in get_preparation_data
_check_not_importing_main()
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 134, in _check_not_importing_main
raise RuntimeError('''
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
PaddlePaddle works well on 1 GPU.
[2024-04-15 01:15:53,031] [ WARNING] install_check.py:289 - PaddlePaddle meets some problem with 2 GPUs. This may be caused by:
There is not enough GPUs visible on your system
Some GPUs are occupied by other process now
NVIDIA-NCCL2 is not installed correctly on your system. Please follow instruction on https://github.com/NVIDIA/nccl-tests
to test your NCCL, or reinstall it following https://docs.nvidia.com/deeplearning/sdk/nccl-install-guide/index.html
[2024-04-15 01:15:53,031] [ WARNING] install_check.py:297 -
Original Error is:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
PaddlePaddle is installed successfully ONLY for single GPU! Let's start deep learning with PaddlePaddle now.
Traceback (most recent call last):
File "", line 1, in
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 125, in _main
prepare(preparation_data)
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 236, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
main_content = runpy.run_path(main_path,
File "/usr/lib/python3.10/runpy.py", line 289, in run_path
return _run_module_code(code, init_globals, run_name,
File "/usr/lib/python3.10/runpy.py", line 96, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/test.py", line 2, in
paddle.utils.run_check()
File "/usr/local/lib/python3.10/dist-packages/paddle/utils/install_check.py", line 302, in run_check
raise e
File "/usr/local/lib/python3.10/dist-packages/paddle/utils/install_check.py", line 283, in run_check
_run_parallel(device_list)
File "/usr/local/lib/python3.10/dist-packages/paddle/utils/install_check.py", line 210, in _run_parallel
paddle.distributed.spawn(train_for_run_parallel, nprocs=len(device_list))
File "/usr/local/lib/python3.10/dist-packages/paddle/distributed/spawn.py", line 604, in spawn
process.start()
File "/usr/lib/python3.10/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/usr/lib/python3.10/multiprocessing/context.py", line 288, in _Popen
return Popen(process_obj)
File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in init
super().init(process_obj)
File "/usr/lib/python3.10/multiprocessing/popen_fork.py", line 19, in init
self._launch(process_obj)
File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 42, in _launch
prep_data = spawn.get_preparation_data(process_obj._name)
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 154, in get_preparation_data
_check_not_importing_main()
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 134, in _check_not_importing_main
raise RuntimeError('''
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
[2024-04-15 01:15:53,654] [ WARNING] install_check.py:289 - PaddlePaddle meets some problem with 2 GPUs. This may be caused by:
to test your NCCL, or reinstall it following https://docs.nvidia.com/deeplearning/sdk/nccl-install-guide/index.html
[2024-04-15 01:15:53,654] [ WARNING] install_check.py:297 -
Original Error is: Process 0 terminated with exit code 1.
PaddlePaddle is installed successfully ONLY for single GPU! Let's start deep learning with PaddlePaddle now.
Traceback (most recent call last):
File "/home/test.py", line 2, in
paddle.utils.run_check()
File "/usr/local/lib/python3.10/dist-packages/paddle/utils/install_check.py", line 302, in run_check
raise e
File "/usr/local/lib/python3.10/dist-packages/paddle/utils/install_check.py", line 283, in run_check
_run_parallel(device_list)
File "/usr/local/lib/python3.10/dist-packages/paddle/utils/install_check.py", line 210, in _run_parallel
paddle.distributed.spawn(train_for_run_parallel, nprocs=len(device_list))
File "/usr/local/lib/python3.10/dist-packages/paddle/distributed/spawn.py", line 614, in spawn
while not context.join():
File "/usr/local/lib/python3.10/dist-packages/paddle/distributed/spawn.py", line 423, in join
self._throw_exception(error_index)
File "/usr/local/lib/python3.10/dist-packages/paddle/distributed/spawn.py", line 435, in _throw_exception
raise Exception(
Exception: Process 0 terminated with exit code 1.
Beta Was this translation helpful? Give feedback.
All reactions