使用docker 部署多GPU报错 #12287

q465414859 · 2024-04-15T01:20:26Z

q465414859
Apr 15, 2024

请提供下述完整信息以便快速定位问题/Please provide the following information to quickly locate the problem

系统环境/System Environment：linux docker cuda12.0
版本号/Version：Paddle： PaddleOCR：问题相关组件/Related components：nvidia-docker pull registry.baidubce.com/paddlepaddle/paddle:2.6.1-gpu-cuda12.0-cudnn8.9-trt8.6
运行指令/Command Code：paddle.utils.run_check()
完整报错/Complete Error Message：
root@ecs-1cf4:# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
9e2548b1330a registry.baidubce.com/paddlepaddle/paddle:2.6.1-gpu-cuda12.0-cudnn8.9-trt8.6 "/opt/nvidia/nvidia_…" 35 hours ago Up 2 seconds 22/tcp paddle
root@ecs-1cf4:# docker exec -it 9e2548b1330a /bin/bahs
OCI runtime exec failed: exec failed: unable to start container process: exec: "/bin/bahs": stat /bin/bahs: no such file or directory: unknown
root@ecs-1cf4:~# docker exec -it 9e2548b1330a /bin/bash
λ 9e2548b1330a /home ls
ccache-4.8.2/ cmake-3.18.0-Linux-x86_64/ log/ test.py
λ 9e2548b1330a /home python test.py
grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
Running verify PaddlePaddle program ...
I0415 01:15:44.325572 50 program_interpreter.cc:212] New Executor is Running.
W0415 01:15:44.326015 50 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.0, Driver API Version: 12.0, Runtime API Version: 12.0
W0415 01:15:44.327044 50 gpu_resources.cc:164] device: 0, cuDNN Version: 8.8.
I0415 01:15:46.750312 50 interpreter_util.cc:624] Standalone Executor is Used.
PaddlePaddle works well on 1 GPU.
grep: grep: warning: GREP_OPTIONS is deprecated; please use an alias or scriptwarning: GREP_OPTIONS is deprecated; please use an alias or script

Running verify PaddlePaddle program ...
Running verify PaddlePaddle program ...
I0415 01:15:50.691897 125 program_interpreter.cc:212] New Executor is Running.
W0415 01:15:50.692445 125 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.0, Driver API Version: 12.0, Runtime API Version: 12.0
W0415 01:15:50.693403 125 gpu_resources.cc:164] device: 0, cuDNN Version: 8.8.
I0415 01:15:50.697796 126 program_interpreter.cc:212] New Executor is Running.
W0415 01:15:50.698231 126 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.0, Driver API Version: 12.0, Runtime API Version: 12.0
W0415 01:15:50.699101 126 gpu_resources.cc:164] device: 0, cuDNN Version: 8.8.
I0415 01:15:53.014739 126 interpreter_util.cc:624] Standalone Executor is Used.
I0415 01:15:53.020294 125 interpreter_util.cc:624] Standalone Executor is Used.
PaddlePaddle works well on 1 GPU.
[2024-04-15 01:15:53,026] [ WARNING] install_check.py:289 - PaddlePaddle meets some problem with 2 GPUs. This may be caused by:

There is not enough GPUs visible on your system
Some GPUs are occupied by other process now
NVIDIA-NCCL2 is not installed correctly on your system. Please follow instruction on https://github.com/NVIDIA/nccl-tests
to test your NCCL, or reinstall it following https://docs.nvidia.com/deeplearning/sdk/nccl-install-guide/index.html
[2024-04-15 01:15:53,026] [ WARNING] install_check.py:297 -
Original Error is:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
```
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:

    if __name__ == '__main__':
        freeze_support()
        ...

The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
```

PaddlePaddle is installed successfully ONLY for single GPU! Let's start deep learning with PaddlePaddle now.
Traceback (most recent call last):
File "", line 1, in
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 125, in _main
prepare(preparation_data)
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 236, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
main_content = runpy.run_path(main_path,
File "/usr/lib/python3.10/runpy.py", line 289, in run_path
return _run_module_code(code, init_globals, run_name,
File "/usr/lib/python3.10/runpy.py", line 96, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/test.py", line 2, in
paddle.utils.run_check()
File "/usr/local/lib/python3.10/dist-packages/paddle/utils/install_check.py", line 302, in run_check
raise e
File "/usr/local/lib/python3.10/dist-packages/paddle/utils/install_check.py", line 283, in run_check
_run_parallel(device_list)
File "/usr/local/lib/python3.10/dist-packages/paddle/utils/install_check.py", line 210, in _run_parallel
paddle.distributed.spawn(train_for_run_parallel, nprocs=len(device_list))
File "/usr/local/lib/python3.10/dist-packages/paddle/distributed/spawn.py", line 604, in spawn
process.start()
File "/usr/lib/python3.10/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/usr/lib/python3.10/multiprocessing/context.py", line 288, in _Popen
return Popen(process_obj)
File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in init
super().init(process_obj)
File "/usr/lib/python3.10/multiprocessing/popen_fork.py", line 19, in init
self._launch(process_obj)
File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 42, in _launch
prep_data = spawn.get_preparation_data(process_obj._name)
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 154, in get_preparation_data
_check_not_importing_main()
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 134, in _check_not_importing_main
raise RuntimeError('''
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.

PaddlePaddle works well on 1 GPU.
[2024-04-15 01:15:53,031] [ WARNING] install_check.py:289 - PaddlePaddle meets some problem with 2 GPUs. This may be caused by:

There is not enough GPUs visible on your system
Some GPUs are occupied by other process now
NVIDIA-NCCL2 is not installed correctly on your system. Please follow instruction on https://github.com/NVIDIA/nccl-tests
to test your NCCL, or reinstall it following https://docs.nvidia.com/deeplearning/sdk/nccl-install-guide/index.html
[2024-04-15 01:15:53,031] [ WARNING] install_check.py:297 -
Original Error is:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
```
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:

    if __name__ == '__main__':
        freeze_support()
        ...

The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
```

PaddlePaddle is installed successfully ONLY for single GPU! Let's start deep learning with PaddlePaddle now.
Traceback (most recent call last):
File "", line 1, in
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 125, in _main
prepare(preparation_data)
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 236, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
main_content = runpy.run_path(main_path,
File "/usr/lib/python3.10/runpy.py", line 289, in run_path
return _run_module_code(code, init_globals, run_name,
File "/usr/lib/python3.10/runpy.py", line 96, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/test.py", line 2, in
paddle.utils.run_check()
File "/usr/local/lib/python3.10/dist-packages/paddle/utils/install_check.py", line 302, in run_check
raise e
File "/usr/local/lib/python3.10/dist-packages/paddle/utils/install_check.py", line 283, in run_check
_run_parallel(device_list)
File "/usr/local/lib/python3.10/dist-packages/paddle/utils/install_check.py", line 210, in _run_parallel
paddle.distributed.spawn(train_for_run_parallel, nprocs=len(device_list))
File "/usr/local/lib/python3.10/dist-packages/paddle/distributed/spawn.py", line 604, in spawn
process.start()
File "/usr/lib/python3.10/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/usr/lib/python3.10/multiprocessing/context.py", line 288, in _Popen
return Popen(process_obj)
File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in init
super().init(process_obj)
File "/usr/lib/python3.10/multiprocessing/popen_fork.py", line 19, in init
self._launch(process_obj)
File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 42, in _launch
prep_data = spawn.get_preparation_data(process_obj._name)
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 154, in get_preparation_data
_check_not_importing_main()
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 134, in _check_not_importing_main
raise RuntimeError('''
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.

[2024-04-15 01:15:53,654] [ WARNING] install_check.py:289 - PaddlePaddle meets some problem with 2 GPUs. This may be caused by:

There is not enough GPUs visible on your system
Some GPUs are occupied by other process now
NVIDIA-NCCL2 is not installed correctly on your system. Please follow instruction on https://github.com/NVIDIA/nccl-tests
to test your NCCL, or reinstall it following https://docs.nvidia.com/deeplearning/sdk/nccl-install-guide/index.html
[2024-04-15 01:15:53,654] [ WARNING] install_check.py:297 -
Original Error is: Process 0 terminated with exit code 1.
PaddlePaddle is installed successfully ONLY for single GPU! Let's start deep learning with PaddlePaddle now.
Traceback (most recent call last):
File "/home/test.py", line 2, in
paddle.utils.run_check()
File "/usr/local/lib/python3.10/dist-packages/paddle/utils/install_check.py", line 302, in run_check
raise e
File "/usr/local/lib/python3.10/dist-packages/paddle/utils/install_check.py", line 283, in run_check
_run_parallel(device_list)
File "/usr/local/lib/python3.10/dist-packages/paddle/utils/install_check.py", line 210, in _run_parallel
paddle.distributed.spawn(train_for_run_parallel, nprocs=len(device_list))
File "/usr/local/lib/python3.10/dist-packages/paddle/distributed/spawn.py", line 614, in spawn
while not context.join():
File "/usr/local/lib/python3.10/dist-packages/paddle/distributed/spawn.py", line 423, in join
self._throw_exception(error_index)
File "/usr/local/lib/python3.10/dist-packages/paddle/distributed/spawn.py", line 435, in _throw_exception
raise Exception(
Exception: Process 0 terminated with exit code 1.

tink2123 · 2024-04-16T08:36:00Z

tink2123
Apr 16, 2024
Collaborator

NVIDIA-NCCL2 is not installed correctly on your system.

需要安装好NCCL才能完成多卡通信

0 replies

q465414859 · 2024-04-16T11:56:44Z

q465414859
Apr 16, 2024
Author

NVIDIA-NCCL2 is not installed correctly on your system.

需要安装好NCCL才能完成多卡通信

我不用docker 时候都安装了也是这个问题，git上留言他们说让用docker然后用了还这样

0 replies

tink2123 · 2024-04-21T15:30:57Z

tink2123
Apr 21, 2024
Collaborator

推荐使用飞桨的官方docker镜像：https://www.paddlepaddle.org.cn/

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

使用docker 部署多GPU报错 #12287

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

使用docker 部署多GPU报错 #12287

Uh oh!

q465414859 Apr 15, 2024

Replies: 3 comments

Uh oh!

tink2123 Apr 16, 2024 Collaborator

Uh oh!

q465414859 Apr 16, 2024 Author

Uh oh!

tink2123 Apr 21, 2024 Collaborator

q465414859
Apr 15, 2024

tink2123
Apr 16, 2024
Collaborator

q465414859
Apr 16, 2024
Author

tink2123
Apr 21, 2024
Collaborator