multi GPU training with deepmd-kit-2.0.0.b4, each GPU create 4 processors #982

tonystarkiss · 2021-08-16T12:57:31Z

tonystarkiss
Aug 16, 2021

I run with the follow command, each GPU create 4 processors, how to solve it ?

CUDA_VISIBLE_DEVICES=4,5,6,7 horovodrun -np 4 /data/deepmd-kit/bin/dp train --mpi-log=workers input.json

the logs:
[1,3]:4. Check if another program is using the same GPU by execuating `nvidia-smi`. The usage of GPUs is controlled by `CUDA_VISIBLE_DEVICES` environment variable (current value: 4,5,6,7).
[1,3]:
[1,1]:DEEPMD INFO built training
[1,1]:2021-08-16 12:53:29.228785: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
[1,1]:pciBusID: 0000:85:00.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0
[1,1]:coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s
[1,1]:2021-08-16 12:53:29.255893: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
[1,1]:2021-08-16 12:53:29.255989: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
[1,1]:2021-08-16 12:53:29.256022: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264] 0
[1,1]:2021-08-16 12:53:29.256030: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0: N
[1,1]:2021-08-16 12:53:29.258110: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 13270 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:85:00.0, compute capability: 7.0)
[1,0]:DEEPMD INFO built training
[1,0]:2021-08-16 12:53:29.414985: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
[1,0]:pciBusID: 0000:85:00.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0
[1,0]:coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s
[1,0]:2021-08-16 12:53:29.420238: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
[1,0]:2021-08-16 12:53:29.420436: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
[1,0]:2021-08-16 12:53:29.420470: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264] 0
[1,0]:2021-08-16 12:53:29.420496: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0: N
[1,0]:2021-08-16 12:53:29.425876: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 13270 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:85:00.0, compute capability: 7.0)
[1,1]:DEEPMD INFO initialize model from scratch

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

[1,0]:DEEPMD INFO initialize model from scratch
[1,1]:DEEPMD INFO start training at lr 1.00e-03 (== 1.00e-03), decay_step 5000, decay_rate 0.950006, final lr will be 3.51e-08

mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[50161,1],3]
Exit code: 1

tonystarkiss · 2021-08-17T13:05:10Z

tonystarkiss
Aug 17, 2021
Author

why does 4 threads create per GPU? is there something wrong?

3 replies

njzjz Aug 18, 2021
Maintainer

If you notice the PID, you will find the processors in each GPUs are the same! (972309-972312) So, there are only 4 processors in total, but not 16.

tonystarkiss Aug 18, 2021
Author

yes, you are right,there really only 4 PID in total. but use one gpu, the training goes narmally, but use multi GPU, the log always output error: out of memory, even I follow the error tips changed the batch size from "auto" to "auto:16", how to solve it?

the use case is water in example folder.

[1,0]: File "/lib/python3.9/site-packages/tensorflow/python/ops/gen_math_ops.py", line 5716, in mat_mul
[1,0]: _, _, _op, _outputs = _op_def_library._apply_op_helper(
[1,0]: File "/lib/python3.9/site-packages/tensorflow/python/framework/op_def_library.py", line 748, in _apply_op_helper
[1,0]: op = g._create_op_internal(op_type_name, inputs, dtypes=None,
[1,0]: File "/lib/python3.9/site-packages/tensorflow/python/framework/ops.py", line 3557, in _create_op_internal
[1,0]: ret = Operation(
[1,0]: File "/lib/python3.9/site-packages/tensorflow/python/framework/ops.py", line 2045, in init
[1,0]: self._traceback = tf_stack.extract_stack_for_node(self._c_op)
[1,0]:
[1,3]:DEEPMD INFO start training at lr 1.00e-03 (== 1.00e-03), decay_step 5000, decay_rate 0.598777, final lr will be 3.51e-08
[1,2]:DEEPMD INFO start training at lr 1.00e-03 (== 1.00e-03), decay_step 5000, decay_rate 0.598777, final lr will be 3.51e-08
[1,3]:cuda assert: out of memory /tmp/pip-req-build-tdiu3xrn/source/lib/include/gpu_cuda.h 122
[1,3]:Your memory is not enough, thus an error has been raised above. You need to take the following actions:
[1,3]:1. Check if the network size of the model is too large.
[1,3]:2. Check if the batch size of training or testing is too large. You can set the training batch size to `auto`.
[1,3]:3. Check if the number of atoms is too large.
[1,3]:4. Check if another program is using the same GPU by execuating `nvidia-smi`. The usage of GPUs is controlled by `CUDA_VISIBLE_DEVICES` environment variable.
[1,3]:2021-08-18 10:11:24.593815: F tensorflow/stream_executor/cuda/cuda_driver.cc:210] Failed setting context: UNKNOWN ERROR (4)
[1,2]:cuda assert: out of memory /tmp/pip-req-build-tdiu3xrn/source/lib/include/gpu_cuda.h 122
[1,2]:Your memory is not enough, thus an error has been raised above. You need to take the following actions:
[1,2]:1. Check if the network size of the model is too large.
[1,2]:2. Check if the batch size of training or testing is too large. You can set the training batch size to `auto`.
[1,2]:3. Check if the number of atoms is too large.
[1,2]:4. Check if another program is using the same GPU by execuating `nvidia-smi`. The usage of GPUs is controlled by `CUDA_VISIBLE_DEVICES` environment variable.

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[32519,1],0]
Exit code: 1

root@fcdbf2cae43c:/data/water/se_e2_a#

njzjz Aug 18, 2021
Maintainer

I think it should be a bug. @denghuilu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

multi GPU training with deepmd-kit-2.0.0.b4, each GPU create 4 processors #982

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

multi GPU training with deepmd-kit-2.0.0.b4, each GPU create 4 processors #982

Uh oh!

Uh oh!

tonystarkiss Aug 16, 2021

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

[1,0]:DEEPMD INFO initialize model from scratch [1,1]:DEEPMD INFO start training at lr 1.00e-03 (== 1.00e-03), decay_step 5000, decay_rate 0.950006, final lr will be 3.51e-08

Replies: 1 comment · 3 replies

Uh oh!

tonystarkiss Aug 17, 2021 Author

Uh oh!

njzjz Aug 18, 2021 Maintainer

Uh oh!

tonystarkiss Aug 18, 2021 Author

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

Process name: [[32519,1],0] Exit code: 1

Uh oh!

njzjz Aug 18, 2021 Maintainer

tonystarkiss
Aug 16, 2021

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

[1,0]:DEEPMD INFO initialize model from scratch
[1,1]:DEEPMD INFO start training at lr 1.00e-03 (== 1.00e-03), decay_step 5000, decay_rate 0.950006, final lr will be 3.51e-08

Replies: 1 comment 3 replies

tonystarkiss
Aug 17, 2021
Author

njzjz Aug 18, 2021
Maintainer

tonystarkiss Aug 18, 2021
Author

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

Process name: [[32519,1],0]
Exit code: 1

njzjz Aug 18, 2021
Maintainer