multi GPU training with deepmd-kit-2.0.0.b4, each GPU create 4 processors #982
Unanswered
tonystarkiss
asked this question in
Q&A
Replies: 1 comment 3 replies
-
why does 4 threads create per GPU? is there something wrong? |
Beta Was this translation helpful? Give feedback.
3 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I run with the follow command, each GPU create 4 processors, how to solve it ?
CUDA_VISIBLE_DEVICES=4,5,6,7 horovodrun -np 4 /data/deepmd-kit/bin/dp train --mpi-log=workers input.json
the logs:
[1,3]:4. Check if another program is using the same GPU by execuating
nvidia-smi
. The usage of GPUs is controlled byCUDA_VISIBLE_DEVICES
environment variable (current value: 4,5,6,7).[1,3]:
[1,1]:DEEPMD INFO built training
[1,1]:2021-08-16 12:53:29.228785: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
[1,1]:pciBusID: 0000:85:00.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0
[1,1]:coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s
[1,1]:2021-08-16 12:53:29.255893: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
[1,1]:2021-08-16 12:53:29.255989: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
[1,1]:2021-08-16 12:53:29.256022: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264] 0
[1,1]:2021-08-16 12:53:29.256030: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0: N
[1,1]:2021-08-16 12:53:29.258110: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 13270 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:85:00.0, compute capability: 7.0)
[1,0]:DEEPMD INFO built training
[1,0]:2021-08-16 12:53:29.414985: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
[1,0]:pciBusID: 0000:85:00.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0
[1,0]:coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s
[1,0]:2021-08-16 12:53:29.420238: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
[1,0]:2021-08-16 12:53:29.420436: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
[1,0]:2021-08-16 12:53:29.420470: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264] 0
[1,0]:2021-08-16 12:53:29.420496: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0: N
[1,0]:2021-08-16 12:53:29.425876: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 13270 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:85:00.0, compute capability: 7.0)
[1,1]:DEEPMD INFO initialize model from scratch
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
[1,0]:DEEPMD INFO initialize model from scratch
[1,1]:DEEPMD INFO start training at lr 1.00e-03 (== 1.00e-03), decay_step 5000, decay_rate 0.950006, final lr will be 3.51e-08
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[50161,1],3]
Exit code: 1
Beta Was this translation helpful? Give feedback.
All reactions