-
Hello, I'm reaching out regarding my recent attempts to optimize my training process using two GPUs. Following the guidelines provided, I embarked on testing the I wrote a detailed description of the challenges I've faced, which I believe may require your expertise to address effectively. Furthermore, I've attached the log files for your review, as they may offer valuable insights into the root cause of the issues. Your assistance in resolving these matters would be greatly appreciated.
DescriptionVersion v2.2.10I tested the GPU performance using the following commands: ## GPUx1
export CUDA_VISIBLE_DEVICES=0; horovodrun -np 1 dp train --mpi-log=workers input.json > GPUx1_v2.2.10.log 2>&1
## GPUx2
export CUDA_VISIBLE_DEVICES=0,1; horovodrun -np 2 dp train --mpi-log=workers input.json > GPUx2_v2.2.10.log 2>&1 The program correctly detects the GPUs, but the training time does not seem right.
The time costs for using 1 GPU and 2 GPUs are almost the same, with values of Version v3.3.0a0## GPUx2
export CUDA_VISIBLE_DEVICES=0,1; horovodrun -np 2 dp train --mpi-log=workers input.json > GPUx2_v3.3.0a.log 2>&1 Segmentation faults occur after the training task is initiated if I use this command. enviromentsBelow are the systems and environments that I utilized. OS and GPUsI have two DeepMDI installed deepmd v2.2.10 and v3.3.0a in offline mode by executing the following command: bash deepmd-kit-x.x.x-cudaxxx-Linux-x86_64.sh Additionally, I have provided the dependencies given by horovodrunThe results of the horovodrun check are consistent across both DeepMD versions, v2.2.10 and v3.3.a0. $ horovodrun --check-build
Available Frameworks:
[X] TensorFlow
[X] PyTorch
[ ] MXNet
Available Controllers:
[X] MPI
[ ] Gloo
Available Tensor Operations:
[X] NCCL
[ ] DDL
[ ] CCL
[X] MPI
[ ] Gloo |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
Parallel training applies data parallelism and is equivalent to increasing the batch size. The time per batch is expected to remain the same. Your result shows that the communication cost is low. xref: #1470 |
Beta Was this translation helpful? Give feedback.
The epochs double.