Parallel Training with Multi-GPUs #3655

bsplu · 2024-04-08T06:32:40Z

bsplu
Apr 8, 2024

Hello,

I'm reaching out regarding my recent attempts to optimize my training process using two GPUs. Following the guidelines provided, I embarked on testing the water/se_e2_a example. Regrettably, the outcomes were not as anticipated. Despite my efforts with both versions of the code, v2.2.10 and v3.0.0a, I haven't achieved the desired results.

I wrote a detailed description of the challenges I've faced, which I believe may require your expertise to address effectively. Furthermore, I've attached the log files for your review, as they may offer valuable insights into the root cause of the issues.

Your assistance in resolving these matters would be greatly appreciated.

input:
input.json
log files:
GPUx1_v2.2.10.log : v2.2.0 with one GPU
GPUx2_v2.2.10.log : v2.2.0 with two GPUs
GPUx2_v3.3.0a.log : v3.3.0a0 with two GPUs

Description

Version v2.2.10

I tested the GPU performance using the following commands:

## GPUx1
export CUDA_VISIBLE_DEVICES=0; horovodrun -np 1 dp train --mpi-log=workers input.json > GPUx1_v2.2.10.log 2>&1
## GPUx2
export CUDA_VISIBLE_DEVICES=0,1; horovodrun -np 2 dp train --mpi-log=workers input.json > GPUx2_v2.2.10.log 2>&1

The program correctly detects the GPUs, but the training time does not seem right.

---Summary of the training---------------------------------------
distributed
world size:           2
my rank:              0
node list:            ['qstation01']
running on:           qstation01
computing device:     gpu:0
CUDA_VISIBLE_DEVICES: 0,1
Count of visible GPU: 2
num_intra_threads:    0
num_inter_threads:    0

The time costs for using 1 GPU and 2 GPUs are almost the same, with values of 0.0298 s/batch and 0.0301 s/batch, respectively. Surprisingly, the total training time of using 2 GPUs is slightly longer than using one GPU.

Version v3.3.0a0

## GPUx2
export CUDA_VISIBLE_DEVICES=0,1; horovodrun -np 2 dp train --mpi-log=workers input.json > GPUx2_v3.3.0a.log 2>&1

Segmentation faults occur after the training task is initiated if I use this command.

enviroments

Below are the systems and environments that I utilized.

OS and GPUs

I have two NVIDIA RTX A6000 with CUDA Version: 12.2. My OS is 22.04.4 LTS (Jammy Jellyfish)

DeepMD

I installed deepmd v2.2.10 and v3.3.0a in offline mode by executing the following command:

bash deepmd-kit-x.x.x-cudaxxx-Linux-x86_64.sh

Additionally, I have provided the dependencies given by pip list in this discussion.
piplist_v3.3.0a.log
piplist_v2.2.10.log

horovodrun

The results of the horovodrun check are consistent across both DeepMD versions, v2.2.10 and v3.3.a0.

$ horovodrun --check-build
Available Frameworks:
    [X] TensorFlow
    [X] PyTorch
    [ ] MXNet

Available Controllers:
    [X] MPI
    [ ] Gloo

Available Tensor Operations:
    [X] NCCL
    [ ] DDL
    [ ] CCL
    [X] MPI
    [ ] Gloo

Answered by njzjz

Apr 9, 2024

the practical training steps double

The epochs double.

View full answer

njzjz · 2024-04-08T20:54:05Z

njzjz
Apr 8, 2024
Maintainer

Parallel training applies data parallelism and is equivalent to increasing the batch size. The time per batch is expected to remain the same. Your result shows that the communication cost is low.

xref: #1470

2 replies

bsplu Apr 9, 2024
Author

Thanks for your quick reply.
Alright, I understand now. This clarifies that the time per batch should be similar regardless of the number of GPUs used. Please correct me if I am wrong, the key term numb_steps controls the total number of batches, rather than the commonly used total number of epochs, correct? Therefore, each GPU always runs numb_steps, which means the total training time should always remain almost the same but the practical training steps double when I double the number of GPUs.

njzjz Apr 9, 2024
Maintainer

the practical training steps double

The epochs double.

Answer selected by bsplu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Parallel Training with Multi-GPUs #3655

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Parallel Training with Multi-GPUs #3655

Uh oh!

bsplu Apr 8, 2024

Description

Version v2.2.10

Version v3.3.0a0

enviroments

OS and GPUs

DeepMD

horovodrun

Replies: 1 comment · 2 replies

Uh oh!

njzjz Apr 8, 2024 Maintainer

Uh oh!

bsplu Apr 9, 2024 Author

Uh oh!

njzjz Apr 9, 2024 Maintainer

bsplu
Apr 8, 2024

Replies: 1 comment 2 replies

njzjz
Apr 8, 2024
Maintainer

bsplu Apr 9, 2024
Author

njzjz Apr 9, 2024
Maintainer