DataParallel parallel efficiency #2852

DL-WallModel · 2021-07-13T15:24:21Z

DL-WallModel
Jul 13, 2021

Hi!

I used the DataParallel function to parallelize an existing sequential NN following the example in pytorch_geometric/examples/multi_gpu/data_parallel.py. It seems that the training process works well since the final accuracy is the same in both, sequential and parallel execution. However, the parallel efficiency is very bad since the parallel training takes up to 30% longer than the sequential one. The only difference in the executions is the number of used GPUs, the code is exactly the same. I made some checks to find any possible bug, but apparently, everything is correct. First of all, I checked whether the forward pass was working in parallel, and it seems that it is. I printed the number of batch size per GPU. Taking into account that the batch size assigned in the DataListLoader was 32, these are the results:

def forward(self, data):

    if torch.cuda.current_device() == 0:
        print("forward: cuda device=",torch.cuda.current_device()," data size=",len(Batch.to_data_list(data)))

The output in the sequential execution was:

Let's use 1 GPUs!
forward: cuda device= 0  data size= 32

While the output In the parallel execution with 4 GPU was:

Let's use 4 GPUs!
forward: cuda device= 0  data size= 8

So, it seems that the data is properly distributed.

Therefore, I computed the total running time within an epoch for each function in the training function, forward pass, loss, backward, and optimization. Since DataParallel only parallelizes at Module level, I expected a reduction in time only in the forward pass. However, I obtained substantial increases in all the functions:

sequential run:

total time forward = 1.201972484588623
total time loss    = 0.01023721694946289 
total time backward= 0.1326758861541748
total time optim   = 0.18305039405822754

parallel run:

total time forward = 10.862336158752441
total time loss    = 0.05023837089538574
total time backward= 0.3378880023956299
total time optim   = 0.21431660652160645

Actually, in the first epoch, the increase in the forward pass parallel execution time is huge. The second epoch onwards is approximately double that of the sequential run. Is the first epoch doing any sort of parallel preprocessing?

Taking into account that the forward function seems to work in parallel and the final accuracy is correct, is this behavior possible or there must be an unnoticed bug? Could it be attributable to a communication overhead?

Thanks in advance for your help!

rusty1s · 2021-07-14T14:39:53Z

rusty1s
Jul 14, 2021
Maintainer

Those are some interesting findings. Thanks for digging into this. I also tested in on the data_parallel.py example but couldn't really reproduce it. Do your findings also hold for the provided example?

There is indeed an overhead due to GPU synchronization and communication, but this overhead should be negligible.

6 replies

rusty1s Jul 15, 2021
Maintainer

You can probably download the data from the URL, push them to the supercomputer and place them inside a folder path/raw. You should then be able to load the dataset by running MNISTSuperpixels(root).

DL-WallModel Jul 15, 2021
Author

Hey! finally I managed to download the MNISTSuperpixels dataset and run the data_paralell.py script. Again, the parallel run has been slower than the sequential one. These are the total times for each function and the whole script:

sequential run:

total time forward = 9.543952703475952
total time loss    = 0.012811899185180664
total time backward= 0.10166525840759277 
total time optim   = 0.08971166610717773

Script execution time: 34.9747953414917 seconds.

parallel run with 4 GPUs:

total time forward = 18.774266481399536
total time loss    = 0.011303424835205078
total time backward= 0.2075808048248291
total time optim   = 0.0927724838256836

Script execution time: 50.78387451171875 seconds.

It seems that the forward and backward passes almost double the execution time... I'm running an intranode execution with 4 GPUs, so no network is used here... Any idea about what is going on?

Thanks!
Joan

rusty1s Jul 15, 2021
Maintainer

Can you show me the script that you used to benchmark the runtimes?

DL-WallModel Jul 15, 2021
Author

Sure! I modified the file extension since .py files can not be uploaded. As you will see, total times are computed only by device 0 since only the module is parallelized. Is that correct?

data_parallel.txt

rusty1s Jul 16, 2021
Maintainer

Thank you very much!

I think you need to modify the benchmark as follows:

You actually measure warm-up time of GPUs, which is not really fair. I modified the script so that it first runs 10 steps without measuring time.
Keep in mind that you need to call torch.cuda.synchronize() every time before you call time.time(). Other-wise you may just measure Python function call runtimes, but not the actual runtime of GPU execution (CUDA is asynchronous).

With that, the runtimes look much more reasonably to me:

Let's use 4 GPUs!
total time forward = 10.90360403060913
total time loss    = 0.008981466293334961
total time backward= 1.624488353729248
total time optim   = 0.05384540557861328

and

Let's use 1 GPUs!
total time forward = 9.642948389053345
total time loss    = 0.005097627639770508
total time backward= 4.647931337356567
total time optim   = 0.05002474784851074

I don't think it's really possible to beat the runtimes of the single GPU example here due to the overhead in GPU communication. However, the multi-GPU example should be way more beneficial when increasing the batch-size, i.e., using a batch_size of 4096 --- every GPU processes 1024 examples (just as in the single GPU example).

In addition, I want to draw your attention to the distributed multi GPU examples as well. This is the recommended PyTorch way to do multi-GPU processing, and it should be even faster as well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DataParallel parallel efficiency #2852

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 6 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

DataParallel parallel efficiency #2852

Uh oh!

Uh oh!

DL-WallModel Jul 13, 2021

Replies: 1 comment · 6 replies

Uh oh!

Uh oh!

rusty1s Jul 14, 2021 Maintainer

Uh oh!

rusty1s Jul 15, 2021 Maintainer

Uh oh!

DL-WallModel Jul 15, 2021 Author

Uh oh!

rusty1s Jul 15, 2021 Maintainer

Uh oh!

DL-WallModel Jul 15, 2021 Author

Uh oh!

rusty1s Jul 16, 2021 Maintainer

DL-WallModel
Jul 13, 2021

Replies: 1 comment 6 replies

rusty1s
Jul 14, 2021
Maintainer

rusty1s Jul 15, 2021
Maintainer

DL-WallModel Jul 15, 2021
Author

rusty1s Jul 15, 2021
Maintainer

DL-WallModel Jul 15, 2021
Author

rusty1s Jul 16, 2021
Maintainer