DataParallel parallel efficiency #2852
Unanswered
DL-WallModel
asked this question in
Q&A
Replies: 1 comment 6 replies
-
Those are some interesting findings. Thanks for digging into this. I also tested in on the There is indeed an overhead due to GPU synchronization and communication, but this overhead should be negligible. |
Beta Was this translation helpful? Give feedback.
6 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi!
I used the DataParallel function to parallelize an existing sequential NN following the example in pytorch_geometric/examples/multi_gpu/data_parallel.py. It seems that the training process works well since the final accuracy is the same in both, sequential and parallel execution. However, the parallel efficiency is very bad since the parallel training takes up to 30% longer than the sequential one. The only difference in the executions is the number of used GPUs, the code is exactly the same. I made some checks to find any possible bug, but apparently, everything is correct. First of all, I checked whether the forward pass was working in parallel, and it seems that it is. I printed the number of batch size per GPU. Taking into account that the batch size assigned in the DataListLoader was 32, these are the results:
The output in the sequential execution was:
While the output In the parallel execution with 4 GPU was:
So, it seems that the data is properly distributed.
Therefore, I computed the total running time within an epoch for each function in the training function, forward pass, loss, backward, and optimization. Since DataParallel only parallelizes at Module level, I expected a reduction in time only in the forward pass. However, I obtained substantial increases in all the functions:
sequential run:
parallel run:
Actually, in the first epoch, the increase in the forward pass parallel execution time is huge. The second epoch onwards is approximately double that of the sequential run. Is the first epoch doing any sort of parallel preprocessing?
Taking into account that the forward function seems to work in parallel and the final accuracy is correct, is this behavior possible or there must be an unnoticed bug? Could it be attributable to a communication overhead?
Thanks in advance for your help!
Beta Was this translation helpful? Give feedback.
All reactions