DDP MultiGPU Training doesn't reduce training time #18187
Unanswered
AlejandroTL
asked this question in
DDP / multi-GPU / multi-node
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello!
I want to do multiGPU training with a model. I have a node with 4 GPUs. Training with just 1 GPU, each epoch takes 9 hours. Training with 4 GPUs, each epoch takes 9 hours. There is no reduction whatsoever nor in the number of batches needed to train nor the time.
The way I am calling the trainer is:
Lines I see in the logs when training with 4 GPUs are:
How can I actually check whether I am indeed working with 4 GPUs? I know that my system can see them.
Finally, I use
ddp_find_unused_parameters_true
instead ofddp
cause I use atorch.nn.Embedding
and not in every minibatch I retrieve all indices, which apparently provokes some problems:Torch version:
'2.0.1+cu117'
Pytorch lightning version:
'2.0.6'
I solved this problem but now what happens is that my script gets stuck in the following construction of the DDP process:
I have seen some githubs issues with similar problems but no solution found! Any idea?
Thanks!!!
Beta Was this translation helpful? Give feedback.
All reactions