Training not proceeding #14016
Unanswered
kad99kev
asked this question in
DDP / multi-GPU / multi-node
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello,
I am trying to train models using multiple GPUs (2). When I change the training strategy to dp it gets stuck after 1 epoch and epoch 2 does not begin. On the other hand, when I use ddp training does not start at all. I am currently using v1.7.0.
Here is my DataLoader for reference.
And my Trainer for reference
Any help would be appreciated, thank you!
Edit:
Additionally, when I forcefully use 1 GPU (
devices=1
), it gets stuck on the validation of epoch 0.Beta Was this translation helpful? Give feedback.
All reactions