Why does the distributed training get stuck here and doesn't move. #15266
-
I have requested two GPUs on slurm cluster for distributed training, but the program does not move? When I use only one GPU, the model is trained normally.
|
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 5 replies
-
I guess it deadlocked when creating the missing folder? |
Beta Was this translation helpful? Give feedback.
-
When I switch the communication method from ‘nccl’ to 'gloo', it works. I don't know what the problem is, but I hope I can help you.
|
Beta Was this translation helpful? Give feedback.
-
I am having the same issue, did you find what what was the problem? |
Beta Was this translation helpful? Give feedback.
-
Similar issues, but a little bit different. My training could start successfully no matter using Start distribution successfully
Got stuck
|
Beta Was this translation helpful? Give feedback.
When I switch the communication method from ‘nccl’ to 'gloo', it works. I don't know what the problem is, but I hope I can help you.