NODE_RANK causes DDP jobs to hang at initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/8
#5798
Unanswered
ajtao
asked this question in
DDP / multi-GPU / multi-node
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello,
In my compute cluster, all pytorch lightning code will hang when using more than 1 GPU.
It hangs right at "initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/8"
Some relevant stats:
I have found that training does work if i unset NODE_RANK.
By instrumenting the pytorch-lightning code, i have observed that:
use_torchelastic_ddp
for all ranks.My questions:
What's your environment?
Beta Was this translation helpful? Give feedback.
All reactions