Large(er) scale multi-node / multi-gpu issue #10025
Unanswered
proutrc
asked this question in
DDP / multi-GPU / multi-node
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
We are running into an issue with DDP initializing at a certain threshold, specifically at 14 nodes on our cluster. The cluster is LSF based (Summit at OLCF). Each of our nodes has 6 GPUs.
Things initialize and run properly at <14 nodes, but it just hangs at initialization for runs >=14 nodes (84+ GPUs).
I am curious if anyone has seen anything like this within pytorch-lightning? I am currently using
pytorch 1.9
andpytorch-lightning 1.4.8
.Below are a couple relative snippets from the output at initialization (13 nodes and 14 nodes respectively):
The second one, trying to initialize 84 GPUs across 14 nodes, just hangs here until the job time runs out. It looks like there is possibly an issue with the global_ranks and how it is setting them up all of a sudden.
Anyone seen this?
Beta Was this translation helpful? Give feedback.
All reactions