Why would my Lightning script not be able to access all GPUs available on each requested node? #9456
Unanswered
amorehead
asked this question in
DDP / multi-GPU / multi-node
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello, everyone. I am using ORNL's Summit compute cluster (an LSF-based server using the PowerPC/Power9 architecture) to run distributed deep learning training using Lightning. Currently, I cannot request more than 8 compute nodes on the compute cluster without seeing "CUDA: GPU unavailable" errors for arbitrary nodes being requested at DDP/NCCL startup. I have looked into physical limitations on the number of GPUs/GPU nodes I can request, but I cannot find anything relevant in their documentation that may help with this issue. In addition, I have heard stories of others being able to train on Summit using Lightning s.t. their jobs scale to hundreds of nodes without a hitch (in the past - so this is likely still possible). Has anyone encountered GPU unavailable issues when increasing the number of DDP nodes past a certain threshold? For example, in my case I can train without any errors using 8 nodes with 6 GPUs in each. Anything more than 8 nodes introduces the CUDA GPU waiting error.
Beta Was this translation helpful? Give feedback.
All reactions