Why would my Lightning script not be able to access all GPUs available on each requested node? #9456

amorehead · 2021-09-11T21:54:40Z

amorehead
Sep 11, 2021

Hello, everyone. I am using ORNL's Summit compute cluster (an LSF-based server using the PowerPC/Power9 architecture) to run distributed deep learning training using Lightning. Currently, I cannot request more than 8 compute nodes on the compute cluster without seeing "CUDA: GPU unavailable" errors for arbitrary nodes being requested at DDP/NCCL startup. I have looked into physical limitations on the number of GPUs/GPU nodes I can request, but I cannot find anything relevant in their documentation that may help with this issue. In addition, I have heard stories of others being able to train on Summit using Lightning s.t. their jobs scale to hundreds of nodes without a hitch (in the past - so this is likely still possible). Has anyone encountered GPU unavailable issues when increasing the number of DDP nodes past a certain threshold? For example, in my case I can train without any errors using 8 nodes with 6 GPUs in each. Anything more than 8 nodes introduces the CUDA GPU waiting error.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why would my Lightning script not be able to access all GPUs available on each requested node? #9456

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Why would my Lightning script not be able to access all GPUs available on each requested node? #9456

Uh oh!

Uh oh!

amorehead Sep 11, 2021

Replies: 0 comments

amorehead
Sep 11, 2021