Choose which GPU goes to which node in DDP (Tesla K80 hangs) #9282
Unanswered
roman-vygon
asked this question in
Lightning Trainer API: Trainer, LightningModule, LightningDataModule
Replies: 1 comment
-
I think you need to use Lightning version above 1.3.8 if you are also using a newer pytorch version (I think 1.8+). The barrier received device ids so this should be fixed. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi! I've been expiriencing a similar problem to pytorch/pytorch#1637, where I have a machine with two Tesla K80s, that are visible as 4 devices in nvidia-smi.
I'm trying to train a NeMo framework example model, and it can train fine if I use
CUDA_VISIBLE_DEVICES=1,3 python train.py
or
CUDA_VISIBLE_DEVICES=0,2 python train.py
or even
CUDA_VISIBLE_DEVICES=0,3 python train.py
The problem is when I try to either pair the 0th and 1st gpu or the 2d and the 3d. Or, and this is what I'm trying to achieve, running ddp with all 4 devices.
I thought of overcoming the problem by using num_nodes=2 gpus=2, so that the first node would use 1st and 3d gpu, and the second one - gpu#0 and gpu#2. But the training hangs, probably because it uses the pairs 0-1 and 2-3. How can I change the devices that get assigned to each node?
The training hangs with a message:
Internally this is due to some p2p problem, which I can't solve right now. Here is the output of cuda/p2pBandwidthLatencyTest:
Beta Was this translation helpful? Give feedback.
All reactions