Choose which GPU goes to which node in DDP (Tesla K80 hangs) #9282

roman-vygon · 2021-09-02T12:05:22Z

roman-vygon
Sep 2, 2021

Hi! I've been expiriencing a similar problem to pytorch/pytorch#1637, where I have a machine with two Tesla K80s, that are visible as 4 devices in nvidia-smi.

I'm trying to train a NeMo framework example model, and it can train fine if I use
CUDA_VISIBLE_DEVICES=1,3 python train.py
or
CUDA_VISIBLE_DEVICES=0,2 python train.py
or even
CUDA_VISIBLE_DEVICES=0,3 python train.py
The problem is when I try to either pair the 0th and 1st gpu or the 2d and the 3d. Or, and this is what I'm trying to achieve, running ddp with all 4 devices.
I thought of overcoming the problem by using num_nodes=2 gpus=2, so that the first node would use 1st and 3d gpu, and the second one - gpu#0 and gpu#2. But the training hangs, probably because it uses the pairs 0-1 and 2-3. How can I change the devices that get assigned to each node?
The training hangs with a message:

Using native 16bit precision.
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
[W ProcessGroupNCCL.cpp:1569] Rank 0 using best-guess GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1569] Rank 1 using best-guess GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.

Internally this is due to some p2p problem, which I can't solve right now. Here is the output of cuda/p2pBandwidthLatencyTest:

[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, Tesla K80, pciBusID: 6, pciDeviceID: 0, pciDomainID:0
Device: 1, Tesla K80, pciBusID: 7, pciDeviceID: 0, pciDomainID:0
Device: 2, Tesla K80, pciBusID: 86, pciDeviceID: 0, pciDomainID:0
Device: 3, Tesla K80, pciBusID: 87, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=0 CANNOT Access Peer Device=2
Device=0 CANNOT Access Peer Device=3
Device=1 CAN Access Peer Device=0
Device=1 CANNOT Access Peer Device=2
Device=1 CANNOT Access Peer Device=3
Device=2 CANNOT Access Peer Device=0
Device=2 CANNOT Access Peer Device=1
Device=2 CAN Access Peer Device=3
Device=3 CANNOT Access Peer Device=0
Device=3 CANNOT Access Peer Device=1
Device=3 CAN Access Peer Device=2

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
     D\D     0     1     2     3
     0       1     1     0     0
     1       1     1     0     0
     2       0     0     1     1
     3       0     0     1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 146.32   6.57   6.66   6.66
     1   7.81 171.79   6.95   7.04
     2   7.20   7.76 164.14   7.25
     3   7.76   8.00   8.45 172.64
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1      2      3
     0 174.21   2.48   9.52   9.16
     1   2.48 175.37   9.96   9.76
     2  10.05  10.03 175.28   2.48
     3   9.69   9.49   2.48 175.84
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 176.02   8.45  14.06  14.08
     1   8.18 175.74  14.14  14.13
     2  14.12  14.12 175.86   9.04
     3  14.10  14.10   9.59 175.87
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 175.85   4.02  14.21  14.20
     1   4.02 175.91  14.17  14.13
     2  14.11  14.18 175.85   4.02
     3  14.04  14.11   4.02 175.91
P2P=Disabled Latency Matrix (us)
   GPU     0      1      2      3
     0   4.58  20.28  23.36  23.23
     1  22.36   4.51  23.10  23.23
     2  18.53  23.17   4.41  18.64
     3  23.30  22.75  19.72   4.51

   CPU     0      1      2      3
     0   2.60   8.68   8.47   8.50
     1   8.92   2.74   8.59   8.48
     2   7.70   7.54   2.28   7.42
     3   7.68   7.57   7.32   2.31
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1      2      3
     0   4.36 49368.10  23.24  23.16
     1 49367.82   4.56  23.34  22.43
     2  20.69  21.21   6.17 49368.54
     3  21.12  20.95 49368.71   4.72

   CPU     0      1      2      3
     0   2.60   2.21   8.81   9.22
     1   2.29   3.04   8.77   8.83
     2   7.91   7.81   2.34   1.89
     3   7.78   7.65   1.96   2.41

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

awaelchli · 2021-09-04T20:53:37Z

awaelchli
Sep 4, 2021

I think you need to use Lightning version above 1.3.8 if you are also using a newer pytorch version (I think 1.8+). The barrier received device ids so this should be fixed.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Choose which GPU goes to which node in DDP (Tesla K80 hangs) #9282

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Choose which GPU goes to which node in DDP (Tesla K80 hangs) #9282

Uh oh!

roman-vygon Sep 2, 2021

Replies: 1 comment

Uh oh!

awaelchli Sep 4, 2021

roman-vygon
Sep 2, 2021

awaelchli
Sep 4, 2021