NCCL timeout while doing multi gpu training

### Bug description

I am getting NCCL timeout issue while training the model. The code usually runs for 40k epochs and then fails with the below error:

```
[rank2]:[E513 13:25:57.714781669 ProcessGroupNCCL.cpp:616] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=532339, OpType=BROADCAST, NumelIn=5164000, NumelOut=5
164000, Timeout(ms)=1800000) ran for 1800042 milliseconds before timing out.                                                                                                           
[rank1]:[E513 13:25:57.714777142 ProcessGroupNCCL.cpp:616] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=532339, OpType=BROADCAST, NumelIn=5164000, NumelOut=5
164000, Timeout(ms)=1800000) ran for 1800042 milliseconds before timing out.                                                                                                           
[rank0]:[E513 13:25:57.714786872 ProcessGroupNCCL.cpp:616] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=532341, OpType=_ALLGATHER_BASE, NumelIn=513, NumelOut
=2052, Timeout(ms)=1800000) ran for 1800036 milliseconds before timing out.                                                                                                            
[rank2]:[E513 13:25:57.715039178 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 2] Exception (either an error or timeout) detected by watchdog at work: 532339, last en
queued NCCL work: 532340, last completed NCCL work: 532338.                                                                                                                            
[rank2]:[E513 13:25:57.715055330 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 2] Timeout at NCCL work: 532339, last enqueued NCCL work: 532340, last completed NCCL w
ork: 532338.                                                                                                                                                                           
[rank2]:[E513 13:25:57.715061964 ProcessGroupNCCL.cpp:630] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU opera
tions might run on corrupted/incomplete data.                                                                                                                                          
[rank2]:[E513 13:25:57.715068256 ProcessGroupNCCL.cpp:636] [Rank 2] To avoid data inconsistency, we are taking the entire process down.                                                
[rank0]:[E513 13:25:57.715087695 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 0] Exception (either an error or timeout) detected by watchdog at work: 532341, last en
queued NCCL work: 532341, last completed NCCL work: 532340.                                                                                                                            
[rank1]:[E513 13:25:57.715092955 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 532339, last en
queued NCCL work: 532340, last completed NCCL work: 532338.                                                                                                                            
[rank0]:[E513 13:25:57.715112458 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 0] Timeout at NCCL work: 532341, last enqueued NCCL work: 532341, last completed NCCL w
ork: 532340.                                                                                                                                                                           
[rank1]:[E513 13:25:57.715120352 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 1] Timeout at NCCL work: 532339, last enqueued NCCL work: 532340, last completed NCCL w
ork: 532338.                                                                                                                                                                           
[rank1]:[E513 13:25:57.715130246 ProcessGroupNCCL.cpp:630] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU opera
tions might run on corrupted/incomplete data.                                                                                                                                          
[rank0]:[E513 13:25:57.715136289 ProcessGroupNCCL.cpp:630] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU opera
tions might run on corrupted/incomplete data.                                                                                                                                          
[rank1]:[E513 13:25:57.715137895 ProcessGroupNCCL.cpp:636] [Rank 1] To avoid data inconsistency, we are taking the entire process down.                                                
[rank0]:[E513 13:25:57.715143973 ProcessGroupNCCL.cpp:636] [Rank 0] To avoid data inconsistency, we are taking the entire process down.                                                
[rank2]:[E513 13:25:57.716347552 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught co
llective operation timeout: WorkNCCL(SeqNum=532339, OpType=BROADCAST, NumelIn=5164000, NumelOut=5164000, Timeout(ms)=1800000) ran for 1800042 milliseconds before timing out.          
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):                                                                
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f6aa9ca0446 in /home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/sit
e-packages/torch/lib/libc10.so)                                                                                                                                                        
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7f6aaafa5672 in /home/ubuntu/.cache/pypoetry/v
irtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)                                                                                
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f6aaafacab3 in /home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/site-package
s/torch/lib/libtorch_cuda.so)                                                                                                                                                          
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f6aaafae51d in /home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/site-packag
es/torch/lib/libtorch_cuda.so)                                                                                                                                                         
frame #4: <unknown function> + 0x145c0 (0x7f6af7bd95c0 in /home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch.
so)                                                                                                                                                                                    
frame #5: <unknown function> + 0x8609 (0x7f6afacee609 in /lib/x86_64-linux-gnu/libpthread.so.0)                                                                                        
frame #6: clone + 0x43 (0x7f6afaab9353 in /lib/x86_64-linux-gnu/libc.so.6)                                                                                                             
[rank0]:[E513 13:25:57.716872543 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught co
llective operation timeout: WorkNCCL(SeqNum=532341, OpType=_ALLGATHER_BASE, NumelIn=513, NumelOut=2052, Timeout(ms)=1800000) ran for 1800036 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f75e5bd2446 in /home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/sit
e-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7f75e6ed7672 in /home/ubuntu/.cache/pypoetry/v
irtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f75e6edeab3 in /home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/site-package
s/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f75e6ee051d in /home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/site-packag
es/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7f7633b0b5c0 in /home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch.
so)
frame #5: <unknown function> + 0x8609 (0x7f7636c20609 in /lib/x86_64-linux-gnu/libpthread.so.0)                                                                                        
frame #6: clone + 0x43 (0x7f76369eb353 in /lib/x86_64-linux-gnu/libc.so.6)                                                                                                             

[rank1]:[E513 13:25:57.716878046 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught co
llective operation timeout: WorkNCCL(SeqNum=532339, OpType=BROADCAST, NumelIn=5164000, NumelOut=5164000, Timeout(ms)=1800000) ran for 1800042 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f8f35c3c446 in /home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/sit
e-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7f8f36f41672 in /home/ubuntu/.cache/pypoetry/v
irtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f8f36f48ab3 in /home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/site-package
s/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f8f36f4a51d in /home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/site-packag
es/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7f8f83b755c0 in /home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch.
so)
frame #5: <unknown function> + 0x8609 (0x7f8f86c8a609 in /lib/x86_64-linux-gnu/libpthread.so.0)                                                                                        
frame #6: clone + 0x43 (0x7f8f86a55353 in /lib/x86_64-linux-gnu/libc.so.6)                                                                                                             

Aborted (core dumped)  
```

### What version are you seeing the problem on?

v2.4

### Reproduced in studio

_No response_

### How to reproduce the bug

```python

```

### Error messages and logs

```
# Error messages and logs here please
```


### Environment

<details>
  <summary>Current environment</summary>

```
Collecting environment information...
PyTorch version: 2.4.1+cu118
Is debug build: False
CUDA used to build PyTorch: 11.8
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: Could not collect
CMake version: version 3.27.0
Libc version: glibc-2.31

Python version: 3.10.16 (main, Dec 11 2024, 16:24:50) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.0-1040-aws-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 11.8.89
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: Tesla T4
GPU 1: Tesla T4
GPU 2: Tesla T4
GPU 3: Tesla T4

Nvidia driver version: 535.230.02
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          48
On-line CPU(s) list:             0-47
Thread(s) per core:              2
Core(s) per socket:              24
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           85
Model name:                      Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
Stepping:                        7
CPU MHz:                         2499.996
BogoMIPS:                        4999.99
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       768 KiB
L1i cache:                       768 KiB
L2 cache:                        24 MiB
L3 cache:                        35.8 MiB
NUMA node0 CPU(s):               0-47
Vulnerability Itlb multihit:     KVM: Mitigation: VMX unsupported
Vulnerability L1tf:              Mitigation; PTE Inversion
Vulnerability Mds:               Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Meltdown:          Mitigation; PTI
Vulnerability Mmio stale data:   Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Retbleed:          Vulnerable
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Retpolines, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke avx512_vnni

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu11==11.11.3.6
[pip3] nvidia-cuda-cupti-cu11==11.8.87
[pip3] nvidia-cuda-nvrtc-cu11==11.8.89
[pip3] nvidia-cuda-runtime-cu11==11.8.89
[pip3] nvidia-cudnn-cu11==9.1.0.70
[pip3] nvidia-cufft-cu11==10.9.0.58
[pip3] nvidia-curand-cu11==10.3.0.86
[pip3] nvidia-cusolver-cu11==11.4.1.48
[pip3] nvidia-cusparse-cu11==11.7.5.86
[pip3] nvidia-nccl-cu11==2.20.5
[pip3] nvidia-nvtx-cu11==11.8.86
[pip3] onnx==1.17.0
[pip3] pytorch-lightning==2.4.0
[pip3] torch==2.4.1+cu118
[pip3] torchaudio==2.4.1+cu118
[pip3] torchmetrics==1.6.0
[pip3] torchvision==0.19.1
[pip3] triton==3.0.0
```

</details>


### More info

_No response_

cc @justusschock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

NCCL timeout while doing multi gpu training #20832

Bug description

What version are you seeing the problem on?

Reproduced in studio

How to reproduce the bug

Error messages and logs

Environment

More info

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

NCCL timeout while doing multi gpu training #20832

Description

Bug description

What version are you seeing the problem on?

Reproduced in studio

How to reproduce the bug

Error messages and logs

Environment

More info

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions