Skip to content

NCCL timeout while doing multi gpu training #20832

@aditya-sanas

Description

@aditya-sanas

Bug description

I am getting NCCL timeout issue while training the model. The code usually runs for 40k epochs and then fails with the below error:

[rank2]:[E513 13:25:57.714781669 ProcessGroupNCCL.cpp:616] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=532339, OpType=BROADCAST, NumelIn=5164000, NumelOut=5
164000, Timeout(ms)=1800000) ran for 1800042 milliseconds before timing out.                                                                                                           
[rank1]:[E513 13:25:57.714777142 ProcessGroupNCCL.cpp:616] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=532339, OpType=BROADCAST, NumelIn=5164000, NumelOut=5
164000, Timeout(ms)=1800000) ran for 1800042 milliseconds before timing out.                                                                                                           
[rank0]:[E513 13:25:57.714786872 ProcessGroupNCCL.cpp:616] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=532341, OpType=_ALLGATHER_BASE, NumelIn=513, NumelOut
=2052, Timeout(ms)=1800000) ran for 1800036 milliseconds before timing out.                                                                                                            
[rank2]:[E513 13:25:57.715039178 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 2] Exception (either an error or timeout) detected by watchdog at work: 532339, last en
queued NCCL work: 532340, last completed NCCL work: 532338.                                                                                                                            
[rank2]:[E513 13:25:57.715055330 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 2] Timeout at NCCL work: 532339, last enqueued NCCL work: 532340, last completed NCCL w
ork: 532338.                                                                                                                                                                           
[rank2]:[E513 13:25:57.715061964 ProcessGroupNCCL.cpp:630] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU opera
tions might run on corrupted/incomplete data.                                                                                                                                          
[rank2]:[E513 13:25:57.715068256 ProcessGroupNCCL.cpp:636] [Rank 2] To avoid data inconsistency, we are taking the entire process down.                                                
[rank0]:[E513 13:25:57.715087695 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 0] Exception (either an error or timeout) detected by watchdog at work: 532341, last en
queued NCCL work: 532341, last completed NCCL work: 532340.                                                                                                                            
[rank1]:[E513 13:25:57.715092955 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 532339, last en
queued NCCL work: 532340, last completed NCCL work: 532338.                                                                                                                            
[rank0]:[E513 13:25:57.715112458 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 0] Timeout at NCCL work: 532341, last enqueued NCCL work: 532341, last completed NCCL w
ork: 532340.                                                                                                                                                                           
[rank1]:[E513 13:25:57.715120352 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 1] Timeout at NCCL work: 532339, last enqueued NCCL work: 532340, last completed NCCL w
ork: 532338.                                                                                                                                                                           
[rank1]:[E513 13:25:57.715130246 ProcessGroupNCCL.cpp:630] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU opera
tions might run on corrupted/incomplete data.                                                                                                                                          
[rank0]:[E513 13:25:57.715136289 ProcessGroupNCCL.cpp:630] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU opera
tions might run on corrupted/incomplete data.                                                                                                                                          
[rank1]:[E513 13:25:57.715137895 ProcessGroupNCCL.cpp:636] [Rank 1] To avoid data inconsistency, we are taking the entire process down.                                                
[rank0]:[E513 13:25:57.715143973 ProcessGroupNCCL.cpp:636] [Rank 0] To avoid data inconsistency, we are taking the entire process down.                                                
[rank2]:[E513 13:25:57.716347552 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught co
llective operation timeout: WorkNCCL(SeqNum=532339, OpType=BROADCAST, NumelIn=5164000, NumelOut=5164000, Timeout(ms)=1800000) ran for 1800042 milliseconds before timing out.          
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):                                                                
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f6aa9ca0446 in /home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/sit
e-packages/torch/lib/libc10.so)                                                                                                                                                        
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7f6aaafa5672 in /home/ubuntu/.cache/pypoetry/v
irtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)                                                                                
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f6aaafacab3 in /home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/site-package
s/torch/lib/libtorch_cuda.so)                                                                                                                                                          
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f6aaafae51d in /home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/site-packag
es/torch/lib/libtorch_cuda.so)                                                                                                                                                         
frame #4: <unknown function> + 0x145c0 (0x7f6af7bd95c0 in /home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch.
so)                                                                                                                                                                                    
frame #5: <unknown function> + 0x8609 (0x7f6afacee609 in /lib/x86_64-linux-gnu/libpthread.so.0)                                                                                        
frame #6: clone + 0x43 (0x7f6afaab9353 in /lib/x86_64-linux-gnu/libc.so.6)                                                                                                             
[rank0]:[E513 13:25:57.716872543 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught co
llective operation timeout: WorkNCCL(SeqNum=532341, OpType=_ALLGATHER_BASE, NumelIn=513, NumelOut=2052, Timeout(ms)=1800000) ran for 1800036 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f75e5bd2446 in /home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/sit
e-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7f75e6ed7672 in /home/ubuntu/.cache/pypoetry/v
irtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f75e6edeab3 in /home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/site-package
s/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f75e6ee051d in /home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/site-packag
es/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7f7633b0b5c0 in /home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch.
so)
frame #5: <unknown function> + 0x8609 (0x7f7636c20609 in /lib/x86_64-linux-gnu/libpthread.so.0)                                                                                        
frame #6: clone + 0x43 (0x7f76369eb353 in /lib/x86_64-linux-gnu/libc.so.6)                                                                                                             

[rank1]:[E513 13:25:57.716878046 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught co
llective operation timeout: WorkNCCL(SeqNum=532339, OpType=BROADCAST, NumelIn=5164000, NumelOut=5164000, Timeout(ms)=1800000) ran for 1800042 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f8f35c3c446 in /home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/sit
e-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7f8f36f41672 in /home/ubuntu/.cache/pypoetry/v
irtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f8f36f48ab3 in /home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/site-package
s/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f8f36f4a51d in /home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/site-packag
es/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7f8f83b755c0 in /home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch.
so)
frame #5: <unknown function> + 0x8609 (0x7f8f86c8a609 in /lib/x86_64-linux-gnu/libpthread.so.0)                                                                                        
frame #6: clone + 0x43 (0x7f8f86a55353 in /lib/x86_64-linux-gnu/libc.so.6)                                                                                                             

Aborted (core dumped)  

What version are you seeing the problem on?

v2.4

Reproduced in studio

No response

How to reproduce the bug

Error messages and logs

# Error messages and logs here please

Environment

Current environment
Collecting environment information...
PyTorch version: 2.4.1+cu118
Is debug build: False
CUDA used to build PyTorch: 11.8
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: Could not collect
CMake version: version 3.27.0
Libc version: glibc-2.31

Python version: 3.10.16 (main, Dec 11 2024, 16:24:50) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.0-1040-aws-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 11.8.89
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: Tesla T4
GPU 1: Tesla T4
GPU 2: Tesla T4
GPU 3: Tesla T4

Nvidia driver version: 535.230.02
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          48
On-line CPU(s) list:             0-47
Thread(s) per core:              2
Core(s) per socket:              24
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           85
Model name:                      Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
Stepping:                        7
CPU MHz:                         2499.996
BogoMIPS:                        4999.99
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       768 KiB
L1i cache:                       768 KiB
L2 cache:                        24 MiB
L3 cache:                        35.8 MiB
NUMA node0 CPU(s):               0-47
Vulnerability Itlb multihit:     KVM: Mitigation: VMX unsupported
Vulnerability L1tf:              Mitigation; PTE Inversion
Vulnerability Mds:               Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Meltdown:          Mitigation; PTI
Vulnerability Mmio stale data:   Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Retbleed:          Vulnerable
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Retpolines, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke avx512_vnni

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu11==11.11.3.6
[pip3] nvidia-cuda-cupti-cu11==11.8.87
[pip3] nvidia-cuda-nvrtc-cu11==11.8.89
[pip3] nvidia-cuda-runtime-cu11==11.8.89
[pip3] nvidia-cudnn-cu11==9.1.0.70
[pip3] nvidia-cufft-cu11==10.9.0.58
[pip3] nvidia-curand-cu11==10.3.0.86
[pip3] nvidia-cusolver-cu11==11.4.1.48
[pip3] nvidia-cusparse-cu11==11.7.5.86
[pip3] nvidia-nccl-cu11==2.20.5
[pip3] nvidia-nvtx-cu11==11.8.86
[pip3] onnx==1.17.0
[pip3] pytorch-lightning==2.4.0
[pip3] torch==2.4.1+cu118
[pip3] torchaudio==2.4.1+cu118
[pip3] torchmetrics==1.6.0
[pip3] torchvision==0.19.1
[pip3] triton==3.0.0

More info

No response

cc @justusschock

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingdistributedGeneric distributed-related topichelp wantedOpen to be worked onrepro neededThe issue is missing a reproducible examplever: 2.4.x

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions