Skip to content

enqueue.cc:1556 NCCL WARN Cuda failure 700 'an illegal memory access was encountered'Β #20757

@Yingshu-Li

Description

@Yingshu-Li

Bug description

When I use DDP to train the model on two 5090 GPUS, this error will occur.

enqueue.cc:1556 NCCL WARN Cuda failure 700 'an illegal memory access was encountered'

But if I just use one GPU, the training will start successfully.

What version are you seeing the problem on?

v2.5

How to reproduce the bug

I try to use:
export NCCL_DEBUG=INFO
export PYTHONFAULTHANDLER=1
export NCCL_P2P_DISABLE=1
export CUDA_LAUNCH_BLOCKING=1
to track the error.

Error messages and logs

user-MH53-G40-001:10225:10225 [0] NCCL INFO Bootstrap: Using enp1s0f0np0:10.70.59.78<0>
user-MH53-G40-001:10225:10225 [0] NCCL INFO cudaDriverVersion 12080
user-MH53-G40-001:10225:10225 [0] NCCL INFO NCCL version 2.26.2+cuda12.2
user-MH53-G40-001:10225:10225 [0] NCCL INFO Comm config Blocking set to 1
user-MH53-G40-001:10225:10549 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal net plugin.
user-MH53-G40-001:10225:10549 [0] NCCL INFO NET/IB : No device found.
user-MH53-G40-001:10225:10549 [0] NCCL INFO NET/IB : Using [RO]; OOB enp1s0f0np0:10.70.59.78<0>
user-MH53-G40-001:10225:10549 [0] NCCL INFO NET/Socket : Using [0]enp1s0f0np0:10.70.59.78<0>
user-MH53-G40-001:10225:10549 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. 
user-MH53-G40-001:10225:10549 [0] NCCL INFO Using network Socket
user-MH53-G40-001:10225:10549 [0] NCCL INFO ncclCommInitRankConfig comm 0x1ca810c0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 81000 commId 0xb5061c20698dde7e - Init START
user-MH53-G40-001:10225:10549 [0] NCCL INFO RAS client listening socket at 127.0.0.1<28028>
user-MH53-G40-001:10225:10549 [0] NCCL INFO Bootstrap timings total 0.020280 (create 0.000026, send 0.000094, recv 0.019841, ring 0.000048, delay 0.000000)
user-MH53-G40-001:10225:10549 [0] NCCL INFO NCCL_P2P_DISABLE set by environment to 1
user-MH53-G40-001:10225:10549 [0] NCCL INFO comm 0x1ca810c0 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
user-MH53-G40-001:10225:10549 [0] NCCL INFO Channel 00/04 : 0 1
user-MH53-G40-001:10225:10549 [0] NCCL INFO Channel 01/04 : 0 1
user-MH53-G40-001:10225:10549 [0] NCCL INFO Channel 02/04 : 0 1
user-MH53-G40-001:10225:10549 [0] NCCL INFO Channel 03/04 : 0 1
user-MH53-G40-001:10225:10549 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->-1 [3] -1/-1/-1->0->1
user-MH53-G40-001:10225:10549 [0] NCCL INFO P2P Chunksize set to 131072
user-MH53-G40-001:10225:10549 [0] NCCL INFO Check P2P Type intraNodeP2pSupport 0 directMode 0
user-MH53-G40-001:10225:10554 [0] NCCL INFO [Proxy Service] Device 0 CPU core 22
user-MH53-G40-001:10225:10556 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 34
user-MH53-G40-001:10225:10549 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
user-MH53-G40-001:10225:10549 [0] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
user-MH53-G40-001:10225:10549 [0] NCCL INFO CC Off, workFifoBytes 1048576
user-MH53-G40-001:10225:10549 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin.
user-MH53-G40-001:10225:10549 [0] NCCL INFO ncclCommInitRankConfig comm 0x1ca810c0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 81000 commId 0xb5061c20698dde7e - Init COMPLETE
user-MH53-G40-001:10225:10549 [0] NCCL INFO Init timings - ncclCommInitRankConfig: rank 0 nranks 2 total 0.55 (kernels 0.52, alloc 0.00, bootstrap 0.02, allgathers 0.00, topo 0.00, graphs 0.00, connections 0.00, rest 0.00)
user-MH53-G40-001:10225:10558 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
user-MH53-G40-001:10225:10558 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
user-MH53-G40-001:10225:10558 [0] NCCL INFO Channel 02 : 0[0] -> 1[1] via SHM/direct/direct
user-MH53-G40-001:10225:10558 [0] NCCL INFO Channel 03 : 0[0] -> 1[1] via SHM/direct/direct
user-MH53-G40-001:10225:10558 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 1
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1]

[2025-04-27 20:37:53] user-MH53-G40-001:10225:10225 [0] enqueue.cc:1556 NCCL WARN Cuda failure 700 'an illegal memory access was encountered'
user-MH53-G40-001:10225:10225 [0] NCCL INFO group.cc:241 -> 1
user-MH53-G40-001:10225:10225 [0] NCCL INFO group.cc:478 -> 1
user-MH53-G40-001:10225:10225 [0] NCCL INFO group.cc:581 -> 1
user-MH53-G40-001:10225:10225 [0] NCCL INFO enqueue.cc:2299 -> 1
Fatal Python error: Aborted

Thread 0x0000742af2ffd6c0 (most recent call first):
  File "/home/yingshu/.conda/envs/r2gen/lib/python3.10/threading.py", line 324 in wait
  File "/home/yingshu/.conda/envs/r2gen/lib/python3.10/queue.py", line 180 in get
  File "/home/yingshu/.conda/envs/r2gen/lib/python3.10/site-packages/tensorboard/summary/writer/event_file_writer.py", line 269 in _run
  File "/home/yingshu/.conda/envs/r2gen/lib/python3.10/site-packages/tensorboard/summary/writer/event_file_writer.py", line 244 in run
  File "/home/yingshu/.conda/envs/r2gen/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/home/yingshu/.conda/envs/r2gen/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x0000742c9edde6c0 (most recent call first):
  File "/home/yingshu/.conda/envs/r2gen/lib/python3.10/site-packages/lightning/fabric/strategies/launchers/subprocess_script.py", line 204 in run
  File "/home/yingshu/.conda/envs/r2gen/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/home/yingshu/.conda/envs/r2gen/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x0000742ca7fff6c0 (most recent call first):
  File "/home/yingshu/.conda/envs/r2gen/lib/python3.10/threading.py", line 324 in wait
  File "/home/yingshu/.conda/envs/r2gen/lib/python3.10/threading.py", line 607 in wait
  File "/home/yingshu/.conda/envs/r2gen/lib/python3.10/site-packages/tqdm/_monitor.py", line 60 in run
  File "/home/yingshu/.conda/envs/r2gen/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/home/yingshu/.conda/envs/r2gen/lib/python3.10/threading.py", line 973 in _bootstrap

Current thread 0x0000742ec57cc600 (most recent call first):
  File "/home/yingshu/.conda/envs/r2gen/lib/python3.10/site-packages/torch/distributed/utils.py", line 322 in _sync_params_and_buffers
  File "/home/yingshu/.conda/envs/r2gen/lib/python3.10/site-packages/torch/distributed/utils.py", line 311 in _sync_module_states
  File "/home/yingshu/.conda/envs/r2gen/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 837 in __init__
  File "/home/yingshu/.conda/envs/r2gen/lib/python3.10/site-packages/lightning/pytorch/strategies/ddp.py", line 195 in _setup_model
  File "/home/yingshu/.conda/envs/r2gen/lib/python3.10/site-packages/lightning/pytorch/strategies/ddp.py", line 283 in configure_ddp
  File "/home/yingshu/.conda/envs/r2gen/lib/python3.10/site-packages/lightning/pytorch/strategies/ddp.py", line 171 in setup
  File "/home/yingshu/.conda/envs/r2gen/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 963 in _run
  File "/home/yingshu/.conda/envs/r2gen/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 580 in _fit_impl
  File "/home/yingshu/.conda/envs/r2gen/lib/python3.10/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105 in launch
  File "/home/yingshu/.conda/envs/r2gen/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 43 in _call_and_handle_interrupt
  File "/home/yingshu/.conda/envs/r2gen/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 544 in fit
  File "/mnt/sda/yingshu/R2GenGPT/train.py", line 44 in train
  File "/mnt/sda/yingshu/R2GenGPT/train.py", line 51 in main
  File "/mnt/sda/yingshu/R2GenGPT/train.py", line 55 in <module>

Extension modules: numpy._core._multiarray_umath, numpy.linalg._umath_linalg, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, scipy._lib._ccallback_c, scipy.signal._sigtools, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, charset_normalizer.md, scipy.linalg._solve_toeplitz, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg._matfuncs_expm, scipy.linalg._linalg_pythran, scipy.linalg.cython_blas, scipy.linalg._decomp_update, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.linalg._propack._spropack, scipy.sparse.linalg._propack._dpropack, scipy.sparse.linalg._propack._cpropack, scipy.sparse.linalg._propack._zpropack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, scipy.special._ufuncs_cxx, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ellip_harm_2, scipy._lib._uarray._uarray, scipy.signal._max_len_seq_inner, scipy.signal._upfirdn_apply, scipy.signal._spline, scipy.spatial._ckdtree, scipy._lib.messagestream, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._distance_wrap, scipy.spatial._hausdorff, scipy.spatial.transform._rotation, scipy.interpolate._fitpack, scipy.interpolate._dfitpack, scipy.optimize._group_columns, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._cobyla, scipy.optimize._slsqp, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy.optimize._cython_nnls, scipy.linalg._decomp_interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.optimize._direct, scipy.interpolate._dierckx, scipy.interpolate._ppoly, scipy.interpolate._interpnd, scipy.interpolate._rbfinterp_pythran, scipy.interpolate._rgi_cython, scipy.interpolate._bspl, scipy.ndimage._nd_image, scipy.ndimage._rank_filter_1d, _ni_label, scipy.ndimage._ni_label, scipy.signal._sosfilt, scipy.integrate._odepack, scipy.integrate._quadpack, scipy.integrate._vode, scipy.integrate._dop, scipy.integrate._lsoda, scipy.special.cython_special, scipy.stats._stats, scipy.stats._sobol, scipy.stats._qmc_cy, scipy.stats._biasedurn, scipy.stats._stats_pythran, scipy.stats._levy_stable.levyst, scipy.stats._ansari_swilk_statistics, scipy.stats._mvn, scipy.stats._rcont.rcont, scipy.signal._peak_finding_utils, PIL._imaging, PIL._imagingft, regex._regex, requests.packages.charset_normalizer.md, requests.packages.chardet.md, yaml._yaml, markupsafe._speedups, simsimd, stringzilla, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, psutil._psutil_linux, psutil._psutil_posix, google._upb._message (total: 164)
Fatal Python error: Aborted

Thread 0x0000766a215fe6c0 (most recent call first):
  File "/home/yingshu/.conda/envs/r2gen/lib/python3.10/threading.py", line 324 in wait
  File "/home/yingshu/.conda/envs/r2gen/lib/python3.10/threading.py", line 607 in wait
  File "/home/yingshu/.conda/envs/r2gen/lib/python3.10/site-packages/tqdm/_monitor.py", line 60 in run
  File "/home/yingshu/.conda/envs/r2gen/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/home/yingshu/.conda/envs/r2gen/lib/python3.10/threading.py", line 973 in _bootstrap

Current thread 0x0000766c3a667740 (most recent call first):
  File "/home/yingshu/.conda/envs/r2gen/lib/python3.10/site-packages/torch/distributed/utils.py", line 322 in _sync_params_and_buffers
  File "/home/yingshu/.conda/envs/r2gen/lib/python3.10/site-packages/torch/distributed/utils.py", line 311 in _sync_module_states
  File "/home/yingshu/.conda/envs/r2gen/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 837 in __init__
  File "/home/yingshu/.conda/envs/r2gen/lib/python3.10/site-packages/lightning/pytorch/strategies/ddp.py", line 195 in _setup_model
  File "/home/yingshu/.conda/envs/r2gen/lib/python3.10/site-packages/lightning/pytorch/strategies/ddp.py", line 283 in configure_ddp
  File "/home/yingshu/.conda/envs/r2gen/lib/python3.10/site-packages/lightning/pytorch/strategies/ddp.py", line 171 in setup
  File "/home/yingshu/.conda/envs/r2gen/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 963 in _run
  File "/home/yingshu/.conda/envs/r2gen/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 580 in _fit_impl
  File "/home/yingshu/.conda/envs/r2gen/lib/python3.10/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105 in launch
  File "/home/yingshu/.conda/envs/r2gen/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 43 in _call_and_handle_interrupt
  File "/home/yingshu/.conda/envs/r2gen/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 544 in fit
  File "/mnt/sda/yingshu/R2GenGPT/train.py", line 44 in train
  File "/mnt/sda/yingshu/R2GenGPT/train.py", line 51 in main
  File "/mnt/sda/yingshu/R2GenGPT/train.py", line 55 in <module>

Extension modules: numpy._core._multiarray_umath, numpy.linalg._umath_linalg, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, scipy._lib._ccallback_c, scipy.signal._sigtools, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, charset_normalizer.md, scipy.linalg._solve_toeplitz, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg._matfuncs_expm, scipy.linalg._linalg_pythran, scipy.linalg.cython_blas, scipy.linalg._decomp_update, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.linalg._propack._spropack, scipy.sparse.linalg._propack._dpropack, scipy.sparse.linalg._propack._cpropack, scipy.sparse.linalg._propack._zpropack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, scipy.special._ufuncs_cxx, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ellip_harm_2, scipy._lib._uarray._uarray, scipy.signal._max_len_seq_inner, scipy.signal._upfirdn_apply, scipy.signal._spline, scipy.spatial._ckdtree, scipy._lib.messagestream, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._distance_wrap, scipy.spatial._hausdorff, scipy.spatial.transform._rotation, scipy.interpolate._fitpack, scipy.interpolate._dfitpack, scipy.optimize._group_columns, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._cobyla, scipy.optimize._slsqp, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy.optimize._cython_nnls, scipy.linalg._decomp_interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.optimize._direct, scipy.interpolate._dierckx, scipy.interpolate._ppoly, scipy.interpolate._interpnd, scipy.interpolate._rbfinterp_pythran, scipy.interpolate._rgi_cython, scipy.interpolate._bspl, scipy.ndimage._nd_image, scipy.ndimage._rank_filter_1d, _ni_label, scipy.ndimage._ni_label, scipy.signal._sosfilt, scipy.integrate._odepack, scipy.integrate._quadpack, scipy.integrate._vode, scipy.integrate._dop, scipy.integrate._lsoda, scipy.special.cython_special, scipy.stats._stats, scipy.stats._sobol, scipy.stats._qmc_cy, scipy.stats._biasedurn, scipy.stats._stats_pythran, scipy.stats._levy_stable.levyst, scipy.stats._ansari_swilk_statistics, scipy.stats._mvn, scipy.stats._rcont.rcont, scipy.signal._peak_finding_utils, PIL._imaging, PIL._imagingft, regex._regex, requests.packages.charset_normalizer.md, requests.packages.chardet.md, yaml._yaml, markupsafe._speedups, simsimd, stringzilla, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, psutil._psutil_linux, psutil._psutil_posix (total: 163)

Environment

Current environment
#- PyTorch Lightning Version (e.g., 2.5.0): 2.5.1.post0
#- PyTorch Version (e.g., 2.5): 2.7.0+cu128
#- Python version (e.g., 3.12): 3.10
#- OS (e.g., Linux): ubuntu
#- CUDA/cuDNN version: 12.8
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):

More info

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingneeds triageWaiting to be triaged by maintainersver: 2.5.x

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions