Skip to content

NCCL backend fails during multi-node, multi-GPU trainingΒ #20306

@raketenolli

Description

@raketenolli

Bug description

I set up a training on a Slurm cluster, specifying 2 nodes with 4 GPUs each. During initialization, I observed the Unexpected behavior (times out) of all_gather_into_tensor with subgroups (Pytorch issue)

Apparently, this issue has not been solved on the Pytorch or NCCL level, but there is a workaround (described in this post on that same issue).

How/where could this workaround be implemented in Pytorch Lightning, if outright solving the underlying problem is not possible?

What version are you seeing the problem on?

v2.4

How to reproduce the bug

No response

Error messages and logs

# Error messages and logs here please

Environment

I'm working on a Slurm cluster with 2 headnodes (no GPUs), 6 computenodes (configuration see below) and NFS-mounted data storage.

<details>
  <summary>Current environment</summary>

* CUDA:
        - GPU:
                - NVIDIA RTX A6000
                - NVIDIA RTX A6000
                - NVIDIA RTX A6000
                - NVIDIA RTX A6000
                - NVIDIA RTX A6000
                - NVIDIA RTX A6000
                - NVIDIA RTX A6000
                - NVIDIA RTX A6000
        - available:         True
        - version:           12.1
* Lightning:
        - lightning-utilities: 0.11.7
        - pytorch-lightning: 2.4.0
        - torch:             2.4.1+cu121
        - torchmetrics:      1.4.2
        - torchvision:       0.19.1+cu121
* Packages:
        - absl-py:           2.1.0
        - aiohappyeyeballs:  2.4.0
        - aiohttp:           3.10.5
        - aiosignal:         1.3.1
        - albucore:          0.0.16
        - albumentations:    1.4.15
        - annotated-types:   0.7.0
        - async-timeout:     4.0.3
        - attrs:             24.2.0
        - certifi:           2024.8.30
        - charset-normalizer: 3.3.2
        - contourpy:         1.3.0
        - cycler:            0.12.1
        - eval-type-backport: 0.2.0
        - filelock:          3.13.1
        - fonttools:         4.53.1
        - frozenlist:        1.4.1
        - fsspec:            2024.2.0
        - future:            1.0.0
        - geopandas:         1.0.1
        - grpcio:            1.66.1
        - huggingface-hub:   0.25.0
        - idna:              3.10
        - imageio:           2.35.1
        - imgaug:            0.4.0
        - jinja2:            3.1.3
        - joblib:            1.4.2
        - kiwisolver:        1.4.7
        - lazy-loader:       0.4
        - lightning-utilities: 0.11.7
        - markdown:          3.7
        - matplotlib:        3.9.2
        - mpmath:            1.3.0
        - msgpack:           1.1.0
        - multidict:         6.1.0
        - networkx:          3.2.1
        - numpy:             1.26.3
        - nvidia-cublas-cu12: 12.1.3.1
        - nvidia-cuda-cupti-cu12: 12.1.105
        - nvidia-cuda-nvrtc-cu12: 12.1.105
        - nvidia-cuda-runtime-cu12: 12.1.105
        - nvidia-cudnn-cu12: 9.1.0.70
        - nvidia-cufft-cu12: 11.0.2.54
        - nvidia-curand-cu12: 10.3.2.106
        - nvidia-cusolver-cu12: 11.4.5.107
        - nvidia-cusparse-cu12: 12.1.0.106
        - nvidia-nccl-cu12:  2.20.5
        - nvidia-nvjitlink-cu12: 12.1.105
        - nvidia-nvtx-cu12:  12.1.105
        - opencv-python:     4.10.0.84
        - opencv-python-headless: 4.10.0.84
        - packaging:         24.1
        - pandas:            2.2.2
        - pillow:            10.2.0
        - pip:               22.3.1
        - protobuf:          5.28.1
        - pydantic:          2.9.2
        - pydantic-core:     2.23.4
        - pyogrio:           0.9.0
        - pyparsing:         3.1.4
        - pyproj:            3.6.1
        - python-dateutil:   2.9.0.post0
        - pytorch-lightning: 2.4.0
        - pytz:              2024.2
        - pyyaml:            6.0.2
        - requests:          2.32.3
        - s2sphere:          0.2.5
        - safetensors:       0.4.5
        - scikit-image:      0.24.0
        - scikit-learn:      1.5.2
        - scipy:             1.14.1
        - setuptools:        65.5.0
        - shapely:           2.0.6
        - six:               1.16.0
        - sympy:             1.12
        - tensorboard:       2.17.1
        - tensorboard-data-server: 0.7.2
        - threadpoolctl:     3.5.0
        - tifffile:          2024.8.30
        - timm:              1.0.9
        - torch:             2.4.1+cu121
        - torchmetrics:      1.4.2
        - torchvision:       0.19.1+cu121
        - tqdm:              4.66.5
        - triton:            3.0.0
        - typing-extensions: 4.9.0
        - tzdata:            2024.1
        - urllib3:           2.2.3
        - werkzeug:          3.0.4
        - yarl:              1.11.1
* System:
        - OS:                Linux
        - architecture:
                - 64bit
                - ELF
        - processor:         x86_64
        - python:            3.10.9
        - release:           5.15.0-50-generic
        - version:           #56~20.04.1-Ubuntu SMP Tue Sep 27 15:51:29 UTC 2022

</details>

More info

No response

cc @justusschock @lantiga

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions