NCCL backend fails during multi-node, multi-GPU training

### Bug description

I set up a training on a Slurm cluster, specifying 2 nodes with 4 GPUs each. During initialization, I observed the [Unexpected behavior (times out) of all_gather_into_tensor with subgroups](https://github.com/pytorch/pytorch/issues/134006#top) (Pytorch issue)

Apparently, this issue has not been solved on the Pytorch or NCCL level, but there is a workaround (described in [this post](https://github.com/pytorch/pytorch/issues/134006#issuecomment-2300041017) on that same issue).

How/where could this workaround be implemented in Pytorch Lightning, if outright solving the underlying problem is not possible?

### What version are you seeing the problem on?

v2.4

### How to reproduce the bug

_No response_

### Error messages and logs

```
# Error messages and logs here please
```


### Environment

I'm working on a Slurm cluster with 2 headnodes (no GPUs), 6 computenodes (configuration see below) and NFS-mounted data storage.

```
<details>
  <summary>Current environment</summary>

* CUDA:
        - GPU:
                - NVIDIA RTX A6000
                - NVIDIA RTX A6000
                - NVIDIA RTX A6000
                - NVIDIA RTX A6000
                - NVIDIA RTX A6000
                - NVIDIA RTX A6000
                - NVIDIA RTX A6000
                - NVIDIA RTX A6000
        - available:         True
        - version:           12.1
* Lightning:
        - lightning-utilities: 0.11.7
        - pytorch-lightning: 2.4.0
        - torch:             2.4.1+cu121
        - torchmetrics:      1.4.2
        - torchvision:       0.19.1+cu121
* Packages:
        - absl-py:           2.1.0
        - aiohappyeyeballs:  2.4.0
        - aiohttp:           3.10.5
        - aiosignal:         1.3.1
        - albucore:          0.0.16
        - albumentations:    1.4.15
        - annotated-types:   0.7.0
        - async-timeout:     4.0.3
        - attrs:             24.2.0
        - certifi:           2024.8.30
        - charset-normalizer: 3.3.2
        - contourpy:         1.3.0
        - cycler:            0.12.1
        - eval-type-backport: 0.2.0
        - filelock:          3.13.1
        - fonttools:         4.53.1
        - frozenlist:        1.4.1
        - fsspec:            2024.2.0
        - future:            1.0.0
        - geopandas:         1.0.1
        - grpcio:            1.66.1
        - huggingface-hub:   0.25.0
        - idna:              3.10
        - imageio:           2.35.1
        - imgaug:            0.4.0
        - jinja2:            3.1.3
        - joblib:            1.4.2
        - kiwisolver:        1.4.7
        - lazy-loader:       0.4
        - lightning-utilities: 0.11.7
        - markdown:          3.7
        - matplotlib:        3.9.2
        - mpmath:            1.3.0
        - msgpack:           1.1.0
        - multidict:         6.1.0
        - networkx:          3.2.1
        - numpy:             1.26.3
        - nvidia-cublas-cu12: 12.1.3.1
        - nvidia-cuda-cupti-cu12: 12.1.105
        - nvidia-cuda-nvrtc-cu12: 12.1.105
        - nvidia-cuda-runtime-cu12: 12.1.105
        - nvidia-cudnn-cu12: 9.1.0.70
        - nvidia-cufft-cu12: 11.0.2.54
        - nvidia-curand-cu12: 10.3.2.106
        - nvidia-cusolver-cu12: 11.4.5.107
        - nvidia-cusparse-cu12: 12.1.0.106
        - nvidia-nccl-cu12:  2.20.5
        - nvidia-nvjitlink-cu12: 12.1.105
        - nvidia-nvtx-cu12:  12.1.105
        - opencv-python:     4.10.0.84
        - opencv-python-headless: 4.10.0.84
        - packaging:         24.1
        - pandas:            2.2.2
        - pillow:            10.2.0
        - pip:               22.3.1
        - protobuf:          5.28.1
        - pydantic:          2.9.2
        - pydantic-core:     2.23.4
        - pyogrio:           0.9.0
        - pyparsing:         3.1.4
        - pyproj:            3.6.1
        - python-dateutil:   2.9.0.post0
        - pytorch-lightning: 2.4.0
        - pytz:              2024.2
        - pyyaml:            6.0.2
        - requests:          2.32.3
        - s2sphere:          0.2.5
        - safetensors:       0.4.5
        - scikit-image:      0.24.0
        - scikit-learn:      1.5.2
        - scipy:             1.14.1
        - setuptools:        65.5.0
        - shapely:           2.0.6
        - six:               1.16.0
        - sympy:             1.12
        - tensorboard:       2.17.1
        - tensorboard-data-server: 0.7.2
        - threadpoolctl:     3.5.0
        - tifffile:          2024.8.30
        - timm:              1.0.9
        - torch:             2.4.1+cu121
        - torchmetrics:      1.4.2
        - torchvision:       0.19.1+cu121
        - tqdm:              4.66.5
        - triton:            3.0.0
        - typing-extensions: 4.9.0
        - tzdata:            2024.1
        - urllib3:           2.2.3
        - werkzeug:          3.0.4
        - yarl:              1.11.1
* System:
        - OS:                Linux
        - architecture:
                - 64bit
                - ELF
        - processor:         x86_64
        - python:            3.10.9
        - release:           5.15.0-50-generic
        - version:           #56~20.04.1-Ubuntu SMP Tue Sep 27 15:51:29 UTC 2022

</details>
```

### More info

_No response_

cc @justusschock @lantiga

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

NCCL backend fails during multi-node, multi-GPU training #20306

Bug description

What version are you seeing the problem on?

How to reproduce the bug

Error messages and logs

Environment

More info

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

NCCL backend fails during multi-node, multi-GPU training #20306

Description

Bug description

What version are you seeing the problem on?

How to reproduce the bug

Error messages and logs

Environment

More info

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions