Slurm multi-node work fine but multi-gpu doesn't 

### Bug description

I am training a sample model which works on multiple GPUs as long as these are across nodes. But as soon as I allocate more than one GPU on a node it returns  `[rank7]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, unhandled cuda error, NCCL version 2.17.1
[rank7]: ncclUnhandledCudaError: Call to CUDA function failed.
[rank7]: Last error:
[rank7]: Cuda failure 'invalid device pointer' `



### What version are you seeing the problem on?

v2.4

### How to reproduce the bug

```python
python training script:

from pytorch_lightning.demos.boring_classes import BoringModel, BoringDataModule
from pytorch_lightning import Trainer
import os


def main():
    print(
        f"LOCAL_RANK={os.environ.get('LOCAL_RANK', 0)}, SLURM_NTASKS={os.environ.get('SLURM_NTASKS')}, SLURM_NTASKS_PER_NODE={os.environ.get('SLURM_NTASKS_PER_NODE')}"
    )
    model = BoringModel()
    datamodule = BoringDataModule()
    trainer = Trainer(max_epochs=100,devices=2,num_nodes=4)
    print(f"trainer.num_devices: {trainer.num_devices}")
    trainer.fit(model, datamodule)


if __name__ == "__main__":
    main()


Slurm sbatch.sh file:

#!/bin/bash 
#SBATCH --job-name=rocm_DDP_lightining
#SBATCH --nodes=4
#SBATCH --gres=gpu:2
#SBATCH --ntasks-per-node=2
#SBATCH --mem=96g
#SBATCH --output=/mnt/jobOutput/sample.out
#SBATCH --error=/mnt/jobErrors/sample.err
#SBATCH --time=0-02:00:00
#SBATCH --cpus-per-task 10
#SBATCH --partition rocm
export NCCL_DEBUG=INFO
export PYTHONFAULTHANDLER=1
srun python /mnt/sample_lightning.py
```


### Error messages and logs

```
Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
Initializing distributed: GLOBAL_RANK: 5, MEMBER: 6/8
Initializing distributed: GLOBAL_RANK: 4, MEMBER: 5/8
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
/Users/khan01/miniconda3/envs/rocm_test/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py:75: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `lightning.pytorch` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
Initializing distributed: GLOBAL_RANK: 6, MEMBER: 7/8
You are using a CUDA device ('AMD Instinct MI50/MI60') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/8
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/8
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/8
Initializing distributed: GLOBAL_RANK: 7, MEMBER: 8/8
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/8
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 8 processes
----------------------------------------------------------------------------------------------------

[rank0]: Traceback (most recent call last):
[rank0]:   File "/mnt/scratchc/ralab/atif/SRResNet_SRGAN/rocm_DDP.py", line 109, in <module>
[rank0]:     main()
[rank0]:   File "/mnt/scratchc/ralab/atif/SRResNet_SRGAN/rocm_DDP.py", line 18, in main
[rank0]:     trainer.fit(model, dm)
[rank0]:   File "/Users/khan01/miniconda3/envs/rocm_test/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 538, in fit
[rank0]:     call._call_and_handle_interrupt(
[rank0]:   File "/Users/khan01/miniconda3/envs/rocm_test/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 46, in _call_and_handle_interrupt
[rank0]:     return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/Users/khan01/miniconda3/envs/rocm_test/lib/python3.11/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
[rank0]:     return function(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/Users/khan01/miniconda3/envs/rocm_test/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 574, in _fit_impl
[rank0]:     self._run(model, ckpt_path=ckpt_path)
[rank0]:   File "/Users/khan01/miniconda3/envs/rocm_test/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 938, in _run
[rank0]:     self.__setup_profiler()
[rank0]:   File "/Users/khan01/miniconda3/envs/rocm_test/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 1071, in __setup_profiler
[rank0]:     self.profiler.setup(stage=self.state.fn, local_rank=local_rank, log_dir=self.log_dir)
[rank0]:                                                                             ^^^^^^^^^^^^
[rank0]:   File "/Users/khan01/miniconda3/envs/rocm_test/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 1233, in log_dir
[rank0]:     dirpath = self.strategy.broadcast(dirpath)
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/Users/khan01/miniconda3/envs/rocm_test/lib/python3.11/site-packages/lightning/pytorch/strategies/ddp.py", line 307, in broadcast
[rank0]:     torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
[rank0]:   File "/Users/khan01/miniconda3/envs/rocm_test/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/Users/khan01/miniconda3/envs/rocm_test/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2649, in broadcast_object_list
[rank0]:     broadcast(object_sizes_tensor, src=src, group=group)
[rank0]:   File "/Users/khan01/miniconda3/envs/rocm_test/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/Users/khan01/miniconda3/envs/rocm_test/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2136, in broadcast
[rank0]:     work = default_pg.broadcast([tensor], opts)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, unhandled cuda error, NCCL version 2.17.1
[rank0]: ncclUnhandledCudaError: Call to CUDA function failed.
[rank0]: Last error:
[rank0]: Cuda failure 'invalid device pointer'

.
.
.
# [rank7]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, unhandled cuda error, NCCL version 2.17.1
[rank7]: ncclUnhandledCudaError: Call to CUDA function failed.
[rank7]: Last error:
[rank7]: Cuda failure 'invalid device pointer'
srun: error: clust1-rocm-6: task 5: Exited with exit code 1
srun: error: clust1-rocm-3: task 1: Exited with exit code 1
srun: error: clust1-rocm-4: task 3: Exited with exit code 1
srun: error: clust1-rocm-8: task 7: Exited with exit code 1
srun: error: clust1-rocm-6: task 4: Exited with exit code 1
srun: error: clust1-rocm-4: task 2: Exited with exit code 1
srun: error: clust1-rocm-8: task 6: Exited with exit code 1
srun: error: clust1-rocm-3: task 0: Exited with exit code 1
```


### Environment

<details>
  <summary>Current environment</summary>

```
#- PyTorch Lightning Version (e.g., 2.4.0):2.4.0
#- PyTorch Version (e.g., 2.4): 2.3.1+rocm5.7
#- Python version (e.g., 3.12):3.11.0
#- OS (e.g., Linux): Linux 4.18.0-372.32.1.el8_6.x86_64 (RHEL)
#- CUDA/cuDNN version: rocm5.7
#- GPU models and configuration: AMD Instinct MI50/MI60
#- How you installed Lightning(`conda`, `pip`, source): pip
```

</details>


### More info

_No response_

cc @justusschock @lantiga

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Slurm multi-node work fine but multi-gpu doesn't #20438

Bug description

What version are you seeing the problem on?

How to reproduce the bug

Error messages and logs

Environment

More info

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Slurm multi-node work fine but multi-gpu doesn't #20438

Description

Bug description

What version are you seeing the problem on?

How to reproduce the bug

Error messages and logs

Environment

More info

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions