Skip to content

Slurm multi-node work fine but multi-gpu doesn't  #20438

@atifkhanncl

Description

@atifkhanncl

Bug description

I am training a sample model which works on multiple GPUs as long as these are across nodes. But as soon as I allocate more than one GPU on a node it returns [rank7]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, unhandled cuda error, NCCL version 2.17.1 [rank7]: ncclUnhandledCudaError: Call to CUDA function failed. [rank7]: Last error: [rank7]: Cuda failure 'invalid device pointer'

What version are you seeing the problem on?

v2.4

How to reproduce the bug

python training script:

from pytorch_lightning.demos.boring_classes import BoringModel, BoringDataModule
from pytorch_lightning import Trainer
import os


def main():
    print(
        f"LOCAL_RANK={os.environ.get('LOCAL_RANK', 0)}, SLURM_NTASKS={os.environ.get('SLURM_NTASKS')}, SLURM_NTASKS_PER_NODE={os.environ.get('SLURM_NTASKS_PER_NODE')}"
    )
    model = BoringModel()
    datamodule = BoringDataModule()
    trainer = Trainer(max_epochs=100,devices=2,num_nodes=4)
    print(f"trainer.num_devices: {trainer.num_devices}")
    trainer.fit(model, datamodule)


if __name__ == "__main__":
    main()


Slurm sbatch.sh file:

#!/bin/bash 
#SBATCH --job-name=rocm_DDP_lightining
#SBATCH --nodes=4
#SBATCH --gres=gpu:2
#SBATCH --ntasks-per-node=2
#SBATCH --mem=96g
#SBATCH --output=/mnt/jobOutput/sample.out
#SBATCH --error=/mnt/jobErrors/sample.err
#SBATCH --time=0-02:00:00
#SBATCH --cpus-per-task 10
#SBATCH --partition rocm
export NCCL_DEBUG=INFO
export PYTHONFAULTHANDLER=1
srun python /mnt/sample_lightning.py

Error messages and logs

Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
Initializing distributed: GLOBAL_RANK: 5, MEMBER: 6/8
Initializing distributed: GLOBAL_RANK: 4, MEMBER: 5/8
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
/Users/khan01/miniconda3/envs/rocm_test/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py:75: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `lightning.pytorch` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
Initializing distributed: GLOBAL_RANK: 6, MEMBER: 7/8
You are using a CUDA device ('AMD Instinct MI50/MI60') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/8
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/8
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/8
Initializing distributed: GLOBAL_RANK: 7, MEMBER: 8/8
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/8
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 8 processes
----------------------------------------------------------------------------------------------------

[rank0]: Traceback (most recent call last):
[rank0]:   File "/mnt/scratchc/ralab/atif/SRResNet_SRGAN/rocm_DDP.py", line 109, in <module>
[rank0]:     main()
[rank0]:   File "/mnt/scratchc/ralab/atif/SRResNet_SRGAN/rocm_DDP.py", line 18, in main
[rank0]:     trainer.fit(model, dm)
[rank0]:   File "/Users/khan01/miniconda3/envs/rocm_test/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 538, in fit
[rank0]:     call._call_and_handle_interrupt(
[rank0]:   File "/Users/khan01/miniconda3/envs/rocm_test/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 46, in _call_and_handle_interrupt
[rank0]:     return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/Users/khan01/miniconda3/envs/rocm_test/lib/python3.11/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
[rank0]:     return function(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/Users/khan01/miniconda3/envs/rocm_test/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 574, in _fit_impl
[rank0]:     self._run(model, ckpt_path=ckpt_path)
[rank0]:   File "/Users/khan01/miniconda3/envs/rocm_test/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 938, in _run
[rank0]:     self.__setup_profiler()
[rank0]:   File "/Users/khan01/miniconda3/envs/rocm_test/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 1071, in __setup_profiler
[rank0]:     self.profiler.setup(stage=self.state.fn, local_rank=local_rank, log_dir=self.log_dir)
[rank0]:                                                                             ^^^^^^^^^^^^
[rank0]:   File "/Users/khan01/miniconda3/envs/rocm_test/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 1233, in log_dir
[rank0]:     dirpath = self.strategy.broadcast(dirpath)
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/Users/khan01/miniconda3/envs/rocm_test/lib/python3.11/site-packages/lightning/pytorch/strategies/ddp.py", line 307, in broadcast
[rank0]:     torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
[rank0]:   File "/Users/khan01/miniconda3/envs/rocm_test/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/Users/khan01/miniconda3/envs/rocm_test/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2649, in broadcast_object_list
[rank0]:     broadcast(object_sizes_tensor, src=src, group=group)
[rank0]:   File "/Users/khan01/miniconda3/envs/rocm_test/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/Users/khan01/miniconda3/envs/rocm_test/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2136, in broadcast
[rank0]:     work = default_pg.broadcast([tensor], opts)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, unhandled cuda error, NCCL version 2.17.1
[rank0]: ncclUnhandledCudaError: Call to CUDA function failed.
[rank0]: Last error:
[rank0]: Cuda failure 'invalid device pointer'

.
.
.
# [rank7]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, unhandled cuda error, NCCL version 2.17.1
[rank7]: ncclUnhandledCudaError: Call to CUDA function failed.
[rank7]: Last error:
[rank7]: Cuda failure 'invalid device pointer'
srun: error: clust1-rocm-6: task 5: Exited with exit code 1
srun: error: clust1-rocm-3: task 1: Exited with exit code 1
srun: error: clust1-rocm-4: task 3: Exited with exit code 1
srun: error: clust1-rocm-8: task 7: Exited with exit code 1
srun: error: clust1-rocm-6: task 4: Exited with exit code 1
srun: error: clust1-rocm-4: task 2: Exited with exit code 1
srun: error: clust1-rocm-8: task 6: Exited with exit code 1
srun: error: clust1-rocm-3: task 0: Exited with exit code 1

Environment

Current environment
#- PyTorch Lightning Version (e.g., 2.4.0):2.4.0
#- PyTorch Version (e.g., 2.4): 2.3.1+rocm5.7
#- Python version (e.g., 3.12):3.11.0
#- OS (e.g., Linux): Linux 4.18.0-372.32.1.el8_6.x86_64 (RHEL)
#- CUDA/cuDNN version: rocm5.7
#- GPU models and configuration: AMD Instinct MI50/MI60
#- How you installed Lightning(`conda`, `pip`, source): pip

More info

No response

cc @justusschock @lantiga

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions