Skip to content

Problem in multi-gpu training #20264

@xizaoqu

Description

@xizaoqu

Bug description

Hi, my problem is even if my environments have multiple gpus, it only runs on one GPU. Could you help me?

Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1                                                                          
----------------------------------------------------------------------------------------------------                           
distributed_backend=nccl                                                                                                       
All distributed processes registered. Starting with 1 processes 
----------------------------------------------------------------------------------------------------

You are using a CUDA device ('NVIDIA RTX A6000') that has Tensor Cores. To properly utilize them, you should set `torch.set_flo
at32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorc
h.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [1,2]

What version are you seeing the problem on?

v2.1

How to reproduce the bug

The running scripts:
    trainer = pl.Trainer(
        accelerator="gpu",  
        devices=2,         
        max_epochs=100,
        strategy="ddp"    

    trainer.fit(model, data_module)

Error messages and logs

# Error messages and logs here please

Environment

Current environment
python                    3.9.19               h955ad1f_1  
pytorch-lightning         2.0.0                    pypi_0    pypi
pyyaml                    6.0.2                    pypi_0    pypi
readline                  8.2                  h5eee18b_0  
requests                  2.32.3                   pypi_0    pypi
setuptools                72.1.0           py39h06a4308_0  
sqlite                    3.45.3               h5eee18b_0  
sympy                     1.13.2                   pypi_0    pypi
tk                        8.6.14               h39e8969_0  
torch                     2.0.0+cu118              pypi_0    pypi
torchaudio                2.0.0+cu118              pypi_0    pypi
torchmetrics              1.4.1                    pypi_0    pypi
torchvision               0.15.0+cu118             pypi_0    pypi

More info

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingneeds triageWaiting to be triaged by maintainersver: 2.1.x

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions