Skip to content

Abnormally slow both single-gpu & DDP training, what is the problem here? #20702

@Mollylulu

Description

@Mollylulu

Bug description

I just adapt my training into lightning framework for convenient ddp model training. But I got almost 10 times slower than my previous manually torch ddp training, the speed is shown below. I have no idea what is wrong here, could anyone help me figure out what may caus this problem and how to fix it?

Image

I have set:

pl.Trainer(
    accelerator='gpu',
    devices=#GPUS,
    strategy='ddp',
    sync_batchnorm=True,
    deterministic=True,
    gradient_clip_val=$CLIP_VALUE
)

What version are you seeing the problem on?

v2.3

How to reproduce the bug

Error messages and logs

# Error messages and logs here please

Environment

Current environment
#- PyTorch Lightning Version (e.g., 2.5.0):
#- PyTorch Version (e.g., 2.5):
#- Python version (e.g., 3.12):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):

More info

No response

cc @justusschock @lantiga

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions