Skip to content

Pytorch lightning not training with multi-gpu setup. #20577

@jspsiy

Description

@jspsiy

Bug description

I'm currently trying to work on developing a certain model . I'm afraid i cannot give the code but i simply can tell you that i am able to train in cpu. however switching to gpu makes it stuck and doesn't even enter the training_step function.

Stuck before even entering epoch 0

What version are you seeing the problem on?

v2.5

How to reproduce the bug

Train with these settings : 

    trainer = Trainer(
        accelerator="gpu", # change to cpu and it will work
        devices=2,
        precision="16",
        strategy=DeepSpeedStrategy(stage=2),
        max_epochs=100,
        callbacks=[checkpoint_callback],
        num_sanity_val_steps=0,  # ✅ Disable sanity check
        enable_progress_bar=True)

In Linux Terminal i run the commend :  CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 train_parallel.py

Error messages and logs

# Error messages and logs here please

No error message, just stuck.

Environment

  • CUDA:
    • GPU:
      • NVIDIA GeForce RTX 3090
      • NVIDIA GeForce RTX 3090
      • NVIDIA GeForce GTX 1650
    • available: True
    • version: 12.4
  • Lightning:
    • lightning-utilities: 0.12.0
    • open-clip-torch: 2.30.0
    • pytorch-lightning: 1.5.0
    • torch: 2.6.0
    • torchmetrics: 1.6.1
    • torchvision: 0.21.0
  • Packages:
    • absl-py: 2.1.0
    • accelerate: 1.3.0
    • addict: 2.4.0
    • aiofiles: 23.2.1
    • aiohappyeyeballs: 2.4.4
    • aiohttp: 3.11.12
    • aiosignal: 1.3.2
    • annotated-types: 0.7.0
    • antlr4-python3-runtime: 4.9.3
    • anyio: 4.8.0
    • asttokens: 3.0.0
    • async-timeout: 5.0.1
    • attrs: 25.1.0
    • autocommand: 2.2.2
    • backports.tarfile: 1.2.0
    • blinker: 1.9.0
    • certifi: 2025.1.31
    • charset-normalizer: 3.4.1
    • click: 8.1.8
    • coloredlogs: 15.0.1
    • comm: 0.2.2
    • configargparse: 1.7
    • contourpy: 1.3.0
    • cycler: 0.12.1
    • dash: 2.18.2
    • dash-core-components: 2.0.0
    • dash-html-components: 2.0.0
    • dash-table: 5.0.0
    • dataclasses-json: 0.6.7
    • decorator: 5.1.1
    • deepspeed: 0.16.3
    • deprecated: 1.2.18
    • diffusers: 0.24.0
    • docker-pycreds: 0.4.0
    • einops: 0.8.0
    • eval-type-backport: 0.2.2
    • exceptiongroup: 1.2.2
    • executing: 2.2.0
    • fastapi: 0.115.8
    • fastjsonschema: 2.21.1
    • ffmpy: 0.5.0
    • filelock: 3.17.0
    • flask: 3.0.3
    • flatbuffers: 25.1.24
    • fonttools: 4.55.8
    • frozenlist: 1.5.0
    • fsspec: 2025.2.0
    • ftfy: 6.3.1
    • future: 1.0.0
    • gitdb: 4.0.12
    • gitpython: 3.1.44
    • gradio: 4.44.1
    • gradio-client: 1.3.0
    • grpcio: 1.70.0
    • h11: 0.14.0
    • hjson: 3.1.0
    • httpcore: 1.0.7
    • httpx: 0.28.1
    • huggingface-hub: 0.28.1
    • humanfriendly: 10.0
    • idna: 3.10
    • imageio: 2.37.0
    • importlib-metadata: 8.6.1
    • importlib-resources: 6.5.2
    • inflect: 7.3.1
    • ipdb: 0.13.13
    • ipython: 8.18.1
    • ipywidgets: 8.1.5
    • itsdangerous: 2.2.0
    • jaraco.collections: 5.1.0
    • jaraco.context: 5.3.0
    • jaraco.functools: 4.0.1
    • jaraco.text: 3.12.1
    • jaxtyping: 0.2.36
    • jedi: 0.19.2
    • jinja2: 3.1.5
    • joblib: 1.4.2
    • jsonschema: 4.23.0
    • jsonschema-specifications: 2024.10.1
    • jupyter-core: 5.7.2
    • jupyterlab-widgets: 3.0.13
    • kiwisolver: 1.4.7
    • lightning-utilities: 0.12.0
    • lpips: 0.1.4
    • markdown: 3.7
    • markdown-it-py: 3.0.0
    • markupsafe: 2.1.5
    • marshmallow: 3.26.1
    • matplotlib: 3.9.4
    • matplotlib-inline: 0.1.7
    • mdurl: 0.1.2
    • more-itertools: 10.3.0
    • mpmath: 1.3.0
    • msgpack: 1.1.0
    • multidict: 6.1.0
    • mypy-extensions: 1.0.0
    • narwhals: 1.25.1
    • nbformat: 5.10.4
    • nest-asyncio: 1.6.0
    • networkx: 3.2.1
    • ninja: 1.11.1.3
    • numpy: 2.0.2
    • nvdiffrast: 0.3.3
    • nvidia-cublas-cu12: 12.4.5.8
    • nvidia-cuda-cupti-cu12: 12.4.127
    • nvidia-cuda-nvrtc-cu12: 12.4.127
    • nvidia-cuda-runtime-cu12: 12.4.127
    • nvidia-cudnn-cu12: 9.1.0.70
    • nvidia-cufft-cu12: 11.2.1.3
    • nvidia-curand-cu12: 10.3.5.147
    • nvidia-cusolver-cu12: 11.6.1.9
    • nvidia-cusparse-cu12: 12.3.1.170
    • nvidia-cusparselt-cu12: 0.6.2
    • nvidia-ml-py: 12.570.86
    • nvidia-nccl-cu12: 2.21.5
    • nvidia-nvjitlink-cu12: 12.4.127
    • nvidia-nvtx-cu12: 12.4.127
    • omegaconf: 2.3.0
    • onnxruntime: 1.19.2
    • open-clip-torch: 2.30.0
    • open3d: 0.19.0
    • opencv-python: 4.11.0.86
    • orjson: 3.10.15
    • packaging: 24.2
    • pandas: 2.2.3
    • parso: 0.8.4
    • pexpect: 4.9.0
    • pillow: 10.4.0
    • pip: 25.0
    • platformdirs: 4.3.6
    • plotly: 6.0.0
    • prompt-toolkit: 3.0.50
    • propcache: 0.2.1
    • protobuf: 5.29.3
    • psutil: 6.1.1
    • ptyprocess: 0.7.0
    • pure-eval: 0.2.3
    • py-cpuinfo: 9.0.0
    • pydantic: 2.10.6
    • pydantic-core: 2.27.2
    • pydeprecate: 0.3.1
    • pydub: 0.25.1
    • pygltflib: 1.16.3
    • pygments: 2.19.1
    • pymeshlab: 2023.12.post2
    • pyparsing: 3.2.1
    • pyquaternion: 0.9.9
    • python-dateutil: 2.9.0.post0
    • python-multipart: 0.0.20
    • pytorch-lightning: 1.5.0
    • pytz: 2025.1
    • pyyaml: 6.0.2
    • referencing: 0.36.2
    • regex: 2024.11.6
    • requests: 2.32.3
    • retrying: 1.3.4
    • rich: 13.9.4
    • rm-anime-bg: 0.2.0
    • rpds-py: 0.22.3
    • ruff: 0.9.4
    • safetensors: 0.5.2
    • scikit-learn: 1.6.1
    • scipy: 1.13.1
    • semantic-version: 2.10.0
    • sentry-sdk: 2.20.0
    • setproctitle: 1.3.4
    • setuptools: 75.8.0
    • shellingham: 1.5.4
    • six: 1.17.0
    • smmap: 5.0.2
    • sniffio: 1.3.1
    • stack-data: 0.6.3
    • starlette: 0.45.3
    • sympy: 1.13.1
    • tensorboard: 2.18.0
    • tensorboard-data-server: 0.7.2
    • tensorboardx: 1.8
    • threadpoolctl: 3.5.0
    • timm: 1.0.14
    • tokenizers: 0.21.0
    • tomli: 2.2.1
    • tomlkit: 0.12.0
    • torch: 2.6.0
    • torchmetrics: 1.6.1
    • torchvision: 0.21.0
    • tqdm: 4.67.1
    • traitlets: 5.14.3
    • transformers: 4.48.2
    • trimesh: 4.6.1
    • triton: 3.2.0
    • typeguard: 4.3.0
    • typer: 0.15.1
    • typing-extensions: 4.12.2
    • typing-inspect: 0.9.0
    • tzdata: 2025.1
    • urllib3: 2.3.0
    • uvicorn: 0.34.0
    • wandb: 0.19.6
    • wcwidth: 0.2.13
    • websockets: 12.0
    • werkzeug: 3.0.6
    • wheel: 0.45.1
    • widgetsnbextension: 4.0.13
    • wrapt: 1.17.2
    • xformers: 0.0.29.post2
    • yarl: 1.18.3
    • zipp: 3.21.0
  • System:
    • OS: Linux
    • architecture:
      • 64bit
      • ELF
    • processor: x86_64
    • python: 3.9.21
    • release: 6.2.0-37-generic
    • version: Fixed typo in single_cpu_template #38~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Nov 2 18:01:13 UTC 2

More info

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingneeds triageWaiting to be triaged by maintainersver: 2.5.x

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions