Skip to content

Training not starting in multidevice setup #20411

@stonelazy

Description

@stonelazy

Bug description

Training is stuck and doesn't even begin when using multiple devices and not facing issue with single device.

What version are you seeing the problem on?

v2.1, v2.3

How to reproduce the bug

https://colab.research.google.com/drive/1R9tX9vZQrrmbHYY34c2NE2u_4KZ0hOip?usp=sharing

Error messages and logs

Detailed logs are available in same colab notebook pasted above. - https://colab.research.google.com/drive/1R9tX9vZQrrmbHYY34c2NE2u_4KZ0hOip?usp=sharing

Environment

Current environment
#- PyTorch Lightning Version (e.g., 2.4.0):
#- PyTorch Version (e.g., 2.4):
#- Python version (e.g., 3.12):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):

More info

Current environment
  • CUDA:
    • GPU:
      • NVIDIA L40S
      • NVIDIA L40S
      • NVIDIA L40S
      • NVIDIA L40S
      • NVIDIA L40S
      • NVIDIA L40S
      • NVIDIA L40S
      • NVIDIA L40S
    • available: True
    • version: 12.1
  • Lightning:
    • lightning-utilities: 0.11.8
    • pytorch-lightning: 2.0.0
    • torch: 2.3.1+cu121
    • torch-poly-lr-decay: 0.0.1
    • torchaudio: 2.3.1+cu121
    • torchdata: 0.7.1
    • torchmetrics: 1.5.2
  • Packages:
    • absl-py: 2.1.0
    • aiohappyeyeballs: 2.4.3
    • aiohttp: 2.3.10
    • aiohttp-retry: 2.9.1
    • aiosignal: 1.3.1
    • alabaster: 1.0.0
    • alembic: 1.14.0
    • altair: 5.4.1
    • amqp: 5.2.0
    • annotated-types: 0.7.0
    • antlr4-python3-runtime: 4.9.3
    • anyio: 4.6.2.post1
    • appdirs: 1.4.4
    • arrow: 1.3.0
    • astroid: 2.15.8
    • asttokens: 2.4.1
    • async-timeout: 5.0.1
    • asyncssh: 2.18.0
    • atpublic: 5.0
    • attrs: 24.2.0
    • audio-events-classification: 0.1.6
    • audio-metrics: 0.2.2
    • audioread: 3.0.1
    • autocommand: 2.2.2
    • autopage: 0.5.2
    • babel: 2.16.0
    • backports.tarfile: 1.2.0
    • billiard: 4.2.1
    • black: 24.10.0
    • blinker: 1.9.0
    • blis: 0.7.11
    • boto3: 1.35.57
    • botocore: 1.35.57
    • braceexpand: 0.1.7
    • bravado: 11.0.3
    • bravado-core: 6.1.1
    • cachetools: 5.5.0
    • catalogue: 2.0.10
    • cdifflib: 1.2.6
    • celery: 5.4.0
    • certifi: 2024.8.30
    • cffi: 1.17.1
    • cfgv: 3.4.0
    • chardet: 5.2.0
    • charset-normalizer: 3.4.0
    • clang-format: 15.0.7
    • click: 8.1.7
    • click-default-group: 1.2.4
    • click-didyoumean: 0.3.1
    • click-plugins: 1.1.1
    • click-repl: 0.3.0
    • cliff: 4.7.0
    • cloudpathlib: 0.20.0
    • cmaes: 0.11.1
    • cmd2: 2.5.4
    • collection: 0.1.6
    • colorama: 0.4.6
    • coloredlogs: 15.0.1
    • colorlog: 6.9.0
    • confection: 0.1.5
    • configobj: 5.0.9
    • contourpy: 1.3.0
    • coverage: 7.6.4
    • cryptography: 43.0.3
    • ctcdecode: 1.0.8
    • cycler: 0.12.1
    • cymem: 2.0.8
    • cython: 3.0.11
    • cytoolz: 1.0.0
    • decorator: 5.1.1
    • deprecated: 1.2.14
    • dictdiffer: 0.9.0
    • dill: 0.3.9
    • dirhash: 0.2.1
    • diskcache: 5.6.3
    • distlib: 0.3.9
    • distro: 1.9.0
    • docker: 7.1.0
    • docutils: 0.21.2
    • dpath: 2.2.0
    • dulwich: 0.22.5
    • dvc: 2.41.1
    • dvc-data: 0.29.0
    • dvc-http: 2.30.2
    • dvc-objects: 0.14.1
    • dvc-render: 0.0.17
    • dvc-stratus: 0.3.2
    • dvc-studio-client: 0.21.0
    • dvc-task: 0.1.9
    • dvclive: 2.0.2
    • edit-distance: 1.0.6
    • editdistance: 0.8.1
    • en-core-web-sm: 3.7.1
    • exceptiongroup: 1.2.2
    • execnet: 2.1.1
    • executing: 2.1.0
    • fastapi: 0.99.1
    • fastdtw: 0.3.4
    • ffmpeg-python: 0.2.0
    • filelock: 3.16.1
    • flatbuffers: 24.3.25
    • flatten-dict: 0.4.2
    • flufl-lock: 8.1.0
    • fonttools: 4.54.1
    • fqdn: 1.5.1
    • frozenlist: 1.5.0
    • fsspec: 2024.10.0
    • ftfy: 5.9
    • funcy: 2.0
    • future: 1.0.0
    • gevent: 24.10.3
    • gitdb: 4.0.11
    • gitpython: 3.1.43
    • grandalf: 0.6
    • greenlet: 3.1.1
    • grequests: 0.7.0
    • grpcio: 1.67.1
    • gunicorn: 22.0.0
    • h11: 0.14.0
    • huggingface-hub: 0.26.2
    • humanfriendly: 10.0
    • hydra-core: 1.3.2
    • identify: 2.6.1
    • idna: 3.10
    • idna-ssl: 1.1.0
    • imageio: 2.36.0
    • imagesize: 1.4.1
    • importlib-metadata: 6.11.0
    • importlib-resources: 6.4.5
    • indic-nlp-library: 0.92
    • infinibatch: 0.1.0
    • inflect: 7.4.0
    • iniconfig: 2.0.0
    • ipython: 8.29.0
    • isoduration: 20.11.0
    • isort: 5.13.2
    • iterative-telemetry: 0.0.6
    • jaraco.collections: 5.1.0
    • jaraco.context: 5.3.0
    • jaraco.functools: 4.0.1
    • jaraco.text: 3.12.1
    • jedi: 0.19.1
    • jellyfish: 1.1.0
    • jinja2: 3.1.4
    • jmespath: 1.0.1
    • joblib: 1.4.2
    • jsonformatter: 0.3.2
    • jsonpointer: 3.0.0
    • jsonref: 1.1.0
    • jsonschema: 4.23.0
    • jsonschema-specifications: 2023.7.1
    • kiwisolver: 1.4.7
    • kombu: 5.4.2
    • langcodes: 3.4.1
    • language-data: 1.2.0
    • lazy-object-proxy: 1.10.0
    • librosa: 0.9.2
    • lightning-utilities: 0.11.8
    • limits: 3.13.0
    • llvmlite: 0.43.0
    • mako: 1.3.6
    • marisa-trie: 1.2.1
    • markdown: 3.7
    • markdown-it-py: 3.0.0
    • markupsafe: 3.0.2
    • matplotlib: 3.9.2
    • matplotlib-inline: 0.1.7
    • mccabe: 0.7.0
    • mdurl: 0.1.2
    • monotonic: 1.6
    • more-itertools: 10.5.0
    • morfessor: 2.0.6
    • mpmath: 1.3.0
    • msgpack: 1.1.0
    • multidict: 6.1.0
    • murmurhash: 1.0.10
    • mypy: 0.961
    • mypy-extensions: 1.0.0
    • nanotime: 0.5.2
    • narwhals: 1.13.3
    • nemo-text-processing: 0.2.0rc0
    • neptune-client: 0.16.19
    • networkx: 3.4.2
    • nltk: 3.9.1
    • nodeenv: 1.9.1
    • numba: 0.60.0
    • numpy: 1.23.5
    • nvidia-cublas-cu12: 12.1.3.1
    • nvidia-cuda-cupti-cu12: 12.1.105
    • nvidia-cuda-nvrtc-cu12: 12.1.105
    • nvidia-cuda-runtime-cu12: 12.1.105
    • nvidia-cudnn-cu12: 8.9.2.26
    • nvidia-cufft-cu12: 11.0.2.54
    • nvidia-curand-cu12: 10.3.2.106
    • nvidia-cusolver-cu12: 11.4.5.107
    • nvidia-cusparse-cu12: 12.1.0.106
    • nvidia-ml-py: 12.535.161
    • nvidia-nccl-cu12: 2.20.5
    • nvidia-nvjitlink-cu12: 12.6.77
    • nvidia-nvtx-cu12: 12.1.105
    • nvitop: 1.3.2
    • oauthlib: 3.2.2
    • omegaconf: 2.3.0
    • onnx: 1.17.0
    • onnxconverter-common: 1.14.0
    • onnxruntime-gpu: 1.17.1
    • openai-whisper: 20231117
    • optuna: 2.10.1
    • packaging: 24.2
    • pandas: 2.2.3
    • parso: 0.8.4
    • pathspec: 0.9.0
    • pbr: 6.1.0
    • pesq: 0.0.4
    • pexpect: 4.9.0
    • pillow: 10.4.0
    • pip: 24.3.1
    • platformdirs: 4.3.6
    • pluggy: 1.5.0
    • pooch: 1.8.2
    • pre-commit: 4.0.1
    • preshed: 3.0.9
    • prettytable: 3.12.0
    • prompt-toolkit: 3.0.48
    • propcache: 0.2.0
    • protobuf: 3.20.2
    • psutil: 5.9.8
    • ptyprocess: 0.7.0
    • pure-eval: 0.2.3
    • pyarrow: 18.0.0
    • pybind11: 2.13.6
    • pycparser: 2.22
    • pycryptodome: 3.21.0
    • pydantic: 1.10.19
    • pydantic-core: 2.23.4
    • pydeck: 0.9.1
    • pydot: 3.0.2
    • pygit2: 1.16.0
    • pygments: 2.18.0
    • pygtrie: 2.5.0
    • pyjwt: 2.9.0
    • pylint: 2.17.7
    • pylint-protobuf: 0.20.2
    • pyloudnorm: 0.1.1
    • pynini: 2.1.5
    • pyparsing: 3.2.0
    • pyperclip: 1.9.0
    • pyphen: 0.17.0
    • pyroomacoustics: 0.5.0
    • pysptk: 0.2.2
    • pystoi: 0.3.3
    • pystratus: 0.2.4
    • pytest: 8.3.3
    • pytest-cov: 6.0.0
    • pytest-mock: 3.14.0
    • pytest-xdist: 3.6.1
    • python-dateutil: 2.9.0.post0
    • python-dotenv: 1.0.1
    • python-magic: 0.4.27
    • python-multipart: 0.0.7
    • pytorch-lightning: 2.0.0
    • pytz: 2024.2
    • pyvad: 0.2.0
    • pywavelets: 1.7.0
    • pyworld: 0.3.4
    • pyyaml: 6.0.2
    • redis: 5.2.0
    • referencing: 0.30.2
    • regex: 2024.11.6
    • registrable: 0.0.4
    • requests: 2.32.3
    • requests-oauthlib: 2.0.0
    • resampy: 0.4.3
    • rfc3339-validator: 0.1.4
    • rfc3986-validator: 0.1.1
    • rich: 13.9.4
    • rpds-py: 0.21.0
    • ruamel.yaml: 0.18.6
    • ruamel.yaml.clib: 0.2.12
    • s3transfer: 0.10.3
    • sacremoses: 0.1.1
    • safetensors: 0.4.5
    • scantree: 0.0.2
    • scikit-image: 0.19.3
    • scikit-learn: 1.5.1
    • scipy: 1.14.1
    • scmrepo: 0.1.5
    • setuptools: 75.3.0
    • shellingham: 1.5.4
    • shortuuid: 1.0.13
    • shtab: 1.7.1
    • simplejson: 3.19.3
    • six: 1.16.0
    • skl2onnx: 1.17.0
    • slowapi: 0.1.9
    • smart-open: 7.0.5
    • smmap: 5.0.1
    • sniffio: 1.3.1
    • snowballstemmer: 2.2.0
    • soundfile: 0.10.3.post1
    • sox: 1.5.0
    • spacy: 3.7.6
    • spacy-legacy: 3.0.12
    • sphinx: 8.1.3
    • sphinx-argparse: 0.5.2
    • sphinx-rtd-theme: 3.0.1
    • sphinxcontrib-applehelp: 2.0.0
    • sphinxcontrib-devhelp: 2.0.0
    • sphinxcontrib-htmlhelp: 2.1.0
    • sphinxcontrib-jquery: 4.1
    • sphinxcontrib-jsmath: 1.0.1
    • sphinxcontrib-qthelp: 2.0.0
    • sphinxcontrib-serializinghtml: 2.0.0
    • sqlalchemy: 2.0.36
    • sqlite-fts4: 1.0.3
    • sqlite-utils: 3.37
    • srsly: 2.4.8
    • srt: 3.5.3
    • stack-data: 0.6.3
    • starlette: 0.27.0
    • stevedore: 5.3.0
    • streamlit: 1.29.0
    • swagger-spec-validator: 3.0.4
    • sympy: 1.13.3
    • tabulate: 0.9.0
    • taskipy: 1.14.0
    • tenacity: 8.5.0
    • tensorboard: 2.18.0
    • tensorboard-data-server: 0.7.2
    • termcolor: 2.5.0
    • textacy: 0.12.0
    • thinc: 8.2.5
    • threadpoolctl: 3.5.0
    • tifffile: 2024.9.20
    • tiktoken: 0.8.0
    • tokenize-rt: 6.1.0
    • tokenizers: 0.13.3
    • toml: 0.10.2
    • tomli: 2.0.2
    • tomlkit: 0.13.2
    • toolz: 0.11.2
    • torch: 2.3.1+cu121
    • torch-poly-lr-decay: 0.0.1
    • torchaudio: 2.3.1+cu121
    • torchdata: 0.7.1
    • torchmetrics: 1.5.2
    • tornado: 6.4.1
    • tqdm: 4.67.0
    • traitlets: 5.14.3
    • transformers: 4.28.0
    • triton: 2.3.1
    • typeguard: 4.4.1
    • typer: 0.13.0
    • types-python-dateutil: 2.9.0.20241003
    • typing-extensions: 4.12.2
    • tzdata: 2024.2
    • tzlocal: 5.2
    • uri-template: 1.3.0
    • urllib3: 1.26.19
    • uvicorn: 0.24.0.post1
    • validators: 0.34.0
    • vine: 5.1.0
    • virtualenv: 20.27.1
    • voluptuous: 0.15.2
    • wasabi: 1.1.3
    • watchdog: 6.0.0
    • wcwidth: 0.2.13
    • weasel: 0.4.1
    • webcolors: 24.8.0
    • webdataset: 0.2.86
    • webrtcvad: 2.0.10
    • websocket-client: 1.8.0
    • werkzeug: 3.1.3
    • wget: 3.2
    • wheel: 0.44.0
    • wrapt: 1.16.0
    • xls-r-sqa: 0.1.0
    • yarl: 1.17.1
    • youtube-dl: 2021.2.22
    • zc.lockfile: 3.0.post1
    • zipp: 3.20.2
    • zope.event: 5.0
    • zope.interface: 7.1.1
  • System:

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions