-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Closed
Labels
bugSomething isn't workingSomething isn't workingneeds triageWaiting to be triaged by maintainersWaiting to be triaged by maintainersver: 2.1.xver: 2.3.x
Description
Bug description
Training is stuck and doesn't even begin when using multiple devices and not facing issue with single device.
What version are you seeing the problem on?
v2.1, v2.3
How to reproduce the bug
https://colab.research.google.com/drive/1R9tX9vZQrrmbHYY34c2NE2u_4KZ0hOip?usp=sharing
Error messages and logs
Detailed logs are available in same colab notebook pasted above. - https://colab.research.google.com/drive/1R9tX9vZQrrmbHYY34c2NE2u_4KZ0hOip?usp=sharing
Environment
Current environment
#- PyTorch Lightning Version (e.g., 2.4.0):
#- PyTorch Version (e.g., 2.4):
#- Python version (e.g., 3.12):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
More info
Current environment
- CUDA:
- GPU:
- NVIDIA L40S
- NVIDIA L40S
- NVIDIA L40S
- NVIDIA L40S
- NVIDIA L40S
- NVIDIA L40S
- NVIDIA L40S
- NVIDIA L40S
- available: True
- version: 12.1
- GPU:
- Lightning:
- lightning-utilities: 0.11.8
- pytorch-lightning: 2.0.0
- torch: 2.3.1+cu121
- torch-poly-lr-decay: 0.0.1
- torchaudio: 2.3.1+cu121
- torchdata: 0.7.1
- torchmetrics: 1.5.2
- Packages:
- absl-py: 2.1.0
- aiohappyeyeballs: 2.4.3
- aiohttp: 2.3.10
- aiohttp-retry: 2.9.1
- aiosignal: 1.3.1
- alabaster: 1.0.0
- alembic: 1.14.0
- altair: 5.4.1
- amqp: 5.2.0
- annotated-types: 0.7.0
- antlr4-python3-runtime: 4.9.3
- anyio: 4.6.2.post1
- appdirs: 1.4.4
- arrow: 1.3.0
- astroid: 2.15.8
- asttokens: 2.4.1
- async-timeout: 5.0.1
- asyncssh: 2.18.0
- atpublic: 5.0
- attrs: 24.2.0
- audio-events-classification: 0.1.6
- audio-metrics: 0.2.2
- audioread: 3.0.1
- autocommand: 2.2.2
- autopage: 0.5.2
- babel: 2.16.0
- backports.tarfile: 1.2.0
- billiard: 4.2.1
- black: 24.10.0
- blinker: 1.9.0
- blis: 0.7.11
- boto3: 1.35.57
- botocore: 1.35.57
- braceexpand: 0.1.7
- bravado: 11.0.3
- bravado-core: 6.1.1
- cachetools: 5.5.0
- catalogue: 2.0.10
- cdifflib: 1.2.6
- celery: 5.4.0
- certifi: 2024.8.30
- cffi: 1.17.1
- cfgv: 3.4.0
- chardet: 5.2.0
- charset-normalizer: 3.4.0
- clang-format: 15.0.7
- click: 8.1.7
- click-default-group: 1.2.4
- click-didyoumean: 0.3.1
- click-plugins: 1.1.1
- click-repl: 0.3.0
- cliff: 4.7.0
- cloudpathlib: 0.20.0
- cmaes: 0.11.1
- cmd2: 2.5.4
- collection: 0.1.6
- colorama: 0.4.6
- coloredlogs: 15.0.1
- colorlog: 6.9.0
- confection: 0.1.5
- configobj: 5.0.9
- contourpy: 1.3.0
- coverage: 7.6.4
- cryptography: 43.0.3
- ctcdecode: 1.0.8
- cycler: 0.12.1
- cymem: 2.0.8
- cython: 3.0.11
- cytoolz: 1.0.0
- decorator: 5.1.1
- deprecated: 1.2.14
- dictdiffer: 0.9.0
- dill: 0.3.9
- dirhash: 0.2.1
- diskcache: 5.6.3
- distlib: 0.3.9
- distro: 1.9.0
- docker: 7.1.0
- docutils: 0.21.2
- dpath: 2.2.0
- dulwich: 0.22.5
- dvc: 2.41.1
- dvc-data: 0.29.0
- dvc-http: 2.30.2
- dvc-objects: 0.14.1
- dvc-render: 0.0.17
- dvc-stratus: 0.3.2
- dvc-studio-client: 0.21.0
- dvc-task: 0.1.9
- dvclive: 2.0.2
- edit-distance: 1.0.6
- editdistance: 0.8.1
- en-core-web-sm: 3.7.1
- exceptiongroup: 1.2.2
- execnet: 2.1.1
- executing: 2.1.0
- fastapi: 0.99.1
- fastdtw: 0.3.4
- ffmpeg-python: 0.2.0
- filelock: 3.16.1
- flatbuffers: 24.3.25
- flatten-dict: 0.4.2
- flufl-lock: 8.1.0
- fonttools: 4.54.1
- fqdn: 1.5.1
- frozenlist: 1.5.0
- fsspec: 2024.10.0
- ftfy: 5.9
- funcy: 2.0
- future: 1.0.0
- gevent: 24.10.3
- gitdb: 4.0.11
- gitpython: 3.1.43
- grandalf: 0.6
- greenlet: 3.1.1
- grequests: 0.7.0
- grpcio: 1.67.1
- gunicorn: 22.0.0
- h11: 0.14.0
- huggingface-hub: 0.26.2
- humanfriendly: 10.0
- hydra-core: 1.3.2
- identify: 2.6.1
- idna: 3.10
- idna-ssl: 1.1.0
- imageio: 2.36.0
- imagesize: 1.4.1
- importlib-metadata: 6.11.0
- importlib-resources: 6.4.5
- indic-nlp-library: 0.92
- infinibatch: 0.1.0
- inflect: 7.4.0
- iniconfig: 2.0.0
- ipython: 8.29.0
- isoduration: 20.11.0
- isort: 5.13.2
- iterative-telemetry: 0.0.6
- jaraco.collections: 5.1.0
- jaraco.context: 5.3.0
- jaraco.functools: 4.0.1
- jaraco.text: 3.12.1
- jedi: 0.19.1
- jellyfish: 1.1.0
- jinja2: 3.1.4
- jmespath: 1.0.1
- joblib: 1.4.2
- jsonformatter: 0.3.2
- jsonpointer: 3.0.0
- jsonref: 1.1.0
- jsonschema: 4.23.0
- jsonschema-specifications: 2023.7.1
- kiwisolver: 1.4.7
- kombu: 5.4.2
- langcodes: 3.4.1
- language-data: 1.2.0
- lazy-object-proxy: 1.10.0
- librosa: 0.9.2
- lightning-utilities: 0.11.8
- limits: 3.13.0
- llvmlite: 0.43.0
- mako: 1.3.6
- marisa-trie: 1.2.1
- markdown: 3.7
- markdown-it-py: 3.0.0
- markupsafe: 3.0.2
- matplotlib: 3.9.2
- matplotlib-inline: 0.1.7
- mccabe: 0.7.0
- mdurl: 0.1.2
- monotonic: 1.6
- more-itertools: 10.5.0
- morfessor: 2.0.6
- mpmath: 1.3.0
- msgpack: 1.1.0
- multidict: 6.1.0
- murmurhash: 1.0.10
- mypy: 0.961
- mypy-extensions: 1.0.0
- nanotime: 0.5.2
- narwhals: 1.13.3
- nemo-text-processing: 0.2.0rc0
- neptune-client: 0.16.19
- networkx: 3.4.2
- nltk: 3.9.1
- nodeenv: 1.9.1
- numba: 0.60.0
- numpy: 1.23.5
- nvidia-cublas-cu12: 12.1.3.1
- nvidia-cuda-cupti-cu12: 12.1.105
- nvidia-cuda-nvrtc-cu12: 12.1.105
- nvidia-cuda-runtime-cu12: 12.1.105
- nvidia-cudnn-cu12: 8.9.2.26
- nvidia-cufft-cu12: 11.0.2.54
- nvidia-curand-cu12: 10.3.2.106
- nvidia-cusolver-cu12: 11.4.5.107
- nvidia-cusparse-cu12: 12.1.0.106
- nvidia-ml-py: 12.535.161
- nvidia-nccl-cu12: 2.20.5
- nvidia-nvjitlink-cu12: 12.6.77
- nvidia-nvtx-cu12: 12.1.105
- nvitop: 1.3.2
- oauthlib: 3.2.2
- omegaconf: 2.3.0
- onnx: 1.17.0
- onnxconverter-common: 1.14.0
- onnxruntime-gpu: 1.17.1
- openai-whisper: 20231117
- optuna: 2.10.1
- packaging: 24.2
- pandas: 2.2.3
- parso: 0.8.4
- pathspec: 0.9.0
- pbr: 6.1.0
- pesq: 0.0.4
- pexpect: 4.9.0
- pillow: 10.4.0
- pip: 24.3.1
- platformdirs: 4.3.6
- pluggy: 1.5.0
- pooch: 1.8.2
- pre-commit: 4.0.1
- preshed: 3.0.9
- prettytable: 3.12.0
- prompt-toolkit: 3.0.48
- propcache: 0.2.0
- protobuf: 3.20.2
- psutil: 5.9.8
- ptyprocess: 0.7.0
- pure-eval: 0.2.3
- pyarrow: 18.0.0
- pybind11: 2.13.6
- pycparser: 2.22
- pycryptodome: 3.21.0
- pydantic: 1.10.19
- pydantic-core: 2.23.4
- pydeck: 0.9.1
- pydot: 3.0.2
- pygit2: 1.16.0
- pygments: 2.18.0
- pygtrie: 2.5.0
- pyjwt: 2.9.0
- pylint: 2.17.7
- pylint-protobuf: 0.20.2
- pyloudnorm: 0.1.1
- pynini: 2.1.5
- pyparsing: 3.2.0
- pyperclip: 1.9.0
- pyphen: 0.17.0
- pyroomacoustics: 0.5.0
- pysptk: 0.2.2
- pystoi: 0.3.3
- pystratus: 0.2.4
- pytest: 8.3.3
- pytest-cov: 6.0.0
- pytest-mock: 3.14.0
- pytest-xdist: 3.6.1
- python-dateutil: 2.9.0.post0
- python-dotenv: 1.0.1
- python-magic: 0.4.27
- python-multipart: 0.0.7
- pytorch-lightning: 2.0.0
- pytz: 2024.2
- pyvad: 0.2.0
- pywavelets: 1.7.0
- pyworld: 0.3.4
- pyyaml: 6.0.2
- redis: 5.2.0
- referencing: 0.30.2
- regex: 2024.11.6
- registrable: 0.0.4
- requests: 2.32.3
- requests-oauthlib: 2.0.0
- resampy: 0.4.3
- rfc3339-validator: 0.1.4
- rfc3986-validator: 0.1.1
- rich: 13.9.4
- rpds-py: 0.21.0
- ruamel.yaml: 0.18.6
- ruamel.yaml.clib: 0.2.12
- s3transfer: 0.10.3
- sacremoses: 0.1.1
- safetensors: 0.4.5
- scantree: 0.0.2
- scikit-image: 0.19.3
- scikit-learn: 1.5.1
- scipy: 1.14.1
- scmrepo: 0.1.5
- setuptools: 75.3.0
- shellingham: 1.5.4
- shortuuid: 1.0.13
- shtab: 1.7.1
- simplejson: 3.19.3
- six: 1.16.0
- skl2onnx: 1.17.0
- slowapi: 0.1.9
- smart-open: 7.0.5
- smmap: 5.0.1
- sniffio: 1.3.1
- snowballstemmer: 2.2.0
- soundfile: 0.10.3.post1
- sox: 1.5.0
- spacy: 3.7.6
- spacy-legacy: 3.0.12
- sphinx: 8.1.3
- sphinx-argparse: 0.5.2
- sphinx-rtd-theme: 3.0.1
- sphinxcontrib-applehelp: 2.0.0
- sphinxcontrib-devhelp: 2.0.0
- sphinxcontrib-htmlhelp: 2.1.0
- sphinxcontrib-jquery: 4.1
- sphinxcontrib-jsmath: 1.0.1
- sphinxcontrib-qthelp: 2.0.0
- sphinxcontrib-serializinghtml: 2.0.0
- sqlalchemy: 2.0.36
- sqlite-fts4: 1.0.3
- sqlite-utils: 3.37
- srsly: 2.4.8
- srt: 3.5.3
- stack-data: 0.6.3
- starlette: 0.27.0
- stevedore: 5.3.0
- streamlit: 1.29.0
- swagger-spec-validator: 3.0.4
- sympy: 1.13.3
- tabulate: 0.9.0
- taskipy: 1.14.0
- tenacity: 8.5.0
- tensorboard: 2.18.0
- tensorboard-data-server: 0.7.2
- termcolor: 2.5.0
- textacy: 0.12.0
- thinc: 8.2.5
- threadpoolctl: 3.5.0
- tifffile: 2024.9.20
- tiktoken: 0.8.0
- tokenize-rt: 6.1.0
- tokenizers: 0.13.3
- toml: 0.10.2
- tomli: 2.0.2
- tomlkit: 0.13.2
- toolz: 0.11.2
- torch: 2.3.1+cu121
- torch-poly-lr-decay: 0.0.1
- torchaudio: 2.3.1+cu121
- torchdata: 0.7.1
- torchmetrics: 1.5.2
- tornado: 6.4.1
- tqdm: 4.67.0
- traitlets: 5.14.3
- transformers: 4.28.0
- triton: 2.3.1
- typeguard: 4.4.1
- typer: 0.13.0
- types-python-dateutil: 2.9.0.20241003
- typing-extensions: 4.12.2
- tzdata: 2024.2
- tzlocal: 5.2
- uri-template: 1.3.0
- urllib3: 1.26.19
- uvicorn: 0.24.0.post1
- validators: 0.34.0
- vine: 5.1.0
- virtualenv: 20.27.1
- voluptuous: 0.15.2
- wasabi: 1.1.3
- watchdog: 6.0.0
- wcwidth: 0.2.13
- weasel: 0.4.1
- webcolors: 24.8.0
- webdataset: 0.2.86
- webrtcvad: 2.0.10
- websocket-client: 1.8.0
- werkzeug: 3.1.3
- wget: 3.2
- wheel: 0.44.0
- wrapt: 1.16.0
- xls-r-sqa: 0.1.0
- yarl: 1.17.1
- youtube-dl: 2021.2.22
- zc.lockfile: 3.0.post1
- zipp: 3.20.2
- zope.event: 5.0
- zope.interface: 7.1.1
- System:
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.10.6
- release: 6.8.0-47-generic
- version: Enable any ML experiment tracking framework #47~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Wed Oct 2 16:16:55 UTC 2
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingneeds triageWaiting to be triaged by maintainersWaiting to be triaged by maintainersver: 2.1.xver: 2.3.x