-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Closed
Labels
bugSomething isn't workingSomething isn't workingneeds triageWaiting to be triaged by maintainersWaiting to be triaged by maintainersver: 2.5.x
Description
Bug description
I'm currently trying to work on developing a certain model . I'm afraid i cannot give the code but i simply can tell you that i am able to train in cpu. however switching to gpu makes it stuck and doesn't even enter the training_step function.
Stuck before even entering epoch 0
What version are you seeing the problem on?
v2.5
How to reproduce the bug
Train with these settings :
trainer = Trainer(
accelerator="gpu", # change to cpu and it will work
devices=2,
precision="16",
strategy=DeepSpeedStrategy(stage=2),
max_epochs=100,
callbacks=[checkpoint_callback],
num_sanity_val_steps=0, # ✅ Disable sanity check
enable_progress_bar=True)
In Linux Terminal i run the commend : CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 train_parallel.py
Error messages and logs
# Error messages and logs here please
No error message, just stuck.
Environment
- CUDA:
- GPU:
- NVIDIA GeForce RTX 3090
- NVIDIA GeForce RTX 3090
- NVIDIA GeForce GTX 1650
- available: True
- version: 12.4
- GPU:
- Lightning:
- lightning-utilities: 0.12.0
- open-clip-torch: 2.30.0
- pytorch-lightning: 1.5.0
- torch: 2.6.0
- torchmetrics: 1.6.1
- torchvision: 0.21.0
- Packages:
- absl-py: 2.1.0
- accelerate: 1.3.0
- addict: 2.4.0
- aiofiles: 23.2.1
- aiohappyeyeballs: 2.4.4
- aiohttp: 3.11.12
- aiosignal: 1.3.2
- annotated-types: 0.7.0
- antlr4-python3-runtime: 4.9.3
- anyio: 4.8.0
- asttokens: 3.0.0
- async-timeout: 5.0.1
- attrs: 25.1.0
- autocommand: 2.2.2
- backports.tarfile: 1.2.0
- blinker: 1.9.0
- certifi: 2025.1.31
- charset-normalizer: 3.4.1
- click: 8.1.8
- coloredlogs: 15.0.1
- comm: 0.2.2
- configargparse: 1.7
- contourpy: 1.3.0
- cycler: 0.12.1
- dash: 2.18.2
- dash-core-components: 2.0.0
- dash-html-components: 2.0.0
- dash-table: 5.0.0
- dataclasses-json: 0.6.7
- decorator: 5.1.1
- deepspeed: 0.16.3
- deprecated: 1.2.18
- diffusers: 0.24.0
- docker-pycreds: 0.4.0
- einops: 0.8.0
- eval-type-backport: 0.2.2
- exceptiongroup: 1.2.2
- executing: 2.2.0
- fastapi: 0.115.8
- fastjsonschema: 2.21.1
- ffmpy: 0.5.0
- filelock: 3.17.0
- flask: 3.0.3
- flatbuffers: 25.1.24
- fonttools: 4.55.8
- frozenlist: 1.5.0
- fsspec: 2025.2.0
- ftfy: 6.3.1
- future: 1.0.0
- gitdb: 4.0.12
- gitpython: 3.1.44
- gradio: 4.44.1
- gradio-client: 1.3.0
- grpcio: 1.70.0
- h11: 0.14.0
- hjson: 3.1.0
- httpcore: 1.0.7
- httpx: 0.28.1
- huggingface-hub: 0.28.1
- humanfriendly: 10.0
- idna: 3.10
- imageio: 2.37.0
- importlib-metadata: 8.6.1
- importlib-resources: 6.5.2
- inflect: 7.3.1
- ipdb: 0.13.13
- ipython: 8.18.1
- ipywidgets: 8.1.5
- itsdangerous: 2.2.0
- jaraco.collections: 5.1.0
- jaraco.context: 5.3.0
- jaraco.functools: 4.0.1
- jaraco.text: 3.12.1
- jaxtyping: 0.2.36
- jedi: 0.19.2
- jinja2: 3.1.5
- joblib: 1.4.2
- jsonschema: 4.23.0
- jsonschema-specifications: 2024.10.1
- jupyter-core: 5.7.2
- jupyterlab-widgets: 3.0.13
- kiwisolver: 1.4.7
- lightning-utilities: 0.12.0
- lpips: 0.1.4
- markdown: 3.7
- markdown-it-py: 3.0.0
- markupsafe: 2.1.5
- marshmallow: 3.26.1
- matplotlib: 3.9.4
- matplotlib-inline: 0.1.7
- mdurl: 0.1.2
- more-itertools: 10.3.0
- mpmath: 1.3.0
- msgpack: 1.1.0
- multidict: 6.1.0
- mypy-extensions: 1.0.0
- narwhals: 1.25.1
- nbformat: 5.10.4
- nest-asyncio: 1.6.0
- networkx: 3.2.1
- ninja: 1.11.1.3
- numpy: 2.0.2
- nvdiffrast: 0.3.3
- nvidia-cublas-cu12: 12.4.5.8
- nvidia-cuda-cupti-cu12: 12.4.127
- nvidia-cuda-nvrtc-cu12: 12.4.127
- nvidia-cuda-runtime-cu12: 12.4.127
- nvidia-cudnn-cu12: 9.1.0.70
- nvidia-cufft-cu12: 11.2.1.3
- nvidia-curand-cu12: 10.3.5.147
- nvidia-cusolver-cu12: 11.6.1.9
- nvidia-cusparse-cu12: 12.3.1.170
- nvidia-cusparselt-cu12: 0.6.2
- nvidia-ml-py: 12.570.86
- nvidia-nccl-cu12: 2.21.5
- nvidia-nvjitlink-cu12: 12.4.127
- nvidia-nvtx-cu12: 12.4.127
- omegaconf: 2.3.0
- onnxruntime: 1.19.2
- open-clip-torch: 2.30.0
- open3d: 0.19.0
- opencv-python: 4.11.0.86
- orjson: 3.10.15
- packaging: 24.2
- pandas: 2.2.3
- parso: 0.8.4
- pexpect: 4.9.0
- pillow: 10.4.0
- pip: 25.0
- platformdirs: 4.3.6
- plotly: 6.0.0
- prompt-toolkit: 3.0.50
- propcache: 0.2.1
- protobuf: 5.29.3
- psutil: 6.1.1
- ptyprocess: 0.7.0
- pure-eval: 0.2.3
- py-cpuinfo: 9.0.0
- pydantic: 2.10.6
- pydantic-core: 2.27.2
- pydeprecate: 0.3.1
- pydub: 0.25.1
- pygltflib: 1.16.3
- pygments: 2.19.1
- pymeshlab: 2023.12.post2
- pyparsing: 3.2.1
- pyquaternion: 0.9.9
- python-dateutil: 2.9.0.post0
- python-multipart: 0.0.20
- pytorch-lightning: 1.5.0
- pytz: 2025.1
- pyyaml: 6.0.2
- referencing: 0.36.2
- regex: 2024.11.6
- requests: 2.32.3
- retrying: 1.3.4
- rich: 13.9.4
- rm-anime-bg: 0.2.0
- rpds-py: 0.22.3
- ruff: 0.9.4
- safetensors: 0.5.2
- scikit-learn: 1.6.1
- scipy: 1.13.1
- semantic-version: 2.10.0
- sentry-sdk: 2.20.0
- setproctitle: 1.3.4
- setuptools: 75.8.0
- shellingham: 1.5.4
- six: 1.17.0
- smmap: 5.0.2
- sniffio: 1.3.1
- stack-data: 0.6.3
- starlette: 0.45.3
- sympy: 1.13.1
- tensorboard: 2.18.0
- tensorboard-data-server: 0.7.2
- tensorboardx: 1.8
- threadpoolctl: 3.5.0
- timm: 1.0.14
- tokenizers: 0.21.0
- tomli: 2.2.1
- tomlkit: 0.12.0
- torch: 2.6.0
- torchmetrics: 1.6.1
- torchvision: 0.21.0
- tqdm: 4.67.1
- traitlets: 5.14.3
- transformers: 4.48.2
- trimesh: 4.6.1
- triton: 3.2.0
- typeguard: 4.3.0
- typer: 0.15.1
- typing-extensions: 4.12.2
- typing-inspect: 0.9.0
- tzdata: 2025.1
- urllib3: 2.3.0
- uvicorn: 0.34.0
- wandb: 0.19.6
- wcwidth: 0.2.13
- websockets: 12.0
- werkzeug: 3.0.6
- wheel: 0.45.1
- widgetsnbextension: 4.0.13
- wrapt: 1.17.2
- xformers: 0.0.29.post2
- yarl: 1.18.3
- zipp: 3.21.0
- System:
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.9.21
- release: 6.2.0-37-generic
- version: Fixed typo in single_cpu_template #38~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Nov 2 18:01:13 UTC 2
More info
No response
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingneeds triageWaiting to be triaged by maintainersWaiting to be triaged by maintainersver: 2.5.x