-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Open
Labels
bugSomething isn't workingSomething isn't workingstrategy: fsdpFully Sharded Data ParallelFully Sharded Data Paralleltrainer: validatever: 2.3.x
Description
Bug description

I'm running validation where each trainer can have different data size, however, validation gets stuck.
In the example above, rank 5 ran out of data after batch 50, while other ranks still have data. But the program got stuck.
I'm using FSDP strategy to train an LLM. Not sure why validation batches are synchronized.
What version are you seeing the problem on?
v2.3
v2.5.0.post0
How to reproduce the bug
def validation_step(self, input_dict: Dict, batch_idx: int) -> Dict[str, torch.Tensor]:
pred_dict : Dict[str, torch.Tensor] = self.model(input_dict)
loss_dict: Dict[str, torch.Tensor] = (
self.model.get_loss_dict(gt_dict=input_dict, pred_dict=pred_dict)
)
self.log_dict(
{f"val_loss/{key}": loss for key, loss in loss_dict.items()},
# In validation step, all logs are aggregated and logged at the epoch end.
on_step=False,
on_epoch=True,
sync_dist=True,
)
return pred_dict
Error messages and logs
No error message. NCCL timeout after 10 minutes.
Environment
Current environment
- CUDA:
- GPU:
- NVIDIA L20
- available: True
- version: 12.4 - Lightning:
- efficientnet-pytorch: 0.7.1
- lightning: 2.3.0
- lightning-thunder: 0.1.0
- lightning-utilities: 0.11.2
- pytorch-lightning: 2.5.0.post0
- pytorch-quantization: 2.1.2
- pytorch-triton: 3.0.0+a9bc1a364
- torch: 2.5.1+cu124
- torch-scatter: 2.1.2+pt25cu124
- torchcodec: 0.1.1+cu124
- torchdata: 0.7.1a0
- torchmetrics: 1.6.1
- torchvision: 0.20.1+cu124
- xpilot-lightning-mini: 2.3.0 - Packages:
- absl-py: 2.1.0
- accelerate: 1.2.1
- adjusttext: 1.1.1
- aiofiles: 22.1.0
- aiohttp: 3.9.3
- aiosignal: 1.3.1
- aiosqlite: 0.20.0
- albumentations: 1.1.0
- alembic: 1.14.0
- aliyun-python-sdk-core: 2.16.0
- aliyun-python-sdk-kms: 2.16.5
- annotated-types: 0.6.0
- anyio: 3.7.1
- argon2-cffi: 23.1.0
- argon2-cffi-bindings: 21.2.0
- arrow: 1.3.0
- astral: 3.2
- astroid: 2.15.8
- asttokens: 2.4.1
- astunparse: 1.6.3
- async-timeout: 4.0.3
- attrs: 23.2.0
- audioread: 3.0.1
- av: 12.3.0
- babel: 2.16.0
- bcrypt: 4.2.1
- beautifulsoup4: 4.12.3
- bleach: 6.1.0
- blessed: 1.20.0
- blinker: 1.9.0
- blis: 0.7.11
- blosc2: 2.7.1
- cachetools: 5.3.3
- casadi: 3.6.5
- catalogue: 2.0.10
- cattrs: 24.1.2
- ccimport: 0.4.4
- certifi: 2024.2.2
- cffi: 1.16.0
- charset-normalizer: 3.3.2
- click: 8.1.7
- cloudpathlib: 0.16.0
- cloudpickle: 2.2.1
- cmake: 3.29.0.1
- coloredlogs: 15.0.1
- colorlog: 6.9.0
- comm: 0.2.2
- confection: 0.1.4
- contextlib2: 21.6.0
- contourpy: 1.2.1
- crcmod: 1.7
- cryptography: 44.0.0
- cuda-python: 12.4.0rc7+3.ge75c8a9.dirty
- cudf: 24.2.0
- cudnn: 1.1.2
- cugraph: 24.2.0
- cugraph-dgl: 24.2.0
- cugraph-service-client: 24.2.0
- cugraph-service-server: 24.2.0
- cuml: 24.2.0
- cumm-cu120: 0.4.11
- cupy-cuda12x: 13.0.0
- cycler: 0.12.1
- cymem: 2.0.8
- cython: 3.0.10
- dask: 2024.1.1
- dask-cuda: 24.2.0
- dask-cudf: 24.2.0
- databricks-cli: 0.18.0
- dbus-python: 1.2.18
- debugpy: 1.8.1
- decorator: 5.1.1
- decord: 0.6.0
- deepspeed: 0.15.0
- defusedxml: 0.7.1
- deprecated: 1.2.15
- descartes: 1.1.0
- dill: 0.3.9
- distributed: 2024.1.1
- distro: 1.7.0
- dm-tree: 0.1.8
- dnspython: 2.7.0
- docker: 6.1.3
- docstring-parser: 0.16
- dotmap: 1.3.30
- duckdb: 0.8.1
- durationpy: 0.9
- easydict: 1.11
- editor: 1.6.6
- efficientnet-pytorch: 0.7.1
- einops: 0.7.0
- elasticsearch: 7.17.12
- elasticsearch-dsl: 7.4.1
- engineering-notation: 0.10.0
- entrypoints: 0.4
- exceptiongroup: 1.2.0
- execnet: 2.0.2
- executing: 2.0.1
- expecttest: 0.1.3
- fastjsonschema: 2.19.1
- fastrlock: 0.8.2
- ffmpeg-python: 0.2.0
- filelock: 3.13.3
- filterpy: 1.4.5
- fire: 0.7.0
- flash-attn: 2.7.2.post1
- flask: 2.3.3
- flatbuffers: 24.12.23
- fonttools: 4.51.0
- fqdn: 1.5.1
- frozenlist: 1.4.1
- fsspec: 2022.11.0
- ftfy: 6.2.3
- func-timeout: 4.3.5
- future: 1.0.0
- fuyao: 4.5.1.post2
- fuyao-all: 3.8.17
- fuyao-skinny: 0.1.10
- gast: 0.5.4
- gitdb: 4.0.12
- gitpython: 3.1.44
- google-auth: 2.29.0
- google-auth-oauthlib: 0.4.6
- graphsurgeon: 0.4.6
- greenlet: 3.1.1
- grpcio: 1.62.1
- gunicorn: 20.1.0
- guppy3: 3.1.3
- h5py: 3.12.1
- hjson: 3.1.0
- huggingface-hub: 0.27.1
- humanfriendly: 10.0
- hypothesis: 5.35.1
- idna: 3.6
- igraph: 0.11.4
- imageio: 2.36.1
- importlib-metadata: 4.13.0
- iniconfig: 2.0.0
- inquirer: 3.4.0
- intel-openmp: 2021.4.0
- ipcqueue: 0.9.7
- ipykernel: 6.29.4
- ipython: 8.21.0
- ipython-genutils: 0.2.0
- ipywidgets: 8.1.5
- isoduration: 20.11.0
- isort: 5.13.2
- itsdangerous: 2.2.0
- jedi: 0.19.1
- jinja2: 3.1.3
- jmespath: 0.10.0
- joblib: 1.3.2
- json5: 0.9.24
- jsonpointer: 3.0.0
- jsonschema: 4.21.1
- jsonschema-specifications: 2023.12.1
- jupyter: 1.1.1
- jupyter-client: 6.2.0
- jupyter-console: 6.6.3
- jupyter-core: 5.7.2
- jupyter-enterprise-gateway: 3.2.2
- jupyter-events: 0.11.0
- jupyter-server: 1.24.0
- jupyter-server-fileid: 0.9.3
- jupyter-server-ydoc: 0.8.0
- jupyter-tensorboard: 0.2.0
- jupyter-ydoc: 0.2.5
- jupyterlab: 3.6.6
- jupyterlab-pygments: 0.3.0
- jupyterlab-server: 2.27.3
- jupyterlab-widgets: 3.0.13
- jupytext: 1.16.1
- kafka-python: 2.0.2
- kiwisolver: 1.4.5
- kubernetes: 31.0.0
- langcodes: 3.3.0
- lark: 1.1.9
- lazy-loader: 0.4
- lazy-object-proxy: 1.10.0
- librosa: 0.10.1
- lightning: 2.3.0
- lightning-thunder: 0.1.0
- lightning-utilities: 0.11.2
- llvmlite: 0.42.0
- locket: 1.0.0
- looseversion: 1.3.0
- lz4: 4.3.2
- mako: 1.3.8
- markdown: 3.6
- markdown-it-py: 3.0.0
- markupsafe: 2.1.5
- masksdk: 1.2.1
- matplotlib: 3.8.4
- matplotlib-inline: 0.1.6
- mccabe: 0.7.0
- mdit-py-plugins: 0.4.0
- mdurl: 0.1.2
- mistune: 3.0.2
- mkl: 2021.1.1
- mkl-devel: 2021.1.1
- mkl-include: 2021.1.1
- mlflow: 1.29.0
- mock: 5.1.0
- mpmath: 1.3.0
- msgpack: 1.0.8
- multidict: 6.0.5
- multiscaledeformableattention: 1.0
- murmurhash: 1.0.10
- mypy-extensions: 1.0.0
- namedatomiclock: 1.1.3
- nbclassic: 1.1.0
- nbclient: 0.10.0
- nbconvert: 7.16.3
- nbformat: 5.10.4
- ndindex: 1.9.2
- nest-asyncio: 1.6.0
- networkx: 3.4.2
- ninja: 1.11.1.1
- notebook: 6.4.10
- notebook-shim: 0.2.4
- numba: 0.59.0+1.g20ae2b56c
- numexpr: 2.10.2
- numpy: 1.24.4
- nuscenes-devkit: 1.1.9
- nvfuser: 0.1.6a0+a684e2a
- nvidia-cublas-cu12: 12.4.5.8
- nvidia-cuda-cupti-cu12: 12.4.127
- nvidia-cuda-nvrtc-cu12: 12.4.127
- nvidia-cuda-runtime-cu12: 12.4.127
- nvidia-cudnn-cu12: 9.1.0.70
- nvidia-cufft-cu12: 11.2.1.3
- nvidia-curand-cu12: 10.3.5.147
- nvidia-cusolver-cu12: 11.6.1.9
- nvidia-cusparse-cu12: 12.3.1.170
- nvidia-dali-cuda120: 1.36.0
- nvidia-ml-py: 12.560.30
- nvidia-nccl-cu12: 2.21.5
- nvidia-nvimgcodec-cu12: 0.2.0.7
- nvidia-nvjitlink-cu12: 12.4.127
- nvidia-nvtx-cu12: 12.4.127
- nvidia-pyindex: 1.0.9
- nvsmi: 0.4.2
- nvtx: 0.2.5
- oauthlib: 3.2.2
- onnx: 1.16.0
- onnxruntime: 1.16.0
- opencv: 4.7.0
- opencv-python: 4.5.5.62
- opencv-python-headless: 4.5.5.62
- opt-einsum: 3.3.0
- optree: 0.11.0
- orjson: 3.8.7
- oss-middle-layer: 1.9.9
- oss2: 2.16.0
- ossfs: 2023.1.0
- packaging: 21.3
- pandas: 1.5.3
- pandocfilters: 1.5.1
- paramiko: 3.5.0
- parso: 0.8.4
- partd: 1.4.1
- pccm: 0.4.16
- peft: 0.13.2
- pexpect: 4.9.0
- pika: 1.3.2
- pillow: 10.3.0
- pip: 24.0
- pipdeptree: 2.13.0
- platformdirs: 4.2.0
- pluggy: 1.4.0
- ply: 3.11
- polars: 0.18.6
- polygraphy: 0.49.8
- pooch: 1.8.1
- portalocker: 3.1.1
- preshed: 3.0.9
- presto-python-client: 0.8.4
- prettytable: 3.10.0
- prometheus-client: 0.20.0
- prometheus-flask-exporter: 0.23.1
- prompt-toolkit: 3.0.43
- protobuf: 3.20.3
- psutil: 5.9.4
- psycopg2-binary: 2.9.10
- ptyprocess: 0.7.0
- pure-eval: 0.2.2
- py-cpuinfo: 9.0.0
- py-spy: 0.3.14
- pyarrow: 14.0.1
- pyasn1: 0.6.0
- pyasn1-modules: 0.4.0
- pybind11: 2.12.0
- pybind11-global: 2.12.0
- pycls: 0.1.1
- pycocotools: 2.0.8
- pycparser: 2.22
- pycryptodome: 3.21.0
- pycryptodomex: 3.21.0
- pydantic: 2.6.4
- pydantic-core: 2.16.3
- pygments: 2.17.2
- pygobject: 3.42.1
- pyjwt: 2.10.1
- pylibcugraph: 24.2.0
- pylibcugraphops: 24.2.0
- pylibraft: 24.2.0
- pylint: 2.17.7
- pylint-exit: 1.2.0
- pymongo: 4.10.1
- pymysql: 1.0.2
- pynacl: 1.5.0
- pynvjitlink: 0.1.13
- pynvml: 11.4.1
- pyparsing: 3.1.2
- pypcd: 0.1.1
- pyquaternion: 0.9.9
- pyrasite: 2.0.1
- pytest: 8.1.1
- pytest-flakefinder: 1.1.0
- pytest-rerunfailures: 14.0
- pytest-shard: 0.1.2
- pytest-xdist: 3.5.0
- python-dateutil: 2.9.0.post0
- python-hostlist: 1.23.0
- python-json-logger: 3.2.1
- python-lzf: 0.2.6
- python-rapidjson: 1.8
- pytorch-lightning: 2.5.0.post0
- pytorch-quantization: 2.1.2
- pytorch-triton: 3.0.0+a9bc1a364
- pytz: 2022.7.1
- pyyaml: 6.0.1
- pyzmq: 24.0.1
- qudida: 0.0.4
- querystring-parser: 1.2.4
- raft-dask: 24.2.0
- rapids-dask-dependency: 24.2.0a0
- ratelimiter: 1.2.0.post0
- readchar: 4.2.1
- redis: 5.2.1
- referencing: 0.34.0
- regex: 2023.12.25
- requests: 2.31.0
- requests-oauthlib: 2.0.0
- rfc3339-validator: 0.1.4
- rfc3986-validator: 0.1.1
- rich: 13.7.1
- rmm: 24.2.0
- rpds-py: 0.18.0
- rsa: 4.9
- ruamel.yaml: 0.18.10
- ruamel.yaml.clib: 0.2.12
- runs: 1.2.2
- safetensors: 0.5.2
- schema: 0.7.5
- scikit-image: 0.25.0
- scikit-learn: 1.2.0
- scipy: 1.12.0
- seaborn: 0.12.2
- send2trash: 1.8.2
- sentencepiece: 0.2.0
- setuptools: 68.2.2
- sh: 2.0.7
- shapely: 1.8.1
- simplejson: 3.19.3
- six: 1.16.0
- smart-open: 6.4.0
- smmap: 5.0.2
- sniffio: 1.3.1
- sortedcontainers: 2.4.0
- soundfile: 0.12.1
- soupsieve: 2.5
- soxr: 0.3.7
- spacy: 3.7.4
- spacy-legacy: 3.0.12
- spacy-loggers: 1.0.5
- spconv-cu120: 2.3.6
- sphinx-glpi-theme: 0.6
- sqlalchemy: 1.4.54
- sqlparse: 0.5.3
- srsly: 2.4.8
- ssh-import-id: 5.11
- stack-data: 0.6.3
- sympy: 1.13.1
- tables: 3.10.1
- tabulate: 0.9.0
- tbb: 2021.12.0
- tblib: 3.0.0
- tenacity: 9.0.0
- tensorboard: 2.17.0
- tensorboard-data-server: 0.7.2
- tensorboard-plugin-wit: 1.8.1
- tensorboardx: 2.6.2.2
- tensorrt: 8.6.3
- termcolor: 2.5.0
- terminado: 0.18.1
- texttable: 1.7.0
- thinc: 8.2.3
- threadpoolctl: 3.3.0
- thriftpy2: 0.4.17
- tifffile: 2024.12.12
- timm: 0.5.4
- tinycss2: 1.2.1
- tk-tools: 0.16.0
- tokenizers: 0.20.4
- toml: 0.10.2
- tomli: 2.0.1
- tomlkit: 0.13.2
- toolz: 0.12.1
- torch: 2.5.1+cu124
- torch-scatter: 2.1.2+pt25cu124
- torchcodec: 0.1.1+cu124
- torchdata: 0.7.1a0
- torchmetrics: 1.6.1
- torchvision: 0.20.1+cu124
- tornado: 6.4
- tqdm: 4.66.2
- traitlets: 5.9.0
- transformers: 4.46.2
- treelite: 4.0.0
- triton: 3.1.0
- typed-argument-parser: 1.10.1
- typer: 0.9.4
- types-dataclasses: 0.6.6
- types-python-dateutil: 2.9.0.20241206
- typing-extensions: 4.10.0
- typing-inspect: 0.9.0
- ucx-py: 0.36.0
- uff: 0.6.9
- uri-template: 1.3.0
- urllib3: 1.26.18
- wasabi: 1.1.2
- watchdog: 6.0.0
- wcwidth: 0.2.13
- weasel: 0.3.4
- webcolors: 24.11.1
- webencodings: 0.5.1
- websocket-client: 1.8.0
- werkzeug: 3.0.2
- wheel: 0.43.0
- widgetsnbextension: 4.0.13
- wrapt: 1.17.0
- xcompress: 5.0.1
- xdata-dataloader: 3.1.63
- xdoctest: 1.0.2
- xfoundation: 0.0.6.dev2
- xgboost: 2.0.3
- xmod: 1.8.1
- xpilot-lightning-mini: 2.3.0
- y-py: 0.6.2
- yacs: 0.1.8
- yarl: 1.9.4
- yarn-api-client: 1.0.3
- ypy-websocket: 0.8.4
- zict: 3.0.0
- zipp: 3.17.0 - System:
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.10.12
- release: 5.10.134-17.3.al8.x86_64
- version: Proposal for help #1 SMP Thu Oct 31 14:29:57 CST 2024
More info
No response
cc @lantiga
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingstrategy: fsdpFully Sharded Data ParallelFully Sharded Data Paralleltrainer: validatever: 2.3.x