Skip to content

Validation stuck when trainers have different data size #20561

@btian

Description

@btian

Bug description

Image

I'm running validation where each trainer can have different data size, however, validation gets stuck.

In the example above, rank 5 ran out of data after batch 50, while other ranks still have data. But the program got stuck.

I'm using FSDP strategy to train an LLM. Not sure why validation batches are synchronized.

What version are you seeing the problem on?

v2.3
v2.5.0.post0

How to reproduce the bug

def validation_step(self, input_dict: Dict, batch_idx: int) -> Dict[str, torch.Tensor]:
        pred_dict : Dict[str, torch.Tensor] = self.model(input_dict)
        loss_dict: Dict[str, torch.Tensor] = (
            self.model.get_loss_dict(gt_dict=input_dict, pred_dict=pred_dict)
        )

        self.log_dict(
            {f"val_loss/{key}": loss for key, loss in loss_dict.items()},
            # In validation step, all logs are aggregated and logged at the epoch end.
            on_step=False,
            on_epoch=True,
            sync_dist=True,
        )

        return pred_dict

Error messages and logs

No error message. NCCL timeout after 10 minutes.

Environment

Current environment
  • CUDA:
    - GPU:
    - NVIDIA L20
    - available: True
    - version: 12.4
  • Lightning:
    - efficientnet-pytorch: 0.7.1
    - lightning: 2.3.0
    - lightning-thunder: 0.1.0
    - lightning-utilities: 0.11.2
    - pytorch-lightning: 2.5.0.post0
    - pytorch-quantization: 2.1.2
    - pytorch-triton: 3.0.0+a9bc1a364
    - torch: 2.5.1+cu124
    - torch-scatter: 2.1.2+pt25cu124
    - torchcodec: 0.1.1+cu124
    - torchdata: 0.7.1a0
    - torchmetrics: 1.6.1
    - torchvision: 0.20.1+cu124
    - xpilot-lightning-mini: 2.3.0
  • Packages:
    - absl-py: 2.1.0
    - accelerate: 1.2.1
    - adjusttext: 1.1.1
    - aiofiles: 22.1.0
    - aiohttp: 3.9.3
    - aiosignal: 1.3.1
    - aiosqlite: 0.20.0
    - albumentations: 1.1.0
    - alembic: 1.14.0
    - aliyun-python-sdk-core: 2.16.0
    - aliyun-python-sdk-kms: 2.16.5
    - annotated-types: 0.6.0
    - anyio: 3.7.1
    - argon2-cffi: 23.1.0
    - argon2-cffi-bindings: 21.2.0
    - arrow: 1.3.0
    - astral: 3.2
    - astroid: 2.15.8
    - asttokens: 2.4.1
    - astunparse: 1.6.3
    - async-timeout: 4.0.3
    - attrs: 23.2.0
    - audioread: 3.0.1
    - av: 12.3.0
    - babel: 2.16.0
    - bcrypt: 4.2.1
    - beautifulsoup4: 4.12.3
    - bleach: 6.1.0
    - blessed: 1.20.0
    - blinker: 1.9.0
    - blis: 0.7.11
    - blosc2: 2.7.1
    - cachetools: 5.3.3
    - casadi: 3.6.5
    - catalogue: 2.0.10
    - cattrs: 24.1.2
    - ccimport: 0.4.4
    - certifi: 2024.2.2
    - cffi: 1.16.0
    - charset-normalizer: 3.3.2
    - click: 8.1.7
    - cloudpathlib: 0.16.0
    - cloudpickle: 2.2.1
    - cmake: 3.29.0.1
    - coloredlogs: 15.0.1
    - colorlog: 6.9.0
    - comm: 0.2.2
    - confection: 0.1.4
    - contextlib2: 21.6.0
    - contourpy: 1.2.1
    - crcmod: 1.7
    - cryptography: 44.0.0
    - cuda-python: 12.4.0rc7+3.ge75c8a9.dirty
    - cudf: 24.2.0
    - cudnn: 1.1.2
    - cugraph: 24.2.0
    - cugraph-dgl: 24.2.0
    - cugraph-service-client: 24.2.0
    - cugraph-service-server: 24.2.0
    - cuml: 24.2.0
    - cumm-cu120: 0.4.11
    - cupy-cuda12x: 13.0.0
    - cycler: 0.12.1
    - cymem: 2.0.8
    - cython: 3.0.10
    - dask: 2024.1.1
    - dask-cuda: 24.2.0
    - dask-cudf: 24.2.0
    - databricks-cli: 0.18.0
    - dbus-python: 1.2.18
    - debugpy: 1.8.1
    - decorator: 5.1.1
    - decord: 0.6.0
    - deepspeed: 0.15.0
    - defusedxml: 0.7.1
    - deprecated: 1.2.15
    - descartes: 1.1.0
    - dill: 0.3.9
    - distributed: 2024.1.1
    - distro: 1.7.0
    - dm-tree: 0.1.8
    - dnspython: 2.7.0
    - docker: 6.1.3
    - docstring-parser: 0.16
    - dotmap: 1.3.30
    - duckdb: 0.8.1
    - durationpy: 0.9
    - easydict: 1.11
    - editor: 1.6.6
    - efficientnet-pytorch: 0.7.1
    - einops: 0.7.0
    - elasticsearch: 7.17.12
    - elasticsearch-dsl: 7.4.1
    - engineering-notation: 0.10.0
    - entrypoints: 0.4
    - exceptiongroup: 1.2.0
    - execnet: 2.0.2
    - executing: 2.0.1
    - expecttest: 0.1.3
    - fastjsonschema: 2.19.1
    - fastrlock: 0.8.2
    - ffmpeg-python: 0.2.0
    - filelock: 3.13.3
    - filterpy: 1.4.5
    - fire: 0.7.0
    - flash-attn: 2.7.2.post1
    - flask: 2.3.3
    - flatbuffers: 24.12.23
    - fonttools: 4.51.0
    - fqdn: 1.5.1
    - frozenlist: 1.4.1
    - fsspec: 2022.11.0
    - ftfy: 6.2.3
    - func-timeout: 4.3.5
    - future: 1.0.0
    - fuyao: 4.5.1.post2
    - fuyao-all: 3.8.17
    - fuyao-skinny: 0.1.10
    - gast: 0.5.4
    - gitdb: 4.0.12
    - gitpython: 3.1.44
    - google-auth: 2.29.0
    - google-auth-oauthlib: 0.4.6
    - graphsurgeon: 0.4.6
    - greenlet: 3.1.1
    - grpcio: 1.62.1
    - gunicorn: 20.1.0
    - guppy3: 3.1.3
    - h5py: 3.12.1
    - hjson: 3.1.0
    - huggingface-hub: 0.27.1
    - humanfriendly: 10.0
    - hypothesis: 5.35.1
    - idna: 3.6
    - igraph: 0.11.4
    - imageio: 2.36.1
    - importlib-metadata: 4.13.0
    - iniconfig: 2.0.0
    - inquirer: 3.4.0
    - intel-openmp: 2021.4.0
    - ipcqueue: 0.9.7
    - ipykernel: 6.29.4
    - ipython: 8.21.0
    - ipython-genutils: 0.2.0
    - ipywidgets: 8.1.5
    - isoduration: 20.11.0
    - isort: 5.13.2
    - itsdangerous: 2.2.0
    - jedi: 0.19.1
    - jinja2: 3.1.3
    - jmespath: 0.10.0
    - joblib: 1.3.2
    - json5: 0.9.24
    - jsonpointer: 3.0.0
    - jsonschema: 4.21.1
    - jsonschema-specifications: 2023.12.1
    - jupyter: 1.1.1
    - jupyter-client: 6.2.0
    - jupyter-console: 6.6.3
    - jupyter-core: 5.7.2
    - jupyter-enterprise-gateway: 3.2.2
    - jupyter-events: 0.11.0
    - jupyter-server: 1.24.0
    - jupyter-server-fileid: 0.9.3
    - jupyter-server-ydoc: 0.8.0
    - jupyter-tensorboard: 0.2.0
    - jupyter-ydoc: 0.2.5
    - jupyterlab: 3.6.6
    - jupyterlab-pygments: 0.3.0
    - jupyterlab-server: 2.27.3
    - jupyterlab-widgets: 3.0.13
    - jupytext: 1.16.1
    - kafka-python: 2.0.2
    - kiwisolver: 1.4.5
    - kubernetes: 31.0.0
    - langcodes: 3.3.0
    - lark: 1.1.9
    - lazy-loader: 0.4
    - lazy-object-proxy: 1.10.0
    - librosa: 0.10.1
    - lightning: 2.3.0
    - lightning-thunder: 0.1.0
    - lightning-utilities: 0.11.2
    - llvmlite: 0.42.0
    - locket: 1.0.0
    - looseversion: 1.3.0
    - lz4: 4.3.2
    - mako: 1.3.8
    - markdown: 3.6
    - markdown-it-py: 3.0.0
    - markupsafe: 2.1.5
    - masksdk: 1.2.1
    - matplotlib: 3.8.4
    - matplotlib-inline: 0.1.6
    - mccabe: 0.7.0
    - mdit-py-plugins: 0.4.0
    - mdurl: 0.1.2
    - mistune: 3.0.2
    - mkl: 2021.1.1
    - mkl-devel: 2021.1.1
    - mkl-include: 2021.1.1
    - mlflow: 1.29.0
    - mock: 5.1.0
    - mpmath: 1.3.0
    - msgpack: 1.0.8
    - multidict: 6.0.5
    - multiscaledeformableattention: 1.0
    - murmurhash: 1.0.10
    - mypy-extensions: 1.0.0
    - namedatomiclock: 1.1.3
    - nbclassic: 1.1.0
    - nbclient: 0.10.0
    - nbconvert: 7.16.3
    - nbformat: 5.10.4
    - ndindex: 1.9.2
    - nest-asyncio: 1.6.0
    - networkx: 3.4.2
    - ninja: 1.11.1.1
    - notebook: 6.4.10
    - notebook-shim: 0.2.4
    - numba: 0.59.0+1.g20ae2b56c
    - numexpr: 2.10.2
    - numpy: 1.24.4
    - nuscenes-devkit: 1.1.9
    - nvfuser: 0.1.6a0+a684e2a
    - nvidia-cublas-cu12: 12.4.5.8
    - nvidia-cuda-cupti-cu12: 12.4.127
    - nvidia-cuda-nvrtc-cu12: 12.4.127
    - nvidia-cuda-runtime-cu12: 12.4.127
    - nvidia-cudnn-cu12: 9.1.0.70
    - nvidia-cufft-cu12: 11.2.1.3
    - nvidia-curand-cu12: 10.3.5.147
    - nvidia-cusolver-cu12: 11.6.1.9
    - nvidia-cusparse-cu12: 12.3.1.170
    - nvidia-dali-cuda120: 1.36.0
    - nvidia-ml-py: 12.560.30
    - nvidia-nccl-cu12: 2.21.5
    - nvidia-nvimgcodec-cu12: 0.2.0.7
    - nvidia-nvjitlink-cu12: 12.4.127
    - nvidia-nvtx-cu12: 12.4.127
    - nvidia-pyindex: 1.0.9
    - nvsmi: 0.4.2
    - nvtx: 0.2.5
    - oauthlib: 3.2.2
    - onnx: 1.16.0
    - onnxruntime: 1.16.0
    - opencv: 4.7.0
    - opencv-python: 4.5.5.62
    - opencv-python-headless: 4.5.5.62
    - opt-einsum: 3.3.0
    - optree: 0.11.0
    - orjson: 3.8.7
    - oss-middle-layer: 1.9.9
    - oss2: 2.16.0
    - ossfs: 2023.1.0
    - packaging: 21.3
    - pandas: 1.5.3
    - pandocfilters: 1.5.1
    - paramiko: 3.5.0
    - parso: 0.8.4
    - partd: 1.4.1
    - pccm: 0.4.16
    - peft: 0.13.2
    - pexpect: 4.9.0
    - pika: 1.3.2
    - pillow: 10.3.0
    - pip: 24.0
    - pipdeptree: 2.13.0
    - platformdirs: 4.2.0
    - pluggy: 1.4.0
    - ply: 3.11
    - polars: 0.18.6
    - polygraphy: 0.49.8
    - pooch: 1.8.1
    - portalocker: 3.1.1
    - preshed: 3.0.9
    - presto-python-client: 0.8.4
    - prettytable: 3.10.0
    - prometheus-client: 0.20.0
    - prometheus-flask-exporter: 0.23.1
    - prompt-toolkit: 3.0.43
    - protobuf: 3.20.3
    - psutil: 5.9.4
    - psycopg2-binary: 2.9.10
    - ptyprocess: 0.7.0
    - pure-eval: 0.2.2
    - py-cpuinfo: 9.0.0
    - py-spy: 0.3.14
    - pyarrow: 14.0.1
    - pyasn1: 0.6.0
    - pyasn1-modules: 0.4.0
    - pybind11: 2.12.0
    - pybind11-global: 2.12.0
    - pycls: 0.1.1
    - pycocotools: 2.0.8
    - pycparser: 2.22
    - pycryptodome: 3.21.0
    - pycryptodomex: 3.21.0
    - pydantic: 2.6.4
    - pydantic-core: 2.16.3
    - pygments: 2.17.2
    - pygobject: 3.42.1
    - pyjwt: 2.10.1
    - pylibcugraph: 24.2.0
    - pylibcugraphops: 24.2.0
    - pylibraft: 24.2.0
    - pylint: 2.17.7
    - pylint-exit: 1.2.0
    - pymongo: 4.10.1
    - pymysql: 1.0.2
    - pynacl: 1.5.0
    - pynvjitlink: 0.1.13
    - pynvml: 11.4.1
    - pyparsing: 3.1.2
    - pypcd: 0.1.1
    - pyquaternion: 0.9.9
    - pyrasite: 2.0.1
    - pytest: 8.1.1
    - pytest-flakefinder: 1.1.0
    - pytest-rerunfailures: 14.0
    - pytest-shard: 0.1.2
    - pytest-xdist: 3.5.0
    - python-dateutil: 2.9.0.post0
    - python-hostlist: 1.23.0
    - python-json-logger: 3.2.1
    - python-lzf: 0.2.6
    - python-rapidjson: 1.8
    - pytorch-lightning: 2.5.0.post0
    - pytorch-quantization: 2.1.2
    - pytorch-triton: 3.0.0+a9bc1a364
    - pytz: 2022.7.1
    - pyyaml: 6.0.1
    - pyzmq: 24.0.1
    - qudida: 0.0.4
    - querystring-parser: 1.2.4
    - raft-dask: 24.2.0
    - rapids-dask-dependency: 24.2.0a0
    - ratelimiter: 1.2.0.post0
    - readchar: 4.2.1
    - redis: 5.2.1
    - referencing: 0.34.0
    - regex: 2023.12.25
    - requests: 2.31.0
    - requests-oauthlib: 2.0.0
    - rfc3339-validator: 0.1.4
    - rfc3986-validator: 0.1.1
    - rich: 13.7.1
    - rmm: 24.2.0
    - rpds-py: 0.18.0
    - rsa: 4.9
    - ruamel.yaml: 0.18.10
    - ruamel.yaml.clib: 0.2.12
    - runs: 1.2.2
    - safetensors: 0.5.2
    - schema: 0.7.5
    - scikit-image: 0.25.0
    - scikit-learn: 1.2.0
    - scipy: 1.12.0
    - seaborn: 0.12.2
    - send2trash: 1.8.2
    - sentencepiece: 0.2.0
    - setuptools: 68.2.2
    - sh: 2.0.7
    - shapely: 1.8.1
    - simplejson: 3.19.3
    - six: 1.16.0
    - smart-open: 6.4.0
    - smmap: 5.0.2
    - sniffio: 1.3.1
    - sortedcontainers: 2.4.0
    - soundfile: 0.12.1
    - soupsieve: 2.5
    - soxr: 0.3.7
    - spacy: 3.7.4
    - spacy-legacy: 3.0.12
    - spacy-loggers: 1.0.5
    - spconv-cu120: 2.3.6
    - sphinx-glpi-theme: 0.6
    - sqlalchemy: 1.4.54
    - sqlparse: 0.5.3
    - srsly: 2.4.8
    - ssh-import-id: 5.11
    - stack-data: 0.6.3
    - sympy: 1.13.1
    - tables: 3.10.1
    - tabulate: 0.9.0
    - tbb: 2021.12.0
    - tblib: 3.0.0
    - tenacity: 9.0.0
    - tensorboard: 2.17.0
    - tensorboard-data-server: 0.7.2
    - tensorboard-plugin-wit: 1.8.1
    - tensorboardx: 2.6.2.2
    - tensorrt: 8.6.3
    - termcolor: 2.5.0
    - terminado: 0.18.1
    - texttable: 1.7.0
    - thinc: 8.2.3
    - threadpoolctl: 3.3.0
    - thriftpy2: 0.4.17
    - tifffile: 2024.12.12
    - timm: 0.5.4
    - tinycss2: 1.2.1
    - tk-tools: 0.16.0
    - tokenizers: 0.20.4
    - toml: 0.10.2
    - tomli: 2.0.1
    - tomlkit: 0.13.2
    - toolz: 0.12.1
    - torch: 2.5.1+cu124
    - torch-scatter: 2.1.2+pt25cu124
    - torchcodec: 0.1.1+cu124
    - torchdata: 0.7.1a0
    - torchmetrics: 1.6.1
    - torchvision: 0.20.1+cu124
    - tornado: 6.4
    - tqdm: 4.66.2
    - traitlets: 5.9.0
    - transformers: 4.46.2
    - treelite: 4.0.0
    - triton: 3.1.0
    - typed-argument-parser: 1.10.1
    - typer: 0.9.4
    - types-dataclasses: 0.6.6
    - types-python-dateutil: 2.9.0.20241206
    - typing-extensions: 4.10.0
    - typing-inspect: 0.9.0
    - ucx-py: 0.36.0
    - uff: 0.6.9
    - uri-template: 1.3.0
    - urllib3: 1.26.18
    - wasabi: 1.1.2
    - watchdog: 6.0.0
    - wcwidth: 0.2.13
    - weasel: 0.3.4
    - webcolors: 24.11.1
    - webencodings: 0.5.1
    - websocket-client: 1.8.0
    - werkzeug: 3.0.2
    - wheel: 0.43.0
    - widgetsnbextension: 4.0.13
    - wrapt: 1.17.0
    - xcompress: 5.0.1
    - xdata-dataloader: 3.1.63
    - xdoctest: 1.0.2
    - xfoundation: 0.0.6.dev2
    - xgboost: 2.0.3
    - xmod: 1.8.1
    - xpilot-lightning-mini: 2.3.0
    - y-py: 0.6.2
    - yacs: 0.1.8
    - yarl: 1.9.4
    - yarn-api-client: 1.0.3
    - ypy-websocket: 0.8.4
    - zict: 3.0.0
    - zipp: 3.17.0
  • System:
    - OS: Linux
    - architecture:
    - 64bit
    - ELF
    - processor: x86_64
    - python: 3.10.12
    - release: 5.10.134-17.3.al8.x86_64
    - version: Proposal for help #1 SMP Thu Oct 31 14:29:57 CST 2024

More info

No response

cc @lantiga

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions