-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Open
Labels
bugSomething isn't workingSomething isn't workingdistributedGeneric distributed-related topicGeneric distributed-related topicenvironment: slurmver: 2.4.x
Description
Bug description
I set up a training on a Slurm cluster, specifying 2 nodes with 4 GPUs each. During initialization, I observed the Unexpected behavior (times out) of all_gather_into_tensor with subgroups (Pytorch issue)
Apparently, this issue has not been solved on the Pytorch or NCCL level, but there is a workaround (described in this post on that same issue).
How/where could this workaround be implemented in Pytorch Lightning, if outright solving the underlying problem is not possible?
What version are you seeing the problem on?
v2.4
How to reproduce the bug
No response
Error messages and logs
# Error messages and logs here please
Environment
I'm working on a Slurm cluster with 2 headnodes (no GPUs), 6 computenodes (configuration see below) and NFS-mounted data storage.
<details>
<summary>Current environment</summary>
* CUDA:
- GPU:
- NVIDIA RTX A6000
- NVIDIA RTX A6000
- NVIDIA RTX A6000
- NVIDIA RTX A6000
- NVIDIA RTX A6000
- NVIDIA RTX A6000
- NVIDIA RTX A6000
- NVIDIA RTX A6000
- available: True
- version: 12.1
* Lightning:
- lightning-utilities: 0.11.7
- pytorch-lightning: 2.4.0
- torch: 2.4.1+cu121
- torchmetrics: 1.4.2
- torchvision: 0.19.1+cu121
* Packages:
- absl-py: 2.1.0
- aiohappyeyeballs: 2.4.0
- aiohttp: 3.10.5
- aiosignal: 1.3.1
- albucore: 0.0.16
- albumentations: 1.4.15
- annotated-types: 0.7.0
- async-timeout: 4.0.3
- attrs: 24.2.0
- certifi: 2024.8.30
- charset-normalizer: 3.3.2
- contourpy: 1.3.0
- cycler: 0.12.1
- eval-type-backport: 0.2.0
- filelock: 3.13.1
- fonttools: 4.53.1
- frozenlist: 1.4.1
- fsspec: 2024.2.0
- future: 1.0.0
- geopandas: 1.0.1
- grpcio: 1.66.1
- huggingface-hub: 0.25.0
- idna: 3.10
- imageio: 2.35.1
- imgaug: 0.4.0
- jinja2: 3.1.3
- joblib: 1.4.2
- kiwisolver: 1.4.7
- lazy-loader: 0.4
- lightning-utilities: 0.11.7
- markdown: 3.7
- matplotlib: 3.9.2
- mpmath: 1.3.0
- msgpack: 1.1.0
- multidict: 6.1.0
- networkx: 3.2.1
- numpy: 1.26.3
- nvidia-cublas-cu12: 12.1.3.1
- nvidia-cuda-cupti-cu12: 12.1.105
- nvidia-cuda-nvrtc-cu12: 12.1.105
- nvidia-cuda-runtime-cu12: 12.1.105
- nvidia-cudnn-cu12: 9.1.0.70
- nvidia-cufft-cu12: 11.0.2.54
- nvidia-curand-cu12: 10.3.2.106
- nvidia-cusolver-cu12: 11.4.5.107
- nvidia-cusparse-cu12: 12.1.0.106
- nvidia-nccl-cu12: 2.20.5
- nvidia-nvjitlink-cu12: 12.1.105
- nvidia-nvtx-cu12: 12.1.105
- opencv-python: 4.10.0.84
- opencv-python-headless: 4.10.0.84
- packaging: 24.1
- pandas: 2.2.2
- pillow: 10.2.0
- pip: 22.3.1
- protobuf: 5.28.1
- pydantic: 2.9.2
- pydantic-core: 2.23.4
- pyogrio: 0.9.0
- pyparsing: 3.1.4
- pyproj: 3.6.1
- python-dateutil: 2.9.0.post0
- pytorch-lightning: 2.4.0
- pytz: 2024.2
- pyyaml: 6.0.2
- requests: 2.32.3
- s2sphere: 0.2.5
- safetensors: 0.4.5
- scikit-image: 0.24.0
- scikit-learn: 1.5.2
- scipy: 1.14.1
- setuptools: 65.5.0
- shapely: 2.0.6
- six: 1.16.0
- sympy: 1.12
- tensorboard: 2.17.1
- tensorboard-data-server: 0.7.2
- threadpoolctl: 3.5.0
- tifffile: 2024.8.30
- timm: 1.0.9
- torch: 2.4.1+cu121
- torchmetrics: 1.4.2
- torchvision: 0.19.1+cu121
- tqdm: 4.66.5
- triton: 3.0.0
- typing-extensions: 4.9.0
- tzdata: 2024.1
- urllib3: 2.2.3
- werkzeug: 3.0.4
- yarl: 1.11.1
* System:
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.10.9
- release: 5.15.0-50-generic
- version: #56~20.04.1-Ubuntu SMP Tue Sep 27 15:51:29 UTC 2022
</details>
More info
No response
addisonklinke
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingdistributedGeneric distributed-related topicGeneric distributed-related topicenvironment: slurmver: 2.4.x