Skip to content
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions docs/software/communication/nccl.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,16 @@ While the container engine sets these automatically when using the NCCL hook, th

[_Demystifying NCCL: An In-depth Analysis of GPU Communication Protocols and Algorithms_](https://arxiv.org/abs/2507.04786v2) contains detailed information about NCCL algorithms and protocols, which can be helpful for deciding if your application could benefit from an alternative configuration.

!!! warning "NCCL watchdog timeout or hanging process"
In some cases, still under investigation, NCCL may hang resulting in a stuck process or a watchdog timeout error.
In this scenario, we recommend disabling Slingshot eager messages with the following workaround:
```console
# Disable eager messages to avoid NCCL timeouts
export FI_CXI_RDZV_GET_MIN=0
export FI_CXI_RDZV_THRESHOLD=0
export FI_CXI_RDZV_EAGER_SIZE=0
```

!!! warning "Using NCCL with uenvs"
The environment variables listed above are not set automatically when using uenvs.

Expand Down
Loading