From e9fd8a5e296d22b000f57c6dddeba05aeeb4bb6b Mon Sep 17 00:00:00 2001 From: Jonathan Coles Date: Thu, 30 Oct 2025 07:31:35 +0100 Subject: [PATCH 1/2] Recommend disabling eager messages to avoid NCCL watchdog timeouts. --- docs/software/communication/nccl.md | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/docs/software/communication/nccl.md b/docs/software/communication/nccl.md index c3f338b1..4ae90fef 100644 --- a/docs/software/communication/nccl.md +++ b/docs/software/communication/nccl.md @@ -22,6 +22,16 @@ While the container engine sets these automatically when using the NCCL hook, th [_Demystifying NCCL: An In-depth Analysis of GPU Communication Protocols and Algorithms_](https://arxiv.org/abs/2507.04786v2) contains detailed information about NCCL algorithms and protocols, which can be helpful for deciding if your application could benefit from an alternative configuration. +!!! warning "NCCL watchdog timeout or hanging process" + In some cases, still under investigation, NCCL may hang resulting in a stuck process or a watchdog timeout error. + In this scenario, we recommend disabling Slingshot eager messages with the following workaround: + ```console + # Disable eager messages to avoid NCCL timeouts + export FI_CXI_RDZV_GET_MIN=0 + export FI_CXI_RDZV_THRESHOLD=0 + export FI_CXI_RDZV_EAGER_SIZE=0 + ``` + !!! warning "Using NCCL with uenvs" The environment variables listed above are not set automatically when using uenvs. From c09223d9ff31cb7bc41a8168e2abd4fd0b51c4a6 Mon Sep 17 00:00:00 2001 From: Mikael Simberg Date: Thu, 30 Oct 2025 09:09:27 +0100 Subject: [PATCH 2/2] Update docs/software/communication/nccl.md --- docs/software/communication/nccl.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/software/communication/nccl.md b/docs/software/communication/nccl.md index 4ae90fef..fb96928a 100644 --- a/docs/software/communication/nccl.md +++ b/docs/software/communication/nccl.md @@ -25,7 +25,7 @@ While the container engine sets these automatically when using the NCCL hook, th !!! warning "NCCL watchdog timeout or hanging process" In some cases, still under investigation, NCCL may hang resulting in a stuck process or a watchdog timeout error. In this scenario, we recommend disabling Slingshot eager messages with the following workaround: - ```console + ```bash # Disable eager messages to avoid NCCL timeouts export FI_CXI_RDZV_GET_MIN=0 export FI_CXI_RDZV_THRESHOLD=0