Skip to content

Commit 84ac17c

Browse files
committed
Use one source of truth for NCCL environment variables
1 parent c885c33 commit 84ac17c

File tree

3 files changed

+5
-22
lines changed

3 files changed

+5
-22
lines changed

docs/software/communication/nccl.md

Lines changed: 1 addition & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -17,21 +17,9 @@ The environment variables described below must be set to ensure that NCCL uses t
1717
While the container engine sets these automatically when using the NCCL hook, the following environment variables should always be set for correctness and optimal performance when using NCCL:
1818

1919
```bash
20-
export NCCL_NET="AWS Libfabric" # (1)!
21-
export NCCL_NET_GDR_LEVEL=PHB # (2)!
22-
export FI_CXI_DEFAULT_CQ_SIZE=131072 # (3)!
23-
export FI_CXI_DEFAULT_TX_SIZE=32768
24-
export FI_CXI_DISABLE_HOST_REGISTER=1
25-
export FI_CXI_RX_MATCH_MODE=software
26-
export FI_MR_CACHE_MONITOR=userfaultfd
27-
export MPICH_GPU_SUPPORT_ENABLED=0 # (4)!
20+
--8<-- "docs/software/communication/nccl_env_vars"
2821
```
2922

30-
1. This forces NCCL to use the libfabric plugin, enabling full use of the Slingshot network. If the plugin can not be found, applications will fail to start. With the default value, applications would instead fall back to e.g. TCP, which would be significantly slower than with the plugin. [More information about `NCCL_NET`](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-net).
31-
2. Use GPU Direct RDMA when GPU and NIC are on the same NUMA node. [More information about `NCCL_NET_GDR_LEVEL`](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-net-gdr-level-formerly-nccl-ib-gdr-level).
32-
3. This and the other `FI` (libfabric) environment variables have been found to give the best performance on the Alps network across a wide range of applications. Specific applications may perform better with other values.
33-
4. Disable GPU-aware MPI explicitly, to avoid potential deadlocks between MPI and NCCL.
34-
3523
!!! warning "Using NCCL with uenvs"
3624
The environment variables listed above are not set automatically when using uenvs.
3725

docs/software/ml/pytorch.md

Lines changed: 2 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -355,14 +355,8 @@ export CUDA_CACHE_DISABLE=1 # (7)!
355355
############################################
356356
# NCCL and Fabric environment variables #
357357
############################################
358-
export NCCL_NET="AWS Libfabric" # (8)!
359-
export NCCL_NET_GDR_LEVEL=PHB
360-
export NCCL_CROSS_NIC=1
361-
export FI_CXI_DISABLE_HOST_REGISTER=1
362-
export FI_MR_CACHE_MONITOR=userfaultfd
363-
export FI_CXI_DEFAULT_CQ_SIZE=131072
364-
export FI_CXI_DEFAULT_TX_SIZE=32768
365-
export FI_CXI_RX_MATCH_MODE=software
358+
# (8)!
359+
--8<-- "docs/software/communication/nccl_env_vars"
366360

367361
# (9)!
368362
# (10)!

mkdocs.yml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -182,7 +182,8 @@ markdown_extensions:
182182
- name: mermaid
183183
class: mermaid
184184
format: !!python/name:pymdownx.superfences.fence_code_format
185-
- pymdownx.snippets
185+
- pymdownx.snippets:
186+
check_paths: true
186187
- pymdownx.highlight:
187188
anchor_linenums: true
188189
line_spans: __span

0 commit comments

Comments
 (0)