Skip to content

Commit 8871039

Browse files
authored
Use one source of truth for NCCL environment variables (#152)
* Use one source of truth for NCCL environment variables * Add nccl_env_vars file
1 parent dea3ec1 commit 8871039

File tree

4 files changed

+25
-22
lines changed

4 files changed

+25
-22
lines changed

docs/software/communication/nccl.md

Lines changed: 1 addition & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -17,21 +17,9 @@ The environment variables described below must be set to ensure that NCCL uses t
1717
While the container engine sets these automatically when using the NCCL hook, the following environment variables should always be set for correctness and optimal performance when using NCCL:
1818

1919
```bash
20-
export NCCL_NET="AWS Libfabric" # (1)!
21-
export NCCL_NET_GDR_LEVEL=PHB # (2)!
22-
export FI_CXI_DEFAULT_CQ_SIZE=131072 # (3)!
23-
export FI_CXI_DEFAULT_TX_SIZE=32768
24-
export FI_CXI_DISABLE_HOST_REGISTER=1
25-
export FI_CXI_RX_MATCH_MODE=software
26-
export FI_MR_CACHE_MONITOR=userfaultfd
27-
export MPICH_GPU_SUPPORT_ENABLED=0 # (4)!
20+
--8<-- "docs/software/communication/nccl_env_vars"
2821
```
2922

30-
1. This forces NCCL to use the libfabric plugin, enabling full use of the Slingshot network. If the plugin can not be found, applications will fail to start. With the default value, applications would instead fall back to e.g. TCP, which would be significantly slower than with the plugin. [More information about `NCCL_NET`](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-net).
31-
2. Use GPU Direct RDMA when GPU and NIC are on the same NUMA node. [More information about `NCCL_NET_GDR_LEVEL`](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-net-gdr-level-formerly-nccl-ib-gdr-level).
32-
3. This and the other `FI` (libfabric) environment variables have been found to give the best performance on the Alps network across a wide range of applications. Specific applications may perform better with other values.
33-
4. Disable GPU-aware MPI explicitly, to avoid potential deadlocks between MPI and NCCL.
34-
3523
!!! warning "Using NCCL with uenvs"
3624
The environment variables listed above are not set automatically when using uenvs.
3725

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
# This forces NCCL to use the libfabric plugin, enabling full use of the
2+
# Slingshot network. If the plugin can not be found, applications will fail to
3+
# start. With the default value, applications would instead fall back to e.g.
4+
# TCP, which would be significantly slower than with the plugin. More information
5+
# about `NCCL_NET` can be found at:
6+
# https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-net
7+
export NCCL_NET="AWS Libfabric"
8+
# Use GPU Direct RDMA when GPU and NIC are on the same NUMA node. More
9+
# information about `NCCL_NET_GDR_LEVEL` can be found at:
10+
# https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-net-gdr-level-formerly-nccl-ib-gdr-level
11+
export NCCL_NET_GDR_LEVEL=PHB
12+
export NCCL_CROSS_NIC=1
13+
# These `FI` (libfabric) environment variables have been found to give the best
14+
# performance on the Alps network across a wide range of applications. Specific
15+
# applications may perform better with other values.
16+
export FI_CXI_DEFAULT_CQ_SIZE=131072
17+
export FI_CXI_DEFAULT_TX_SIZE=32768
18+
export FI_CXI_DISABLE_HOST_REGISTER=1
19+
export FI_CXI_RX_MATCH_MODE=software
20+
export FI_MR_CACHE_MONITOR=userfaultfd

docs/software/ml/pytorch.md

Lines changed: 2 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -355,14 +355,8 @@ export CUDA_CACHE_DISABLE=1 # (7)!
355355
############################################
356356
# NCCL and Fabric environment variables #
357357
############################################
358-
export NCCL_NET="AWS Libfabric" # (8)!
359-
export NCCL_NET_GDR_LEVEL=PHB
360-
export NCCL_CROSS_NIC=1
361-
export FI_CXI_DISABLE_HOST_REGISTER=1
362-
export FI_MR_CACHE_MONITOR=userfaultfd
363-
export FI_CXI_DEFAULT_CQ_SIZE=131072
364-
export FI_CXI_DEFAULT_TX_SIZE=32768
365-
export FI_CXI_RX_MATCH_MODE=software
358+
# (8)!
359+
--8<-- "docs/software/communication/nccl_env_vars"
366360

367361
# (9)!
368362
# (10)!

mkdocs.yml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -182,7 +182,8 @@ markdown_extensions:
182182
- name: mermaid
183183
class: mermaid
184184
format: !!python/name:pymdownx.superfences.fence_code_format
185-
- pymdownx.snippets
185+
- pymdownx.snippets:
186+
check_paths: true
186187
- pymdownx.highlight:
187188
anchor_linenums: true
188189
line_spans: __span

0 commit comments

Comments
 (0)