Skip to content

Commit 64db5bf

Browse files
authored
Merge pull request #1 from boeschf/expand-communication
perf variables
2 parents 259fd4b + 30901d1 commit 64db5bf

File tree

1 file changed

+13
-2
lines changed

1 file changed

+13
-2
lines changed

docs/software/communication/nccl.md

Lines changed: 13 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,15 +9,26 @@ It is commonly used in machine learning frameworks, but traditional scientific a
99
To use the Slingshot network on Alps, the [`aws-ofi-nccl`](https://github.com/aws/aws-ofi-nccl) plugin must be used.
1010
With the container engine, the [AWS OFI NCCL hook][ref-ce-aws-ofi-hook] can be used to load the plugin into the container and configure NCCL to use it.
1111

12-
While the container engine does this automatically, regardless of application, the following environment variable should always be set when using NCCL:
12+
While the container engine does this automatically, regardless of application, the following environment variables should always be set when using NCCL:
1313

1414
```bash
15-
export NCCL_NET_PLUGIN="ofi"
15+
export NCCL_NET="AWS Libfabric"
1616
```
1717

1818
This forces NCCL to use the libfabric plugin, enabling full use of the Slingshot network.
1919
Conversely, if the plugin can not be found, applications will fail to start instead of falling back to e.g. TCP, which would be significantly slower than with the plugin.
2020

21+
For optimal performance, the following environment variables should also be set (these are set automatically by the container engine):
22+
23+
```bash
24+
export NCCL_NET_GDR_LEVEL=PHB
25+
export FI_CXI_DISABLE_HOST_REGISTER=1
26+
export FI_MR_CACHE_MONITOR=userfaultfd
27+
export FI_CXI_DEFAULT_CQ_SIZE=131072
28+
export FI_CXI_DEFAULT_TX_SIZE=32768
29+
export FI_CXI_RX_MATCH_MODE=software
30+
```
31+
2132
!!! warning "GPU-aware MPI with NCCL"
2233
Using GPU-aware MPI together with NCCL [can easily lead to deadlocks](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/mpi.html#inter-gpu-communication-with-cuda-aware-mpi).
2334
Unless care is taken to ensure that the two methods of communication are not used concurrently, we recommend not using GPU-aware MPI with NCCL.

0 commit comments

Comments
 (0)