You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/software/communication/nccl.md
+13-2Lines changed: 13 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,15 +9,26 @@ It is commonly used in machine learning frameworks, but traditional scientific a
9
9
To use the Slingshot network on Alps, the [`aws-ofi-nccl`](https://github.com/aws/aws-ofi-nccl) plugin must be used.
10
10
With the container engine, the [AWS OFI NCCL hook][ref-ce-aws-ofi-hook] can be used to load the plugin into the container and configure NCCL to use it.
11
11
12
-
While the container engine does this automatically, regardless of application, the following environment variable should always be set when using NCCL:
12
+
While the container engine does this automatically, regardless of application, the following environment variables should always be set when using NCCL:
13
13
14
14
```bash
15
-
exportNCCL_NET_PLUGIN="ofi"
15
+
exportNCCL_NET="AWS Libfabric"
16
16
```
17
17
18
18
This forces NCCL to use the libfabric plugin, enabling full use of the Slingshot network.
19
19
Conversely, if the plugin can not be found, applications will fail to start instead of falling back to e.g. TCP, which would be significantly slower than with the plugin.
20
20
21
+
For optimal performance, the following environment variables should also be set (these are set automatically by the container engine):
22
+
23
+
```bash
24
+
export NCCL_NET_GDR_LEVEL=PHB
25
+
export FI_CXI_DISABLE_HOST_REGISTER=1
26
+
export FI_MR_CACHE_MONITOR=userfaultfd
27
+
export FI_CXI_DEFAULT_CQ_SIZE=131072
28
+
export FI_CXI_DEFAULT_TX_SIZE=32768
29
+
export FI_CXI_RX_MATCH_MODE=software
30
+
```
31
+
21
32
!!! warning "GPU-aware MPI with NCCL"
22
33
Using GPU-aware MPI together with NCCL [can easily lead to deadlocks](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/mpi.html#inter-gpu-communication-with-cuda-aware-mpi).
23
34
Unless care is taken to ensure that the two methods of communication are not used concurrently, we recommend not using GPU-aware MPI with NCCL.
0 commit comments