Skip to content

Commit 4b7ae6b

Browse files
committed
Add more examples and warnings about aws ofi nccl plugin not loading correctly
1 parent 9ae6744 commit 4b7ae6b

File tree

1 file changed

+36
-2
lines changed

1 file changed

+36
-2
lines changed

docs/software/communication/nccl.md

Lines changed: 36 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -42,5 +42,39 @@ export MPICH_GPU_SUPPORT_ENABLED=0 # (3)
4242
Note that this option may be set to `1` by default on some Alps clusters.
4343
See [the Cray MPICH documentation][ref-communication-cray-mpich] for more details on GPU-aware MPI with Cray MPICH.
4444

45-
!!! todo
46-
More options?
45+
!!! warning "`invalid usage` error with `NCCL_NET="AWS Libfabric`"
46+
If you are getting error messages such as:
47+
```console
48+
nid006352: Test NCCL failure common.cu:958 'invalid usage (run with NCCL_DEBUG=WARN for details)
49+
```
50+
this may be due to the plugin not being found by NCCL.
51+
If this is the case, running the application with the recommended `NCCL_DEBUG=WARN` should print something similar to the following:
52+
```console
53+
nid006352:34157:34217 [1] net.cc:626 NCCL WARN Error: network AWS Libfabric not found.
54+
```
55+
When using uenvs like `prgenv-gnu`, make sure you are either using the `default` view which loads `aws-ofi-nccl` automatically, or, if using the `modules` view, load the `aws-ofi-nccl` module with `module load aws-ofi-nccl`.
56+
If the plugin is found correctly, running the application with `NCCL_DEBUG=INFO` should print:
57+
```console
58+
nid006352:34610:34631 [0] NCCL INFO Using network AWS Libfabric
59+
```
60+
61+
!!! warning "`NCCL_NET_PLUGIN="ofi"` with uenvs"
62+
When using uenvs, do not set `NCCL_NET_PLUGIN="ofi"` instead of, or in addition to, `NCCL_NET="AWS Libfabric"`.
63+
If you do, your application will fail to start since NCCL will:
64+
65+
1. fail to find the plugin because of the name of the shared library in the uenv, and
66+
2. prefer `NCCL_NET_PLUGIN` over `NCCL_NET`, so it will fail to find the plugin even if `NCCL_NET="AWS Libfabric"` is correctly set.
67+
68+
When both environment variables are set the error message, with `NCCL_DEBUG=WARN`, will look similar to when the plugin isn't available:
69+
```console
70+
nid006365:179857:179897 [1] net.cc:626 NCCL WARN Error: network AWS Libfabric not found.
71+
```
72+
73+
With `NCCL_DEBUG=INFO`, NCCL will print:
74+
```console
75+
nid006365:180142:180163 [0] NCCL INFO NET/Plugin: Could not find: ofi libnccl-net-ofi.so. Using internal network plugin.
76+
...
77+
nid006365:180142:180163 [0] net.cc:626 NCCL WARN Error: network AWS Libfabric not found.
78+
```
79+
80+
If you only set `NCCL_NET="ofi"`, NCCL may silently fail to load the plugin but fall back to the default implementation.

0 commit comments

Comments
 (0)