Skip to content

Commit ea89fe0

Browse files
committed
Expand NCCL and RCCL pages
1 parent 7a871b3 commit ea89fe0

File tree

3 files changed

+20
-3
lines changed

3 files changed

+20
-3
lines changed

docs/software/communication/nccl.md

Lines changed: 15 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,19 @@
44
[NCCL](https://developer.nvidia.com/nccl) is an optimized inter-GPU communication library for NVIDIA GPUs.
55
It is commonly used in machine learning frameworks, but traditional scientific applications can also benefit from NCCL.
66

7+
## Using NCCL
8+
9+
To use the Slingshot network on Alps, the [`aws-ofi-nccl`](https://github.com/aws/aws-ofi-nccl) plugin must be used.
10+
With the container engine, the [AWS OFI NCCL hook][ref-ce-aws-ofi-hook] can be used to load the plugin into the container and configure NCCL to use it.
11+
12+
While the container engine does this automatically, regardless of application, the following environment variable should always be set when using NCCL:
13+
14+
```bash
15+
export NCCL_NET_PLUGIN="ofi"
16+
```
17+
18+
This forces NCCL to use the libfabric plugin, enabling full use of the Slingshot network.
19+
Conversely, if the plugin can not be found, applications will fail to start instead of falling back to e.g. TCP, which would be significantly slower than with the plugin.
20+
721
!!! todo
8-
- high level description
9-
- libfabric/aws-ofi-nccl plugin
10-
- configuration options
22+
More options?

docs/software/communication/rccl.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,3 +8,7 @@ It provides equivalent functionality to [NCCL][ref-communication-nccl] for AMD G
88
- high level description
99
- libfabric/aws-ofi-rccl plugin
1010
- configuration options
11+
12+
!!! info
13+
RCCL uses many of the same [configuration options as NCCL](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html), with the `NCCL` prefix, not `RCCL`.
14+
Refer to NCCL documentation to tune RCCL.

docs/software/container-engine.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -533,6 +533,7 @@ Container hooks let you customize container behavior to fit system-specific need
533533
### AWS OFI NCCL Hook 
534534

535535
The [AWS OFI NCCL plugin](https://github.com/aws/aws-ofi-nccl) is a software extension that allows the [NCCL](https://developer.nvidia.com/nccl) and [RCCL](https://rocm.docs.amd.com/projects/rccl/en/latest/) libraries to use libfabric as a network provider and, through libfabric, to access the Slingshot high-speed interconnect.
536+
Also see [NCCL][ref-communication-nccl] and [libfabric][ref-communication-libfabric] for more information on using the libraries on Alps.
536537

537538
The Container Engine includes a hook program to inject the AWS OFI NCCL plugin in containers; since the plugin must also be compatible with the GPU programming software stack being used, the `com.hooks.aws_ofi_nccl.variant` annotation is used to specify a plugin variant suitable for a given container image.
538539
At the moment of writing, 4 plugin variants are configured: `cuda11`, `cuda12` (to be used on NVIDIA GPU nodes), `rocm5`, and `rocm6` (to be used on AMD GPU nodes alongside RCCL).

0 commit comments

Comments
 (0)