You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/software/container-engine.md
+9-10Lines changed: 9 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -511,24 +511,23 @@ The hook is activated by setting the `com.hooks.cxi.enabled` annotation, which
511
511
com.hooks.aws_ofi_nccl.variant = "cuda12" # (1)
512
512
```
513
513
514
-
1. `com.hooks.aws_ofi_nccl.variant` may vary depending on vClusters.
514
+
1. `com.hooks.aws_ofi_nccl.variant` may vary depending on vClusters. Details below.
515
515
516
516
The [AWS OFI NCCL plugin](https://github.com/aws/aws-ofi-nccl) is a software extension that allows the [NCCL](https://developer.nvidia.com/nccl) and [RCCL](https://rocm.docs.amd.com/projects/rccl/en/latest/) libraries to use libfabric as a network provider and, through libfabric, to access the Slingshot high-speed interconnect.
517
517
Also see [NCCL][ref-communication-nccl] and [libfabric][ref-communication-libfabric] for more information on using the libraries on Alps.
518
518
519
519
The Container Engine includes a hook program to inject the AWS OFI NCCL plugin in containers; since the plugin must also be compatible with the GPU programming software stack being used, the `com.hooks.aws_ofi_nccl.variant` annotation is used to specify a plugin variant suitable for a given container image.
520
520
At the moment of writing, 4 plugin variants are configured: `cuda11`, `cuda12` (to be used on NVIDIA GPU nodes), `rocm5`, and `rocm6` (to be used on AMD GPU nodes alongside RCCL).
521
-
For example, the following EDF enables the hook and uses it to mount the plugin in a CUDA 11 image:
0 commit comments