Skip to content

Commit 3adeba2

Browse files
committed
Update hooks
1 parent 8e3ad89 commit 3adeba2

File tree

1 file changed

+9
-10
lines changed

1 file changed

+9
-10
lines changed

docs/software/container-engine.md

Lines changed: 9 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -511,24 +511,23 @@ The hook is activated by setting the `com.hooks.cxi.enabled` annotation, which
511511
com.hooks.aws_ofi_nccl.variant = "cuda12" # (1)
512512
```
513513

514-
1. `com.hooks.aws_ofi_nccl.variant` may vary depending on vClusters.
514+
1. `com.hooks.aws_ofi_nccl.variant` may vary depending on vClusters. Details below.
515515

516516
The [AWS OFI NCCL plugin](https://github.com/aws/aws-ofi-nccl) is a software extension that allows the [NCCL](https://developer.nvidia.com/nccl) and [RCCL](https://rocm.docs.amd.com/projects/rccl/en/latest/) libraries to use libfabric as a network provider and, through libfabric, to access the Slingshot high-speed interconnect.
517517
Also see [NCCL][ref-communication-nccl] and [libfabric][ref-communication-libfabric] for more information on using the libraries on Alps.
518518

519519
The Container Engine includes a hook program to inject the AWS OFI NCCL plugin in containers; since the plugin must also be compatible with the GPU programming software stack being used, the `com.hooks.aws_ofi_nccl.variant` annotation is used to specify a plugin variant suitable for a given container image.
520520
At the moment of writing, 4 plugin variants are configured: `cuda11`, `cuda12` (to be used on NVIDIA GPU nodes), `rocm5`, and `rocm6` (to be used on AMD GPU nodes alongside RCCL).
521-
For example, the following EDF enables the hook and uses it to mount the plugin in a CUDA 11 image:
522521

523-
```bash
524-
image = "nvcr.io#nvidia/pytorch:22.12-py3"
525-
mounts = ["/capstor/scratch/cscs/amadonna:/capstor/scratch/cscs/amadonna"]
526-
entrypoint = false
522+
!!! example "EDF for the NGC PyTorch 22.12 image with Cuda 11
523+
```bash
524+
image = "nvcr.io#nvidia/pytorch:22.12-py3"
525+
mounts = ["/capstor/scratch/cscs/${USER}:/capstor/scratch/cscs/${USER}"]
527526

528-
[annotations]
529-
com.hooks.aws_ofi_nccl.enabled = "true"
530-
com.hooks.aws_ofi_nccl.variant = "cuda11"
531-
```
527+
[annotations]
528+
com.hooks.aws_ofi_nccl.enabled = "true"
529+
com.hooks.aws_ofi_nccl.variant = "cuda11"
530+
```
532531

533532
The AWS OFI NCCL hook also takes care of the following aspects:
534533

0 commit comments

Comments
 (0)