Skip to content

Commit fc74a30

Browse files
committed
Update resource-hook.md
1 parent 62e6324 commit fc74a30

File tree

1 file changed

+31
-37
lines changed

1 file changed

+31
-37
lines changed

docs/software/container-engine/resource-hook.md

Lines changed: 31 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ This can be done in multiple ways in TOML: for example, both of the following us
3131

3232
* An assignment can implicitly define subtables if the key you assign is a dotted list. As a reference, see the examples made earlier in this section, where assigning a string to the `com.hooks.ssh.enabled` attribute within the `[annotations]` table is exactly equivalent to assigning to the `enabled` attribute within the `[annotations.com.hooks.ssh]` subtable.
3333

34-
* Attributes can be added to a table only in one place in the TOML file. In other words, each table must be defined in a single square bracket section. For example, Case 3 in the example below is invalid because the `ssh` table was doubly defined both in the `[annotations]` and in the `[annotations.com.hooks.ssh]` sections. See the [TOML format](https://toml.io/en/) spec for more details.
34+
* Attributes can be added to a table only in one place in the TOML file. In other words, each table must be defined in a single square bracket section. For example, in the invalid example below, the `ssh` table was doubly defined both in the `[annotations]` and in the `[annotations.com.hooks.ssh]` sections. See the [TOML format](https://toml.io/en/) spec for more details.
3535

3636
```bash title="Valid"
3737
[annotations.com.hooks.ssh]
@@ -124,28 +124,25 @@ Container hooks let you customize container behavior to fit system-specific need
124124
[](){#ref-ce-cxi-hook}
125125
### HPE Slingshot interconnect 
126126

127-
!!! tip
128-
On most vClusters, the CXI hook for Slingshot connectivity is enabled implicitly by default or by other hooks.
129-
Therefore, entering the enabling annotation in the EDF is unnecessary in many cases.
130-
131-
!!! note "Required annotation"
132-
```console
133-
com.hooks.cxi.enabled = "true"
134-
```
127+
```bash title="Required annotation"
128+
com.hooks.cxi.enabled = "true"
129+
```
135130

136131
The Container Engine provides a hook to allow containers relying on [libfabric](https://ofiwg.github.io/libfabric/) to leverage the HPE Slingshot 11 high-speed interconnect.
137132
This component is commonly referred to as the "CXI hook", taking its name from the CXI libfabric provider required to interface with Slingshot 11.
138133
The hook leverages bind-mounting the custom host libfabric library into the container (in addition to all the required dependency libraries and devices as well).
139134

140135
If a libfabric library is already present in the container filesystem (for example, it's provided by the image), it is replaced with its host counterpart, otherwise the host libfabric is just added to the container.
141136

142-
!!! note
143-
Due to the nature of Slingshot and the mechanism implemented by the CXI hook, container applications need to use a communication library which supports libfabric in order to benefit from usage of the hook.
137+
The hook is activated by setting the `com.hooks.cxi.enabled` annotation, which can be defined in the EDF.
144138

145-
!!! note
146-
Libfabric support might have to be defined at compilation time (as is the case for some MPI implementations, like MPICH and OpenMPI) or could be dynamically available at runtime (as is the case with NCCL - see also [this][ref-ce-aws-ofi-hook] section for more details).
139+
!!! tip
140+
On most vClusters, the CXI hook for Slingshot connectivity is enabled implicitly by default or by other hooks.
141+
Therefore, entering the enabling annotation in the EDF is unnecessary in many cases.
147142

148-
The hook is activated by setting the `com.hooks.cxi.enabled` annotation, which can be defined in the EDF.
143+
!!! note
144+
* Due to the nature of Slingshot and the mechanism implemented by the CXI hook, container applications need to use a communication library which supports libfabric in order to benefit from usage of the hook.
145+
* Libfabric support might have to be defined at compilation time (as is the case for some MPI implementations, like MPICH and OpenMPI) or could be dynamically available at runtime (as is the case with NCCL - see also [this][ref-ce-aws-ofi-hook] section for more details).
149146

150147
??? example "Comparison between with and without the CXI hook"
151148
* Without the CXI hook
@@ -225,13 +222,12 @@ The hook is activated by setting the `com.hooks.cxi.enabled` annotation, which
225222
[](){#ref-ce-aws-ofi-hook}
226223
### AWS OFI NCCL Hook 
227224

228-
!!! note "Required annotation"
229-
```console
230-
com.hooks.aws_ofi_nccl.enabled = "true"
231-
com.hooks.aws_ofi_nccl.variant = "cuda12" # (1)
232-
```
225+
```bash title="Required annotation"
226+
com.hooks.aws_ofi_nccl.enabled = "true"
227+
com.hooks.aws_ofi_nccl.variant = "cuda12" # (1)
228+
```
233229

234-
1. `com.hooks.aws_ofi_nccl.variant` may vary depending on vClusters. Details below.
230+
1. `com.hooks.aws_ofi_nccl.variant` may vary depending on vClusters. Details below.
235231

236232
The [AWS OFI NCCL plugin](https://github.com/aws/aws-ofi-nccl) is a software extension that allows the [NCCL](https://developer.nvidia.com/nccl) and [RCCL](https://rocm.docs.amd.com/projects/rccl/en/latest/) libraries to use libfabric as a network provider and, through libfabric, to access the Slingshot high-speed interconnect.
237233
Also see [NCCL][ref-communication-nccl] and [libfabric][ref-communication-libfabric] for more information on using the libraries on Alps.
@@ -258,13 +254,12 @@ At the moment of writing, 4 plugin variants are configured: `cuda11`, `cuda12`
258254
[](){#ref-ce-ssh-hook}
259255
### SSH Hook
260256

261-
!!! note "Required annotation"
262-
```console
263-
com.hooks.ssh.enabled = "true"
264-
com.hooks.ssh.authorize_ssh_key = "<public-key>" # (1)
265-
```
257+
```bash title="Required annotation"
258+
com.hooks.ssh.enabled = "true"
259+
com.hooks.ssh.authorize_ssh_key = "<public-key>" # (1)
260+
```
266261

267-
1. Replace `<public-key>` with your SSH public key.
262+
1. Replace `<public-key>` with your SSH public key.
268263

269264
!!! warning
270265
The `srun` command launching an SSH-connectable container **should set the `--pty` option** in order for the hook to initialize properly.
@@ -279,6 +274,13 @@ After the container starts, it is possible to get a remote shell inside the cont
279274

280275
By default, the server started by the SSH hook listens to port 15263, but this setting can be controlled through the `com.hooks.ssh.port` annotation in the EDF.
281276

277+
!!! note
278+
The container must be **writable** (default) to use the SSH hook.
279+
280+
!!! info
281+
In order to establish connections through Visual Studio Code [Remote - SSH](https://code.visualstudio.com/docs/remote/ssh) extension, the `scp` program must be available inside the container.
282+
This is required to send and establish the VS Code Server into the remote container.
283+
282284
!!! example "Logging into a sleeping container via SSH"
283285
* On the cluster
284286
```console
@@ -297,19 +299,11 @@ By default, the server started by the SSH hook listens to port 15263, but this s
297299
ssh -p 15263 <host-of-container>
298300
```
299301

300-
!!! note
301-
The container must be **writable** (default) to use the SSH hook.
302-
303-
!!! info
304-
In order to establish connections through Visual Studio Code [Remote - SSH](https://code.visualstudio.com/docs/remote/ssh) extension, the `scp` program must be available inside the container.
305-
This is required to send and establish the VS Code Server into the remote container.
306-
307302
### NVIDIA CUDA MPS Hook
308303

309-
!!! note "Require annotation"
310-
```console
311-
com.hooks.nvidia_cuda_mps.enabled = "true"
312-
```
304+
```bash title="Require annotation"
305+
com.hooks.nvidia_cuda_mps.enabled = "true"
306+
```
313307

314308
On several Alps vClusters, NVIDIA GPUs by default operate in "Exclusive process" mode, that is, the CUDA driver is configured to allow only one process at a time to use a given GPU.
315309
For example, on a node with 4 GPUs, a maximum of 4 CUDA processes can run at the same time.

0 commit comments

Comments
 (0)