You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/software/container-engine/resource-hook.md
+31-37Lines changed: 31 additions & 37 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -31,7 +31,7 @@ This can be done in multiple ways in TOML: for example, both of the following us
31
31
32
32
* An assignment can implicitly define subtables if the key you assign is a dotted list. As a reference, see the examples made earlier in this section, where assigning a string to the `com.hooks.ssh.enabled` attribute within the `[annotations]` table is exactly equivalent to assigning to the `enabled` attribute within the `[annotations.com.hooks.ssh]` subtable.
33
33
34
-
* Attributes can be added to a table only in one place in the TOML file. In other words, each table must be defined in a single square bracket section. For example, Case 3 in the example below is invalid because the `ssh` table was doubly defined both in the `[annotations]` and in the `[annotations.com.hooks.ssh]` sections. See the [TOML format](https://toml.io/en/) spec for more details.
34
+
* Attributes can be added to a table only in one place in the TOML file. In other words, each table must be defined in a single square bracket section. For example, in the invalid example below, the `ssh` table was doubly defined both in the `[annotations]` and in the `[annotations.com.hooks.ssh]` sections. See the [TOML format](https://toml.io/en/) spec for more details.
35
35
36
36
```bash title="Valid"
37
37
[annotations.com.hooks.ssh]
@@ -124,28 +124,25 @@ Container hooks let you customize container behavior to fit system-specific need
124
124
[](){#ref-ce-cxi-hook}
125
125
### HPE Slingshot interconnect
126
126
127
-
!!! tip
128
-
On most vClusters, the CXI hook for Slingshot connectivity is enabled implicitly by default or by other hooks.
129
-
Therefore, entering the enabling annotation in the EDF is unnecessary in many cases.
130
-
131
-
!!! note "Required annotation"
132
-
```console
133
-
com.hooks.cxi.enabled = "true"
134
-
```
127
+
```bash title="Required annotation"
128
+
com.hooks.cxi.enabled = "true"
129
+
```
135
130
136
131
The Container Engine provides a hook to allow containers relying on [libfabric](https://ofiwg.github.io/libfabric/) to leverage the HPE Slingshot 11 high-speed interconnect.
137
132
This component is commonly referred to as the "CXI hook", taking its name from the CXI libfabric provider required to interface with Slingshot 11.
138
133
The hook leverages bind-mounting the custom host libfabric library into the container (in addition to all the required dependency libraries and devices as well).
139
134
140
135
If a libfabric library is already present in the container filesystem (for example, it's provided by the image), it is replaced with its host counterpart, otherwise the host libfabric is just added to the container.
141
136
142
-
!!! note
143
-
Due to the nature of Slingshot and the mechanism implemented by the CXI hook, container applications need to use a communication library which supports libfabric in order to benefit from usage of the hook.
137
+
The hook is activated by setting the `com.hooks.cxi.enabled` annotation, which can be defined in the EDF.
144
138
145
-
!!! note
146
-
Libfabric support might have to be defined at compilation time (as is the case for some MPI implementations, like MPICH and OpenMPI) or could be dynamically available at runtime (as is the case with NCCL - see also [this][ref-ce-aws-ofi-hook] section for more details).
139
+
!!! tip
140
+
On most vClusters, the CXI hook for Slingshot connectivity is enabled implicitly by default or by other hooks.
141
+
Therefore, entering the enabling annotation in the EDF is unnecessary in many cases.
147
142
148
-
The hook is activated by setting the `com.hooks.cxi.enabled` annotation, which can be defined in the EDF.
143
+
!!! note
144
+
* Due to the nature of Slingshot and the mechanism implemented by the CXI hook, container applications need to use a communication library which supports libfabric in order to benefit from usage of the hook.
145
+
* Libfabric support might have to be defined at compilation time (as is the case for some MPI implementations, like MPICH and OpenMPI) or could be dynamically available at runtime (as is the case with NCCL - see also [this][ref-ce-aws-ofi-hook] section for more details).
149
146
150
147
??? example "Comparison between with and without the CXI hook"
151
148
* Without the CXI hook
@@ -225,13 +222,12 @@ The hook is activated by setting the `com.hooks.cxi.enabled` annotation, which
225
222
[](){#ref-ce-aws-ofi-hook}
226
223
### AWS OFI NCCL Hook
227
224
228
-
!!! note "Required annotation"
229
-
```console
230
-
com.hooks.aws_ofi_nccl.enabled = "true"
231
-
com.hooks.aws_ofi_nccl.variant = "cuda12" # (1)
232
-
```
225
+
```bash title="Required annotation"
226
+
com.hooks.aws_ofi_nccl.enabled = "true"
227
+
com.hooks.aws_ofi_nccl.variant = "cuda12"# (1)
228
+
```
233
229
234
-
1. `com.hooks.aws_ofi_nccl.variant` may vary depending on vClusters. Details below.
230
+
1.`com.hooks.aws_ofi_nccl.variant` may vary depending on vClusters. Details below.
235
231
236
232
The [AWS OFI NCCL plugin](https://github.com/aws/aws-ofi-nccl) is a software extension that allows the [NCCL](https://developer.nvidia.com/nccl) and [RCCL](https://rocm.docs.amd.com/projects/rccl/en/latest/) libraries to use libfabric as a network provider and, through libfabric, to access the Slingshot high-speed interconnect.
237
233
Also see [NCCL][ref-communication-nccl] and [libfabric][ref-communication-libfabric] for more information on using the libraries on Alps.
@@ -258,13 +254,12 @@ At the moment of writing, 4 plugin variants are configured: `cuda11`, `cuda12`
1. Replace `<public-key>` with your SSH public key.
262
+
1. Replace `<public-key>` with your SSH public key.
268
263
269
264
!!! warning
270
265
The `srun` command launching an SSH-connectable container **should set the `--pty` option** in order for the hook to initialize properly.
@@ -279,6 +274,13 @@ After the container starts, it is possible to get a remote shell inside the cont
279
274
280
275
By default, the server started by the SSH hook listens to port 15263, but this setting can be controlled through the `com.hooks.ssh.port` annotation in the EDF.
281
276
277
+
!!! note
278
+
The container must be **writable** (default) to use the SSH hook.
279
+
280
+
!!! info
281
+
In order to establish connections through Visual Studio Code [Remote - SSH](https://code.visualstudio.com/docs/remote/ssh) extension, the `scp` program must be available inside the container.
282
+
This is required to send and establish the VS Code Server into the remote container.
283
+
282
284
!!! example "Logging into a sleeping container via SSH"
283
285
* On the cluster
284
286
```console
@@ -297,19 +299,11 @@ By default, the server started by the SSH hook listens to port 15263, but this s
297
299
ssh -p 15263 <host-of-container>
298
300
```
299
301
300
-
!!! note
301
-
The container must be **writable** (default) to use the SSH hook.
302
-
303
-
!!! info
304
-
In order to establish connections through Visual Studio Code [Remote - SSH](https://code.visualstudio.com/docs/remote/ssh) extension, the `scp` program must be available inside the container.
305
-
This is required to send and establish the VS Code Server into the remote container.
306
-
307
302
### NVIDIA CUDA MPS Hook
308
303
309
-
!!! note "Require annotation"
310
-
```console
311
-
com.hooks.nvidia_cuda_mps.enabled = "true"
312
-
```
304
+
```bash title="Require annotation"
305
+
com.hooks.nvidia_cuda_mps.enabled = "true"
306
+
```
313
307
314
308
On several Alps vClusters, NVIDIA GPUs by default operate in "Exclusive process" mode, that is, the CUDA driver is configured to allow only one process at a time to use a given GPU.
315
309
For example, on a node with 4 GPUs, a maximum of 4 CUDA processes can run at the same time.
0 commit comments