Skip to content

Commit f2f5f13

Browse files
committed
Update 'Accessing native...'
1 parent b4034e3 commit f2f5f13

File tree

1 file changed

+48
-40
lines changed

1 file changed

+48
-40
lines changed

docs/software/container-engine.md

Lines changed: 48 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -335,47 +335,55 @@ This can be done in multiple ways in TOML: for example, both of the following us
335335
### NVIDIA GPUs
336336

337337
The Container Engine leverages components from the NVIDIA Container Toolkit to expose NVIDIA GPU devices inside containers.
338-
GPU device files are always mounted in containers, and the NVIDIA driver user space components are  mounted if the `NVIDIA_VISIBLE_DEVICES` environment variable is not empty, unset or set to `void`.  `NVIDIA_VISIBLE_DEVICES` is already set in container images officially provided by NVIDIA to enable all GPUs available on the host system. Such images are frequently used to containerize CUDA applications, either directly or as a base for custom images, thus in many cases no action is required to access GPUs.
339-
For example, on a cluster with 4 GH200 devices per compute node:
338+
GPU device files are always mounted in containers, and the NVIDIA driver user space components are  mounted if the `NVIDIA_VISIBLE_DEVICES` environment variable is not empty, unset or set to `void`.
339+
`NVIDIA_VISIBLE_DEVICES` is already set in container images officially provided by NVIDIA to enable all GPUs available on the host system.
340+
Such images are frequently used to containerize CUDA applications, either directly or as a base for custom images, thus in many cases no action is required to access GPUs.
340341

341-
```bash
342-
> cat .edf/cuda12.5.1.toml 
343-
image = "nvidia/cuda:12.5.1-devel-ubuntu24.04"
344-
345-
> srun --environment=cuda12.5.1 nvidia-smi
346-
Thu Oct 26 17:59:36 2023       
347-
+------------------------------------------------------------------------------------+
348-
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.5 |
349-
|--------------------------------------+----------------------+----------------------+
350-
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
351-
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
352-
| | | MIG M. |
353-
|======================================+======================+======================|
354-
| 0 GH200 120GB On | 00000009:01:00.0 Off | 0 |
355-
| N/A 24C P0 89W / 900W | 37MiB / 97871MiB | 0% E. Process |
356-
| | | Disabled |
357-
+--------------------------------------+----------------------+----------------------+
358-
| 1 GH200 120GB On | 00000019:01:00.0 Off | 0 |
359-
| N/A 24C P0 87W / 900W | 37MiB / 97871MiB | 0% E. Process |
360-
| | | Disabled |
361-
+--------------------------------------+----------------------+----------------------+
362-
| 2 GH200 120GB On | 00000029:01:00.0 Off | 0 |
363-
| N/A 24C P0 83W / 900W | 37MiB / 97871MiB | 0% E. Process |
364-
| | | Disabled |
365-
+--------------------------------------+----------------------+----------------------+
366-
| 3 GH200 120GB On | 00000039:01:00.0 Off | 0 |
367-
| N/A 24C P0 85W / 900W | 37MiB / 97871MiB | 0% E. Process |
368-
| | | Disabled |
369-
+--------------------------------------+----------------------+----------------------+
370-
371-
+------------------------------------------------------------------------------------+
372-
| Processes: |
373-
| GPU GI CI PID Type Process name GPU Memory |
374-
| ID ID Usage |
375-
|====================================================================================|
376-
| No running processes found |
377-
+------------------------------------------------------------------------------------+
378-
```
342+
!!! example "Cluster with 4 GH200 devices per node":
343+
```bash
344+
$ cat <<EOF >cuda12.5.1.toml  # (1)
345+
> image = "nvidia/cuda:12.5.1-devel-ubuntu24.04"
346+
> EOF
347+
348+
$ cat cuda12.5.1.toml 
349+
image = "nvidia/cuda:12.5.1-devel-ubuntu24.04"
350+
351+
$ srun --environment=./cuda12.5.1.toml nvidia-smi
352+
Thu Oct 26 17:59:36 2023       
353+
+------------------------------------------------------------------------------------+
354+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.5 |
355+
|--------------------------------------+----------------------+----------------------+
356+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
357+
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
358+
| | | MIG M. |
359+
|======================================+======================+======================|
360+
| 0 GH200 120GB On | 00000009:01:00.0 Off | 0 |
361+
| N/A 24C P0 89W / 900W | 37MiB / 97871MiB | 0% E. Process |
362+
| | | Disabled |
363+
+--------------------------------------+----------------------+----------------------+
364+
| 1 GH200 120GB On | 00000019:01:00.0 Off | 0 |
365+
| N/A 24C P0 87W / 900W | 37MiB / 97871MiB | 0% E. Process |
366+
| | | Disabled |
367+
+--------------------------------------+----------------------+----------------------+
368+
| 2 GH200 120GB On | 00000029:01:00.0 Off | 0 |
369+
| N/A 24C P0 83W / 900W | 37MiB / 97871MiB | 0% E. Process |
370+
| | | Disabled |
371+
+--------------------------------------+----------------------+----------------------+
372+
| 3 GH200 120GB On | 00000039:01:00.0 Off | 0 |
373+
| N/A 24C P0 85W / 900W | 37MiB / 97871MiB | 0% E. Process |
374+
| | | Disabled |
375+
+--------------------------------------+----------------------+----------------------+
376+
377+
+------------------------------------------------------------------------------------+
378+
| Processes: |
379+
| GPU GI CI PID Type Process name GPU Memory |
380+
| ID ID Usage |
381+
|====================================================================================|
382+
| No running processes found |
383+
+------------------------------------------------------------------------------------+
384+
```
385+
386+
1. Creating `cuda12.5.1.toml` on the current folder.
379387

380388
It is possible to use environment variables to control which capabilities of the NVIDIA driver are enabled inside containers.
381389
Additionally, the NVIDIA Container Toolkit can enforce specific constraints for the container, for example, on versions of the CUDA runtime or driver, or on the architecture of the GPUs.

0 commit comments

Comments
 (0)