Skip to content

Commit 95d8ee2

Browse files
authored
add troubleshooting guide for losing GPU access error (#186)
* add troubleshooting guide for losing GPU access error Signed-off-by: Abigail McCarthy <[email protected]>
1 parent f98163b commit 95d8ee2

File tree

2 files changed

+48
-2
lines changed

2 files changed

+48
-2
lines changed

container-toolkit/install-guide.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,6 @@ For information about installing the driver with a package manager, refer to
1818
the [_NVIDIA Driver Installation Quickstart Guide_](https://docs.nvidia.com/datacenter/tesla/tesla-installation-notes/index.html).
1919
Alternatively, you can install the driver by [downloading](https://www.nvidia.com/en-us/drivers/) a `.run` installer.
2020

21-
2221
(installing-with-apt)=
2322

2423
### With `apt`: Ubuntu, Debian

container-toolkit/troubleshooting.md

Lines changed: 48 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -119,4 +119,51 @@ Without this option, you might observe this error when running GPU containers:
119119
``Failed to initialize NVML: Insufficient Permissions``.
120120
However, using this option disables SELinux separation in the container and the container is executed
121121
in an unconfined type.
122-
Review the SELinux policies on your system.
122+
Review the SELinux policies on your system.
123+
124+
125+
## Containers losing access to GPUs with error: "Failed to initialize NVML: Unknown Error"
126+
127+
When using the NVIDIA Container Runtime Hook (i.e. the Docker `--gpus` flag or
128+
the NVIDIA Container Runtime in `legacy` mode) to inject requested GPUs and driver
129+
libraries into a container, the hook makes modifications, including setting up cgroup access, to the container without the low-level runtime (e.g. `runc`) being aware of these changes.
130+
The result is that updates to the container may remove access to the requested GPUs.
131+
132+
When the container loses access to the GPU, you will see the following error message from the console output:
133+
134+
```console
135+
Failed to initialize NVML: Unknown Error
136+
```
137+
138+
The message may differ depending on the type of application that is running in
139+
the container.
140+
141+
The container needs to be deleted once the issue occurs.
142+
When it is restarted, manually or automatically depending if you are using a container orchestration platform, it will regain access to the GPU.
143+
144+
### Affected environments
145+
146+
On certain systems this behavior is not limited to *explicit* container updates
147+
such as adjusting CPU and Memory limits for a container.
148+
On systems where `systemd` is used to manage the cgroups of the container, reloading the `systemd` unit files (`systemctl daemon-reload`) is sufficient to trigger container updates and cause a loss of GPU access.
149+
150+
### Mitigations and Workarounds
151+
152+
```{warning}
153+
Certain `runc` versions show similar behavior with the `systemd` cgroup driver when `/dev/char` symlinks for the required devices are missing on the system.
154+
Refer to [GitHub disccusion #1133](https://github.com/NVIDIA/nvidia-container-toolkit/discussions/1133) for more details around this issue.
155+
It should be noted that the behavior persisted even if device nodes were requested on the command line.
156+
Newer `runc` versions do not show this behavior and newer NVIDIA driver versions ensure that the required symlinks are present, reducing the likelihood of the specific issue occurring for affected `runc` versions.
157+
```
158+
159+
Use the following workarounds to prevent containers from losing access to requested GPUs when a `systemctl daemon-reload` command is run:
160+
161+
* Explicitly request the device nodes associated with the requested GPU(s) and any control device nodes when starting the container.
162+
For the Docker CLI, this is done by adding the relevant `--device` flags.
163+
In the case of the NVIDIA Kubernetes Device Plugin the `compatWithCPUManager= true` [Helm option](https://github.com/NVIDIA/k8s-device-plugin?tab=readme-ov-file#setting-other-helm-chart-values) will ensure the same thing.
164+
* Use the Container Device Interface (CDI) to inject devices into a container.
165+
When CDI is used to inject devices into a container, the required device nodes are included in the modifications made to the container config.
166+
This means that even if the container is updated it will still have access to the required devices.
167+
* For Docker, use cgroupfs as the cgroup driver for containers.
168+
This will ensure that the container will not lose access to devices when `systemctl daemon-reload` is run.
169+
This approach does not change the behavior for explicit container updates and a container will still lose access to devices in this case.

0 commit comments

Comments
 (0)