You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: container-toolkit/troubleshooting.md
+48-1Lines changed: 48 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -119,4 +119,51 @@ Without this option, you might observe this error when running GPU containers:
119
119
``Failed to initialize NVML: Insufficient Permissions``.
120
120
However, using this option disables SELinux separation in the container and the container is executed
121
121
in an unconfined type.
122
-
Review the SELinux policies on your system.
122
+
Review the SELinux policies on your system.
123
+
124
+
125
+
## Containers losing access to GPUs with error: "Failed to initialize NVML: Unknown Error"
126
+
127
+
When using the NVIDIA Container Runtime Hook (i.e. the Docker `--gpus` flag or
128
+
the NVIDIA Container Runtime in `legacy` mode) to inject requested GPUs and driver
129
+
libraries into a container, the hook makes modifications, including setting up cgroup access, to the container without the low-level runtime (e.g. `runc`) being aware of these changes.
130
+
The result is that updates to the container may remove access to the requested GPUs.
131
+
132
+
When the container loses access to the GPU, you will see the following error message from the console output:
133
+
134
+
```console
135
+
Failed to initialize NVML: Unknown Error
136
+
```
137
+
138
+
The message may differ depending on the type of application that is running in
139
+
the container.
140
+
141
+
The container needs to be deleted once the issue occurs.
142
+
When it is restarted, manually or automatically depending if you are using a container orchestration platform, it will regain access to the GPU.
143
+
144
+
### Affected environments
145
+
146
+
On certain systems this behavior is not limited to *explicit* container updates
147
+
such as adjusting CPU and Memory limits for a container.
148
+
On systems where `systemd` is used to manage the cgroups of the container, reloading the `systemd` unit files (`systemctl daemon-reload`) is sufficient to trigger container updates and cause a loss of GPU access.
149
+
150
+
### Mitigations and Workarounds
151
+
152
+
```{warning}
153
+
Certain `runc` versions show similar behavior with the `systemd` cgroup driver when `/dev/char` symlinks for the required devices are missing on the system.
154
+
Refer to [GitHub disccusion #1133](https://github.com/NVIDIA/nvidia-container-toolkit/discussions/1133) for more details around this issue.
155
+
It should be noted that the behavior persisted even if device nodes were requested on the command line.
156
+
Newer `runc` versions do not show this behavior and newer NVIDIA driver versions ensure that the required symlinks are present, reducing the likelihood of the specific issue occurring for affected `runc` versions.
157
+
```
158
+
159
+
Use the following workarounds to prevent containers from losing access to requested GPUs when a `systemctl daemon-reload` command is run:
160
+
161
+
* Explicitly request the device nodes associated with the requested GPU(s) and any control device nodes when starting the container.
162
+
For the Docker CLI, this is done by adding the relevant `--device` flags.
163
+
In the case of the NVIDIA Kubernetes Device Plugin the `compatWithCPUManager= true`[Helm option](https://github.com/NVIDIA/k8s-device-plugin?tab=readme-ov-file#setting-other-helm-chart-values) will ensure the same thing.
164
+
* Use the Container Device Interface (CDI) to inject devices into a container.
165
+
When CDI is used to inject devices into a container, the required device nodes are included in the modifications made to the container config.
166
+
This means that even if the container is updated it will still have access to the required devices.
167
+
* For Docker, use cgroupfs as the cgroup driver for containers.
168
+
This will ensure that the container will not lose access to devices when `systemctl daemon-reload` is run.
169
+
This approach does not change the behavior for explicit container updates and a container will still lose access to devices in this case.
0 commit comments