Skip to content

Commit 85408ef

Browse files
committed
add troubleshooting guide for losing GPU access error
Signed-off-by: Abigail McCarthy <[email protected]>
1 parent a1cb216 commit 85408ef

File tree

1 file changed

+278
-1
lines changed

1 file changed

+278
-1
lines changed

container-toolkit/troubleshooting.md

Lines changed: 278 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -119,4 +119,281 @@ Without this option, you might observe this error when running GPU containers:
119119
``Failed to initialize NVML: Insufficient Permissions``.
120120
However, using this option disables SELinux separation in the container and the container is executed
121121
in an unconfined type.
122-
Review the SELinux policies on your system.
122+
Review the SELinux policies on your system.
123+
124+
125+
## Containers losing access to GPUs with error: "Failed to initialize NVML: Unknown Error"
126+
127+
Under specific conditions, it’s possible that containerized GPU workloads may suddenly lose access to their GPUs.
128+
This situation occurs when `systemd` is used to manage the cgroups of the container and it is triggered to reload any Unit files that have references to NVIDIA GPUs (e.g. with something as simple as a systemctl daemon-reload).
129+
130+
When the container loses access to the GPU, you will see the following error message from the console output:
131+
132+
```console
133+
Failed to initialize NVML: Unknown Error
134+
```
135+
136+
The container needs to be deleted once the issue occurs.
137+
When it is restarted (manually or automatically depending if you are using a container orchestration platform), it will regain access to the GPU.
138+
139+
The issue originates from the fact that recent versions of `runc` require that symlinks be present under `/dev/char` to any device nodes being injected into a container. Unfortunately, these symlinks are not present for NVIDIA devices, and the NVIDIA GPU driver does not provide a means for them to be created automatically.
140+
141+
A fix will be present in the next patch release of all supported NVIDIA GPU drivers.
142+
143+
### Affected environments
144+
145+
You many be affected by this issue if you are use `runc` and enable `systemd cgroup` management at the high-level container runtime.
146+
147+
```{note}
148+
If the system is NOT using `systemd` to manage `cgroups`, then it is NOT subject to this issue.
149+
```
150+
151+
Below is a full list of affected environments:
152+
153+
- Docker environment using `containerd` / `runc` and you have the following configurations:
154+
- `cgroup driver` enabled with `systemd`.
155+
For example, parameter `"exec-opts": ["native.cgroupdriver=systemd"]` set in /etc/docker/daemon.json.
156+
- Newer docker version is used where `systemd cgroup` management is the default, like on Ubuntu 22.04.
157+
158+
To check if Docker uses systemd cgroup management, run the following command (the output below indicates that systemd cgroup driver is enabled) :
159+
160+
```console
161+
$ docker info
162+
...
163+
Cgroup Driver: systemd
164+
Cgroup Version: 1
165+
```
166+
167+
- K8s environment using `containerd` / `runc` with the following configruations:
168+
- `SystemdCgroup = true` in the containerd configuration file (usually located in `/etc/containerd/config.toml`) as shown below:
169+
170+
```console
171+
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
172+
BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime"
173+
...
174+
SystemdCgroup = true
175+
```
176+
177+
To check if containerd uses systemd cgroup management, issue the following command:
178+
179+
```console
180+
$ sudo crictl info
181+
```
182+
183+
*Example output:*
184+
185+
```output
186+
...
187+
"runtimes": {
188+
"nvidia": {
189+
"runtimeType": "io.containerd.runc.v2",
190+
...
191+
"options": {
192+
"BinaryName": "/usr/local/nvidia/toolkit/nvidia-container-runtime",
193+
...
194+
"ShimCgroup": "",
195+
"SystemdCgroup": true
196+
```
197+
198+
- K8s environment (including OpenShift) using `cri-o` / `runc` with the following configurations:
199+
- `cgroup_manager` enabled with systemd in the cri-o configuration file usually located in `/etc/crio/crio.conf` or `/etc/crio/crio.conf.d/00-default` as shown below (sample with OpenShift):
200+
201+
```console
202+
[crio.runtime]
203+
...
204+
cgroup_manager = "systemd"
205+
206+
hooks_dir = [
207+
"/etc/containers/oci/hooks.d",
208+
"/run/containers/oci/hooks.d",
209+
"/usr/share/containers/oci/hooks.d",
210+
]
211+
```
212+
213+
Podman environments use crun by default and are not subject to this issue unless runc is configured as the low-level container runtime to be used.
214+
215+
### How to check if you are affected
216+
217+
You can use the following steps to confirm that your system is affected. After you implement one of the workarounds (mentioned in the next section), you can repeat the steps to confirm that the error is no longer reproducible.
218+
219+
#### For Docker environments
220+
221+
1. Run a test container:
222+
223+
```console
224+
$ docker run -d --rm --runtime=nvidia --gpus all \
225+
--device=/dev/nvidia-uvm \
226+
--device=/dev/nvidia-uvm-tools \
227+
--device=/dev/nvidia-modeset \
228+
--device=/dev/nvidiactl \
229+
--device=/dev/nvidia0 \
230+
nvcr.io/nvidia/cuda:12.0.0-base-ubuntu20.04 bash -c "while [ true ]; do nvidia-smi -L; sleep 5; done"
231+
232+
bc045274b44bdf6ec2e4cc10d2968d1d2a046c47cad0a1d2088dc0a430add24b
233+
```
234+
235+
Make sure to mount the different devices as shown above. They are needed to narrow the problem down to this specific issue.
236+
237+
If your system has more than 1 GPU, append the above command with the additional --device mount. Example with a system that has 2 GPUs:
238+
239+
```console
240+
$ docker run -d --rm --runtime=nvidia --gpus all \
241+
...
242+
--device=/dev/nvidia0 \
243+
--device=/dev/nvidia1 \
244+
...
245+
```
246+
247+
1. Check the logs from the container:
248+
249+
```console
250+
$ docker logs bc045274b44bdf6ec2e4cc10d2968d1d2a046c47cad0a1d2088dc0a430add24b
251+
```
252+
253+
*Example output:*
254+
255+
```output
256+
GPU 0: Tesla K80 (UUID: GPU-05ea3312-64dd-a4e7-bc72-46d2f6050147)
257+
GPU 0: Tesla K80 (UUID: GPU-05ea3312-64dd-a4e7-bc72-46d2f6050147)
258+
```
259+
260+
1. Initiate a daemon-reload:
261+
262+
```console
263+
$ sudo systemctl daemon-reload
264+
```
265+
266+
1. Check the logs from the container:
267+
268+
```console
269+
$ docker logs bc045274b44bdf6ec2e4cc10d2968d1d2a046c47cad0a1d2088dc0a430add24b
270+
```
271+
272+
*Example output:*
273+
274+
```output
275+
GPU 0: Tesla K80 (UUID: GPU-05ea3312-64dd-a4e7-bc72-46d2f6050147)
276+
GPU 0: Tesla K80 (UUID: GPU-05ea3312-64dd-a4e7-bc72-46d2f6050147)
277+
GPU 0: Tesla K80 (UUID: GPU-05ea3312-64dd-a4e7-bc72-46d2f6050147)
278+
GPU 0: Tesla K80 (UUID: GPU-05ea3312-64dd-a4e7-bc72-46d2f6050147)
279+
Failed to initialize NVML: Unknown Error
280+
Failed to initialize NVML: Unknown Error
281+
```
282+
283+
#### For Kubernetes environments
284+
285+
1. Run a test pod:
286+
287+
```console
288+
$ cat nvidia-smi-loop.yaml
289+
290+
apiVersion: v1
291+
kind: Pod
292+
metadata:
293+
name: cuda-nvidia-smi-loop
294+
spec:
295+
restartPolicy: OnFailure
296+
containers:
297+
- name: cuda
298+
image: "nvcr.io/nvidia/cuda:12.0.0-base-ubuntu20.04"
299+
command: ["/bin/sh", "-c"]
300+
args: ["while true; do nvidia-smi -L; sleep 5; done"]
301+
resources:
302+
limits:
303+
nvidia.com/gpu: 1
304+
305+
306+
$ kubectl apply -f nvidia-smi-loop.yaml
307+
```
308+
309+
1. Check the logs from the pod:
310+
311+
```console
312+
$ kubectl logs cuda-nvidia-smi-loop
313+
```
314+
315+
*Example output:*
316+
317+
```console output
318+
GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-551720f0-caf0-22b7-f525-2a51a6ab478d)
319+
GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-551720f0-caf0-22b7-f525-2a51a6ab478d)
320+
```
321+
322+
1. Initiate a `daemon-reload`:
323+
324+
```console
325+
$ sudo systemctl daemon-reload
326+
```
327+
328+
1. Check the logs from the pod:
329+
330+
```console
331+
$ kubectl logs cuda-nvidia-smi-loop
332+
```
333+
334+
*Example output:
335+
336+
```output
337+
GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-551720f0-caf0-22b7-f525-2a51a6ab478d)
338+
GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-551720f0-caf0-22b7-f525-2a51a6ab478d)
339+
Failed to initialize NVML: Unknown Error
340+
Failed to initialize NVML: Unknown Error
341+
```
342+
343+
### Workarounds
344+
345+
The following workarounds are available for both standalone docker environments and Kubernetes environments.
346+
347+
### For Docker environments
348+
349+
The recommended workaround for Docker environments is to **use the `nvidia-ctk` utility.**
350+
The NVIDIA Container Toolkit v1.12.0 and later includes this utility for creating symlinks in `/dev/char` for all possible NVIDIA device nodes required for using GPUs in containers.
351+
This can be run as follows:
352+
353+
1. Run `nvidia-ctk`:
354+
355+
```console
356+
$ sudo nvidia-ctk system create-dev-char-symlinks \
357+
--create-all
358+
```
359+
360+
In cases where the NVIDIA GPU Driver Container is used, the path to the driver installation must be specified. In this case the command should be modified to:
361+
362+
```console
363+
$ sudo nvidia-ctk system create-dev-symlinks \
364+
--create-all \
365+
--driver-root={{NVIDIA_DRIVER_ROOT}}
366+
```
367+
368+
Where {{NVIDIA_DRIVER_ROOT}} is the path to which the NVIDIA GPU Driver container installs the NVIDIA GPU driver and creates the NVIDIA Device Nodes.
369+
370+
1. Configure this command to run at boot on each node where GPUs will be used in containers.
371+
The command requires that the NVIDIA driver kernel modules have been loaded at the point where it is run.
372+
373+
A simple `udev` rule to enforce this can be seen below:
374+
375+
```console
376+
# This will create /dev/char symlinks to all device nodes
377+
ACTION=="add", DEVPATH=="/bus/pci/drivers/nvidia", RUN+="/usr/bin/nvidia-ctk system create-dev-char-symlinks --create-all"
378+
```
379+
380+
A good place to install this rule would be in `/lib/udev/rules.d/71-nvidia-dev-char.rules`
381+
382+
Some additional workarounds for Docker environments:
383+
384+
- **Explicitly disabling systemd cgroup management in Docker.**
385+
- Set the parameter `"exec-opts": ["native.cgroupdriver=cgroupfs"]` in the `/etc/docker/daemon.json` file and restart docker.
386+
- **Downgrading to `docker.io` packages where `systemd` is not the default `cgroup` manager.**
387+
388+
#### For K8s environments
389+
390+
The recommended workaround is to deploy GPU Operator 22.9.2 or later to automatically fix the issue on all K8s nodes of the cluster.
391+
The fix is integrated inside the validator pod which will run when a new node is deployed or at every reboot of the node.
392+
393+
Some additional workarounds for Kubernets environments:
394+
395+
- For deployments using the standalone k8s-device-plugin (i.e. not through the use of the operator), installation of a `udev` rule as described in the previous section can be made to work around this issue. Be sure to pass the correct `{{NVIDIA_DRIVER_ROOT}}` in cases where the driver container is also in use.
396+
397+
- Explicitly disabling `systemd cgroup` management in `containerd` or `cri-o`:
398+
- Remove the parameter `cgroup_manager = "systemd"` from `cri-o` configuration file (usually located here: `/etc/crio/crio.conf` or `/etc/crio/crio.conf.d/00-default`) and restart `cri-o`.
399+
- Downgrading to a version of the `containerd.io` package where `systemd` is not the default `cgroup` manager (and not overriding that, of course).

0 commit comments

Comments
 (0)