You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: container-toolkit/troubleshooting.md
+17-11Lines changed: 17 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -126,7 +126,7 @@ Review the SELinux policies on your system.
126
126
127
127
When using the NVIDIA Container Runtime Hook (i.e. the Docker `--gpus` flag or
128
128
the NVIDIA Container Runtime in `legacy` mode) to inject requested GPUs and driver
129
-
libraries into a container, the hook makes modifications, including setting up cgroup access, to the container without the low-level runtime (e.g. `runc`) being aware of these changes.
129
+
libraries into a container, the hook makes modifications, including setting up cgroup access, to the container without the low-level runtime (e.g. `runc`) being aware of these changes.
130
130
The result is that updates to the container may remove access to the requested GPUs.
131
131
132
132
When the container loses access to the GPU, you will see the following error message from the console output:
@@ -144,26 +144,32 @@ When it is restarted, manually or automatically depending if you are using a con
144
144
### Affected environments
145
145
146
146
On certain systems this behavior is not limited to *explicit* container updates
147
-
such as adjusting CPU and Memory limits for a container.
147
+
such as adjusting CPU and Memory limits for a container.
148
148
On systems where `systemd` is used to manage the cgroups of the container, reloading the `systemd` unit files (`systemctl daemon-reload`) is sufficient to trigger container updates and cause a loss of GPU access.
149
149
150
150
### Mitigations and Workarounds
151
151
152
152
```{warning}
153
-
Certain `runc` versions show similar behavior with the `systemd` cgroup driver when `/dev/char` symlinks for the required devices are missing on the system.
153
+
Certain `runc` versions show similar behavior with the `systemd` cgroup driver when `/dev/char` symlinks for the required devices are missing on the system.
154
154
Refer to [GitHub disccusion #1133](https://github.com/NVIDIA/nvidia-container-toolkit/discussions/1133) for more details around this issue.
155
-
It should be noted that the behavior persisted even if device nodes were requested on the command line.
155
+
It should be noted that the behavior persisted even if device nodes were requested on the command line.
156
156
Newer `runc` versions do not show this behavior and newer NVIDIA driver versions ensure that the required symlinks are present, reducing the likelihood of the specific issue occurring for affected `runc` versions.
157
157
```
158
158
159
159
Use the following workarounds to prevent containers from losing access to requested GPUs when a `systemctl daemon-reload` command is run:
160
160
161
-
* Explicitly request the device nodes associated with the requested GPU(s) and any control device nodes when starting the container.
162
-
For the Docker CLI, this is done by adding the relevant `--device` flags.
161
+
* For Docker, use cgroupfs as the cgroup driver for containers. To do this, update the `/etc/docker/daemon.json` to include:
162
+
```json
163
+
{
164
+
"exec-opts": ["native.cgroupdriver=cgroupfs"]
165
+
}
166
+
```
167
+
and restart docker by running `systemctl restart docker`.
168
+
This will ensure that the container will not lose access to devices when `systemctl daemon-reload` is run.
169
+
This approach does not change the behavior for explicit container updates and a container will still lose access to devices in this case.
170
+
* Explicitly request the device nodes associated with the requested GPU(s) and any control device nodes when starting the container.
171
+
For the Docker CLI, this is done by adding the relevant `--device` flags.
163
172
In the case of the NVIDIA Kubernetes Device Plugin the `compatWithCPUManager= true`[Helm option](https://github.com/NVIDIA/k8s-device-plugin?tab=readme-ov-file#setting-other-helm-chart-values) will ensure the same thing.
164
-
* Use the Container Device Interface (CDI) to inject devices into a container.
165
-
When CDI is used to inject devices into a container, the required device nodes are included in the modifications made to the container config.
173
+
* Use the Container Device Interface (CDI) to inject devices into a container.
174
+
When CDI is used to inject devices into a container, the required device nodes are included in the modifications made to the container config.
166
175
This means that even if the container is updated it will still have access to the required devices.
167
-
* For Docker, use cgroupfs as the cgroup driver for containers.
168
-
This will ensure that the container will not lose access to devices when `systemctl daemon-reload` is run.
169
-
This approach does not change the behavior for explicit container updates and a container will still lose access to devices in this case.
0 commit comments