Skip to content

Commit e6c1de5

Browse files
elezarchenopisa-mccarthy
authored
Improve container-toolkit troubleshooting docs (#214)
* Improve container-toolkit troubleshooting docs Signed-off-by: Evan Lezar <[email protected]> * Update note syntax Co-authored-by: Abigail McCarthy <[email protected]> Signed-off-by: Andrew Chen <[email protected]> --------- Signed-off-by: Evan Lezar <[email protected]> Signed-off-by: Andrew Chen <[email protected]> Co-authored-by: Andrew Chen <[email protected]> Co-authored-by: Abigail McCarthy <[email protected]>
1 parent 95d8ee2 commit e6c1de5

File tree

2 files changed

+23
-11
lines changed

2 files changed

+23
-11
lines changed

container-toolkit/install-guide.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,12 @@ For information about installing the driver with a package manager, refer to
1818
the [_NVIDIA Driver Installation Quickstart Guide_](https://docs.nvidia.com/datacenter/tesla/tesla-installation-notes/index.html).
1919
Alternatively, you can install the driver by [downloading](https://www.nvidia.com/en-us/drivers/) a `.run` installer.
2020

21+
```{note}
22+
There is a [known issue](troubleshooting.md#containers-losing-access-to-gpus-with-error-failed-to-initialize-nvml-unknown-error) on systems
23+
where `systemd` cgroup drivers are used that cause containers to lose access to requested GPUs when
24+
`systemctl daemon reload` is run. Please see the troubleshooting documentation for more information.
25+
```
26+
2127
(installing-with-apt)=
2228

2329
### With `apt`: Ubuntu, Debian

container-toolkit/troubleshooting.md

Lines changed: 17 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -126,7 +126,7 @@ Review the SELinux policies on your system.
126126

127127
When using the NVIDIA Container Runtime Hook (i.e. the Docker `--gpus` flag or
128128
the NVIDIA Container Runtime in `legacy` mode) to inject requested GPUs and driver
129-
libraries into a container, the hook makes modifications, including setting up cgroup access, to the container without the low-level runtime (e.g. `runc`) being aware of these changes.
129+
libraries into a container, the hook makes modifications, including setting up cgroup access, to the container without the low-level runtime (e.g. `runc`) being aware of these changes.
130130
The result is that updates to the container may remove access to the requested GPUs.
131131

132132
When the container loses access to the GPU, you will see the following error message from the console output:
@@ -144,26 +144,32 @@ When it is restarted, manually or automatically depending if you are using a con
144144
### Affected environments
145145

146146
On certain systems this behavior is not limited to *explicit* container updates
147-
such as adjusting CPU and Memory limits for a container.
147+
such as adjusting CPU and Memory limits for a container.
148148
On systems where `systemd` is used to manage the cgroups of the container, reloading the `systemd` unit files (`systemctl daemon-reload`) is sufficient to trigger container updates and cause a loss of GPU access.
149149

150150
### Mitigations and Workarounds
151151

152152
```{warning}
153-
Certain `runc` versions show similar behavior with the `systemd` cgroup driver when `/dev/char` symlinks for the required devices are missing on the system.
153+
Certain `runc` versions show similar behavior with the `systemd` cgroup driver when `/dev/char` symlinks for the required devices are missing on the system.
154154
Refer to [GitHub disccusion #1133](https://github.com/NVIDIA/nvidia-container-toolkit/discussions/1133) for more details around this issue.
155-
It should be noted that the behavior persisted even if device nodes were requested on the command line.
155+
It should be noted that the behavior persisted even if device nodes were requested on the command line.
156156
Newer `runc` versions do not show this behavior and newer NVIDIA driver versions ensure that the required symlinks are present, reducing the likelihood of the specific issue occurring for affected `runc` versions.
157157
```
158158

159159
Use the following workarounds to prevent containers from losing access to requested GPUs when a `systemctl daemon-reload` command is run:
160160

161-
* Explicitly request the device nodes associated with the requested GPU(s) and any control device nodes when starting the container.
162-
For the Docker CLI, this is done by adding the relevant `--device` flags.
161+
* For Docker, use cgroupfs as the cgroup driver for containers. To do this, update the `/etc/docker/daemon.json` to include:
162+
```json
163+
{
164+
"exec-opts": ["native.cgroupdriver=cgroupfs"]
165+
}
166+
```
167+
and restart docker by running `systemctl restart docker`.
168+
This will ensure that the container will not lose access to devices when `systemctl daemon-reload` is run.
169+
This approach does not change the behavior for explicit container updates and a container will still lose access to devices in this case.
170+
* Explicitly request the device nodes associated with the requested GPU(s) and any control device nodes when starting the container.
171+
For the Docker CLI, this is done by adding the relevant `--device` flags.
163172
In the case of the NVIDIA Kubernetes Device Plugin the `compatWithCPUManager= true` [Helm option](https://github.com/NVIDIA/k8s-device-plugin?tab=readme-ov-file#setting-other-helm-chart-values) will ensure the same thing.
164-
* Use the Container Device Interface (CDI) to inject devices into a container.
165-
When CDI is used to inject devices into a container, the required device nodes are included in the modifications made to the container config.
173+
* Use the Container Device Interface (CDI) to inject devices into a container.
174+
When CDI is used to inject devices into a container, the required device nodes are included in the modifications made to the container config.
166175
This means that even if the container is updated it will still have access to the required devices.
167-
* For Docker, use cgroupfs as the cgroup driver for containers.
168-
This will ensure that the container will not lose access to devices when `systemctl daemon-reload` is run.
169-
This approach does not change the behavior for explicit container updates and a container will still lose access to devices in this case.

0 commit comments

Comments
 (0)