Improve container-toolkit troubleshooting docs (#214)

elezar · chenopis · a-mccarthy · web-flow · commit e6c1de55084d · 2025-07-09T10:53:30.000-07:00
* Improve container-toolkit troubleshooting docs

Signed-off-by: Evan Lezar &lt;elezar@nvidia.com&gt;

* Update note syntax

Co-authored-by: Abigail McCarthy &lt;20771501+a-mccarthy@users.noreply.github.com&gt;
Signed-off-by: Andrew Chen &lt;andrewch@nvidia.com&gt;

---------

Signed-off-by: Evan Lezar &lt;elezar@nvidia.com&gt;
Signed-off-by: Andrew Chen &lt;andrewch@nvidia.com&gt;
Co-authored-by: Andrew Chen &lt;chenopis@users.noreply.github.com&gt;
Co-authored-by: Abigail McCarthy &lt;20771501+a-mccarthy@users.noreply.github.com&gt;
diff --git a/container-toolkit/install-guide.md b/container-toolkit/install-guide.md
@@ -18,6 +18,12 @@ For information about installing the driver with a package manager, refer to
 the [_NVIDIA Driver Installation Quickstart Guide_](https://docs.nvidia.com/datacenter/tesla/tesla-installation-notes/index.html).
 Alternatively, you can install the driver by [downloading](https://www.nvidia.com/en-us/drivers/) a `.run` installer.
 
+```{note}
+There is a [known issue](troubleshooting.md#containers-losing-access-to-gpus-with-error-failed-to-initialize-nvml-unknown-error) on systems
+where `systemd` cgroup drivers are used that cause containers to lose access to requested GPUs when
+`systemctl daemon reload` is run. Please see the troubleshooting documentation for more information.
+```
+
 (installing-with-apt)=
 
 ### With `apt`: Ubuntu, Debian
diff --git a/container-toolkit/troubleshooting.md b/container-toolkit/troubleshooting.md
@@ -126,7 +126,7 @@ Review the SELinux policies on your system.
 
 When using the NVIDIA Container Runtime Hook (i.e. the Docker `--gpus` flag or
 the NVIDIA Container Runtime in `legacy` mode) to inject requested GPUs and driver
-libraries into a container, the hook makes modifications, including setting up cgroup access, to the container without the low-level runtime (e.g. `runc`) being aware of these changes. 
+libraries into a container, the hook makes modifications, including setting up cgroup access, to the container without the low-level runtime (e.g. `runc`) being aware of these changes.
 The result is that updates to the container may remove access to the requested GPUs.
 
 When the container loses access to the GPU, you will see the following error message from the console output:
@@ -144,26 +144,32 @@ When it is restarted, manually or automatically depending if you are using a con
 ### Affected environments
 
 On certain systems this behavior is not limited to *explicit* container updates
-such as adjusting CPU and Memory limits for a container. 
+such as adjusting CPU and Memory limits for a container.
 On systems where `systemd` is used to manage the cgroups of the container, reloading the `systemd` unit files (`systemctl daemon-reload`) is sufficient to trigger container updates and cause a loss of GPU access.
 
 ### Mitigations and  Workarounds
 
 ```{warning}
-Certain `runc` versions show similar behavior with the `systemd` cgroup driver when `/dev/char` symlinks for the required devices are missing on the system. 
+Certain `runc` versions show similar behavior with the `systemd` cgroup driver when `/dev/char` symlinks for the required devices are missing on the system.
 Refer to [GitHub disccusion #1133](https://github.com/NVIDIA/nvidia-container-toolkit/discussions/1133) for more details around this issue.
-It should be noted that the behavior persisted even if device nodes were requested on the command line. 
+It should be noted that the behavior persisted even if device nodes were requested on the command line.
 Newer `runc` versions do not show this behavior and newer NVIDIA driver versions ensure that the required symlinks are present, reducing the likelihood of the specific issue occurring for affected `runc` versions.
 ```
 
 Use the following workarounds to prevent containers from losing access to requested GPUs when a `systemctl daemon-reload` command is run:
 
-* Explicitly request the device nodes associated with the requested GPU(s) and any control device nodes when starting the container. 
-  For the Docker CLI, this is done by adding the relevant `--device` flags. 
+* For Docker, use cgroupfs as the cgroup driver for containers. To do this, update the `/etc/docker/daemon.json` to include:
+  ```json
+  {
+    "exec-opts": ["native.cgroupdriver=cgroupfs"]
+  }
+  ```
+  and restart docker by running `systemctl restart docker`.
+  This will ensure that the container will not lose access to devices when `systemctl daemon-reload` is run.
+  This approach does not change the behavior for explicit container updates and a container will still lose access to devices in this case.
+* Explicitly request the device nodes associated with the requested GPU(s) and any control device nodes when starting the container.
+  For the Docker CLI, this is done by adding the relevant `--device` flags.
   In the case of the NVIDIA Kubernetes Device Plugin the `compatWithCPUManager= true` [Helm option](https://github.com/NVIDIA/k8s-device-plugin?tab=readme-ov-file#setting-other-helm-chart-values) will ensure the same thing.
-* Use the Container Device Interface (CDI) to inject devices into a container. 
-  When CDI is used to inject devices into a container, the required device nodes are included in the modifications made to the container config. 
+* Use the Container Device Interface (CDI) to inject devices into a container.
+  When CDI is used to inject devices into a container, the required device nodes are included in the modifications made to the container config.
   This means that even if the container is updated it will still have access to the required devices.
-* For Docker, use cgroupfs as the cgroup driver for containers. 
-  This will ensure that the container will not lose access to devices when `systemctl daemon-reload` is run. 
-  This approach does not change the behavior for explicit container updates and a container will still lose access to devices in this case.