This guide covers making NVIDIA GPUs available to Kubernetes pods -- from driver installation to the device plugin that exposes GPUs as schedulable resources. This was the most trial-and-error-heavy part of the entire setup.
| GPU | VRAM | Role |
|---|---|---|
| RTX 5090 | 32 GB | LoRA training (AI-Toolkit) |
| RTX 3080 | 10 GB | Inference / generation (ComfyUI) |
| RTX 2070 SUPER | 8 GB | Inference / generation (ComfyUI) |
| Component | Version |
|---|---|
| NVIDIA Driver | 590.48.01 |
| CUDA | 12.8 |
| containerd | 1.7.28 |
| NVIDIA Device Plugin (nvdp) | 0.17.1 |
The open-source nouveau driver conflicts with NVIDIA's proprietary driver. It must be blacklisted before installing the NVIDIA driver.
Create /etc/modprobe.d/blacklist-nouveau.conf:
blacklist nouveau
options nouveau modeset=0
Regenerate initramfs and reboot:
sudo update-initramfs -u
sudo rebootThe config is stored at system/modprobe.d/blacklist-nouveau.conf.
After reboot, verify nouveau is not loaded:
lsmod | grep nouveau
# Should return nothingInstall the NVIDIA driver 590.48.01 with CUDA 12.8 support. Follow NVIDIA's official instructions for Debian, or install from the .run file:
# Verify installation
nvidia-smiExpected output should show all three GPUs:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 590.48.01 Driver Version: 590.48.01 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3080 Off | 00000000:0A:00.0 Off | N/A |
| 1 NVIDIA GeForce RTX 2070 SUPER Off | 00000000:0B:00.0 Off | N/A |
| 2 NVIDIA GeForce RTX 5090 Off | 00000000:01:00.0 Off | N/A |
+-----------------------------------------+------------------------+----------------------+
The NVIDIA Container Toolkit lets container runtimes access GPUs. It includes nvidia-ctk, the tool that configures containerd and generates CDI specs.
# Add the NVIDIA Container Toolkit repository
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkitThis is a critical detail that many guides get wrong. There are two ways to configure the NVIDIA runtime in containerd:
- Available runtime -- NVIDIA is registered but not used unless explicitly requested
- Default runtime -- Every container automatically uses the NVIDIA runtime
We set NVIDIA as the default runtime. This means every pod gets GPU access by default, and the NVIDIA Device Plugin can detect and allocate GPUs without requiring a RuntimeClass on every pod spec.
The key line in config.toml:
[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "nvidia"You can configure this with nvidia-ctk:
sudo nvidia-ctk runtime configure --runtime=containerd --set-as-default
sudo systemctl restart containerdOr edit /etc/containerd/config.toml directly. The full configuration is at system/containerd/config.toml.
If NVIDIA is only an available (non-default) runtime, you need to create a RuntimeClass and reference it in every pod spec. With it as the default runtime, containerd uses it for all containers automatically. The NVIDIA Device Plugin then manages which containers actually get GPU access through the nvidia.com/gpu resource and NVIDIA_VISIBLE_DEVICES environment variable.
CDI (Container Device Interface) is the modern way to expose devices to containers. It replaces the older --gpus flag approach with a standardized specification.
sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yamlVerify the CDI spec was created:
nvidia-ctk cdi listThis should list all three GPUs with their CDI identifiers.
Even though we set NVIDIA as the default runtime, creating a RuntimeClass is good practice for explicit documentation in pod specs:
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: nvidia
handler: nvidiakubectl apply -f nvidia-runtimeclass.yamlPods can then reference this RuntimeClass in their spec:
spec:
runtimeClassName: nvidiaIn our setup this is optional because the default runtime is already NVIDIA, but it makes the intent explicit.
The device plugin is what actually makes GPUs visible to the Kubernetes scheduler as allocatable resources (nvidia.com/gpu). Without it, Kubernetes has no idea GPUs exist.
Before arriving at the standalone device plugin, we tried the NVIDIA GPU Operator first. The GPU Operator is NVIDIA's all-in-one solution: it bundles the driver, container toolkit, device plugin, and monitoring into a single Helm chart with an operator managing everything.
Why it failed for us:
The GPU Operator assumes it manages the entire NVIDIA stack. It tries to install its own driver containers, its own container toolkit, and its own device plugin. On a system where we had already installed the driver and toolkit at the host level, the Operator conflicted with the existing installation. Driver version mismatches between host and operator containers caused init container crashes. The Operator's validator pods failed health checks because they detected the host-installed driver but could not reconcile it with their expected state.
After debugging for several hours, the pragmatic choice was clear: uninstall the GPU Operator and use the standalone device plugin instead.
# Uninstall the GPU Operator (if you went down this path)
helm uninstall gpu-operator -n gpu-operator
kubectl delete namespace gpu-operator| Aspect | GPU Operator | Standalone Device Plugin |
|---|---|---|
| Scope | Full stack: driver, toolkit, plugin, monitoring | Device plugin only |
| Best for | Fresh clusters where NVIDIA manages everything | Systems with existing host-level driver/toolkit |
| Complexity | Higher -- operator manages many components | Lower -- single DaemonSet |
| Flexibility | Less -- fights host-level installations | More -- works alongside existing setup |
| Our choice | Tried first, failed | Used successfully |
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update
helm install nvidia-device-plugin nvdp/nvidia-device-plugin \
--version 0.17.1 \
--namespace nvidia-device-plugin \
--create-namespaceAfter the device plugin DaemonSet is running, check that the node reports GPU resources:
kubectl get nodes -o json | jq '.items[].status.allocatable["nvidia.com/gpu"]'Expected output:
"3"
You can also describe the node to see GPU details:
kubectl describe node zosmaai | grep -A 5 "Allocatable:" | grep nvidianvidia.com/gpu: 3
Run a quick test pod to confirm GPU access from inside a container:
kubectl run gpu-test --rm -it --restart=Never \
--image=nvidia/cuda:12.8.0-base-ubuntu22.04 \
--limits=nvidia.com/gpu=1 \
-- nvidia-smiThis should print the nvidia-smi output from inside the pod, showing one GPU allocated to it.
| File | Location | Purpose |
|---|---|---|
system/containerd/config.toml |
/etc/containerd/config.toml |
containerd config with NVIDIA as default runtime |
system/modprobe.d/blacklist-nouveau.conf |
/etc/modprobe.d/blacklist-nouveau.conf |
Blacklist nouveau driver |
| CDI spec (generated) | /etc/cdi/nvidia.yaml |
CDI device specification for all GPUs |
"Failed to initialize NVML" in pods: The NVIDIA runtime is not configured as default, or containerd was not restarted after configuration changes. Verify default_runtime_name = "nvidia" in config.toml and restart containerd.
Device plugin pod in CrashLoopBackOff: Check that the NVIDIA driver is loaded on the host (nvidia-smi works) and that the containerd socket is accessible. The device plugin needs both.
"0" GPUs allocatable: The device plugin is not running or cannot detect GPUs. Check device plugin logs: kubectl logs -n nvidia-device-plugin -l app.kubernetes.io/name=nvidia-device-plugin.
With GPUs available to Kubernetes, the next step is GPU Assignment to control which workload gets which GPU.