-
-
Notifications
You must be signed in to change notification settings - Fork 798
Description
Bug Report
The v1.13 CDI switch (siderolabs/extensions#996 + #12909) enabled enable_cdi=true in containerd and moved to gpu-operator. Following the NVIDIA GPU docs, pods requesting nvidia.com/gpu resource limits fail:
error running createContainer hook #0: fork/exec /usr/bin/nvidia-ctk: no such file or directory
The docs-suggested test (runtimeClassName: nvidia without resource limits) passes fine.
Root Cause
Two different code paths exist:
-
Without resource limits: The
nvidia-container-runtimeOCI wrapper handles GPU access directly. No CDI spec is involved. This is what the docs test and CI (internal/integration/api/extensions_nvidia.go) exercise. -
With
nvidia.com/gpuresource limits: The device plugin generates a CDI spec at/run/cdi/k8s.device-plugin.nvidia.com-gpu.json. containerd (withenable_cdi=truefrom feat: enable container device interface #12909) reads this spec and injects OCI hooks. The device plugin defaults all hook paths to/usr/bin/nvidia-ctk(source). On Talos, the binary is at/usr/local/bin/nvidia-ctk(installed bynvidia-container-toolkit-lts). The hooks execute on the host, so the host path must be correct.
/usr/local/bin/ is the standard path for all Talos extensions — this can't change, and the extensions validator rejects symlinks in /usr/bin/ (see siderolabs/extensions#1017).
Reproduction
# Docs test — PASSES (no CDI involvement)
kubectl run nvidia-test --restart=Never -ti --rm \
--image nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0 \
--overrides '{"spec": {"runtimeClassName": "nvidia"}}'
# Standard K8s GPU scheduling — FAILS
kubectl run nvidia-smi --rm -it --restart=Never \
--image=nvcr.io/nvidia/cuda:12.8.1-base-ubuntu24.04 \
--overrides='{"spec":{"runtimeClassName":"nvidia","containers":[{"name":"nvidia-smi","image":"nvcr.io/nvidia/cuda:12.8.1-base-ubuntu24.04","command":["nvidia-smi"],"resources":{"limits":{"nvidia.com/gpu":"1"}}}]}}'Verified Fix
Setting NVIDIA_CDI_HOOK_PATH on the device plugin resolves the issue. The documented helm command should include:
helm upgrade --wait --install -n gpu-operator gpu-operator nvidia/gpu-operator \
--set driver.enabled=false \
--set toolkit.enabled=false \
--set hostPaths.driverInstallDir=/usr/local/glibc/usr/lib \
--set 'devicePlugin.env[0].name=NVIDIA_CDI_HOOK_PATH' \
--set 'devicePlugin.env[0].value=/usr/local/bin/nvidia-cdi-hook'After applying this fix, the device plugin's CDI spec at /run/cdi/k8s.device-plugin.nvidia.com-gpu.json correctly references /usr/local/bin/nvidia-cdi-hook, and pods with nvidia.com/gpu resource limits work.
The CI test (internal/integration/api/extensions_nvidia.go) also needs:
- The
NVIDIA_CDI_HOOK_PATHenv var intestdata/nvidia-gpu-operator.yaml - A test case that requests
nvidia.com/gpuresource limits (not justruntimeClassName: nvidia)
Environment
- Talos: v1.13.0-beta.0
- gpu-operator: v25.10.1
- Extensions:
nvidia-open-gpu-kernel-modules-lts+nvidia-container-toolkit-lts(driver 580.126.20) - GPU: NVIDIA RTX 3090