Skip to content

gpu-operator device plugin CDI spec has wrong hook path on Talos v1.13 #13021

@ormandj

Description

@ormandj

Bug Report

The v1.13 CDI switch (siderolabs/extensions#996 + #12909) enabled enable_cdi=true in containerd and moved to gpu-operator. Following the NVIDIA GPU docs, pods requesting nvidia.com/gpu resource limits fail:

error running createContainer hook #0: fork/exec /usr/bin/nvidia-ctk: no such file or directory

The docs-suggested test (runtimeClassName: nvidia without resource limits) passes fine.

Root Cause

Two different code paths exist:

  • Without resource limits: The nvidia-container-runtime OCI wrapper handles GPU access directly. No CDI spec is involved. This is what the docs test and CI (internal/integration/api/extensions_nvidia.go) exercise.

  • With nvidia.com/gpu resource limits: The device plugin generates a CDI spec at /run/cdi/k8s.device-plugin.nvidia.com-gpu.json. containerd (with enable_cdi=true from feat: enable container device interface #12909) reads this spec and injects OCI hooks. The device plugin defaults all hook paths to /usr/bin/nvidia-ctk (source). On Talos, the binary is at /usr/local/bin/nvidia-ctk (installed by nvidia-container-toolkit-lts). The hooks execute on the host, so the host path must be correct.

/usr/local/bin/ is the standard path for all Talos extensions — this can't change, and the extensions validator rejects symlinks in /usr/bin/ (see siderolabs/extensions#1017).

Reproduction

# Docs test — PASSES (no CDI involvement)
kubectl run nvidia-test --restart=Never -ti --rm \
  --image nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0 \
  --overrides '{"spec": {"runtimeClassName": "nvidia"}}'

# Standard K8s GPU scheduling — FAILS
kubectl run nvidia-smi --rm -it --restart=Never \
  --image=nvcr.io/nvidia/cuda:12.8.1-base-ubuntu24.04 \
  --overrides='{"spec":{"runtimeClassName":"nvidia","containers":[{"name":"nvidia-smi","image":"nvcr.io/nvidia/cuda:12.8.1-base-ubuntu24.04","command":["nvidia-smi"],"resources":{"limits":{"nvidia.com/gpu":"1"}}}]}}'

Verified Fix

Setting NVIDIA_CDI_HOOK_PATH on the device plugin resolves the issue. The documented helm command should include:

helm upgrade --wait --install -n gpu-operator gpu-operator nvidia/gpu-operator \
  --set driver.enabled=false \
  --set toolkit.enabled=false \
  --set hostPaths.driverInstallDir=/usr/local/glibc/usr/lib \
  --set 'devicePlugin.env[0].name=NVIDIA_CDI_HOOK_PATH' \
  --set 'devicePlugin.env[0].value=/usr/local/bin/nvidia-cdi-hook'

After applying this fix, the device plugin's CDI spec at /run/cdi/k8s.device-plugin.nvidia.com-gpu.json correctly references /usr/local/bin/nvidia-cdi-hook, and pods with nvidia.com/gpu resource limits work.

The CI test (internal/integration/api/extensions_nvidia.go) also needs:

  1. The NVIDIA_CDI_HOOK_PATH env var in testdata/nvidia-gpu-operator.yaml
  2. A test case that requests nvidia.com/gpu resource limits (not just runtimeClassName: nvidia)

Environment

  • Talos: v1.13.0-beta.0
  • gpu-operator: v25.10.1
  • Extensions: nvidia-open-gpu-kernel-modules-lts + nvidia-container-toolkit-lts (driver 580.126.20)
  • GPU: NVIDIA RTX 3090

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions