gpu-operator device plugin CDI spec has wrong hook path on Talos v1.13

## Bug Report

The v1.13 CDI switch (siderolabs/extensions#996 + #12909) enabled `enable_cdi=true` in containerd and moved to gpu-operator. Following the [NVIDIA GPU docs](https://docs.siderolabs.com/talos/v1.13/configure-your-talos-cluster/hardware-and-drivers/nvidia-gpu), pods requesting `nvidia.com/gpu` resource limits fail:

```
error running createContainer hook #0: fork/exec /usr/bin/nvidia-ctk: no such file or directory
```

The docs-suggested test (`runtimeClassName: nvidia` without resource limits) passes fine.

### Root Cause

Two different code paths exist:

- **Without resource limits**: The `nvidia-container-runtime` OCI wrapper handles GPU access directly. No CDI spec is involved. This is what the docs test and CI (`internal/integration/api/extensions_nvidia.go`) exercise.

- **With `nvidia.com/gpu` resource limits**: The device plugin generates a CDI spec at `/run/cdi/k8s.device-plugin.nvidia.com-gpu.json`. containerd (with `enable_cdi=true` from #12909) reads this spec and injects OCI hooks. The device plugin defaults all hook paths to `/usr/bin/nvidia-ctk` ([source](https://github.com/NVIDIA/k8s-device-plugin/blob/v0.18.1/api/config/v1/consts.go#L54)). On Talos, the binary is at `/usr/local/bin/nvidia-ctk` (installed by `nvidia-container-toolkit-lts`). The hooks execute on the **host**, so the host path must be correct.

`/usr/local/bin/` is the standard path for all Talos extensions — this can't change, and the extensions validator rejects symlinks in `/usr/bin/` (see siderolabs/extensions#1017).

### Reproduction

```bash
# Docs test — PASSES (no CDI involvement)
kubectl run nvidia-test --restart=Never -ti --rm \
  --image nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0 \
  --overrides '{"spec": {"runtimeClassName": "nvidia"}}'

# Standard K8s GPU scheduling — FAILS
kubectl run nvidia-smi --rm -it --restart=Never \
  --image=nvcr.io/nvidia/cuda:12.8.1-base-ubuntu24.04 \
  --overrides='{"spec":{"runtimeClassName":"nvidia","containers":[{"name":"nvidia-smi","image":"nvcr.io/nvidia/cuda:12.8.1-base-ubuntu24.04","command":["nvidia-smi"],"resources":{"limits":{"nvidia.com/gpu":"1"}}}]}}'
```

### Verified Fix

Setting `NVIDIA_CDI_HOOK_PATH` on the device plugin resolves the issue. The documented helm command should include:

```bash
helm upgrade --wait --install -n gpu-operator gpu-operator nvidia/gpu-operator \
  --set driver.enabled=false \
  --set toolkit.enabled=false \
  --set hostPaths.driverInstallDir=/usr/local/glibc/usr/lib \
  --set 'devicePlugin.env[0].name=NVIDIA_CDI_HOOK_PATH' \
  --set 'devicePlugin.env[0].value=/usr/local/bin/nvidia-cdi-hook'
```

After applying this fix, the device plugin's CDI spec at `/run/cdi/k8s.device-plugin.nvidia.com-gpu.json` correctly references `/usr/local/bin/nvidia-cdi-hook`, and pods with `nvidia.com/gpu` resource limits work.

The CI test (`internal/integration/api/extensions_nvidia.go`) also needs:
1. The `NVIDIA_CDI_HOOK_PATH` env var in `testdata/nvidia-gpu-operator.yaml`
2. A test case that requests `nvidia.com/gpu` resource limits (not just `runtimeClassName: nvidia`)

### Environment

- Talos: v1.13.0-beta.0
- gpu-operator: v25.10.1
- Extensions: `nvidia-open-gpu-kernel-modules-lts` + `nvidia-container-toolkit-lts` (driver 580.126.20)
- GPU: NVIDIA RTX 3090

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

gpu-operator device plugin CDI spec has wrong hook path on Talos v1.13 #13021

Bug Report

Root Cause

Reproduction

Verified Fix

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

gpu-operator device plugin CDI spec has wrong hook path on Talos v1.13 #13021

Description

Bug Report

Root Cause

Reproduction

Verified Fix

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions