Skip to content

nvidia-operator-validator pod toolkit-validation RunContainerError #1982

@caleb-brewer

Description

@caleb-brewer

Describe the bug
When we installed GPU operator, it came up fine. We had a power loss on our servers and when they powered back on, all of the nvidia-operator-validator pods are CrashLoopBackOff with the toolkit-validation container failing.

To Reproduce
Haven't been able to reproduce. Our cluster is in an odd state.

Environment (please provide the following information):

  • GPU Operator Version: 25.10.1
  • OS: RHEL 8.9
  • Kernel Version: Linux version 4.18.0-513.5.1.el8_9.x86_64
  • Container Runtime Version: containerd://2.1.4-k3s2
  • Kubernetes Distro and Version: rke2 v1.33.5

Here is the extra values file we are passing to the helm chart on helm install:

dcgmExporter:
  env:
  - name: DCGM_EXPORTER_INTERVAL
    value: "3000"
  serviceMonitor:
    additionalLabels:
      release: prometheus-operator
    enabled: true
    interval: 3s
driver:
  enabled: false
gds:
  enabled: false
node-feature-discovery:
  prometheus:
    enabled: true
    labels:
      release: prometheus-operator
toolkit:
  enabled: false
  env:
  - name: CONTAINERD_CONFIG
    value: /var/lib/rancher/rke2/agent/etc/containerd/config.toml
  - name: CONTAINERD_SOCKET
    value: /run/k3s/containerd/containerd.sock
  - name: CONTAINERD_RUNTIME_CLASS
    value: nvidia
  - name: CONTAINERD_SET_AS_DEFAULT
    value: "true"
  • kubernetes pods status: kubectl get pods -n OPERATOR_NAMESPACE
NAME                                                          READY   STATUS                   RESTARTS          AGE
gpu-feature-discovery-5drhb                                   0/1     Init:0/1                 0                 16h
gpu-feature-discovery-896b2                                   0/1     Init:0/1                 0                 16h
gpu-feature-discovery-8kxnk                                   0/1     Init:0/1                 0                 16h
gpu-feature-discovery-f999p                                   0/1     Init:0/1                 0                 16h
gpu-feature-discovery-hhs2l                                   0/1     Init:0/1                 0                 16h
gpu-feature-discovery-k76mh                                   0/1     Init:0/1                 0                 16h
gpu-feature-discovery-pdn6f                                   0/1     Init:0/1                 0                 16h
gpu-feature-discovery-tbg6x                                   0/1     Init:0/1                 0                 16h
gpu-feature-discovery-x7t74                                   0/1     Init:0/1                 0                 16h
gpu-operator-6b7854bc77-r5p9f                                 1/1     Running                  0                 16h
gpu-operator-node-feature-discovery-gc-74dd579c7f-wp99c       1/1     Running                  0                 16h
gpu-operator-node-feature-discovery-master-5645495d9c-9d49h   1/1     Running                  0                 16h
gpu-operator-node-feature-discovery-worker-2qc4n              1/1     Running                  0                 16h
gpu-operator-node-feature-discovery-worker-6t55l              1/1     Running                  0                 16h
gpu-operator-node-feature-discovery-worker-756vt              1/1     Running                  0                 16h
gpu-operator-node-feature-discovery-worker-fw86j              1/1     Running                  0                 16h
gpu-operator-node-feature-discovery-worker-jwrmm              1/1     Running                  0                 16h
gpu-operator-node-feature-discovery-worker-ncdjv              1/1     Running                  0                 16h
gpu-operator-node-feature-discovery-worker-nxzsc              1/1     Running                  0                 16h
gpu-operator-node-feature-discovery-worker-z9jfs              1/1     Running                  0                 16h
gpu-operator-node-feature-discovery-worker-zhql2              1/1     Running                  0                 16h
nvidia-dcgm-exporter-4dp6x                                    0/1     Init:0/1                 0                 16h
nvidia-dcgm-exporter-4stqg                                    0/1     Init:0/1                 0                 16h
nvidia-dcgm-exporter-7flft                                    0/1     Init:0/1                 0                 16h
nvidia-dcgm-exporter-bx8gn                                    0/1     Init:0/1                 0                 16h
nvidia-dcgm-exporter-fqskw                                    0/1     Init:0/1                 0                 16h
nvidia-dcgm-exporter-g7fp6                                    0/1     Init:0/1                 0                 16h
nvidia-dcgm-exporter-mtlkt                                    0/1     Init:0/1                 0                 16h
nvidia-dcgm-exporter-v8kdd                                    0/1     Init:0/1                 0                 16h
nvidia-dcgm-exporter-x8s8x                                    0/1     Init:0/1                 0                 16h
nvidia-device-plugin-daemonset-5pzck                          0/1     Init:0/1                 0                 16h
nvidia-device-plugin-daemonset-7fzld                          0/1     Init:0/1                 0                 16h
nvidia-device-plugin-daemonset-9kj9p                          0/1     Init:0/1                 0                 16h
nvidia-device-plugin-daemonset-bt9gr                          0/1     Init:0/1                 0                 16h
nvidia-device-plugin-daemonset-d85hc                          0/1     Init:0/1                 0                 16h
nvidia-device-plugin-daemonset-hnhv8                          0/1     Init:0/1                 0                 16h
nvidia-device-plugin-daemonset-jd6vs                          0/1     Init:0/1                 0                 16h
nvidia-device-plugin-daemonset-lrfv4                          0/1     Init:0/1                 0                 16h
nvidia-device-plugin-daemonset-mts92                          0/1     Init:0/1                 0                 16h
nvidia-operator-validator-djn6w                               0/1     Init:CrashLoopBackOff    194 (4m10s ago)   16h
nvidia-operator-validator-dn42j                               0/1     Init:CrashLoopBackOff    194 (4m50s ago)   16h
nvidia-operator-validator-frh8m                               0/1     Init:CrashLoopBackOff    194 (3m23s ago)   16h
nvidia-operator-validator-ht5lb                               0/1     Init:CrashLoopBackOff    194 (4m33s ago)   16h
nvidia-operator-validator-jvgnf                               0/1     Init:CrashLoopBackOff    194 (3m20s ago)   16h
nvidia-operator-validator-pzpvj                               0/1     Init:CrashLoopBackOff    194 (4m17s ago)   16h
nvidia-operator-validator-tcjd8                               0/1     Init:CrashLoopBackOff    195 (22s ago)     16h
nvidia-operator-validator-vd95x                               0/1     Init:RunContainerError   195 (4s ago)      16h
nvidia-operator-validator-zkvhv                               0/1     Init:CrashLoopBackOff    194 (4m16s ago)   16h
  • kubernetes daemonset status: kubectl get ds -n OPERATOR_NAMESPACE
NAME                                         DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                                          AGE
gpu-feature-discovery                        9         9         0       9            0           nvidia.com/gpu.deploy.gpu-feature-discovery=true                       16h
gpu-operator-node-feature-discovery-worker   9         9         9       9            9           <none>                                                                 16h
nvidia-dcgm-exporter                         9         9         0       9            0           nvidia.com/gpu.deploy.dcgm-exporter=true                               16h
nvidia-device-plugin-daemonset               9         9         0       9            0           nvidia.com/gpu.deploy.device-plugin=true                               16h
nvidia-device-plugin-mps-control-daemon      0         0         0       0            0           nvidia.com/gpu.deploy.device-plugin=true,nvidia.com/mps.capable=true   16h
nvidia-mig-manager                           0         0         0       0            0           nvidia.com/gpu.deploy.mig-manager=true                                 16h
nvidia-operator-validator                    9         9         0       9            0           nvidia.com/gpu.deploy.operator-validator=true                          16h
  • If a pod/ds is in an error state or pending state kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
kubectl describe pod -n gpu-operator-resources nvidia-operator-validator-djn6w
Name:                 nvidia-operator-validator-djn6w
Namespace:            gpu-operator-resources
Priority:             2000001000
Priority Class Name:  system-node-critical
Runtime Class Name:   nvidia
Service Account:      nvidia-operator-validator
Node:                 treachery/192.168.3.50
Start Time:           Tue, 09 Dec 2025 16:54:27 -0600
Labels:               app=nvidia-operator-validator
                      app.kubernetes.io/managed-by=gpu-operator
                      app.kubernetes.io/part-of=gpu-operator
                      controller-revision-hash=cdff56f56
                      helm.sh/chart=gpu-operator-v25.10.1
                      pod-template-generation=1
Annotations:          cni.projectcalico.org/containerID: 0d2a73527f1e9cf9623c3c7e64a5432626308b338985a008f65f6091b3e24024
                      cni.projectcalico.org/podIP: 10.42.9.3/32
                      cni.projectcalico.org/podIPs: 10.42.9.3/32
                      k8s.v1.cni.cncf.io/network-status:
                        [{
                            "name": "k8s-pod-network",
                            "ips": [
                                "10.42.9.3"
                            ],
                            "default": true,
                            "dns": {}
                        }]
Status:               Pending
IP:                   10.42.9.3
IPs:
  IP:           10.42.9.3
Controlled By:  DaemonSet/nvidia-operator-validator
Init Containers:
  driver-validation:
    Container ID:  containerd://6db57c35b7f2946541b0f1b88d845085dcb5bc603ac9aeac83de19051c8eddb9
    Image:         nvcr.io/nvidia/gpu-operator:v25.10.1
    Image ID:      nvcr.io/nvidia/gpu-operator@sha256:26af0c4ad1efb8f3a49df4cdfe4269c999b9094fdc5cbeb442913b544259994b
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 09 Dec 2025 16:54:41 -0600
      Finished:     Tue, 09 Dec 2025 16:54:42 -0600
    Ready:          True
    Restart Count:  0
    Environment:
      WITH_WAIT:           true
      COMPONENT:           driver
      OPERATOR_NAMESPACE:  gpu-operator-resources (v1:metadata.namespace)
    Mounts:
      /host from host-root (ro)
      /host-dev-char from host-dev-char (rw)
      /run/nvidia/driver from driver-install-dir (rw)
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-r5fll (ro)
  toolkit-validation:
    Container ID:  containerd://cc370c2676b377c604826afdb0307b9c56a8b53eb09089a221e641657ea373bf
    Image:         nvcr.io/nvidia/gpu-operator:v25.10.1
    Image ID:      nvcr.io/nvidia/gpu-operator@sha256:26af0c4ad1efb8f3a49df4cdfe4269c999b9094fdc5cbeb442913b544259994b
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       StartError
      Message:      failed to create containerd task: failed to create shim task: OCI runtime create failed: could not apply required modification to OCI specification: error modifying OCI spec: failed to inject CDI devices: unresolvable CDI devices management.nvidia.com/gpu=all
      Exit Code:    128
      Started:      Wed, 31 Dec 1969 18:00:00 -0600
      Finished:     Wed, 10 Dec 2025 09:08:16 -0600
    Ready:          False
    Restart Count:  195
    Environment:
      NVIDIA_VISIBLE_DEVICES:  all
      WITH_WAIT:               false
      COMPONENT:               toolkit
    Mounts:
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-r5fll (ro)
  cuda-validation:
    Container ID:
    Image:         nvcr.io/nvidia/gpu-operator:v25.10.1
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:
      WITH_WAIT:                    false
      COMPONENT:                    cuda
      NODE_NAME:                     (v1:spec.nodeName)
      OPERATOR_NAMESPACE:           gpu-operator-resources (v1:metadata.namespace)
      VALIDATOR_IMAGE:              nvcr.io/nvidia/gpu-operator:v25.10.1
      VALIDATOR_IMAGE_PULL_POLICY:  IfNotPresent
      VALIDATOR_RUNTIME_CLASS:      nvidia
    Mounts:
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-r5fll (ro)
  plugin-validation:
    Container ID:
    Image:         nvcr.io/nvidia/gpu-operator:v25.10.1
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:
      COMPONENT:                    plugin
      WITH_WAIT:                    false
      WITH_WORKLOAD:                false
      MIG_STRATEGY:                 single
      NODE_NAME:                     (v1:spec.nodeName)
      OPERATOR_NAMESPACE:           gpu-operator-resources (v1:metadata.namespace)
      VALIDATOR_IMAGE:              nvcr.io/nvidia/gpu-operator:v25.10.1
      VALIDATOR_IMAGE_PULL_POLICY:  IfNotPresent
      VALIDATOR_RUNTIME_CLASS:      nvidia
    Mounts:
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-r5fll (ro)
Containers:
  nvidia-operator-validator:
    Container ID:
    Image:         nvcr.io/nvidia/gpu-operator:v25.10.1
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      echo all validations are successful; while true; do sleep 86400; done
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-r5fll (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True
  Initialized                 False
  Ready                       False
  ContainersReady             False
  PodScheduled                True
Volumes:
  run-nvidia-validations:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/validations
    HostPathType:  DirectoryOrCreate
  driver-install-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/driver
    HostPathType:
  host-root:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:
  host-dev-char:
    Type:          HostPath (bare host directory volume)
    Path:          /dev/char
    HostPathType:
  kube-api-access-r5fll:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    Optional:                false
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              nvidia.com/gpu.deploy.operator-validator=true
Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type     Reason   Age                     From     Message
  ----     ------   ----                    ----     -------
  Warning  BackOff  3m33s (x4473 over 16h)  kubelet  Back-off restarting failed container toolkit-validation in pod nvidia-operator-validator-djn6w_gpu-operator-resources(bed91771-a066-4ab5-9164-9f76f292054e)
  Normal   Pulled   13s (x196 over 16h)     kubelet  Container image "nvcr.io/nvidia/gpu-operator:v25.10.1" already present on machine
  • If a pod/ds is in an error state or pending state kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers
  • Output from running nvidia-smi from the driver container: kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
    This command was run on one of the cluster hosts, we don't have a driver container
Wed Dec 10 09:09:10 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.05              Driver Version: 560.35.05      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A40                     Off |   00000000:2A:00.0 Off |                    0 |
|  0%   37C    P8             15W /  300W |      14MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A40                     Off |   00000000:3D:00.0 Off |                    0 |
|  0%   38C    P8             15W /  300W |      14MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A     10073      G   /usr/libexec/Xorg                               4MiB |
|    1   N/A  N/A     10073      G   /usr/libexec/Xorg                               4MiB |
+-----------------------------------------------------------------------------------------+
  • containerd logs journalctl -u containerd > containerd.log
    -- Logs begin at Mon 2025-12-08 15:42:02 CST, end at Wed 2025-12-10 09:09:54 CST. --
    -- No entries --

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions