-
Notifications
You must be signed in to change notification settings - Fork 439
Closed
Labels
Description
Describe the bug
When we installed GPU operator, it came up fine. We had a power loss on our servers and when they powered back on, all of the nvidia-operator-validator pods are CrashLoopBackOff with the toolkit-validation container failing.
To Reproduce
Haven't been able to reproduce. Our cluster is in an odd state.
Environment (please provide the following information):
- GPU Operator Version: 25.10.1
- OS: RHEL 8.9
- Kernel Version: Linux version 4.18.0-513.5.1.el8_9.x86_64
- Container Runtime Version: containerd://2.1.4-k3s2
- Kubernetes Distro and Version: rke2 v1.33.5
Here is the extra values file we are passing to the helm chart on helm install:
dcgmExporter:
env:
- name: DCGM_EXPORTER_INTERVAL
value: "3000"
serviceMonitor:
additionalLabels:
release: prometheus-operator
enabled: true
interval: 3s
driver:
enabled: false
gds:
enabled: false
node-feature-discovery:
prometheus:
enabled: true
labels:
release: prometheus-operator
toolkit:
enabled: false
env:
- name: CONTAINERD_CONFIG
value: /var/lib/rancher/rke2/agent/etc/containerd/config.toml
- name: CONTAINERD_SOCKET
value: /run/k3s/containerd/containerd.sock
- name: CONTAINERD_RUNTIME_CLASS
value: nvidia
- name: CONTAINERD_SET_AS_DEFAULT
value: "true"
- kubernetes pods status:
kubectl get pods -n OPERATOR_NAMESPACE
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-5drhb 0/1 Init:0/1 0 16h
gpu-feature-discovery-896b2 0/1 Init:0/1 0 16h
gpu-feature-discovery-8kxnk 0/1 Init:0/1 0 16h
gpu-feature-discovery-f999p 0/1 Init:0/1 0 16h
gpu-feature-discovery-hhs2l 0/1 Init:0/1 0 16h
gpu-feature-discovery-k76mh 0/1 Init:0/1 0 16h
gpu-feature-discovery-pdn6f 0/1 Init:0/1 0 16h
gpu-feature-discovery-tbg6x 0/1 Init:0/1 0 16h
gpu-feature-discovery-x7t74 0/1 Init:0/1 0 16h
gpu-operator-6b7854bc77-r5p9f 1/1 Running 0 16h
gpu-operator-node-feature-discovery-gc-74dd579c7f-wp99c 1/1 Running 0 16h
gpu-operator-node-feature-discovery-master-5645495d9c-9d49h 1/1 Running 0 16h
gpu-operator-node-feature-discovery-worker-2qc4n 1/1 Running 0 16h
gpu-operator-node-feature-discovery-worker-6t55l 1/1 Running 0 16h
gpu-operator-node-feature-discovery-worker-756vt 1/1 Running 0 16h
gpu-operator-node-feature-discovery-worker-fw86j 1/1 Running 0 16h
gpu-operator-node-feature-discovery-worker-jwrmm 1/1 Running 0 16h
gpu-operator-node-feature-discovery-worker-ncdjv 1/1 Running 0 16h
gpu-operator-node-feature-discovery-worker-nxzsc 1/1 Running 0 16h
gpu-operator-node-feature-discovery-worker-z9jfs 1/1 Running 0 16h
gpu-operator-node-feature-discovery-worker-zhql2 1/1 Running 0 16h
nvidia-dcgm-exporter-4dp6x 0/1 Init:0/1 0 16h
nvidia-dcgm-exporter-4stqg 0/1 Init:0/1 0 16h
nvidia-dcgm-exporter-7flft 0/1 Init:0/1 0 16h
nvidia-dcgm-exporter-bx8gn 0/1 Init:0/1 0 16h
nvidia-dcgm-exporter-fqskw 0/1 Init:0/1 0 16h
nvidia-dcgm-exporter-g7fp6 0/1 Init:0/1 0 16h
nvidia-dcgm-exporter-mtlkt 0/1 Init:0/1 0 16h
nvidia-dcgm-exporter-v8kdd 0/1 Init:0/1 0 16h
nvidia-dcgm-exporter-x8s8x 0/1 Init:0/1 0 16h
nvidia-device-plugin-daemonset-5pzck 0/1 Init:0/1 0 16h
nvidia-device-plugin-daemonset-7fzld 0/1 Init:0/1 0 16h
nvidia-device-plugin-daemonset-9kj9p 0/1 Init:0/1 0 16h
nvidia-device-plugin-daemonset-bt9gr 0/1 Init:0/1 0 16h
nvidia-device-plugin-daemonset-d85hc 0/1 Init:0/1 0 16h
nvidia-device-plugin-daemonset-hnhv8 0/1 Init:0/1 0 16h
nvidia-device-plugin-daemonset-jd6vs 0/1 Init:0/1 0 16h
nvidia-device-plugin-daemonset-lrfv4 0/1 Init:0/1 0 16h
nvidia-device-plugin-daemonset-mts92 0/1 Init:0/1 0 16h
nvidia-operator-validator-djn6w 0/1 Init:CrashLoopBackOff 194 (4m10s ago) 16h
nvidia-operator-validator-dn42j 0/1 Init:CrashLoopBackOff 194 (4m50s ago) 16h
nvidia-operator-validator-frh8m 0/1 Init:CrashLoopBackOff 194 (3m23s ago) 16h
nvidia-operator-validator-ht5lb 0/1 Init:CrashLoopBackOff 194 (4m33s ago) 16h
nvidia-operator-validator-jvgnf 0/1 Init:CrashLoopBackOff 194 (3m20s ago) 16h
nvidia-operator-validator-pzpvj 0/1 Init:CrashLoopBackOff 194 (4m17s ago) 16h
nvidia-operator-validator-tcjd8 0/1 Init:CrashLoopBackOff 195 (22s ago) 16h
nvidia-operator-validator-vd95x 0/1 Init:RunContainerError 195 (4s ago) 16h
nvidia-operator-validator-zkvhv 0/1 Init:CrashLoopBackOff 194 (4m16s ago) 16h
- kubernetes daemonset status:
kubectl get ds -n OPERATOR_NAMESPACE
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
gpu-feature-discovery 9 9 0 9 0 nvidia.com/gpu.deploy.gpu-feature-discovery=true 16h
gpu-operator-node-feature-discovery-worker 9 9 9 9 9 <none> 16h
nvidia-dcgm-exporter 9 9 0 9 0 nvidia.com/gpu.deploy.dcgm-exporter=true 16h
nvidia-device-plugin-daemonset 9 9 0 9 0 nvidia.com/gpu.deploy.device-plugin=true 16h
nvidia-device-plugin-mps-control-daemon 0 0 0 0 0 nvidia.com/gpu.deploy.device-plugin=true,nvidia.com/mps.capable=true 16h
nvidia-mig-manager 0 0 0 0 0 nvidia.com/gpu.deploy.mig-manager=true 16h
nvidia-operator-validator 9 9 0 9 0 nvidia.com/gpu.deploy.operator-validator=true 16h
- If a pod/ds is in an error state or pending state
kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
kubectl describe pod -n gpu-operator-resources nvidia-operator-validator-djn6w
Name: nvidia-operator-validator-djn6w
Namespace: gpu-operator-resources
Priority: 2000001000
Priority Class Name: system-node-critical
Runtime Class Name: nvidia
Service Account: nvidia-operator-validator
Node: treachery/192.168.3.50
Start Time: Tue, 09 Dec 2025 16:54:27 -0600
Labels: app=nvidia-operator-validator
app.kubernetes.io/managed-by=gpu-operator
app.kubernetes.io/part-of=gpu-operator
controller-revision-hash=cdff56f56
helm.sh/chart=gpu-operator-v25.10.1
pod-template-generation=1
Annotations: cni.projectcalico.org/containerID: 0d2a73527f1e9cf9623c3c7e64a5432626308b338985a008f65f6091b3e24024
cni.projectcalico.org/podIP: 10.42.9.3/32
cni.projectcalico.org/podIPs: 10.42.9.3/32
k8s.v1.cni.cncf.io/network-status:
[{
"name": "k8s-pod-network",
"ips": [
"10.42.9.3"
],
"default": true,
"dns": {}
}]
Status: Pending
IP: 10.42.9.3
IPs:
IP: 10.42.9.3
Controlled By: DaemonSet/nvidia-operator-validator
Init Containers:
driver-validation:
Container ID: containerd://6db57c35b7f2946541b0f1b88d845085dcb5bc603ac9aeac83de19051c8eddb9
Image: nvcr.io/nvidia/gpu-operator:v25.10.1
Image ID: nvcr.io/nvidia/gpu-operator@sha256:26af0c4ad1efb8f3a49df4cdfe4269c999b9094fdc5cbeb442913b544259994b
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
nvidia-validator
State: Terminated
Reason: Completed
Exit Code: 0
Started: Tue, 09 Dec 2025 16:54:41 -0600
Finished: Tue, 09 Dec 2025 16:54:42 -0600
Ready: True
Restart Count: 0
Environment:
WITH_WAIT: true
COMPONENT: driver
OPERATOR_NAMESPACE: gpu-operator-resources (v1:metadata.namespace)
Mounts:
/host from host-root (ro)
/host-dev-char from host-dev-char (rw)
/run/nvidia/driver from driver-install-dir (rw)
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-r5fll (ro)
toolkit-validation:
Container ID: containerd://cc370c2676b377c604826afdb0307b9c56a8b53eb09089a221e641657ea373bf
Image: nvcr.io/nvidia/gpu-operator:v25.10.1
Image ID: nvcr.io/nvidia/gpu-operator@sha256:26af0c4ad1efb8f3a49df4cdfe4269c999b9094fdc5cbeb442913b544259994b
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
nvidia-validator
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: StartError
Message: failed to create containerd task: failed to create shim task: OCI runtime create failed: could not apply required modification to OCI specification: error modifying OCI spec: failed to inject CDI devices: unresolvable CDI devices management.nvidia.com/gpu=all
Exit Code: 128
Started: Wed, 31 Dec 1969 18:00:00 -0600
Finished: Wed, 10 Dec 2025 09:08:16 -0600
Ready: False
Restart Count: 195
Environment:
NVIDIA_VISIBLE_DEVICES: all
WITH_WAIT: false
COMPONENT: toolkit
Mounts:
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-r5fll (ro)
cuda-validation:
Container ID:
Image: nvcr.io/nvidia/gpu-operator:v25.10.1
Image ID:
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
nvidia-validator
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment:
WITH_WAIT: false
COMPONENT: cuda
NODE_NAME: (v1:spec.nodeName)
OPERATOR_NAMESPACE: gpu-operator-resources (v1:metadata.namespace)
VALIDATOR_IMAGE: nvcr.io/nvidia/gpu-operator:v25.10.1
VALIDATOR_IMAGE_PULL_POLICY: IfNotPresent
VALIDATOR_RUNTIME_CLASS: nvidia
Mounts:
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-r5fll (ro)
plugin-validation:
Container ID:
Image: nvcr.io/nvidia/gpu-operator:v25.10.1
Image ID:
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
nvidia-validator
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment:
COMPONENT: plugin
WITH_WAIT: false
WITH_WORKLOAD: false
MIG_STRATEGY: single
NODE_NAME: (v1:spec.nodeName)
OPERATOR_NAMESPACE: gpu-operator-resources (v1:metadata.namespace)
VALIDATOR_IMAGE: nvcr.io/nvidia/gpu-operator:v25.10.1
VALIDATOR_IMAGE_PULL_POLICY: IfNotPresent
VALIDATOR_RUNTIME_CLASS: nvidia
Mounts:
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-r5fll (ro)
Containers:
nvidia-operator-validator:
Container ID:
Image: nvcr.io/nvidia/gpu-operator:v25.10.1
Image ID:
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
echo all validations are successful; while true; do sleep 86400; done
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment: <none>
Mounts:
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-r5fll (ro)
Conditions:
Type Status
PodReadyToStartContainers True
Initialized False
Ready False
ContainersReady False
PodScheduled True
Volumes:
run-nvidia-validations:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/validations
HostPathType: DirectoryOrCreate
driver-install-dir:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/driver
HostPathType:
host-root:
Type: HostPath (bare host directory volume)
Path: /
HostPathType:
host-dev-char:
Type: HostPath (bare host directory volume)
Path: /dev/char
HostPathType:
kube-api-access-r5fll:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
Optional: false
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: nvidia.com/gpu.deploy.operator-validator=true
Tolerations: node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
nvidia.com/gpu:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning BackOff 3m33s (x4473 over 16h) kubelet Back-off restarting failed container toolkit-validation in pod nvidia-operator-validator-djn6w_gpu-operator-resources(bed91771-a066-4ab5-9164-9f76f292054e)
Normal Pulled 13s (x196 over 16h) kubelet Container image "nvcr.io/nvidia/gpu-operator:v25.10.1" already present on machine
- If a pod/ds is in an error state or pending state
kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers - Output from running
nvidia-smifrom the driver container:kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
This command was run on one of the cluster hosts, we don't have a driver container
Wed Dec 10 09:09:10 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.05 Driver Version: 560.35.05 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A40 Off | 00000000:2A:00.0 Off | 0 |
| 0% 37C P8 15W / 300W | 14MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA A40 Off | 00000000:3D:00.0 Off | 0 |
| 0% 38C P8 15W / 300W | 14MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 10073 G /usr/libexec/Xorg 4MiB |
| 1 N/A N/A 10073 G /usr/libexec/Xorg 4MiB |
+-----------------------------------------------------------------------------------------+
- containerd logs
journalctl -u containerd > containerd.log
-- Logs begin at Mon 2025-12-08 15:42:02 CST, end at Wed 2025-12-10 09:09:54 CST. --
-- No entries --