-
Notifications
You must be signed in to change notification settings - Fork 433
Description
Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
Describe the bug
After updating to latest operator and reconfiguring to run with CDI, I can see correct amounts of GPU beign assigned via nvidia-smi, but NVIDIA_VISIBLE_DEVICES is set to void. This breaks e.g. reporting of GPU usage into clear-ml. Happens on all gpu-nodes (RTX Titan, RTX 6000 Ada, L40s...)
To Reproduce
- Setup RKE2 - https://docs.rke2.io/add-ons/gpu_operators?GPUoperator=v25.10.x
- Install gpu-operator with values, default running with CDI:
node-feature-discovery:
worker:
priorityClassName: system-node-critical
master:
priorityClassName: system-node-critical
driver:
upgradePolicy:
# autoUpgrade (default=true): Switch which enables / disables the driver upgrade controller.
autoUpgrade: false
usePrecompiled: false
version: "580.105.08"
toolkit:
env:
- name: CONTAINERD_SOCKET
value: /run/k3s/containerd/containerd.sock
operator:
upgradeCRD: true
dcgmExporter:
config:
name: custom-dcgm-exporter-metrics
create: true
data: |-
.... (shortened, as it's not relevant)
- Request 1 GPU (doesn't matter if you specify containerRuntimeClass: nvidia or leave empty as they say in RKE2 config) for pod
- NVIDIA_VISIBLE_DEVICES is set to
void(CUDA_VISIBLE_DEVICES not present at all)
Expected behavior
NVIDIA_VISIBLE_DEVICES or CUDA_VISIBLE_DEVICES is set to correct ID(s)
Environment (please provide the following information):
- GPU Operator Version: v25.10.1
- OS: Ubuntu 22.04 LTS
- Kernel Version: 5.15.0-163-generic
- Container Runtime Version: v2.1.5-k3s1
- Kubernetes Distro and Version: RKE2 v1.34.2+rke2r1
Information to attach (optional if deemed irrelevant)
- Output from running
nvidia-smifrom the driver container:kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
Collecting full debug bundle (optional):
curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/main/hack/must-gather.sh
chmod +x must-gather.sh
./must-gather.sh
NOTE: please refer to the must-gather script for debug data collected.
This bundle can be submitted to us via email: [email protected]