Skip to content

NVIDIA_VISIBLE_DEVICES is set to void but nvidia-smi shows correct gpu assigned by k8s #1994

@mmolisch

Description

@mmolisch

Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.

Describe the bug
After updating to latest operator and reconfiguring to run with CDI, I can see correct amounts of GPU beign assigned via nvidia-smi, but NVIDIA_VISIBLE_DEVICES is set to void. This breaks e.g. reporting of GPU usage into clear-ml. Happens on all gpu-nodes (RTX Titan, RTX 6000 Ada, L40s...)

To Reproduce

  1. Setup RKE2 - https://docs.rke2.io/add-ons/gpu_operators?GPUoperator=v25.10.x
  2. Install gpu-operator with values, default running with CDI:
node-feature-discovery:
  worker:
    priorityClassName: system-node-critical
  master:
    priorityClassName: system-node-critical
driver:
  upgradePolicy:
    # autoUpgrade (default=true): Switch which enables / disables the driver upgrade controller.
    autoUpgrade: false
  usePrecompiled: false
  version: "580.105.08"
toolkit:
  env:
  - name: CONTAINERD_SOCKET
    value: /run/k3s/containerd/containerd.sock
operator:
  upgradeCRD: true
dcgmExporter:
  config:
    name: custom-dcgm-exporter-metrics
    create: true
    data: |-
       .... (shortened, as it's not relevant)
  1. Request 1 GPU (doesn't matter if you specify containerRuntimeClass: nvidia or leave empty as they say in RKE2 config) for pod
  2. NVIDIA_VISIBLE_DEVICES is set to void (CUDA_VISIBLE_DEVICES not present at all)
Image

Expected behavior
NVIDIA_VISIBLE_DEVICES or CUDA_VISIBLE_DEVICES is set to correct ID(s)

Environment (please provide the following information):

  • GPU Operator Version: v25.10.1
  • OS: Ubuntu 22.04 LTS
  • Kernel Version: 5.15.0-163-generic
  • Container Runtime Version: v2.1.5-k3s1
  • Kubernetes Distro and Version: RKE2 v1.34.2+rke2r1

Information to attach (optional if deemed irrelevant)

  • Output from running nvidia-smi from the driver container: kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
Image

Collecting full debug bundle (optional):

curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/main/hack/must-gather.sh
chmod +x must-gather.sh
./must-gather.sh

NOTE: please refer to the must-gather script for debug data collected.

This bundle can be submitted to us via email: [email protected]

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions