-
Notifications
You must be signed in to change notification settings - Fork 267
Description
Summary
dcgm-exporter fails to run reliably as a DaemonSet in Kubernetes when DCGM_EXPORTER_KUBERNETES is enabled.
The container crashes with:open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory
Even when a ServiceAccount exists and is referenced, the token is not mounted or is ignored, causing the exporter to crash repeatedly.
Environment
Kubernetes version: v1.28.15
OS: Ubuntu 22.04
GPU nodes: NVIDIA DGX A100 (SXM4 40GB)
Driver: 535.xx
CUDA: 12.2
MIG mode: Enabled (mixed)
Container runtime: containerd
Helm chart: dcgm-exporter
Image: nvcr.io/nvidia/k8s/dcgm-exporter:4.4.2-4.7.1-ubuntu22.04
Expected Behavior
The dcgm-exporter DaemonSet should:
- Start successfully on all GPU nodes
- Collect GPU / MIG metrics
- Optionally map GPU usage to Kubernetes pods when
DCGM_EXPORTER_KUBERNETES=true
Actual Behavior
-
Pods enter CrashLoopBackOff
-
Logs consistently show:
ERROR msg="open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory" -
This happens even when:
- A ServiceAccount exists
- The ServiceAccount is explicitly set in Helm values
-automountServiceAccountTokenis enabled
Pod Logs
time=2026-01-29T07:46:24Z level=INFO msg="Starting dcgm-exporter"
time=2026-01-29T07:46:31Z level=INFO msg="DCGM successfully initialized!"
time=2026-01-29T07:46:32Z level=INFO msg="Collecting DCP Metrics"
time=2026-01-29T07:46:32Z level=ERROR msg="open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory"
Helm Values used
arguments:
- "-c=1000"
- "--collectors=nvml,dcgm"
extraEnv:
- name: DCGM_EXPORTER_KUBERNETES
value: "true"
nodeSelector:
nvidia.com/gpu.present: "true"
resources:
requests:
cpu: "250m"
memory: "512Mi"
limits:
cpu: "1"
memory: "1Gi"
serviceAccount:
create: false
name: dcgm-exporter
automountServiceAccountToken: true
Additional Observations
-
nvidia-dcgm.service is running correctly on the host:
systemctl status nvidia-dcgm Active: active (running) -
/var/lib/kubelet/pod-resourcesis mounted correctly -
GPU labels
(nvidia.com/gpu.present=true)are present -
Node has no taints
-
Issue occurs on multiple nodes
-
Disabling Kubernetes mode:
DCGM_EXPORTER_KUBERNETES=falseavoids the token error, but then pod-level metrics are unavailable
Questions
- How to make this pod up and running?
- Is
DCGM_EXPORTER_KUBERNETES=truestrictly required for dcgm-exporter DaemonSet? - Is there a supported way to run dcgm-exporter in Kubernetes without requiring a ServiceAccount token?
- Is Kubernetes v1.28 officially supported for pod-level GPU metrics?
Thank you
Thanks for maintaining dcgm-exporter — any guidance on the correct configuration or confirmation of a bug would be appreciated.