Skip to content

DCGM Exporter DaemonSet crashes in Kubernetes due to missing ServiceAccount token (/var/run/secrets/.../token) #624

@MepHist2721Y

Description

@MepHist2721Y

Summary

dcgm-exporter fails to run reliably as a DaemonSet in Kubernetes when DCGM_EXPORTER_KUBERNETES is enabled.
The container crashes with:open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory
Even when a ServiceAccount exists and is referenced, the token is not mounted or is ignored, causing the exporter to crash repeatedly.

Environment

Kubernetes version: v1.28.15
OS: Ubuntu 22.04
GPU nodes: NVIDIA DGX A100 (SXM4 40GB)
Driver: 535.xx
CUDA: 12.2
MIG mode: Enabled (mixed)
Container runtime: containerd
Helm chart: dcgm-exporter
Image: nvcr.io/nvidia/k8s/dcgm-exporter:4.4.2-4.7.1-ubuntu22.04

Expected Behavior

The dcgm-exporter DaemonSet should:

  • Start successfully on all GPU nodes
  • Collect GPU / MIG metrics
  • Optionally map GPU usage to Kubernetes pods when DCGM_EXPORTER_KUBERNETES=true

Actual Behavior

  • Pods enter CrashLoopBackOff

  • Logs consistently show: ERROR msg="open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory"

  • This happens even when:
    - A ServiceAccount exists
    - The ServiceAccount is explicitly set in Helm values
    - automountServiceAccountToken is enabled

Pod Logs

time=2026-01-29T07:46:24Z level=INFO msg="Starting dcgm-exporter"
time=2026-01-29T07:46:31Z level=INFO msg="DCGM successfully initialized!"
time=2026-01-29T07:46:32Z level=INFO msg="Collecting DCP Metrics"
time=2026-01-29T07:46:32Z level=ERROR msg="open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory"

Helm Values used

arguments:
  - "-c=1000"
  - "--collectors=nvml,dcgm"

extraEnv:
  - name: DCGM_EXPORTER_KUBERNETES
    value: "true"

nodeSelector:
  nvidia.com/gpu.present: "true"

resources:
  requests:
    cpu: "250m"
    memory: "512Mi"
  limits:
    cpu: "1"
    memory: "1Gi"

serviceAccount:
  create: false
  name: dcgm-exporter

automountServiceAccountToken: true

Additional Observations

  • nvidia-dcgm.service is running correctly on the host:

    systemctl status nvidia-dcgm
    Active: active (running)
    
    
  • /var/lib/kubelet/pod-resources is mounted correctly

  • GPU labels (nvidia.com/gpu.present=true) are present

  • Node has no taints

  • Issue occurs on multiple nodes

  • Disabling Kubernetes mode:

    DCGM_EXPORTER_KUBERNETES=false

    avoids the token error, but then pod-level metrics are unavailable

Questions

  • How to make this pod up and running?
  • Is DCGM_EXPORTER_KUBERNETES=true strictly required for dcgm-exporter DaemonSet?
  • Is there a supported way to run dcgm-exporter in Kubernetes without requiring a ServiceAccount token?
  • Is Kubernetes v1.28 officially supported for pod-level GPU metrics?

Thank you

Thanks for maintaining dcgm-exporter — any guidance on the correct configuration or confirmation of a bug would be appreciated.

@NVIDIA
@kubernetes
@helm
@prometheus
@grafana

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions