DCGM Exporter DaemonSet crashes in Kubernetes due to missing ServiceAccount token (/var/run/secrets/.../token)

### **Summary**

dcgm-exporter fails to run reliably as a DaemonSet in Kubernetes when DCGM_EXPORTER_KUBERNETES is enabled.
The container crashes with:`open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory
`
Even when a ServiceAccount exists and is referenced, the token is not mounted or is ignored, causing the exporter to crash repeatedly.

### **Environment**

**Kubernetes version:** v1.28.15
**OS:** Ubuntu 22.04
**GPU nodes:** NVIDIA DGX A100 (SXM4 40GB)
**Driver:** 535.xx
**CUDA:** 12.2
**MIG mode:** Enabled (mixed)
**Container runtime:** containerd
**Helm chart:** dcgm-exporter
**Image:** `nvcr.io/nvidia/k8s/dcgm-exporter:4.4.2-4.7.1-ubuntu22.04`


### **Expected Behavior**

The dcgm-exporter DaemonSet should:

  - Start successfully on all GPU nodes
  - Collect GPU / MIG metrics
  - Optionally map GPU usage to Kubernetes pods when `DCGM_EXPORTER_KUBERNETES=true`

### **Actual Behavior**

  - Pods enter CrashLoopBackOff
  - Logs consistently show: `ERROR msg="open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory"`
  
  - This happens even when:
          - A ServiceAccount exists
          - The ServiceAccount is explicitly set in Helm values
          - `automountServiceAccountToken` is enabled
       

### **Pod Logs**
```
time=2026-01-29T07:46:24Z level=INFO msg="Starting dcgm-exporter"
time=2026-01-29T07:46:31Z level=INFO msg="DCGM successfully initialized!"
time=2026-01-29T07:46:32Z level=INFO msg="Collecting DCP Metrics"
time=2026-01-29T07:46:32Z level=ERROR msg="open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory"

```

### Helm Values used

```
arguments:
  - "-c=1000"
  - "--collectors=nvml,dcgm"

extraEnv:
  - name: DCGM_EXPORTER_KUBERNETES
    value: "true"

nodeSelector:
  nvidia.com/gpu.present: "true"

resources:
  requests:
    cpu: "250m"
    memory: "512Mi"
  limits:
    cpu: "1"
    memory: "1Gi"

serviceAccount:
  create: false
  name: dcgm-exporter

automountServiceAccountToken: true
```

### Additional Observations

  - nvidia-dcgm.service is running correctly on the host:
    ```
    systemctl status nvidia-dcgm
    Active: active (running)
    
    ```
  - `/var/lib/kubelet/pod-resources` is mounted correctly
  - GPU labels `(nvidia.com/gpu.present=true)` are present
  - Node has no taints
  - Issue occurs on multiple nodes
  - Disabling Kubernetes mode:

     `DCGM_EXPORTER_KUBERNETES=false`

     avoids the token error, but then pod-level metrics are unavailable



### Questions

  - How to make this pod up and running?
  - Is `DCGM_EXPORTER_KUBERNETES=true` strictly required for dcgm-exporter DaemonSet?
  - Is there a supported way to run dcgm-exporter in Kubernetes without requiring a ServiceAccount token?
  - Is Kubernetes v1.28 officially supported for pod-level GPU metrics?

### Thank you

Thanks for maintaining dcgm-exporter — any guidance on the correct configuration or confirmation of a bug would be appreciated.

@nvidia
@kubernetes
@helm
@prometheus
@grafana





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DCGM Exporter DaemonSet crashes in Kubernetes due to missing ServiceAccount token (/var/run/secrets/.../token) #624

Summary

Environment

Expected Behavior

Actual Behavior

Pod Logs

Helm Values used

Additional Observations

Questions

Thank you

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

DCGM Exporter DaemonSet crashes in Kubernetes due to missing ServiceAccount token (/var/run/secrets/.../token) #624

Description

Summary

Environment

Expected Behavior

Actual Behavior

Pod Logs

Helm Values used

Additional Observations

Questions

Thank you

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions