Add per-process GPU metrics for time-sharing and MIG by krystiancastai · Pull Request #594 · NVIDIA/dcgm-exporter

krystiancastai · 2025-12-04T17:47:57Z

Add per-process GPU metrics for time-sharing and MIG environments

Problem

When using GPU time-sharing (multiple pods sharing a single GPU or MIG instance), DCGM Exporter metrics were inaccurate:

Regular GPU time-sharing:
Device-level metrics were duplicated to each pod, making it impossible to determine actual per-pod resource consumption:

DCGM_FI_DEV_GPU_UTIL{pod="gpu-workload-1",...} 99
DCGM_FI_DEV_GPU_UTIL{pod="gpu-workload-2",...} 99
DCGM_FI_DEV_GPU_UTIL{pod="gpu-workload-3",...} 99
DCGM_FI_DEV_GPU_UTIL{pod="gpu-workload-4",...} 99

MIG + time-sharing:
Only one random pod was reported per MIG instance - other pods sharing the same MIG instance were missing from metrics entirely.

Solution

This PR adds per-process GPU metrics collection using NVML APIs, enabling per-pod metrics in time-sharing scenarios:

Uses nvmlDeviceGetComputeRunningProcesses to get per-process memory usage
Uses nvmlDeviceGetProcessUtilization to get per-process SM utilization (regular GPUs only)
Maps process PIDs to Kubernetes pods via cgroup
Supports both regular GPUs and MIG instances

Configuration

The changes in this PR only affect the code path when the --kubernetes-virtual-gpus flag is enabled (or KUBERNETES_VIRTUAL_GPUS=true environment variable). When this flag is disabled, the existing behavior remains unchanged.

Required environment variables:

KUBERNETES_VIRTUAL_GPUS=true
DCGM_EXPORTER_KUBERNETES=true
DCGM_EXPORTER_KUBERNETES_ENABLE_POD_UID=true
DCGM_EXPORTER_KUBERNETES_GPU_ID_TYPE — must match your cloud provider:
- GKE (GCP): device-name
- EKS (AWS): uid

Required pod spec setting:

hostPID=true
securityContext.privileged=true

Per-Process Metrics by GPU Type

Regular GPUs (Time-Sharing)

Both GPU utilization and memory metrics are available per-process:

Metric	Description
`DCGM_FI_DEV_GPU_UTIL`	Per-process SM utilization (%)
`DCGM_FI_DEV_FB_USED`	Per-process memory (MiB)

The metrics provide a clear hierarchy:

Metric Type	Labels	Description
Device total	No pod labels	Total utilization/memory for the GPU
Per-pod	`pod="...", vgpu="..."`	Individual pod's usage on that GPU

Example output

GPU Utilization:

# Device total
DCGM_FI_DEV_GPU_UTIL{gpu="0",UUID="GPU-6a16b9c2-..."} 98

# Per-pod breakdown
DCGM_FI_DEV_GPU_UTIL{gpu="0",UUID="GPU-6a16b9c2-...",pod="gpu-workload-1",vgpu="0"} 6
DCGM_FI_DEV_GPU_UTIL{gpu="0",UUID="GPU-6a16b9c2-...",pod="gpu-workload-2",vgpu="8"} 31
DCGM_FI_DEV_GPU_UTIL{gpu="0",UUID="GPU-6a16b9c2-...",pod="gpu-workload-3",vgpu="3"} 61
DCGM_FI_DEV_GPU_UTIL{gpu="0",UUID="GPU-6a16b9c2-...",pod="gpu-workload-4",vgpu="5"} 0

Memory Usage:

# Device total
DCGM_FI_DEV_FB_USED{gpu="0",UUID="GPU-6a16b9c2-..."} 1194

# Per-pod breakdown
DCGM_FI_DEV_FB_USED{gpu="0",UUID="GPU-6a16b9c2-...",pod="gpu-workload-1",vgpu="0"} 620
DCGM_FI_DEV_FB_USED{gpu="0",UUID="GPU-6a16b9c2-...",pod="gpu-workload-2",vgpu="8"} 108
DCGM_FI_DEV_FB_USED{gpu="0",UUID="GPU-6a16b9c2-...",pod="gpu-workload-3",vgpu="3"} 108
DCGM_FI_DEV_FB_USED{gpu="0",UUID="GPU-6a16b9c2-...",pod="gpu-workload-4",vgpu="5"} 358

MIG Instances (Time-Sharing)

Only memory metrics are available per-process (NVML limitation - SM utilization not available for MIG):

Metric	Description
`DCGM_FI_DEV_FB_USED`	Per-process memory (MiB)

The metrics provide a clear hierarchy:

Metric Type	Labels	Description
MIG instance total	`GPU_I_ID="8"` (no pod labels)	Total memory for the MIG instance
Per-pod within MIG	`GPU_I_ID="8", pod="...", vgpu="..."`	Individual pod's memory usage within that MIG instance

Example output

MIG instance 8 with 3 pods sharing:

# MIG instance total
DCGM_FI_DEV_FB_USED{GPU_I_ID="8",...} 642

# Per-pod breakdown within MIG instance 8
DCGM_FI_DEV_FB_USED{GPU_I_ID="8",pod="gpu-workload-1,...",vgpu="3"} 82
DCGM_FI_DEV_FB_USED{GPU_I_ID="8",pod="gpu-workload-2,...",vgpu="11"} 478
DCGM_FI_DEV_FB_USED{GPU_I_ID="8",pod="gpu-workload-3,...",vgpu="1"} 82

Bug Fixes

AWS EKS MIG support: Fixed MIG device ID parsing for AWS EKS environments. The NVIDIA device plugin on EKS reports MIG device IDs with a ::N suffix (e.g., MIG-xxx::7) which previously caused pod-to-device mapping to fail.

Notes

Device-level metrics are always emitted alongside per-pod metrics
Pods with no active GPU processes show 0 for both metrics

Copilot

Pull request overview

This PR adds MIG time-sharing support with per-process metrics collection. It enables tracking GPU utilization and memory usage on a per-pod basis for both regular GPUs and MIG devices in Kubernetes environments with time-shared GPU workloads.

Key changes:

Implements per-process metrics collection for GPU utilization and framebuffer memory usage
Adds PID-to-pod mapping using cgroup information to associate GPU processes with Kubernetes pods
Extends NVML provider with methods to query process-level metrics for both regular and MIG GPUs

Reviewed changes

Copilot reviewed 12 out of 14 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
internal/pkg/transformation/process_metrics.go	Core logic for collecting and organizing per-process GPU metrics
internal/pkg/transformation/process_metrics_test.go	Comprehensive test coverage for process metrics collection
internal/pkg/transformation/pidmapper.go	Linux implementation for mapping PIDs to pods via cgroup parsing
internal/pkg/transformation/pidmapper_stub.go	Non-Linux stub implementation of PID mapper
internal/pkg/transformation/pidmapper_test.go	Tests for PID-to-pod mapping functionality
internal/pkg/transformation/kubernetes.go	Integration of per-process metrics into pod mapping workflow
internal/pkg/transformation/kubernetes_test.go	Tests for per-process metrics integration with Kubernetes pod mapper
internal/pkg/transformation/const.go	Added metric name constants for per-process metrics
internal/pkg/nvmlprovider/types.go	Extended NVML interface with per-process query methods
internal/pkg/nvmlprovider/provider.go	Implementation of per-process metrics collection using NVML
internal/mocks/pkg/nvmlprovider/mock_client.go	Mock implementations for new NVML methods
go.mod	Added cgroups v3 dependency for PID-to-pod mapping

Comments suppressed due to low confidence (1)

internal/pkg/transformation/kubernetes_test.go:1

Assertion expects 'test-pod' but the test setup at line 388 uses pod0 which has Name 'pod0'. This will cause the test to fail.

/*

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

nvvfedorov · 2025-12-04T17:50:26Z

@krystiancastai , Please sign your commit, as it is required by the CONTRIBUTING.md

nvvfedorov · 2025-12-04T17:51:13Z

@krystiancastai, Add a description explaining what problem the PR solves or fixes.

nvvfedorov

@krystiancastai , Thank you for your contribution. The code in general looks good to me. However, because this is a new feature, please add a description of the feature, what problem it solves, the configuration, and any other relevant details. Also, please don't forget to sing off your commit and update the year in the license headers for new files.

internal/pkg/transformation/pidmapper.go

internal/pkg/transformation/pidmapper_stub.go

internal/pkg/transformation/pidmapper_test.go

internal/pkg/transformation/process_metrics.go

krystiancastai · 2025-12-05T08:07:52Z

@krystiancastai, Add a description explaining what problem the PR solves or fixes.

@nvvfedorov Thanks for the comments! My intention was to publish this merge request in draft mode, as it's missing a description with detailed examples of what was changed, and I still see a few minor things in the code that need fixing. I'll let you know once I've made the updates.

Signed-off-by: Krystian Bednarczuk <krystian@cast.ai>

glowkey · 2025-12-10T16:39:29Z

Thanks for creating this PR! We are planning to test and validate this MR for our next major release in January 2026.

krystiancastai · 2026-01-29T11:19:36Z

Thanks for creating this PR! We are planning to test and validate this MR for our next major release in January 2026.

Hey @glowkey , hope you're doing well! Have you had a chance to test this MR? Is there anything I can do to help move things forward?

glowkey · 2026-01-29T20:56:29Z

Apologies that we did not have time to review and test this for the January release. We had to prioritize the GPU bind/unbind functionality. I imagine that this MR will now need to be updated and tested with that behavior. When a GPU unbind is detected all NVML handles must be released and then nvmlInit() needs to be called when a GPU bind is detected. You can see this behavior in app.go/handleGpuTopologyChange(). Let us know if you have any questions.

nvvfedorov requested a review from Copilot December 4, 2025 17:48

Copilot AI reviewed Dec 4, 2025

View reviewed changes

nvvfedorov requested changes Dec 4, 2025

View reviewed changes

krystiancastai marked this pull request as draft December 5, 2025 08:01

krystiancastai changed the title ~~Add MIG time-sharing support with per-process metrics~~ Add per-process GPU metrics for time-sharing and MIG Dec 8, 2025

Add MIG time-sharing support with per-process metrics

b0872f9

Signed-off-by: Krystian Bednarczuk <krystian@cast.ai>

krystiancastai force-pushed the feature/mig-time-sharing-support branch 2 times, most recently from e1102f5 to 012200b Compare December 8, 2025 14:00

Cleanup: improve comments, reduce test duplication, fix copyright year

0871d70

Signed-off-by: Krystian Bednarczuk <krystian@cast.ai>

krystiancastai force-pushed the feature/mig-time-sharing-support branch from 012200b to 0871d70 Compare December 8, 2025 16:12

krystiancastai marked this pull request as ready for review December 8, 2025 16:21

krystiancastai requested a review from nvvfedorov December 8, 2025 19:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add per-process GPU metrics for time-sharing and MIG#594

Add per-process GPU metrics for time-sharing and MIG#594
krystiancastai wants to merge 2 commits intoNVIDIA:mainfrom
krystiancastai:feature/mig-time-sharing-support

krystiancastai commented Dec 4, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

nvvfedorov commented Dec 4, 2025

Uh oh!

nvvfedorov commented Dec 4, 2025

Uh oh!

nvvfedorov left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

krystiancastai commented Dec 5, 2025

Uh oh!

glowkey commented Dec 10, 2025

Uh oh!

krystiancastai commented Jan 29, 2026

Uh oh!

glowkey commented Jan 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

krystiancastai commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add per-process GPU metrics for time-sharing and MIG environments

Problem

Solution

Configuration

Per-Process Metrics by GPU Type

Regular GPUs (Time-Sharing)

Example output

MIG Instances (Time-Sharing)

Example output

Bug Fixes

Notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

nvvfedorov commented Dec 4, 2025

Uh oh!

nvvfedorov commented Dec 4, 2025

Uh oh!

nvvfedorov left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

krystiancastai commented Dec 5, 2025

Uh oh!

glowkey commented Dec 10, 2025

Uh oh!

krystiancastai commented Jan 29, 2026

Uh oh!

glowkey commented Jan 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

krystiancastai commented Dec 4, 2025 •

edited

Loading

nvvfedorov left a comment •

edited

Loading