Skip to content

Add per-process GPU metrics for time-sharing and MIG#594

Open
krystiancastai wants to merge 2 commits intoNVIDIA:mainfrom
krystiancastai:feature/mig-time-sharing-support
Open

Add per-process GPU metrics for time-sharing and MIG#594
krystiancastai wants to merge 2 commits intoNVIDIA:mainfrom
krystiancastai:feature/mig-time-sharing-support

Conversation

@krystiancastai
Copy link

@krystiancastai krystiancastai commented Dec 4, 2025

Add per-process GPU metrics for time-sharing and MIG environments

Problem

When using GPU time-sharing (multiple pods sharing a single GPU or MIG instance), DCGM Exporter metrics were inaccurate:

Regular GPU time-sharing:
Device-level metrics were duplicated to each pod, making it impossible to determine actual per-pod resource consumption:

DCGM_FI_DEV_GPU_UTIL{pod="gpu-workload-1",...} 99
DCGM_FI_DEV_GPU_UTIL{pod="gpu-workload-2",...} 99
DCGM_FI_DEV_GPU_UTIL{pod="gpu-workload-3",...} 99
DCGM_FI_DEV_GPU_UTIL{pod="gpu-workload-4",...} 99

MIG + time-sharing:
Only one random pod was reported per MIG instance - other pods sharing the same MIG instance were missing from metrics entirely.

Solution

This PR adds per-process GPU metrics collection using NVML APIs, enabling per-pod metrics in time-sharing scenarios:

  • Uses nvmlDeviceGetComputeRunningProcesses to get per-process memory usage
  • Uses nvmlDeviceGetProcessUtilization to get per-process SM utilization (regular GPUs only)
  • Maps process PIDs to Kubernetes pods via cgroup
  • Supports both regular GPUs and MIG instances

Configuration

The changes in this PR only affect the code path when the --kubernetes-virtual-gpus flag is enabled (or KUBERNETES_VIRTUAL_GPUS=true environment variable). When this flag is disabled, the existing behavior remains unchanged.

Required environment variables:

  • KUBERNETES_VIRTUAL_GPUS=true
  • DCGM_EXPORTER_KUBERNETES=true
  • DCGM_EXPORTER_KUBERNETES_ENABLE_POD_UID=true
  • DCGM_EXPORTER_KUBERNETES_GPU_ID_TYPE — must match your cloud provider:
    • GKE (GCP): device-name
    • EKS (AWS): uid

Required pod spec setting:

  • hostPID=true
  • securityContext.privileged=true

Per-Process Metrics by GPU Type

Regular GPUs (Time-Sharing)

Both GPU utilization and memory metrics are available per-process:

Metric Description
DCGM_FI_DEV_GPU_UTIL Per-process SM utilization (%)
DCGM_FI_DEV_FB_USED Per-process memory (MiB)

The metrics provide a clear hierarchy:

Metric Type Labels Description
Device total No pod labels Total utilization/memory for the GPU
Per-pod pod="...", vgpu="..." Individual pod's usage on that GPU

Example output

GPU Utilization:

# Device total
DCGM_FI_DEV_GPU_UTIL{gpu="0",UUID="GPU-6a16b9c2-..."} 98

# Per-pod breakdown
DCGM_FI_DEV_GPU_UTIL{gpu="0",UUID="GPU-6a16b9c2-...",pod="gpu-workload-1",vgpu="0"} 6
DCGM_FI_DEV_GPU_UTIL{gpu="0",UUID="GPU-6a16b9c2-...",pod="gpu-workload-2",vgpu="8"} 31
DCGM_FI_DEV_GPU_UTIL{gpu="0",UUID="GPU-6a16b9c2-...",pod="gpu-workload-3",vgpu="3"} 61
DCGM_FI_DEV_GPU_UTIL{gpu="0",UUID="GPU-6a16b9c2-...",pod="gpu-workload-4",vgpu="5"} 0

Memory Usage:

# Device total
DCGM_FI_DEV_FB_USED{gpu="0",UUID="GPU-6a16b9c2-..."} 1194

# Per-pod breakdown
DCGM_FI_DEV_FB_USED{gpu="0",UUID="GPU-6a16b9c2-...",pod="gpu-workload-1",vgpu="0"} 620
DCGM_FI_DEV_FB_USED{gpu="0",UUID="GPU-6a16b9c2-...",pod="gpu-workload-2",vgpu="8"} 108
DCGM_FI_DEV_FB_USED{gpu="0",UUID="GPU-6a16b9c2-...",pod="gpu-workload-3",vgpu="3"} 108
DCGM_FI_DEV_FB_USED{gpu="0",UUID="GPU-6a16b9c2-...",pod="gpu-workload-4",vgpu="5"} 358

MIG Instances (Time-Sharing)

Only memory metrics are available per-process (NVML limitation - SM utilization not available for MIG):

Metric Description
DCGM_FI_DEV_FB_USED Per-process memory (MiB)

The metrics provide a clear hierarchy:

Metric Type Labels Description
MIG instance total GPU_I_ID="8" (no pod labels) Total memory for the MIG instance
Per-pod within MIG GPU_I_ID="8", pod="...", vgpu="..." Individual pod's memory usage within that MIG instance

Example output

MIG instance 8 with 3 pods sharing:

# MIG instance total
DCGM_FI_DEV_FB_USED{GPU_I_ID="8",...} 642

# Per-pod breakdown within MIG instance 8
DCGM_FI_DEV_FB_USED{GPU_I_ID="8",pod="gpu-workload-1,...",vgpu="3"} 82
DCGM_FI_DEV_FB_USED{GPU_I_ID="8",pod="gpu-workload-2,...",vgpu="11"} 478
DCGM_FI_DEV_FB_USED{GPU_I_ID="8",pod="gpu-workload-3,...",vgpu="1"} 82

Bug Fixes

AWS EKS MIG support: Fixed MIG device ID parsing for AWS EKS environments. The NVIDIA device plugin on EKS reports MIG device IDs with a ::N suffix (e.g., MIG-xxx::7) which previously caused pod-to-device mapping to fail.


Notes

  • Device-level metrics are always emitted alongside per-pod metrics
  • Pods with no active GPU processes show 0 for both metrics

@nvvfedorov nvvfedorov requested a review from Copilot December 4, 2025 17:48
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds MIG time-sharing support with per-process metrics collection. It enables tracking GPU utilization and memory usage on a per-pod basis for both regular GPUs and MIG devices in Kubernetes environments with time-shared GPU workloads.

Key changes:

  • Implements per-process metrics collection for GPU utilization and framebuffer memory usage
  • Adds PID-to-pod mapping using cgroup information to associate GPU processes with Kubernetes pods
  • Extends NVML provider with methods to query process-level metrics for both regular and MIG GPUs

Reviewed changes

Copilot reviewed 12 out of 14 changed files in this pull request and generated no comments.

Show a summary per file
File Description
internal/pkg/transformation/process_metrics.go Core logic for collecting and organizing per-process GPU metrics
internal/pkg/transformation/process_metrics_test.go Comprehensive test coverage for process metrics collection
internal/pkg/transformation/pidmapper.go Linux implementation for mapping PIDs to pods via cgroup parsing
internal/pkg/transformation/pidmapper_stub.go Non-Linux stub implementation of PID mapper
internal/pkg/transformation/pidmapper_test.go Tests for PID-to-pod mapping functionality
internal/pkg/transformation/kubernetes.go Integration of per-process metrics into pod mapping workflow
internal/pkg/transformation/kubernetes_test.go Tests for per-process metrics integration with Kubernetes pod mapper
internal/pkg/transformation/const.go Added metric name constants for per-process metrics
internal/pkg/nvmlprovider/types.go Extended NVML interface with per-process query methods
internal/pkg/nvmlprovider/provider.go Implementation of per-process metrics collection using NVML
internal/mocks/pkg/nvmlprovider/mock_client.go Mock implementations for new NVML methods
go.mod Added cgroups v3 dependency for PID-to-pod mapping
Comments suppressed due to low confidence (1)

internal/pkg/transformation/kubernetes_test.go:1

  • Assertion expects 'test-pod' but the test setup at line 388 uses pod0 which has Name 'pod0'. This will cause the test to fail.
/*

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@nvvfedorov
Copy link
Collaborator

@krystiancastai , Please sign your commit, as it is required by the CONTRIBUTING.md

@nvvfedorov
Copy link
Collaborator

@krystiancastai, Add a description explaining what problem the PR solves or fixes.

Copy link
Collaborator

@nvvfedorov nvvfedorov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@krystiancastai , Thank you for your contribution. The code in general looks good to me. However, because this is a new feature, please add a description of the feature, what problem it solves, the configuration, and any other relevant details. Also, please don't forget to sing off your commit and update the year in the license headers for new files.

@krystiancastai krystiancastai marked this pull request as draft December 5, 2025 08:01
@krystiancastai
Copy link
Author

@krystiancastai, Add a description explaining what problem the PR solves or fixes.

@nvvfedorov Thanks for the comments! My intention was to publish this merge request in draft mode, as it's missing a description with detailed examples of what was changed, and I still see a few minor things in the code that need fixing. I'll let you know once I've made the updates.

@krystiancastai krystiancastai changed the title Add MIG time-sharing support with per-process metrics Add per-process GPU metrics for time-sharing and MIG Dec 8, 2025
Signed-off-by: Krystian Bednarczuk <krystian@cast.ai>
@krystiancastai krystiancastai force-pushed the feature/mig-time-sharing-support branch 2 times, most recently from e1102f5 to 012200b Compare December 8, 2025 14:00
Signed-off-by: Krystian Bednarczuk <krystian@cast.ai>
@krystiancastai krystiancastai force-pushed the feature/mig-time-sharing-support branch from 012200b to 0871d70 Compare December 8, 2025 16:12
@krystiancastai krystiancastai marked this pull request as ready for review December 8, 2025 16:21
@glowkey
Copy link
Collaborator

glowkey commented Dec 10, 2025

Thanks for creating this PR! We are planning to test and validate this MR for our next major release in January 2026.

@krystiancastai
Copy link
Author

Thanks for creating this PR! We are planning to test and validate this MR for our next major release in January 2026.

Hey @glowkey , hope you're doing well! Have you had a chance to test this MR? Is there anything I can do to help move things forward?

@glowkey
Copy link
Collaborator

glowkey commented Jan 29, 2026

Apologies that we did not have time to review and test this for the January release. We had to prioritize the GPU bind/unbind functionality. I imagine that this MR will now need to be updated and tested with that behavior. When a GPU unbind is detected all NVML handles must be released and then nvmlInit() needs to be called when a GPU bind is detected. You can see this behavior in app.go/handleGpuTopologyChange(). Let us know if you have any questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants