Add per-process GPU metrics for time-sharing and MIG#594
Add per-process GPU metrics for time-sharing and MIG#594krystiancastai wants to merge 2 commits intoNVIDIA:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds MIG time-sharing support with per-process metrics collection. It enables tracking GPU utilization and memory usage on a per-pod basis for both regular GPUs and MIG devices in Kubernetes environments with time-shared GPU workloads.
Key changes:
- Implements per-process metrics collection for GPU utilization and framebuffer memory usage
- Adds PID-to-pod mapping using cgroup information to associate GPU processes with Kubernetes pods
- Extends NVML provider with methods to query process-level metrics for both regular and MIG GPUs
Reviewed changes
Copilot reviewed 12 out of 14 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| internal/pkg/transformation/process_metrics.go | Core logic for collecting and organizing per-process GPU metrics |
| internal/pkg/transformation/process_metrics_test.go | Comprehensive test coverage for process metrics collection |
| internal/pkg/transformation/pidmapper.go | Linux implementation for mapping PIDs to pods via cgroup parsing |
| internal/pkg/transformation/pidmapper_stub.go | Non-Linux stub implementation of PID mapper |
| internal/pkg/transformation/pidmapper_test.go | Tests for PID-to-pod mapping functionality |
| internal/pkg/transformation/kubernetes.go | Integration of per-process metrics into pod mapping workflow |
| internal/pkg/transformation/kubernetes_test.go | Tests for per-process metrics integration with Kubernetes pod mapper |
| internal/pkg/transformation/const.go | Added metric name constants for per-process metrics |
| internal/pkg/nvmlprovider/types.go | Extended NVML interface with per-process query methods |
| internal/pkg/nvmlprovider/provider.go | Implementation of per-process metrics collection using NVML |
| internal/mocks/pkg/nvmlprovider/mock_client.go | Mock implementations for new NVML methods |
| go.mod | Added cgroups v3 dependency for PID-to-pod mapping |
Comments suppressed due to low confidence (1)
internal/pkg/transformation/kubernetes_test.go:1
- Assertion expects 'test-pod' but the test setup at line 388 uses pod0 which has Name 'pod0'. This will cause the test to fail.
/*
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@krystiancastai , Please sign your commit, as it is required by the CONTRIBUTING.md |
|
@krystiancastai, Add a description explaining what problem the PR solves or fixes. |
There was a problem hiding this comment.
@krystiancastai , Thank you for your contribution. The code in general looks good to me. However, because this is a new feature, please add a description of the feature, what problem it solves, the configuration, and any other relevant details. Also, please don't forget to sing off your commit and update the year in the license headers for new files.
@nvvfedorov Thanks for the comments! My intention was to publish this merge request in draft mode, as it's missing a description with detailed examples of what was changed, and I still see a few minor things in the code that need fixing. I'll let you know once I've made the updates. |
Signed-off-by: Krystian Bednarczuk <krystian@cast.ai>
e1102f5 to
012200b
Compare
Signed-off-by: Krystian Bednarczuk <krystian@cast.ai>
012200b to
0871d70
Compare
|
Thanks for creating this PR! We are planning to test and validate this MR for our next major release in January 2026. |
Hey @glowkey , hope you're doing well! Have you had a chance to test this MR? Is there anything I can do to help move things forward? |
|
Apologies that we did not have time to review and test this for the January release. We had to prioritize the GPU bind/unbind functionality. I imagine that this MR will now need to be updated and tested with that behavior. When a GPU unbind is detected all NVML handles must be released and then nvmlInit() needs to be called when a GPU bind is detected. You can see this behavior in app.go/handleGpuTopologyChange(). Let us know if you have any questions. |
Add per-process GPU metrics for time-sharing and MIG environments
Problem
When using GPU time-sharing (multiple pods sharing a single GPU or MIG instance), DCGM Exporter metrics were inaccurate:
Regular GPU time-sharing:
Device-level metrics were duplicated to each pod, making it impossible to determine actual per-pod resource consumption:
MIG + time-sharing:
Only one random pod was reported per MIG instance - other pods sharing the same MIG instance were missing from metrics entirely.
Solution
This PR adds per-process GPU metrics collection using NVML APIs, enabling per-pod metrics in time-sharing scenarios:
nvmlDeviceGetComputeRunningProcessesto get per-process memory usagenvmlDeviceGetProcessUtilizationto get per-process SM utilization (regular GPUs only)Configuration
The changes in this PR only affect the code path when the
--kubernetes-virtual-gpusflag is enabled (orKUBERNETES_VIRTUAL_GPUS=trueenvironment variable). When this flag is disabled, the existing behavior remains unchanged.Required environment variables:
KUBERNETES_VIRTUAL_GPUS=trueDCGM_EXPORTER_KUBERNETES=trueDCGM_EXPORTER_KUBERNETES_ENABLE_POD_UID=trueDCGM_EXPORTER_KUBERNETES_GPU_ID_TYPE— must match your cloud provider:device-nameuidRequired pod spec setting:
hostPID=truesecurityContext.privileged=truePer-Process Metrics by GPU Type
Regular GPUs (Time-Sharing)
Both GPU utilization and memory metrics are available per-process:
DCGM_FI_DEV_GPU_UTILDCGM_FI_DEV_FB_USEDThe metrics provide a clear hierarchy:
pod="...", vgpu="..."Example output
GPU Utilization:
Memory Usage:
MIG Instances (Time-Sharing)
Only memory metrics are available per-process (NVML limitation - SM utilization not available for MIG):
DCGM_FI_DEV_FB_USEDThe metrics provide a clear hierarchy:
GPU_I_ID="8"(no pod labels)GPU_I_ID="8", pod="...", vgpu="..."Example output
MIG instance 8 with 3 pods sharing:
Bug Fixes
AWS EKS MIG support: Fixed MIG device ID parsing for AWS EKS environments. The NVIDIA device plugin on EKS reports MIG device IDs with a
::Nsuffix (e.g.,MIG-xxx::7) which previously caused pod-to-device mapping to fail.Notes
0for both metrics