Skip to content

Add PodMapper with informer-based caching for Kubernetes integration.#626

Open
jaeeyoungkim wants to merge 12 commits intoNVIDIA:mainfrom
whatap:main
Open

Add PodMapper with informer-based caching for Kubernetes integration.#626
jaeeyoungkim wants to merge 12 commits intoNVIDIA:mainfrom
whatap:main

Conversation

@jaeeyoungkim
Copy link

PR: PodMapper Performance Optimization (Eliminating Repetitive API Calls via SharedInformer)

1. Problem Analysis

The existing PodMapper implementation had structural inefficiencies that caused significant performance degradation and high load on the Kubernetes API server during every metric collection cycle (Scrape).

A. The Repetitive API Call Problem

  • Mechanism: Inside the Process() function, the toDeviceToPod method iterated through all discovered pods. For each pod, it called getPodMetadata, which executed a synchronous API request: p.Client.CoreV1().Pods(...).Get(...).
  • Impact: If there are 50 pods on a node, the exporter makes 50 sequential synchronous HTTP requests to the API server. This causes the response time to increase linearly with the number of pods.
  • Consequence: High latency in metric collection and unnecessary traffic flooding the API server.

B. Ineffective Short-term Caching

  • Mechanism: The previous metadataCache was defined as a local variable within the function scope. It was created at the start of a scrape and destroyed immediately after.
  • Impact: While it prevented duplicate calls for the same pod within a single request, it failed to persist data across different scrape intervals (e.g., every 15 seconds).
  • Consequence: The exporter had "amnesia," forcing it to re-fetch static metadata (like UID and Labels) from the API server every single time.

C. Blocking I/O & Reliability Risks

  • Mechanism: All API calls were synchronous.
  • Impact: Any latency or downtime in the Kubernetes API server would directly block the dcgm-exporter, potentially causing timeouts in Prometheus scrapes.

2. Proposed Solution

We introduced the Kubernetes SharedInformer pattern to fundamentally resolve these issues by decoupling data retrieval from data access.

A. SharedInformer & Lister

  • Change: Instead of querying the API server on demand, we now maintain a local memory cache (Store) that is kept up-to-date by watching for real-time events (Watch) from the API server.
  • Benefit: Pod metadata lookup is now a memory operation (podLister.Get), reducing access time from milliseconds (network I/O) to nanoseconds (memory access).

B. Background Synchronization

  • Change: The heavy lifting of mapping devices to pods is moved to a background goroutine. The Process() function now simply acquires a read lock (RLock) and reads the pre-computed map.
  • Benefit: The scrape response time is now constant (O(1)), regardless of the number of pods.

C. Node-Level Filtering

  • Change: We utilize the NODE_NAME environment variable to create a FieldSelector.
  • Benefit: The Informer only watches pods on the specific node where the exporter is running, minimizing memory footprint and network usage.

3. Key Changes

internal/pkg/transformation/types.go

  • Added fields for SharedInformerFactory, PodLister, and RWMutex to manage the cache and concurrency.

internal/pkg/transformation/kubernetes.go

  • NewPodMapper: Initializes the Informer with node filtering.
  • Run / Stop: Manages the lifecycle of the Informer and the background sync loop.
  • createPodInfo: Replaced p.Client.Get (API Call) with p.podLister.Get (Cache Lookup).
  • Process: Refactored to read from the thread-safe deviceToPod cache instead of re-computing mappings.
  • Note: Disabled the process-based mapping correction block (approx. lines 810-870) due to missing dependencies, ensuring a stable build.

internal/pkg/server/server.go

  • Integrated PodMapper into the server's lifecycle to ensure the Informer starts and stops with the application.

dcgm-exporter.yaml

  • Added NODE_NAME to the container environment variables using the Kubernetes Downward API (spec.nodeName).

4. Improvements

  • Zero API Load: Eliminates API calls during the scrape cycle (except for the initial sync and minimal watch events).
  • Performance: Drastically reduced scrape duration; performance is now stable and predictable.
  • Reliability: Metric collection continues seamlessly even if the API server becomes temporarily unavailable, serving data from the local cache.

@glowkey
Copy link
Collaborator

glowkey commented Feb 3, 2026

Hi @jaeeyoungkim, thanks for the MR! We generally request that the MRs of this size come with additional tests or at least that the existing tests don't break. Take a look at 'make test-main' and 'tests/e2e/README.md' for more information.

@jaeeyoungkim jaeeyoungkim force-pushed the main branch 2 times, most recently from 1ce02b1 to b6ca8c6 Compare February 5, 2026 10:36
@jaeeyoungkim
Copy link
Author

jaeeyoungkim commented Feb 5, 2026

Hi @glowkey,

I have successfully completed the stress test on a real hardware environment. The results clearly demonstrate that this optimization is critical for scalability, especially in MIG environments.

1. Test Environment

  • Hardware: 8x NVIDIA A100 GPUs
  • MIG Configuration: 56 slices (configured as 1g.5gb profile)
  • Test Image: public.ecr.aws/whatap/dcgm-exporter:4.5.1-4.8.0-pr-test-ubuntu22.04
  • Workload: Deployed 40+ GPU Pods simultaneously to simulate high concurrency.

2. Performance Comparison

Metric Original Implementation Optimized Version (PR #626)
CPU Usage > 10 Cores (High Load) 2~3 Cores (Stable)
Stability Frequent Hangs & Scrape Timeouts No Hangs, Consistent Scrape Times
API Load O(N) calls causing latency O(1) Cache lookup (Zero API load during scrape)

3. Conclusion & Impact

With the original image, the exporter frequently hung and consumed excessive CPU resources due to the synchronous API calls per pod.
The optimized version using SharedInformer maintained stability even under heavy load.

I believe this structural change resolves the root causes of the following reported issues:

Next Step

  • I am currently running the automated E2E tests (tests/e2e/README.md) as requested. I will post the final confirmation once they are complete.

@jaeeyoungkim
Copy link
Author

jaeeyoungkim commented Feb 6, 2026

Hi @glowkey,

I have successfully completed the automated E2E tests (tests/e2e/README.md) using the optimized image.
All 22 specs passed successfully without any failures.

1. E2E Test Summary

  • Image:: public.ecr.aws/whatap/dcgm-exporter:4.5.1-4.8.0-pr-test-6-ubuntu22.04
  • Result: SUCCESS! -- 22 Passed | 0 Failed | 0 Pending | 0 Skipped
  • Duration: ~294 seconds
  • Log File: Please see the attached file (e2e_test_result.log) for the full execution logs.

2. Updates to E2E Test Suite

I included a few improvements in tests/e2e/e2e_suite_test.go to ensure the tests are robust across different environments (e.g., Multi-GPU, MIG)

With both the hardware stress test and the automated E2E suite passing, I believe this PR is ready for review.
Please let me know if there is anything else needed.

e2e_test_result.log

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants