Add PodMapper with informer-based caching for Kubernetes integration.#626
Add PodMapper with informer-based caching for Kubernetes integration.#626jaeeyoungkim wants to merge 12 commits intoNVIDIA:mainfrom
Conversation
|
Hi @jaeeyoungkim, thanks for the MR! We generally request that the MRs of this size come with additional tests or at least that the existing tests don't break. Take a look at 'make test-main' and 'tests/e2e/README.md' for more information. |
1ce02b1 to
b6ca8c6
Compare
|
Hi @glowkey, I have successfully completed the stress test on a real hardware environment. The results clearly demonstrate that this optimization is critical for scalability, especially in MIG environments. 1. Test Environment
2. Performance Comparison
3. Conclusion & ImpactWith the original image, the exporter frequently hung and consumed excessive CPU resources due to the synchronous API calls per pod. I believe this structural change resolves the root causes of the following reported issues:
Next Step
|
…mproving reliability on multi-GPU nodes.
|
Hi @glowkey, I have successfully completed the automated E2E tests ( 1. E2E Test Summary
2. Updates to E2E Test SuiteI included a few improvements in With both the hardware stress test and the automated E2E suite passing, I believe this PR is ready for review. |
PR: PodMapper Performance Optimization (Eliminating Repetitive API Calls via SharedInformer)
1. Problem Analysis
The existing
PodMapperimplementation had structural inefficiencies that caused significant performance degradation and high load on the Kubernetes API server during every metric collection cycle (Scrape).A. The Repetitive API Call Problem
Process()function, thetoDeviceToPodmethod iterated through all discovered pods. For each pod, it calledgetPodMetadata, which executed a synchronous API request:p.Client.CoreV1().Pods(...).Get(...).B. Ineffective Short-term Caching
metadataCachewas defined as a local variable within the function scope. It was created at the start of a scrape and destroyed immediately after.C. Blocking I/O & Reliability Risks
dcgm-exporter, potentially causing timeouts in Prometheus scrapes.2. Proposed Solution
We introduced the Kubernetes SharedInformer pattern to fundamentally resolve these issues by decoupling data retrieval from data access.
A. SharedInformer & Lister
Store) that is kept up-to-date by watching for real-time events (Watch) from the API server.podLister.Get), reducing access time from milliseconds (network I/O) to nanoseconds (memory access).B. Background Synchronization
Process()function now simply acquires a read lock (RLock) and reads the pre-computed map.C. Node-Level Filtering
NODE_NAMEenvironment variable to create aFieldSelector.3. Key Changes
internal/pkg/transformation/types.goSharedInformerFactory,PodLister, andRWMutexto manage the cache and concurrency.internal/pkg/transformation/kubernetes.goNewPodMapper: Initializes the Informer with node filtering.Run/Stop: Manages the lifecycle of the Informer and the background sync loop.createPodInfo: Replacedp.Client.Get(API Call) withp.podLister.Get(Cache Lookup).Process: Refactored to read from the thread-safedeviceToPodcache instead of re-computing mappings.internal/pkg/server/server.goPodMapperinto the server's lifecycle to ensure the Informer starts and stops with the application.dcgm-exporter.yamlNODE_NAMEto the container environment variables using the Kubernetes Downward API (spec.nodeName).4. Improvements