[CONTP-1198] Extract GPU device id from k8s container runtime#45152
[CONTP-1198] Extract GPU device id from k8s container runtime#45152
Conversation
7d74be6 to
7547def
Compare
Static quality checks✅ Please find below the results from static quality gates Successful checksInfo
16 successful checks with minimal change (< 2 KiB)
On-wire sizes (compressed)
|
Regression DetectorRegression Detector ResultsMetrics dashboard Baseline: ff0abbe Optimization Goals: ✅ No significant changes detected
|
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | docker_containers_cpu | % cpu utilization | -0.87 | [-3.84, +2.09] | 1 | Logs |
Fine details of change detection per experiment
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | otlp_ingest_logs | memory utilization | +0.85 | [+0.74, +0.96] | 1 | Logs |
| ➖ | ddot_logs | memory utilization | +0.46 | [+0.40, +0.52] | 1 | Logs |
| ➖ | quality_gate_logs | % cpu utilization | +0.37 | [-1.10, +1.84] | 1 | Logs bounds checks dashboard |
| ➖ | tcp_syslog_to_blackhole | ingress throughput | +0.18 | [+0.11, +0.26] | 1 | Logs |
| ➖ | ddot_metrics_sum_delta | memory utilization | +0.18 | [-0.04, +0.40] | 1 | Logs |
| ➖ | ddot_metrics_sum_cumulativetodelta_exporter | memory utilization | +0.10 | [-0.13, +0.33] | 1 | Logs |
| ➖ | file_to_blackhole_500ms_latency | egress throughput | +0.05 | [-0.34, +0.44] | 1 | Logs |
| ➖ | file_to_blackhole_1000ms_latency | egress throughput | +0.04 | [-0.37, +0.45] | 1 | Logs |
| ➖ | docker_containers_memory | memory utilization | +0.03 | [-0.04, +0.10] | 1 | Logs |
| ➖ | file_to_blackhole_100ms_latency | egress throughput | +0.03 | [-0.02, +0.07] | 1 | Logs |
| ➖ | tcp_dd_logs_filter_exclude | ingress throughput | +0.00 | [-0.08, +0.09] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api | ingress throughput | -0.01 | [-0.14, +0.12] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api_v3 | ingress throughput | -0.02 | [-0.15, +0.11] | 1 | Logs |
| ➖ | ddot_metrics_sum_cumulative | memory utilization | -0.03 | [-0.19, +0.13] | 1 | Logs |
| ➖ | file_to_blackhole_0ms_latency | egress throughput | -0.04 | [-0.53, +0.45] | 1 | Logs |
| ➖ | quality_gate_idle | memory utilization | -0.11 | [-0.16, -0.07] | 1 | Logs bounds checks dashboard |
| ➖ | ddot_metrics | memory utilization | -0.17 | [-0.40, +0.05] | 1 | Logs |
| ➖ | quality_gate_idle_all_features | memory utilization | -0.19 | [-0.23, -0.15] | 1 | Logs bounds checks dashboard |
| ➖ | uds_dogstatsd_20mb_12k_contexts_20_senders | memory utilization | -0.22 | [-0.28, -0.17] | 1 | Logs |
| ➖ | file_tree | memory utilization | -0.58 | [-0.64, -0.52] | 1 | Logs |
| ➖ | otlp_ingest_metrics | memory utilization | -0.70 | [-0.85, -0.55] | 1 | Logs |
| ➖ | docker_containers_cpu | % cpu utilization | -0.87 | [-3.84, +2.09] | 1 | Logs |
| ➖ | quality_gate_metrics_logs | memory utilization | -0.96 | [-1.17, -0.74] | 1 | Logs bounds checks dashboard |
Bounds Checks: ✅ Passed
| perf | experiment | bounds_check_name | replicates_passed | links |
|---|---|---|---|---|
| ✅ | docker_containers_cpu | simple_check_run | 10/10 | |
| ✅ | docker_containers_memory | memory_usage | 10/10 | |
| ✅ | docker_containers_memory | simple_check_run | 10/10 | |
| ✅ | file_to_blackhole_0ms_latency | lost_bytes | 10/10 | |
| ✅ | file_to_blackhole_0ms_latency | memory_usage | 10/10 | |
| ✅ | file_to_blackhole_1000ms_latency | lost_bytes | 10/10 | |
| ✅ | file_to_blackhole_1000ms_latency | memory_usage | 10/10 | |
| ✅ | file_to_blackhole_100ms_latency | lost_bytes | 10/10 | |
| ✅ | file_to_blackhole_100ms_latency | memory_usage | 10/10 | |
| ✅ | file_to_blackhole_500ms_latency | lost_bytes | 10/10 | |
| ✅ | file_to_blackhole_500ms_latency | memory_usage | 10/10 | |
| ✅ | quality_gate_idle | intake_connections | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_idle | memory_usage | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | intake_connections | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | memory_usage | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_logs | intake_connections | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_logs | lost_bytes | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_logs | memory_usage | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | cpu_usage | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | intake_connections | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | lost_bytes | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | memory_usage | 10/10 | bounds checks dashboard |
Explanation
Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%
Performance changes are noted in the perf column of each table:
- ✅ = significantly better comparison variant performance
- ❌ = significantly worse comparison variant performance
- ➖ = no significant change in performance
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
Replicate Execution Details
We run multiple replicates for each experiment/variant. However, we allow replicates to be automatically retried if there are any failures, up to 8 times, at which point the replicate is marked dead and we are unable to run analysis for the entire experiment. We call each of these attempts at running replicates a replicate execution. This section lists all replicate executions that failed due to the target crashing or being oom killed.
Note: In the below tables we bucket failures by experiment, variant, and failure type. For each of these buckets we list out the replicate indexes that failed with an annotation signifying how many times said replicate failed with the given failure mode. In the below example the baseline variant of the experiment named experiment_with_failures had two replicates that failed by oom kills. Replicate 0, which failed 8 executions, and replicate 1 which failed 6 executions, all with the same failure mode.
| Experiment | Variant | Replicates | Failure | Logs | Debug Dashboard |
|---|---|---|---|---|---|
| experiment_with_failures | baseline | 0 (x8) 1 (x6) | Oom killed | Debug Dashboard |
The debug dashboard links will take you to a debugging dashboard specifically designed to investigate replicate execution failures.
❌ Retried Profiling Replicate Execution Failures (target internal profiling)
Note: Profiling replicas may still be executing. See the debug dashboard for up to date status.
| Experiment | Variant | Replicates | Failure | Debug Dashboard |
|---|---|---|---|---|
| quality_gate_idle_all_features | baseline | 11 (x3) | Oom killed | Debug Dashboard |
| quality_gate_idle_all_features | comparison | 11 (x3) | Oom killed | Debug Dashboard |
CI Pass/Fail Decision
✅ Passed. All Quality Gates passed.
- quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check lost_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check lost_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
73fc17c to
71eafe3
Compare
71eafe3 to
184cfcf
Compare
What does this PR do?
Enables GPU device extraction from container runtime configuration (
NVIDIA_VISIBLE_DEVICESenvironment variable) for Kubernetes workloads, with UUID validation to detect and handle user overrides.Changes
Add GPU utility functions** (
comp/core/workloadmeta/collectors/util/gpu_util.go)ExtractGPUDeviceIDsFromEnvMap()- Extract GPU IDs from env var map (containerd)ExtractGPUDeviceIDsFromEnvVars()- Extract GPU IDs from env var slice (docker)IsGPUUUID()- Validate NVIDIA GPU/MIG UUID formatShouldExtractGPUDeviceIDsFromConfig()- Environment detection (ECS/K8s only)Update containerd collector** (
comp/core/workloadmeta/collectors/internal/containerd/container_builder.go)GPUDeviceIDsfrom container spec env varsRefactor docker collector** (
comp/core/workloadmeta/collectors/internal/docker/docker.go) to use shard util extract functionMotivation
Background: How GPU device mapping works
In Kubernetes, the NVIDIA device plugin handles GPU allocation:
nvidia.com/gpuresourceAllocate()API selects GPU(s) and returns UUID(s)NVIDIA_VISIBLE_DEVICES=GPU-uuidat container runtime (not in pod spec)Why this change
NVIDIA_VISIBLE_DEVICESis what the NVIDIA container runtime actually uses to determine GPU visibility.NVIDIA_VISIBLE_DEVICESin their pod spec with values likeall,0, ornone. In those non-canonical cases, the agent validates the value and falls back to PodResources API.GPU UUID Validation
In Kubernetes, the NVIDIA device plugin sets
NVIDIA_VISIBLE_DEVICESto GPU UUIDs. However, users can override this in their pod specs. The UUID validation detects these overrides:GPU-aec058b1-c18e-236e-c14d-49d2990fda0fMIG-aec058b1-c18e-236e-c14d-49d2990fda0fMIG-GPU-aec058b1-.../0/0allnone,void0,1,0,1Note: ECS does not validate UUIDs because users cannot override env vars set by the ECS agent.
GPU Discovery Priority
GPUDeviceIDs(runtime)/proc/PID/environ)Testing
Test Environment: EKS (Kubernetes + containerd)
Setup:
Test Case 1: Normal GPU pod
Verification - Agent workload-list:
Verification - Agent logs:
Result: GPU device extracted from container runtime config (
NVIDIA_VISIBLE_DEVICESin containerd spec). The NVIDIA device plugin sets this env var viaAllocate()API.Test Case 2: User override with
NVIDIA_VISIBLE_DEVICES=allVerification - Agent workload-list:
Verification - Agent logs:
Result: Agent detected
allis not a valid UUID → returnednilforGPUDeviceIDs→ fell back to PodResources API for correct GPU assignment. Note: NoGPU Device IDssection in workload-list (field is nil).