🐛 Fix KEDA APIService race condition in nightly E2E by clubanderson · Pull Request #761 · llm-d/llm-d-workload-variant-autoscaler

clubanderson · 2026-02-18T21:12:36Z

Summary

Add background APIService guard loop that re-patches v1beta1.external.metrics.k8s.io every 10s if KEDA reclaims it
Guard starts when KEDA is detected during deploy_prometheus_adapter(), stops during cleanup
Does not modify KEDA itself — just ensures the APIService always points to prometheus-adapter
Increase external metrics API test timeout from 5min to 10min
Increase scale-to-zero test timeout from 5min to 10min

Problem

KEDA operator continuously reconciles the v1beta1.external.metrics.k8s.io APIService back to keda-metrics-apiserver within ~2 minutes. The existing one-shot patch in install.sh gets overwritten, causing the HPA to receive 404 errors when querying wva_desired_replicas. This makes the nightly E2E tests fail intermittently on OCP clusters with KEDA installed.

KEDA has no built-in option to disable external metrics registration (issue #470, open since 2019).

Test plan

Trigger WVA nightly on OCP — should no longer fail with KEDA APIService 404s
Verify guard output shows re-patching in workflow logs when KEDA reclaims
Verify guard stops cleanly during cleanup (no orphaned processes)
Run multiple consecutive E2E runs to confirm stability

KEDA operator continuously reconciles v1beta1.external.metrics.k8s.io APIService back to keda-metrics-apiserver within ~2 minutes of any patch. This causes persistent 404 errors when HPA queries external metrics for wva_desired_replicas, breaking scale-up tests. Add a background APIService guard loop that checks every 10 seconds and re-patches the APIService to prometheus-adapter if KEDA reclaims it. The guard starts when KEDA is detected during deploy and stops cleanly during cleanup, without modifying KEDA itself. Also increase external metrics and scale-to-zero test timeouts from 5 to 10 minutes for additional margin on busy clusters. Signed-off-by: Andrew Anderson <andy@clubanderson.com>

github-actions · 2026-02-18T21:15:14Z

GPU Pre-flight Check ✅

GPUs are available for e2e-openshift tests. Proceeding with deployment.

Resource	Total	Allocated	Available
GPUs	50	12	38

Cluster	Value
Nodes	16 (7 with GPUs)
Total CPU	993 cores
Total Memory	10383 Gi
GPUs required	4 (min) / 6 (recommended)

Copilot

Pull request overview

This PR aims to stabilize OpenShift nightly E2E runs on clusters with KEDA installed by preventing the v1beta1.external.metrics.k8s.io APIService from being reconciled away from prometheus-adapter, and by extending E2E timeouts to tolerate slower external-metrics readiness / scale-to-zero convergence.

Changes:

Add a background “APIService guard” loop in deploy/install.sh that periodically re-patches v1beta1.external.metrics.k8s.io back to prometheus-adapter.
Increase the external-metrics accessibility assertion timeout in the ShareGPT scale-up OpenShift E2E test from 5m to 10m.
Increase the scale-to-zero “recommend zero replicas” polling window from 5m to 10m.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File	Description
`deploy/install.sh`	Adds a background guard process intended to keep external-metrics APIService pointed at Prometheus Adapter and stops it during undeploy/cleanup paths.
`test/e2e-openshift/sharegpt_scaleup_test.go`	Extends external metrics API availability polling timeout to reduce flakes.
`test/e2e-openshift/scale_to_zero_test.go`	Extends scale-to-zero polling deadline to accommodate slower environments.

Copilot · 2026-02-18T21:18:13Z

deploy/install.sh

        fi
+
+        # Start background guard to prevent KEDA from reclaiming the APIService.
+        # KEDA's operator continuously reconciles the APIService back to its own
+        # metrics server within ~2 minutes of any patch. The guard re-patches it
+        # every 10 seconds without modifying KEDA itself.
+        start_apiservice_guard "$MONITORING_NAMESPACE"


start_apiservice_guard is started whenever the APIService exists, even if it already points at prometheus-adapter/$MONITORING_NAMESPACE. If the guard is only intended to mitigate KEDA’s reconciliation, consider starting it only when the APIService is observed pointing somewhere else (or when it’s specifically pointing to keda-metrics-apiserver) to avoid running an always-on background loop on clusters that don’t need it.

Suggested change

fi

# Start background guard to prevent KEDA from reclaiming the APIService.

# KEDA's operator continuously reconciles the APIService back to its own

# metrics server within ~2 minutes of any patch. The guard re-patches it

# every 10 seconds without modifying KEDA itself.

start_apiservice_guard "$MONITORING_NAMESPACE"

# Start background guard to prevent KEDA from reclaiming the APIService.

# KEDA's operator continuously reconciles the APIService back to its own

# metrics server within ~2 minutes of any patch. The guard re-patches it

# every 10 seconds without modifying KEDA itself.

start_apiservice_guard "$MONITORING_NAMESPACE"

fi

Copilot · 2026-02-18T21:18:13Z

deploy/install.sh

+        # Start background guard to prevent KEDA from reclaiming the APIService.
+        # KEDA's operator continuously reconciles the APIService back to its own
+        # metrics server within ~2 minutes of any patch. The guard re-patches it
+        # every 10 seconds without modifying KEDA itself.
+        start_apiservice_guard "$MONITORING_NAMESPACE"


start_apiservice_guard is started here, but the normal deployment path never calls stop_apiservice_guard before main exits. In a non-interactive script, this background loop can outlive the script and continue patching the cluster unexpectedly (and may interfere with later CI cleanup that deletes the adapter/namespace). Consider giving the guard an explicit lifecycle (e.g., start only for E2E runs and add a corresponding stop action invoked by CI cleanup, or stop it automatically on script exit when persistence isn’t required).

Copilot · 2026-02-18T21:18:13Z

deploy/install.sh

+    APISERVICE_GUARD_PID=$!
+    echo "$APISERVICE_GUARD_PID" > /tmp/apiservice-guard.pid
+    log_success "APIService guard started (PID: $APISERVICE_GUARD_PID)"


The guard PID is written to a fixed path (/tmp/apiservice-guard.pid). This can collide across concurrent runs on the same runner and can become stale; since PIDs are reused, later stop_apiservice_guard invocations risk killing an unrelated process. Consider making the pidfile unique per run (e.g., include $$ / a unique suffix) and validating the process identity before killing.

Copilot · 2026-02-18T21:18:14Z

deploy/install.sh

+            current_ns=$(kubectl get apiservice v1beta1.external.metrics.k8s.io \
+                -o jsonpath='{.spec.service.namespace}' 2>/dev/null || echo "")
+            if [ "$current_svc" != "prometheus-adapter" ] || [ "$current_ns" != "$monitoring_ns" ]; then
+                echo "[apiservice-guard] KEDA reclaimed APIService (now: $current_svc/$current_ns), re-patching to prometheus-adapter/$monitoring_ns"


This log line hard-codes “KEDA reclaimed APIService”, but the condition triggers for any mismatch (e.g., a different external-metrics provider or an empty service/namespace when the APIService is temporarily unavailable). Consider making the message provider-agnostic (e.g., “APIService drift detected”) and include the observed service/namespace without attributing it to KEDA unless KEDA is explicitly detected.

Suggested change

echo "[apiservice-guard] KEDA reclaimed APIService (now: $current_svc/$current_ns), re-patching to prometheus-adapter/$monitoring_ns"

echo "[apiservice-guard] APIService drift detected: v1beta1.external.metrics.k8s.io now points to $current_svc/$current_ns; re-patching to prometheus-adapter/$monitoring_ns"

mamy-CS

/lgtm

Copilot AI review requested due to automatic review settings February 18, 2026 21:12

Copilot started reviewing on behalf of clubanderson February 18, 2026 21:13 View session

Copilot AI reviewed Feb 18, 2026

View reviewed changes

mamy-CS approved these changes Feb 18, 2026

View reviewed changes

clubanderson merged commit f881df1 into main Feb 18, 2026
21 of 22 checks passed

clubanderson deleted the fix/keda-apiservice-resilience branch February 18, 2026 21:21

clubanderson mentioned this pull request Feb 19, 2026

🌱 Guide improvement: workload-autoscaling — align guide content with nightly E2E test patterns llm-d/llm-d#809

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

🐛 Fix KEDA APIService race condition in nightly E2E#761

🐛 Fix KEDA APIService race condition in nightly E2E#761
clubanderson merged 1 commit intomainfrom
fix/keda-apiservice-resilience

clubanderson commented Feb 18, 2026

Uh oh!

github-actions bot commented Feb 18, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 18, 2026

Uh oh!

Copilot AI Feb 18, 2026

Uh oh!

Copilot AI Feb 18, 2026

Uh oh!

Copilot AI Feb 18, 2026

Uh oh!

mamy-CS left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	echo "[apiservice-guard] KEDA reclaimed APIService (now: $current_svc/$current_ns), re-patching to prometheus-adapter/$monitoring_ns"
	echo "[apiservice-guard] APIService drift detected: v1beta1.external.metrics.k8s.io now points to $current_svc/$current_ns; re-patching to prometheus-adapter/$monitoring_ns"

Comments

Conversation

clubanderson commented Feb 18, 2026

Summary

Problem

Test plan

Uh oh!

github-actions bot commented Feb 18, 2026

GPU Pre-flight Check ✅

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

mamy-CS left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants