Skip to content

Comments

🐛 Fix KEDA APIService race condition in nightly E2E#761

Merged
clubanderson merged 1 commit intomainfrom
fix/keda-apiservice-resilience
Feb 18, 2026
Merged

🐛 Fix KEDA APIService race condition in nightly E2E#761
clubanderson merged 1 commit intomainfrom
fix/keda-apiservice-resilience

Conversation

@clubanderson
Copy link
Contributor

Summary

  • Add background APIService guard loop that re-patches v1beta1.external.metrics.k8s.io every 10s if KEDA reclaims it
  • Guard starts when KEDA is detected during deploy_prometheus_adapter(), stops during cleanup
  • Does not modify KEDA itself — just ensures the APIService always points to prometheus-adapter
  • Increase external metrics API test timeout from 5min to 10min
  • Increase scale-to-zero test timeout from 5min to 10min

Problem

KEDA operator continuously reconciles the v1beta1.external.metrics.k8s.io APIService back to keda-metrics-apiserver within ~2 minutes. The existing one-shot patch in install.sh gets overwritten, causing the HPA to receive 404 errors when querying wva_desired_replicas. This makes the nightly E2E tests fail intermittently on OCP clusters with KEDA installed.

KEDA has no built-in option to disable external metrics registration (issue #470, open since 2019).

Test plan

  • Trigger WVA nightly on OCP — should no longer fail with KEDA APIService 404s
  • Verify guard output shows re-patching in workflow logs when KEDA reclaims
  • Verify guard stops cleanly during cleanup (no orphaned processes)
  • Run multiple consecutive E2E runs to confirm stability

KEDA operator continuously reconciles v1beta1.external.metrics.k8s.io
APIService back to keda-metrics-apiserver within ~2 minutes of any
patch. This causes persistent 404 errors when HPA queries external
metrics for wva_desired_replicas, breaking scale-up tests.

Add a background APIService guard loop that checks every 10 seconds
and re-patches the APIService to prometheus-adapter if KEDA reclaims
it. The guard starts when KEDA is detected during deploy and stops
cleanly during cleanup, without modifying KEDA itself.

Also increase external metrics and scale-to-zero test timeouts from
5 to 10 minutes for additional margin on busy clusters.

Signed-off-by: Andrew Anderson <andy@clubanderson.com>
Copilot AI review requested due to automatic review settings February 18, 2026 21:12
@github-actions
Copy link
Contributor

GPU Pre-flight Check ✅

GPUs are available for e2e-openshift tests. Proceeding with deployment.

Resource Total Allocated Available
GPUs 50 12 38
Cluster Value
Nodes 16 (7 with GPUs)
Total CPU 993 cores
Total Memory 10383 Gi
GPUs required 4 (min) / 6 (recommended)

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to stabilize OpenShift nightly E2E runs on clusters with KEDA installed by preventing the v1beta1.external.metrics.k8s.io APIService from being reconciled away from prometheus-adapter, and by extending E2E timeouts to tolerate slower external-metrics readiness / scale-to-zero convergence.

Changes:

  • Add a background “APIService guard” loop in deploy/install.sh that periodically re-patches v1beta1.external.metrics.k8s.io back to prometheus-adapter.
  • Increase the external-metrics accessibility assertion timeout in the ShareGPT scale-up OpenShift E2E test from 5m to 10m.
  • Increase the scale-to-zero “recommend zero replicas” polling window from 5m to 10m.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
deploy/install.sh Adds a background guard process intended to keep external-metrics APIService pointed at Prometheus Adapter and stops it during undeploy/cleanup paths.
test/e2e-openshift/sharegpt_scaleup_test.go Extends external metrics API availability polling timeout to reduce flakes.
test/e2e-openshift/scale_to_zero_test.go Extends scale-to-zero polling deadline to accommodate slower environments.

Comment on lines 1074 to +1080
fi

# Start background guard to prevent KEDA from reclaiming the APIService.
# KEDA's operator continuously reconciles the APIService back to its own
# metrics server within ~2 minutes of any patch. The guard re-patches it
# every 10 seconds without modifying KEDA itself.
start_apiservice_guard "$MONITORING_NAMESPACE"
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

start_apiservice_guard is started whenever the APIService exists, even if it already points at prometheus-adapter/$MONITORING_NAMESPACE. If the guard is only intended to mitigate KEDA’s reconciliation, consider starting it only when the APIService is observed pointing somewhere else (or when it’s specifically pointing to keda-metrics-apiserver) to avoid running an always-on background loop on clusters that don’t need it.

Suggested change
fi
# Start background guard to prevent KEDA from reclaiming the APIService.
# KEDA's operator continuously reconciles the APIService back to its own
# metrics server within ~2 minutes of any patch. The guard re-patches it
# every 10 seconds without modifying KEDA itself.
start_apiservice_guard "$MONITORING_NAMESPACE"
# Start background guard to prevent KEDA from reclaiming the APIService.
# KEDA's operator continuously reconciles the APIService back to its own
# metrics server within ~2 minutes of any patch. The guard re-patches it
# every 10 seconds without modifying KEDA itself.
start_apiservice_guard "$MONITORING_NAMESPACE"
fi

Copilot uses AI. Check for mistakes.
Comment on lines +1076 to +1080
# Start background guard to prevent KEDA from reclaiming the APIService.
# KEDA's operator continuously reconciles the APIService back to its own
# metrics server within ~2 minutes of any patch. The guard re-patches it
# every 10 seconds without modifying KEDA itself.
start_apiservice_guard "$MONITORING_NAMESPACE"
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

start_apiservice_guard is started here, but the normal deployment path never calls stop_apiservice_guard before main exits. In a non-interactive script, this background loop can outlive the script and continue patching the cluster unexpectedly (and may interfere with later CI cleanup that deletes the adapter/namespace). Consider giving the guard an explicit lifecycle (e.g., start only for E2E runs and add a corresponding stop action invoked by CI cleanup, or stop it automatically on script exit when persistence isn’t required).

Copilot uses AI. Check for mistakes.
Comment on lines +170 to +172
APISERVICE_GUARD_PID=$!
echo "$APISERVICE_GUARD_PID" > /tmp/apiservice-guard.pid
log_success "APIService guard started (PID: $APISERVICE_GUARD_PID)"
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The guard PID is written to a fixed path (/tmp/apiservice-guard.pid). This can collide across concurrent runs on the same runner and can become stale; since PIDs are reused, later stop_apiservice_guard invocations risk killing an unrelated process. Consider making the pidfile unique per run (e.g., include $$ / a unique suffix) and validating the process identity before killing.

Copilot uses AI. Check for mistakes.
current_ns=$(kubectl get apiservice v1beta1.external.metrics.k8s.io \
-o jsonpath='{.spec.service.namespace}' 2>/dev/null || echo "")
if [ "$current_svc" != "prometheus-adapter" ] || [ "$current_ns" != "$monitoring_ns" ]; then
echo "[apiservice-guard] KEDA reclaimed APIService (now: $current_svc/$current_ns), re-patching to prometheus-adapter/$monitoring_ns"
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This log line hard-codes “KEDA reclaimed APIService”, but the condition triggers for any mismatch (e.g., a different external-metrics provider or an empty service/namespace when the APIService is temporarily unavailable). Consider making the message provider-agnostic (e.g., “APIService drift detected”) and include the observed service/namespace without attributing it to KEDA unless KEDA is explicitly detected.

Suggested change
echo "[apiservice-guard] KEDA reclaimed APIService (now: $current_svc/$current_ns), re-patching to prometheus-adapter/$monitoring_ns"
echo "[apiservice-guard] APIService drift detected: v1beta1.external.metrics.k8s.io now points to $current_svc/$current_ns; re-patching to prometheus-adapter/$monitoring_ns"

Copilot uses AI. Check for mistakes.
Copy link
Collaborator

@mamy-CS mamy-CS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@clubanderson clubanderson merged commit f881df1 into main Feb 18, 2026
21 of 22 checks passed
@clubanderson clubanderson deleted the fix/keda-apiservice-resilience branch February 18, 2026 21:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants