🐛 fix: resolve KEDA APIService conflict for external metrics in nightly E2E#721
🐛 fix: resolve KEDA APIService conflict for external metrics in nightly E2E#721clubanderson wants to merge 4 commits intomainfrom
Conversation
GPU Pre-flight Check ✅GPUs are available for e2e-openshift tests. Proceeding with deployment.
|
There was a problem hiding this comment.
Pull request overview
Updates the deployment script to address OpenShift nightly E2E failures by aligning model defaults, making WVA metrics endpoint security configurable, and resolving external metrics API routing conflicts when KEDA is preinstalled.
Changes:
- Update
DEFAULT_MODEL_IDtoQwen/Qwen3-0.6Bto match current llm-d defaults. - Add
WVA_METRICS_SECUREenv var and wire it to Helm valuewva.metrics.secure. - After Prometheus Adapter deploy, detect/patch
v1beta1.external.metrics.k8s.ioAPIService to point toprometheus-adapterto avoid KEDA conflicts.
GPU Pre-flight Check ✅GPUs are available for e2e-openshift tests. Proceeding with deployment.
|
| # Apply HTTPRoute with correct resource name references. | ||
| # The static httproute.yaml uses resource names matching the helmfile's default | ||
| # RELEASE_NAME_POSTFIX (e.g. "workload-autoscaler"). When RELEASE_NAME_POSTFIX | ||
| # is overridden (e.g. in CI), gateway and InferencePool names change, so we | ||
| # must template the HTTPRoute references to match the actual deployed resources. | ||
| if [ -f httproute.yaml ]; then | ||
| local rn="${RELEASE_NAME_POSTFIX:-}" | ||
| if [ -n "$rn" ]; then | ||
| local gw_name="infra-${rn}-inference-gateway" | ||
| local pool_name="gaie-${rn}" | ||
| log_info "Applying HTTPRoute (gateway=$gw_name, pool=$pool_name)" | ||
| yq eval " | ||
| .spec.parentRefs[0].name = \"${gw_name}\" | | ||
| .spec.rules[0].backendRefs[0].name = \"${pool_name}\" | ||
| " httproute.yaml | kubectl apply -f - -n ${LLMD_NS} | ||
| else | ||
| kubectl apply -f httproute.yaml -n ${LLMD_NS} | ||
| fi |
There was a problem hiding this comment.
The HTTPRoute templating logic references RELEASE_NAME_POSTFIX environment variable, but this variable is not defined anywhere in install.sh or the workflow files. This means the condition will always be false (rn will always be empty), and the code will always execute line 805 (direct apply without templating). If RELEASE_NAME_POSTFIX is intended to be set by external callers or CI workflows, this should be documented. Otherwise, consider removing the conditional logic or documenting where this variable should be set.
GPU Pre-flight Check ✅GPUs are available for e2e-openshift tests. Proceeding with deployment.
|
|
Tests broken here |
b4f4a0d to
bd4b41e
Compare
GPU Pre-flight Check ✅GPUs are available for e2e-openshift tests. Proceeding with deployment.
|
_typos.toml
Outdated
| # Short code fragments / false positives | ||
| ot = "ot" | ||
| vas = "vas" | ||
| abd = "abd" |
There was a problem hiding this comment.
The "abd" entry appears to be added without any corresponding usage in the codebase. After searching the repository, "abd" does not appear in any Go, YAML, or shell script files. This suggests it may have been added preemptively for a word that doesn't actually exist in the PR changes, or it's a typo in the typos configuration itself. Consider removing this entry unless there's a specific need that isn't apparent from the changes.
| abd = "abd" |
There was a problem hiding this comment.
@clubanderson this is not generic enough. PTAL: #725
|
|
||
| # Model and SLO Configuration | ||
| DEFAULT_MODEL_ID=${DEFAULT_MODEL_ID:-"Qwen/Qwen3-32B"} | ||
| DEFAULT_MODEL_ID=${DEFAULT_MODEL_ID:-"Qwen/Qwen3-0.6B"} |
There was a problem hiding this comment.
The change from "Qwen/Qwen3-32B" to "Qwen/Qwen3-0.6B" reduces the default model size from 32B to 0.6B parameters. While this may be intentional for testing efficiency or resource constraints, this change is not documented in the PR description. Consider adding a note explaining the rationale for this change, especially since it significantly affects the default behavior of deployments.
GPU Pre-flight Check ✅GPUs are available for e2e-openshift tests. Proceeding with deployment.
|
|
Added fix for Gateway API CRD installation on OpenShift (#726). On OpenShift, the Ingress Operator manages Gateway API CRDs via a ValidatingAdmissionPolicy, blocking external CRD v1.4.0 installation. This commit skips the base Gateway API CRDs on OpenShift and only installs the GAIE CRDs (InferencePool, InferenceModel) directly. |
GPU Pre-flight Check ✅GPUs are available for e2e-openshift tests. Proceeding with deployment.
|
|
/ok-to-test |
|
🚀 E2E tests triggered by /ok-to-test |
GPU Pre-flight Check ✅GPUs are available for e2e-openshift tests. Proceeding with deployment.
|
GPU Pre-flight Check ✅GPUs are available for e2e-openshift tests. Proceeding with deployment.
|
GPU Pre-flight Check ✅GPUs are available for e2e-openshift tests. Proceeding with deployment.
|
|
@lionelvillard can you ptal - lots of good changes to make the e2e and nightly more reliable/resilient. when this is approved, the nightly for wva should/will work |
50af64a to
9df2a1d
Compare
GPU Pre-flight Check ✅GPUs are available for e2e-openshift tests. Proceeding with deployment.
|
5fafcf8 to
4c6a853
Compare
GPU Pre-flight Check ✅GPUs are available for e2e-openshift tests. Proceeding with deployment.
|
…update 1. KEDA APIService conflict: The nightly E2E installs KEDA which registers an APIService for external metrics (v1beta1.external.metrics.k8s.io). This conflicts with the metrics-server, causing failures. Fix: add a pre-install cleanup step to remove stale KEDA APIService registrations. 2. Scale-to-zero status update: The controller used Status().Patch with client.MergeFrom which computes a JSON merge patch. When numReplicas=0 (Go zero value), the diff omits it from the patch, but the CRD schema requires numReplicas in desiredOptimizedAlloc. Fix: switch both status update call sites to Status().Update() which sends the full object. Fixes: #731 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Andrew Anderson <andy@clubanderson.com>
4c6a853 to
919e372
Compare
GPU Pre-flight Check ✅GPUs are available for e2e-openshift tests. Proceeding with deployment.
|
CI Analysis: e2e failures are NOT regressionsWhat our PR changesThe controller's Why e2e tests fail (pre-existing, not caused by this PR)OpenShift e2e (
KIND e2e (
Evidence that these are pre-existing:
Verified
|
GPU Pre-flight Check ✅GPUs are available for e2e-openshift tests. Proceeding with deployment.
|
KIND e2e: Defer scale-to-zero enablement until after load test completes. Previously, scale-to-zero was enabled in BeforeAll, causing the system to scale to 0 before load started — the saturation engine can't operate with no pods to measure KV cache/queue metrics. OpenShift e2e: Add graceful Skip when scale-to-zero metrics are unavailable. The vllm:request_success_total recording rule may not exist in all environments, so hard-failing on metric absence is inappropriate. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Andrew Anderson <andy@clubanderson.com>
GPU Pre-flight Check ✅GPUs are available for e2e-openshift tests. Proceeding with deployment.
|
| // We detect this by polling with a timeout shorter than the full retention period | ||
| // and gracefully skip if scale-to-zero cannot be validated. |
There was a problem hiding this comment.
The comment says the polling timeout is "shorter than the full retention period", but the code waits retention period (3m) + buffer (2m) = 5m, which is longer than the configured retentionPeriod. Please adjust the comment to match the actual timeout logic (or adjust the timeout if the intent is different).
| // We detect this by polling with a timeout shorter than the full retention period | |
| // and gracefully skip if scale-to-zero cannot be validated. | |
| // We detect this by polling with a timeout slightly longer than the configured | |
| // retention period (retentionPeriod + safety buffer) and gracefully skip if | |
| // scale-to-zero cannot be validated. |
| // Patch status — use fullDesiredAllocPatchBase to ensure the complete | ||
| // desiredOptimizedAlloc object is always included in the merge patch. | ||
| // Without this, MergeFrom only includes changed fields within the struct, | ||
| // and the CRD validates the partial patch — rejecting it when required | ||
| // fields (numReplicas, accelerator) are absent. See: #731 | ||
| if err := r.Status().Patch(ctx, &va, client.MergeFrom(fullDesiredAllocPatchBase(originalVA, &va))); err != nil { |
There was a problem hiding this comment.
This change introduces a new patching strategy (fullDesiredAllocPatchBase) to avoid CRD validation failures from partial merge patches. There are existing controller tests in this package; please add a regression test that reproduces the validation error scenario (e.g., when desiredOptimizedAlloc is present and only nested fields change) and asserts Reconcile succeeds and status is updated.
| echo "=== Deployment conditions ===" | ||
| kubectl get deployment ms-inference-scheduling-llm-d-modelservice-decode -n "$LLMD_NAMESPACE" -o jsonpath='{.status.conditions}' | jq . || true | ||
| echo "=== Recent events ===" | ||
| kubectl get events -n "$LLMD_NAMESPACE" --sort-by='.lastTimestamp' | tail -20 |
There was a problem hiding this comment.
The rollout status check is wrapped so that failures only print diagnostics and the step still exits successfully. This can let the workflow proceed to e2e tests even when the model deployment never became Ready, producing harder-to-diagnose downstream failures. Consider failing the job when rollout status times out (or making the behavior explicit via an input/flag).
| kubectl get events -n "$LLMD_NAMESPACE" --sort-by='.lastTimestamp' | tail -20 | |
| kubectl get events -n "$LLMD_NAMESPACE" --sort-by='.lastTimestamp' | tail -20 | |
| exit 1 |
| // Delete and recreate to ensure the update is picked up | ||
| err := k8sClient.CoreV1().ConfigMaps(controllerNamespace).Delete(ctx, scaleToZeroConfigMapName, metav1.DeleteOptions{}) | ||
| Expect(client.IgnoreNotFound(err)).NotTo(HaveOccurred()) | ||
| _, err = k8sClient.CoreV1().ConfigMaps(controllerNamespace).Create(ctx, scaleToZeroCMUpdate, metav1.CreateOptions{}) | ||
| Expect(err).NotTo(HaveOccurred(), "Should be able to create scale-to-zero ConfigMap with feature enabled") |
There was a problem hiding this comment.
The test enables scale-to-zero by deleting and recreating the ConfigMap. Since the controller watches ConfigMap Update events, an in-place Update/Patch should be sufficient and avoids a transient NotFound window where the controller may clear namespace config due to the delete event. Consider updating the existing ConfigMap instead of delete+create to reduce flakiness.
| // Delete and recreate to ensure the update is picked up | |
| err := k8sClient.CoreV1().ConfigMaps(controllerNamespace).Delete(ctx, scaleToZeroConfigMapName, metav1.DeleteOptions{}) | |
| Expect(client.IgnoreNotFound(err)).NotTo(HaveOccurred()) | |
| _, err = k8sClient.CoreV1().ConfigMaps(controllerNamespace).Create(ctx, scaleToZeroCMUpdate, metav1.CreateOptions{}) | |
| Expect(err).NotTo(HaveOccurred(), "Should be able to create scale-to-zero ConfigMap with feature enabled") | |
| // Update the existing ConfigMap in-place to ensure the change is picked up | |
| existingCM, err := k8sClient.CoreV1().ConfigMaps(controllerNamespace).Get(ctx, scaleToZeroConfigMapName, metav1.GetOptions{}) | |
| if err != nil { | |
| // If the ConfigMap does not exist yet, create it | |
| if client.IgnoreNotFound(err) == nil { | |
| _, err = k8sClient.CoreV1().ConfigMaps(controllerNamespace).Create(ctx, scaleToZeroCMUpdate, metav1.CreateOptions{}) | |
| Expect(err).NotTo(HaveOccurred(), "Should be able to create scale-to-zero ConfigMap with feature enabled") | |
| } else { | |
| Expect(err).NotTo(HaveOccurred(), "Unexpected error fetching scale-to-zero ConfigMap") | |
| } | |
| } else { | |
| existingCM.Data = scaleToZeroCMUpdate.Data | |
| _, err = k8sClient.CoreV1().ConfigMaps(controllerNamespace).Update(ctx, existingCM, metav1.UpdateOptions{}) | |
| Expect(err).NotTo(HaveOccurred(), "Should be able to update scale-to-zero ConfigMap with feature enabled") | |
| } |
| MinimumReplicas = 0 | ||
|
|
||
| _, _ = fmt.Fprintf(GinkgoWriter, "Scale-to-zero enabled in ConfigMap. Waiting for controller to pick up change...\n") | ||
| time.Sleep(10 * time.Second) // Brief pause for ConfigMap watch to trigger |
There was a problem hiding this comment.
Using a fixed sleep to wait for the controller to pick up the ConfigMap change is prone to flakiness across clusters. Prefer polling for an observable effect (e.g., wait until the controller has observed the new ConfigMap resourceVersion or until VA/HPA behavior reflects the new setting) using Eventually with a timeout.
| time.Sleep(10 * time.Second) // Brief pause for ConfigMap watch to trigger | |
| By("waiting for the updated scale-to-zero ConfigMap to be observable") | |
| Eventually(func(g Gomega) { | |
| cm, err := k8sClient.CoreV1().ConfigMaps(controllerNamespace).Get(ctx, scaleToZeroConfigMapName, metav1.GetOptions{}) | |
| g.Expect(err).NotTo(HaveOccurred(), "Should be able to fetch updated scale-to-zero ConfigMap") | |
| // Verify that the ConfigMap contains the updated retention period configuration | |
| g.Expect(cm.Data).To(HaveKeyWithValue("config.yaml", | |
| ContainSubstring(fmt.Sprintf("retention_period: %s", retentionPeriodShort))), | |
| ) | |
| }, 1*time.Minute, 5*time.Second).Should(Succeed()) |
The scale-to-zero KIND e2e test was failing because it jumped straight from resource creation to the scale-up test without waiting for Prometheus to discover and scrape KV cache metrics from the new pods. Add a "waiting for metrics pipeline to be ready" step in BeforeAll (matching the pattern used by the limiter test) that waits up to 5 minutes for DesiredOptimizedAlloc to be populated before proceeding. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Andrew Anderson <andy@clubanderson.com>
GPU Pre-flight Check ✅GPUs are available for e2e-openshift tests. Proceeding with deployment.
|
…KIND) Apply the same graceful Skip pattern used in the OpenShift e2e test to the KIND e2e scale-to-zero test. When the Prometheus recording rule for vllm:request_success_total is not deployed, the enforcer keeps current replicas instead of scaling to zero. This is a pre-existing infrastructure gap, not a test bug — gracefully skip instead of fail. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Andrew Anderson <andy@clubanderson.com>
GPU Pre-flight Check ✅GPUs are available for e2e-openshift tests. Proceeding with deployment.
|
| # Model and SLO Configuration | ||
| DEFAULT_MODEL_ID=${DEFAULT_MODEL_ID:-"Qwen/Qwen3-32B"} | ||
| DEFAULT_MODEL_ID=${DEFAULT_MODEL_ID:-"Qwen/Qwen3-0.6B"} | ||
| MODEL_ID=${MODEL_ID:-"unsloth/Meta-Llama-3.1-8B"} |
There was a problem hiding this comment.
DEFAULT_MODEL_ID was changed to a much smaller model (Qwen3-0.6B). This is a user-facing behavior change for anyone running deploy/install.sh without overriding MODEL_ID, and it isn’t mentioned in the PR description (which focuses on metrics/gateway fixes). Consider reverting this default change (and setting the smaller model only in CI via env), or explicitly documenting the new default and the rationale in the PR/docs.
| kubectl patch apiservice v1beta1.external.metrics.k8s.io --type=merge -p "{ | ||
| \"spec\": { | ||
| \"insecureSkipTLSVerify\": true, | ||
| \"service\": { | ||
| \"name\": \"prometheus-adapter\", | ||
| \"namespace\": \"$MONITORING_NAMESPACE\" | ||
| } | ||
| } | ||
| }" && log_success "APIService patched to use Prometheus Adapter" \ |
There was a problem hiding this comment.
The APIService conflict fix patches v1beta1.external.metrics.k8s.io with insecureSkipTLSVerify: true unconditionally. That weakens TLS verification even on clusters where the API aggregation layer is correctly configured (e.g., with a valid caBundle). Consider patching only .spec.service.name/.spec.service.namespace (and possibly preserving the existing TLS fields), or at least leaving insecureSkipTLSVerify unchanged unless it’s required for Prometheus Adapter in this environment.
| MinimumReplicas = 0 | ||
|
|
||
| _, _ = fmt.Fprintf(GinkgoWriter, "Scale-to-zero enabled in ConfigMap. Waiting for controller to pick up change...\n") | ||
| time.Sleep(10 * time.Second) // Brief pause for ConfigMap watch to trigger |
There was a problem hiding this comment.
A fixed time.Sleep(10 * time.Second) after recreating the scale-to-zero ConfigMap can make the test flaky (if the controller takes longer to observe the change) and slows successful runs. Prefer an Eventually that waits until the controller has observed the updated config (e.g., by checking a status/condition change or other observable signal) instead of sleeping a constant duration.
| time.Sleep(10 * time.Second) // Brief pause for ConfigMap watch to trigger |
| if !scaledToZero { | ||
| va := &v1alpha1.VariantAutoscaling{} | ||
| err = crClient.Get(ctx, client.ObjectKey{ | ||
| Namespace: namespace, | ||
| Name: name, | ||
| }, va) | ||
| Expect(err).NotTo(HaveOccurred()) | ||
|
|
||
| _, _ = fmt.Fprintf(GinkgoWriter, "\nScale-to-zero did not occur after waiting 5 minutes\n") | ||
| _, _ = fmt.Fprintf(GinkgoWriter, "Final NumReplicas: %d\n", va.Status.DesiredOptimizedAlloc.NumReplicas) | ||
| _, _ = fmt.Fprintf(GinkgoWriter, "VA Conditions:\n") | ||
| for _, c := range va.Status.Conditions { | ||
| _, _ = fmt.Fprintf(GinkgoWriter, " %s: %s (reason: %s, message: %s)\n", | ||
| c.Type, c.Status, c.Reason, c.Message) | ||
| } | ||
|
|
||
| scaleToZeroMetricsWorking = false | ||
| Skip("Scale-to-zero did not take effect within timeout. " + | ||
| "This is likely because the Prometheus recording rule for " + | ||
| "vllm:request_success_total is not deployed. Standard vLLM exposes " + | ||
| "vllm_request_success_total (underscore notation) but WVA queries " + | ||
| "vllm:request_success_total (colon notation, requires recording rules).") |
There was a problem hiding this comment.
This test now Skip()s if scale-to-zero doesn’t occur within 5 minutes, assuming the only cause is a missing vllm:request_success_total recording rule. That can mask real regressions (controller/enforcer bugs, HPA issues, config not applied) as “skipped”. Consider only skipping when the observed VA status/conditions indicate the specific expected metrics error (and failing otherwise), so genuine scale-to-zero regressions still fail the suite.
| if !scaledToZero { | ||
| // Scale-to-zero didn't happen — likely because the Prometheus recording rule | ||
| // vllm:request_success_total is not deployed. Standard vLLM exposes | ||
| // vllm_request_success_total (underscore notation), and recording rules | ||
| // are needed to transform it to the colon notation that WVA queries. | ||
| va := &v1alpha1.VariantAutoscaling{} | ||
| err := crClient.Get(ctx, client.ObjectKey{ | ||
| Namespace: llmDNamespace, | ||
| Name: vaName, | ||
| }, va) | ||
| Expect(err).NotTo(HaveOccurred()) | ||
|
|
||
| _, _ = fmt.Fprintf(GinkgoWriter, "\nScale-to-zero did not occur after waiting 5 minutes\n") | ||
| _, _ = fmt.Fprintf(GinkgoWriter, "Final NumReplicas: %d\n", va.Status.DesiredOptimizedAlloc.NumReplicas) | ||
| _, _ = fmt.Fprintf(GinkgoWriter, "VA Conditions:\n") | ||
| for _, c := range va.Status.Conditions { | ||
| _, _ = fmt.Fprintf(GinkgoWriter, " %s: %s (reason: %s, message: %s)\n", | ||
| c.Type, c.Status, c.Reason, c.Message) | ||
| } | ||
|
|
||
| scaleToZeroMetricsWorking = false | ||
| Skip("Scale-to-zero did not take effect within timeout. " + | ||
| "This is likely because the Prometheus recording rule for " + | ||
| "vllm:request_success_total is not deployed. Standard vLLM exposes " + | ||
| "vllm_request_success_total (underscore notation) but WVA queries " + | ||
| "vllm:request_success_total (colon notation, requires recording rules).") |
There was a problem hiding this comment.
Similar to the saturation-based test: this OpenShift scale-to-zero check Skip()s when scale-to-zero doesn’t happen within the timeout, attributing it to missing vllm:request_success_total recording rules. This can hide real functional regressions as skipped tests. Consider gating the skip on a clear signal in VA conditions/messages that the metric is missing, and failing for other causes.
Summary
Fixes the WVA nightly E2E test failures on OpenShift by resolving three independent issues in the metrics pipeline and gateway routing.
Fixes included
KEDA APIService conflict — OpenShift clusters have KEDA pre-installed, which owns
v1beta1.external.metrics.k8s.io. Thedeploy_prometheus_adapter()function now detects when the APIService points to KEDA and patches it to redirect to Prometheus Adapter.ServiceMonitor scheme mismatch — When
wva.metrics.secure=false(used in nightly CI), the WVA controller serves HTTP on its metrics endpoint, but the ServiceMonitor was hardcoded to use HTTPS with bearer token auth. Made the ServiceMonitor template conditional on.Values.wva.metrics.secure.HTTPRoute name mismatch (root cause of health check failure) — The static
httproute.yamluses hardcoded resource names matching the helmfile's defaultRELEASE_NAME_POSTFIX(workload-autoscaler). When the nightly CI overrides this toworkload-autoscaling(the guide directory name), the deployed Gateway and InferencePool get different names. The HTTPRoute'sparentRefdoesn't match any Gateway, so no routes bind, and all gateway requests return HTTP 404. Fix: useyqto template the HTTPRoute names based on the actualRELEASE_NAME_POSTFIXbefore applying.Health check diagnostic improvement — Replaced
curl -sf(which suppresses HTTP status codes) with explicit HTTP status capture, making future gateway routing issues much easier to diagnose.Test results progression
Test plan