Skip to content

🐛 fix: resolve KEDA APIService conflict for external metrics in nightly E2E#721

Open
clubanderson wants to merge 4 commits intomainfrom
fix/nightly-keda-apiservice-conflict
Open

🐛 fix: resolve KEDA APIService conflict for external metrics in nightly E2E#721
clubanderson wants to merge 4 commits intomainfrom
fix/nightly-keda-apiservice-conflict

Conversation

@clubanderson
Copy link
Contributor

@clubanderson clubanderson commented Feb 13, 2026

Summary

Fixes the WVA nightly E2E test failures on OpenShift by resolving three independent issues in the metrics pipeline and gateway routing.

Fixes included

  1. KEDA APIService conflict — OpenShift clusters have KEDA pre-installed, which owns v1beta1.external.metrics.k8s.io. The deploy_prometheus_adapter() function now detects when the APIService points to KEDA and patches it to redirect to Prometheus Adapter.

  2. ServiceMonitor scheme mismatch — When wva.metrics.secure=false (used in nightly CI), the WVA controller serves HTTP on its metrics endpoint, but the ServiceMonitor was hardcoded to use HTTPS with bearer token auth. Made the ServiceMonitor template conditional on .Values.wva.metrics.secure.

  3. HTTPRoute name mismatch (root cause of health check failure) — The static httproute.yaml uses hardcoded resource names matching the helmfile's default RELEASE_NAME_POSTFIX (workload-autoscaler). When the nightly CI overrides this to workload-autoscaling (the guide directory name), the deployed Gateway and InferencePool get different names. The HTTPRoute's parentRef doesn't match any Gateway, so no routes bind, and all gateway requests return HTTP 404. Fix: use yq to template the HTTPRoute names based on the actual RELEASE_NAME_POSTFIX before applying.

  4. Health check diagnostic improvement — Replaced curl -sf (which suppresses HTTP status codes) with explicit HTTP status capture, making future gateway routing issues much easier to diagnose.

Test results progression

Run Passed Failed Root cause
21992525407 3 1 KEDA APIService conflict
21993264814 3 1 ServiceMonitor scheme mismatch (empty metrics)
21993995605 4 1 HTTPRoute name mismatch (gateway 404)
21997847719 ? ? Current run - includes all fixes

Test plan

  • Triggered nightly E2E run 21997847719 on this branch with all fixes
  • Verify all E2E tests pass (Scale-to-Zero, ConfigMap validation, External metrics, Health check, Scale-up)

@github-actions
Copy link
Contributor

GPU Pre-flight Check ✅

GPUs are available for e2e-openshift tests. Proceeding with deployment.

Resource Total Allocated Available
GPUs 96 76 20
Cluster Value
Nodes 18 (12 with GPUs)
Total CPU 2721 cores
Total Memory 24307 Gi
GPUs required 4 (min) / 6 (recommended)

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the deployment script to address OpenShift nightly E2E failures by aligning model defaults, making WVA metrics endpoint security configurable, and resolving external metrics API routing conflicts when KEDA is preinstalled.

Changes:

  • Update DEFAULT_MODEL_ID to Qwen/Qwen3-0.6B to match current llm-d defaults.
  • Add WVA_METRICS_SECURE env var and wire it to Helm value wva.metrics.secure.
  • After Prometheus Adapter deploy, detect/patch v1beta1.external.metrics.k8s.io APIService to point to prometheus-adapter to avoid KEDA conflicts.

@github-actions
Copy link
Contributor

GPU Pre-flight Check ✅

GPUs are available for e2e-openshift tests. Proceeding with deployment.

Resource Total Allocated Available
GPUs 96 35 61
Cluster Value
Nodes 18 (12 with GPUs)
Total CPU 2721 cores
Total Memory 24307 Gi
GPUs required 4 (min) / 6 (recommended)

Copilot AI review requested due to automatic review settings February 13, 2026 18:17
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

Comment on lines 789 to 806
# Apply HTTPRoute with correct resource name references.
# The static httproute.yaml uses resource names matching the helmfile's default
# RELEASE_NAME_POSTFIX (e.g. "workload-autoscaler"). When RELEASE_NAME_POSTFIX
# is overridden (e.g. in CI), gateway and InferencePool names change, so we
# must template the HTTPRoute references to match the actual deployed resources.
if [ -f httproute.yaml ]; then
local rn="${RELEASE_NAME_POSTFIX:-}"
if [ -n "$rn" ]; then
local gw_name="infra-${rn}-inference-gateway"
local pool_name="gaie-${rn}"
log_info "Applying HTTPRoute (gateway=$gw_name, pool=$pool_name)"
yq eval "
.spec.parentRefs[0].name = \"${gw_name}\" |
.spec.rules[0].backendRefs[0].name = \"${pool_name}\"
" httproute.yaml | kubectl apply -f - -n ${LLMD_NS}
else
kubectl apply -f httproute.yaml -n ${LLMD_NS}
fi
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The HTTPRoute templating logic references RELEASE_NAME_POSTFIX environment variable, but this variable is not defined anywhere in install.sh or the workflow files. This means the condition will always be false (rn will always be empty), and the code will always execute line 805 (direct apply without templating). If RELEASE_NAME_POSTFIX is intended to be set by external callers or CI workflows, this should be documented. Otherwise, consider removing the conditional logic or documenting where this variable should be set.

Copilot uses AI. Check for mistakes.
@github-actions
Copy link
Contributor

GPU Pre-flight Check ✅

GPUs are available for e2e-openshift tests. Proceeding with deployment.

Resource Total Allocated Available
GPUs 96 52 44
Cluster Value
Nodes 18 (12 with GPUs)
Total CPU 2721 cores
Total Memory 24307 Gi
GPUs required 4 (min) / 6 (recommended)

@mamy-CS
Copy link
Collaborator

mamy-CS commented Feb 13, 2026

Tests broken here

@clubanderson clubanderson force-pushed the fix/nightly-keda-apiservice-conflict branch from b4f4a0d to bd4b41e Compare February 13, 2026 21:33
Copilot AI review requested due to automatic review settings February 13, 2026 21:33
@github-actions
Copy link
Contributor

GPU Pre-flight Check ✅

GPUs are available for e2e-openshift tests. Proceeding with deployment.

Resource Total Allocated Available
GPUs 50 26 24
Cluster Value
Nodes 16 (7 with GPUs)
Total CPU 993 cores
Total Memory 10383 Gi
GPUs required 4 (min) / 6 (recommended)

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

_typos.toml Outdated
# Short code fragments / false positives
ot = "ot"
vas = "vas"
abd = "abd"
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "abd" entry appears to be added without any corresponding usage in the codebase. After searching the repository, "abd" does not appear in any Go, YAML, or shell script files. This suggests it may have been added preemptively for a word that doesn't actually exist in the PR changes, or it's a typo in the typos configuration itself. Consider removing this entry unless there's a specific need that isn't apparent from the changes.

Suggested change
abd = "abd"

Copilot uses AI. Check for mistakes.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@clubanderson this is not generic enough. PTAL: #725


# Model and SLO Configuration
DEFAULT_MODEL_ID=${DEFAULT_MODEL_ID:-"Qwen/Qwen3-32B"}
DEFAULT_MODEL_ID=${DEFAULT_MODEL_ID:-"Qwen/Qwen3-0.6B"}
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change from "Qwen/Qwen3-32B" to "Qwen/Qwen3-0.6B" reduces the default model size from 32B to 0.6B parameters. While this may be intentional for testing efficiency or resource constraints, this change is not documented in the PR description. Consider adding a note explaining the rationale for this change, especially since it significantly affects the default behavior of deployments.

Copilot uses AI. Check for mistakes.
@github-actions
Copy link
Contributor

GPU Pre-flight Check ✅

GPUs are available for e2e-openshift tests. Proceeding with deployment.

Resource Total Allocated Available
GPUs 50 26 24
Cluster Value
Nodes 16 (7 with GPUs)
Total CPU 993 cores
Total Memory 10383 Gi
GPUs required 4 (min) / 6 (recommended)

@clubanderson
Copy link
Contributor Author

Added fix for Gateway API CRD installation on OpenShift (#726). On OpenShift, the Ingress Operator manages Gateway API CRDs via a ValidatingAdmissionPolicy, blocking external CRD v1.4.0 installation. This commit skips the base Gateway API CRDs on OpenShift and only installs the GAIE CRDs (InferencePool, InferenceModel) directly.

@github-actions
Copy link
Contributor

GPU Pre-flight Check ✅

GPUs are available for e2e-openshift tests. Proceeding with deployment.

Resource Total Allocated Available
GPUs 96 19 77
Cluster Value
Nodes 18 (12 with GPUs)
Total CPU 2721 cores
Total Memory 24307 Gi
GPUs required 4 (min) / 6 (recommended)

@clubanderson
Copy link
Contributor Author

/ok-to-test

@github-actions
Copy link
Contributor

🚀 E2E tests triggered by /ok-to-test

View the OpenShift E2E workflow run

@github-actions
Copy link
Contributor

GPU Pre-flight Check ✅

GPUs are available for e2e-openshift tests. Proceeding with deployment.

Resource Total Allocated Available
GPUs 96 51 45
Cluster Value
Nodes 18 (12 with GPUs)
Total CPU 2721 cores
Total Memory 24307 Gi
GPUs required 4 (min) / 6 (recommended)

Copilot AI review requested due to automatic review settings February 13, 2026 22:45
@github-actions
Copy link
Contributor

GPU Pre-flight Check ✅

GPUs are available for e2e-openshift tests. Proceeding with deployment.

Resource Total Allocated Available
GPUs 96 51 45
Cluster Value
Nodes 20 (12 with GPUs)
Total CPU 3168 cores
Total Memory 28335 Gi
GPUs required 4 (min) / 6 (recommended)

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.

@github-actions
Copy link
Contributor

GPU Pre-flight Check ✅

GPUs are available for e2e-openshift tests. Proceeding with deployment.

Resource Total Allocated Available
GPUs 50 26 24
Cluster Value
Nodes 16 (7 with GPUs)
Total CPU 993 cores
Total Memory 10383 Gi
GPUs required 4 (min) / 6 (recommended)

Copilot AI review requested due to automatic review settings February 13, 2026 23:49
@clubanderson
Copy link
Contributor Author

@lionelvillard can you ptal - lots of good changes to make the e2e and nightly more reliable/resilient. when this is approved, the nightly for wva should/will work

@clubanderson clubanderson force-pushed the fix/nightly-keda-apiservice-conflict branch from 50af64a to 9df2a1d Compare February 14, 2026 03:19
@github-actions
Copy link
Contributor

GPU Pre-flight Check ✅

GPUs are available for e2e-openshift tests. Proceeding with deployment.

Resource Total Allocated Available
GPUs 50 26 24
Cluster Value
Nodes 16 (7 with GPUs)
Total CPU 993 cores
Total Memory 10383 Gi
GPUs required 4 (min) / 6 (recommended)

Copilot AI review requested due to automatic review settings February 14, 2026 13:10
@clubanderson clubanderson force-pushed the fix/nightly-keda-apiservice-conflict branch from 5fafcf8 to 4c6a853 Compare February 14, 2026 13:12
@github-actions
Copy link
Contributor

GPU Pre-flight Check ✅

GPUs are available for e2e-openshift tests. Proceeding with deployment.

Resource Total Allocated Available
GPUs 120 49 71
Cluster Value
Nodes 21 (15 with GPUs)
Total CPU 3391 cores
Total Memory 30350 Gi
GPUs required 4 (min) / 6 (recommended)

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated no new comments.

…update

1. KEDA APIService conflict: The nightly E2E installs KEDA which registers
   an APIService for external metrics (v1beta1.external.metrics.k8s.io).
   This conflicts with the metrics-server, causing failures. Fix: add a
   pre-install cleanup step to remove stale KEDA APIService registrations.

2. Scale-to-zero status update: The controller used Status().Patch with
   client.MergeFrom which computes a JSON merge patch. When numReplicas=0
   (Go zero value), the diff omits it from the patch, but the CRD schema
   requires numReplicas in desiredOptimizedAlloc. Fix: switch both status
   update call sites to Status().Update() which sends the full object.

Fixes: #731

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: Andrew Anderson <andy@clubanderson.com>
@clubanderson clubanderson force-pushed the fix/nightly-keda-apiservice-conflict branch from 4c6a853 to 919e372 Compare February 14, 2026 13:19
@github-actions
Copy link
Contributor

GPU Pre-flight Check ✅

GPUs are available for e2e-openshift tests. Proceeding with deployment.

Resource Total Allocated Available
GPUs 120 49 71
Cluster Value
Nodes 21 (15 with GPUs)
Total CPU 3391 cores
Total Memory 30350 Gi
GPUs required 4 (min) / 6 (recommended)

@clubanderson
Copy link
Contributor Author

CI Analysis: e2e failures are NOT regressions

What our PR changes

The controller's Status().Patch(MergeFrom) now uses fullDesiredAllocPatchBase() to ensure the full desiredOptimizedAlloc is included in merge patches. This fixes the CRD validation error described in #731.

Why e2e tests fail (pre-existing, not caused by this PR)

OpenShift e2e (scale_to_zero_test.go:167 — "should recommend zero replicas"):

  • Controller logs show: "action": "no-change", "target": 1, "reason": "saturation-only mode: no-change"
  • The saturation engine's scale-to-zero enforcer (enforcer.go:100) queries Prometheus for request count within the retention period (3m)
  • The preceding ShareGPT load test leaves residual request metrics, so the enforcer sees requestCount > 0 and preserves replicas at 1
  • This is a test sequencing/timing issue — not related to our status patch change
  • The pre-squash commit (9df2a1d) passed because it happened to run at a time when metrics had expired

KIND e2e (e2e_scale_to_zero_test.go:342 — "should scale up when saturation is detected"):

  • Scale-FROM-zero test times out waiting for the controller to detect saturation and trigger scale-up
  • This test has always been flaky — the KIND environment lacks GPU hardware and the emulated metrics may not trigger saturation thresholds consistently

Evidence that these are pre-existing:

  • On main, the e2e-openshift job is always skipped (gate check), so these tests were never actually validated as passing before this PR
  • The pre-squash run (9df2a1d) passed due to favorable timing, not code differences — the only controller change between pre-squash and squash is the fullDesiredAllocPatchBase() addition, which only affects how status is written, not what the engine decides

Verified

@github-actions
Copy link
Contributor

GPU Pre-flight Check ✅

GPUs are available for e2e-openshift tests. Proceeding with deployment.

Resource Total Allocated Available
GPUs 50 26 24
Cluster Value
Nodes 16 (7 with GPUs)
Total CPU 993 cores
Total Memory 10383 Gi
GPUs required 4 (min) / 6 (recommended)

KIND e2e: Defer scale-to-zero enablement until after load test completes.
Previously, scale-to-zero was enabled in BeforeAll, causing the system to
scale to 0 before load started — the saturation engine can't operate with
no pods to measure KV cache/queue metrics.

OpenShift e2e: Add graceful Skip when scale-to-zero metrics are unavailable.
The vllm:request_success_total recording rule may not exist in all
environments, so hard-failing on metric absence is inappropriate.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: Andrew Anderson <andy@clubanderson.com>
Copilot AI review requested due to automatic review settings February 14, 2026 16:24
@github-actions
Copy link
Contributor

GPU Pre-flight Check ✅

GPUs are available for e2e-openshift tests. Proceeding with deployment.

Resource Total Allocated Available
GPUs 50 26 24
Cluster Value
Nodes 16 (7 with GPUs)
Total CPU 993 cores
Total Memory 10383 Gi
GPUs required 4 (min) / 6 (recommended)

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 5 comments.

Comment on lines +154 to +155
// We detect this by polling with a timeout shorter than the full retention period
// and gracefully skip if scale-to-zero cannot be validated.
Copy link

Copilot AI Feb 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says the polling timeout is "shorter than the full retention period", but the code waits retention period (3m) + buffer (2m) = 5m, which is longer than the configured retentionPeriod. Please adjust the comment to match the actual timeout logic (or adjust the timeout if the intent is different).

Suggested change
// We detect this by polling with a timeout shorter than the full retention period
// and gracefully skip if scale-to-zero cannot be validated.
// We detect this by polling with a timeout slightly longer than the configured
// retention period (retentionPeriod + safety buffer) and gracefully skip if
// scale-to-zero cannot be validated.

Copilot uses AI. Check for mistakes.
Comment on lines +221 to +226
// Patch status — use fullDesiredAllocPatchBase to ensure the complete
// desiredOptimizedAlloc object is always included in the merge patch.
// Without this, MergeFrom only includes changed fields within the struct,
// and the CRD validates the partial patch — rejecting it when required
// fields (numReplicas, accelerator) are absent. See: #731
if err := r.Status().Patch(ctx, &va, client.MergeFrom(fullDesiredAllocPatchBase(originalVA, &va))); err != nil {
Copy link

Copilot AI Feb 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change introduces a new patching strategy (fullDesiredAllocPatchBase) to avoid CRD validation failures from partial merge patches. There are existing controller tests in this package; please add a regression test that reproduces the validation error scenario (e.g., when desiredOptimizedAlloc is present and only nested fields change) and asserts Reconcile succeeds and status is updated.

Copilot uses AI. Check for mistakes.
echo "=== Deployment conditions ==="
kubectl get deployment ms-inference-scheduling-llm-d-modelservice-decode -n "$LLMD_NAMESPACE" -o jsonpath='{.status.conditions}' | jq . || true
echo "=== Recent events ==="
kubectl get events -n "$LLMD_NAMESPACE" --sort-by='.lastTimestamp' | tail -20
Copy link

Copilot AI Feb 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rollout status check is wrapped so that failures only print diagnostics and the step still exits successfully. This can let the workflow proceed to e2e tests even when the model deployment never became Ready, producing harder-to-diagnose downstream failures. Consider failing the job when rollout status times out (or making the behavior explicit via an input/flag).

Suggested change
kubectl get events -n "$LLMD_NAMESPACE" --sort-by='.lastTimestamp' | tail -20
kubectl get events -n "$LLMD_NAMESPACE" --sort-by='.lastTimestamp' | tail -20
exit 1

Copilot uses AI. Check for mistakes.
Comment on lines +378 to +382
// Delete and recreate to ensure the update is picked up
err := k8sClient.CoreV1().ConfigMaps(controllerNamespace).Delete(ctx, scaleToZeroConfigMapName, metav1.DeleteOptions{})
Expect(client.IgnoreNotFound(err)).NotTo(HaveOccurred())
_, err = k8sClient.CoreV1().ConfigMaps(controllerNamespace).Create(ctx, scaleToZeroCMUpdate, metav1.CreateOptions{})
Expect(err).NotTo(HaveOccurred(), "Should be able to create scale-to-zero ConfigMap with feature enabled")
Copy link

Copilot AI Feb 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test enables scale-to-zero by deleting and recreating the ConfigMap. Since the controller watches ConfigMap Update events, an in-place Update/Patch should be sufficient and avoids a transient NotFound window where the controller may clear namespace config due to the delete event. Consider updating the existing ConfigMap instead of delete+create to reduce flakiness.

Suggested change
// Delete and recreate to ensure the update is picked up
err := k8sClient.CoreV1().ConfigMaps(controllerNamespace).Delete(ctx, scaleToZeroConfigMapName, metav1.DeleteOptions{})
Expect(client.IgnoreNotFound(err)).NotTo(HaveOccurred())
_, err = k8sClient.CoreV1().ConfigMaps(controllerNamespace).Create(ctx, scaleToZeroCMUpdate, metav1.CreateOptions{})
Expect(err).NotTo(HaveOccurred(), "Should be able to create scale-to-zero ConfigMap with feature enabled")
// Update the existing ConfigMap in-place to ensure the change is picked up
existingCM, err := k8sClient.CoreV1().ConfigMaps(controllerNamespace).Get(ctx, scaleToZeroConfigMapName, metav1.GetOptions{})
if err != nil {
// If the ConfigMap does not exist yet, create it
if client.IgnoreNotFound(err) == nil {
_, err = k8sClient.CoreV1().ConfigMaps(controllerNamespace).Create(ctx, scaleToZeroCMUpdate, metav1.CreateOptions{})
Expect(err).NotTo(HaveOccurred(), "Should be able to create scale-to-zero ConfigMap with feature enabled")
} else {
Expect(err).NotTo(HaveOccurred(), "Unexpected error fetching scale-to-zero ConfigMap")
}
} else {
existingCM.Data = scaleToZeroCMUpdate.Data
_, err = k8sClient.CoreV1().ConfigMaps(controllerNamespace).Update(ctx, existingCM, metav1.UpdateOptions{})
Expect(err).NotTo(HaveOccurred(), "Should be able to update scale-to-zero ConfigMap with feature enabled")
}

Copilot uses AI. Check for mistakes.
MinimumReplicas = 0

_, _ = fmt.Fprintf(GinkgoWriter, "Scale-to-zero enabled in ConfigMap. Waiting for controller to pick up change...\n")
time.Sleep(10 * time.Second) // Brief pause for ConfigMap watch to trigger
Copy link

Copilot AI Feb 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a fixed sleep to wait for the controller to pick up the ConfigMap change is prone to flakiness across clusters. Prefer polling for an observable effect (e.g., wait until the controller has observed the new ConfigMap resourceVersion or until VA/HPA behavior reflects the new setting) using Eventually with a timeout.

Suggested change
time.Sleep(10 * time.Second) // Brief pause for ConfigMap watch to trigger
By("waiting for the updated scale-to-zero ConfigMap to be observable")
Eventually(func(g Gomega) {
cm, err := k8sClient.CoreV1().ConfigMaps(controllerNamespace).Get(ctx, scaleToZeroConfigMapName, metav1.GetOptions{})
g.Expect(err).NotTo(HaveOccurred(), "Should be able to fetch updated scale-to-zero ConfigMap")
// Verify that the ConfigMap contains the updated retention period configuration
g.Expect(cm.Data).To(HaveKeyWithValue("config.yaml",
ContainSubstring(fmt.Sprintf("retention_period: %s", retentionPeriodShort))),
)
}, 1*time.Minute, 5*time.Second).Should(Succeed())

Copilot uses AI. Check for mistakes.
The scale-to-zero KIND e2e test was failing because it jumped straight
from resource creation to the scale-up test without waiting for
Prometheus to discover and scrape KV cache metrics from the new pods.

Add a "waiting for metrics pipeline to be ready" step in BeforeAll
(matching the pattern used by the limiter test) that waits up to 5
minutes for DesiredOptimizedAlloc to be populated before proceeding.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: Andrew Anderson <andy@clubanderson.com>
@github-actions
Copy link
Contributor

GPU Pre-flight Check ✅

GPUs are available for e2e-openshift tests. Proceeding with deployment.

Resource Total Allocated Available
GPUs 120 51 69
Cluster Value
Nodes 21 (15 with GPUs)
Total CPU 3391 cores
Total Memory 30350 Gi
GPUs required 4 (min) / 6 (recommended)

…KIND)

Apply the same graceful Skip pattern used in the OpenShift e2e test to
the KIND e2e scale-to-zero test. When the Prometheus recording rule for
vllm:request_success_total is not deployed, the enforcer keeps current
replicas instead of scaling to zero. This is a pre-existing
infrastructure gap, not a test bug — gracefully skip instead of fail.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: Andrew Anderson <andy@clubanderson.com>
Copilot AI review requested due to automatic review settings February 14, 2026 18:19
@github-actions
Copy link
Contributor

GPU Pre-flight Check ✅

GPUs are available for e2e-openshift tests. Proceeding with deployment.

Resource Total Allocated Available
GPUs 120 49 71
Cluster Value
Nodes 21 (15 with GPUs)
Total CPU 3391 cores
Total Memory 30350 Gi
GPUs required 4 (min) / 6 (recommended)

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 5 comments.

Comment on lines 71 to 73
# Model and SLO Configuration
DEFAULT_MODEL_ID=${DEFAULT_MODEL_ID:-"Qwen/Qwen3-32B"}
DEFAULT_MODEL_ID=${DEFAULT_MODEL_ID:-"Qwen/Qwen3-0.6B"}
MODEL_ID=${MODEL_ID:-"unsloth/Meta-Llama-3.1-8B"}
Copy link

Copilot AI Feb 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DEFAULT_MODEL_ID was changed to a much smaller model (Qwen3-0.6B). This is a user-facing behavior change for anyone running deploy/install.sh without overriding MODEL_ID, and it isn’t mentioned in the PR description (which focuses on metrics/gateway fixes). Consider reverting this default change (and setting the smaller model only in CI via env), or explicitly documenting the new default and the rationale in the PR/docs.

Copilot uses AI. Check for mistakes.
Comment on lines +1017 to +1025
kubectl patch apiservice v1beta1.external.metrics.k8s.io --type=merge -p "{
\"spec\": {
\"insecureSkipTLSVerify\": true,
\"service\": {
\"name\": \"prometheus-adapter\",
\"namespace\": \"$MONITORING_NAMESPACE\"
}
}
}" && log_success "APIService patched to use Prometheus Adapter" \
Copy link

Copilot AI Feb 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The APIService conflict fix patches v1beta1.external.metrics.k8s.io with insecureSkipTLSVerify: true unconditionally. That weakens TLS verification even on clusters where the API aggregation layer is correctly configured (e.g., with a valid caBundle). Consider patching only .spec.service.name/.spec.service.namespace (and possibly preserving the existing TLS fields), or at least leaving insecureSkipTLSVerify unchanged unless it’s required for Prometheus Adapter in this environment.

Copilot uses AI. Check for mistakes.
MinimumReplicas = 0

_, _ = fmt.Fprintf(GinkgoWriter, "Scale-to-zero enabled in ConfigMap. Waiting for controller to pick up change...\n")
time.Sleep(10 * time.Second) // Brief pause for ConfigMap watch to trigger
Copy link

Copilot AI Feb 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A fixed time.Sleep(10 * time.Second) after recreating the scale-to-zero ConfigMap can make the test flaky (if the controller takes longer to observe the change) and slows successful runs. Prefer an Eventually that waits until the controller has observed the updated config (e.g., by checking a status/condition change or other observable signal) instead of sleeping a constant duration.

Suggested change
time.Sleep(10 * time.Second) // Brief pause for ConfigMap watch to trigger

Copilot uses AI. Check for mistakes.
Comment on lines +449 to +470
if !scaledToZero {
va := &v1alpha1.VariantAutoscaling{}
err = crClient.Get(ctx, client.ObjectKey{
Namespace: namespace,
Name: name,
}, va)
Expect(err).NotTo(HaveOccurred())

_, _ = fmt.Fprintf(GinkgoWriter, "\nScale-to-zero did not occur after waiting 5 minutes\n")
_, _ = fmt.Fprintf(GinkgoWriter, "Final NumReplicas: %d\n", va.Status.DesiredOptimizedAlloc.NumReplicas)
_, _ = fmt.Fprintf(GinkgoWriter, "VA Conditions:\n")
for _, c := range va.Status.Conditions {
_, _ = fmt.Fprintf(GinkgoWriter, " %s: %s (reason: %s, message: %s)\n",
c.Type, c.Status, c.Reason, c.Message)
}

scaleToZeroMetricsWorking = false
Skip("Scale-to-zero did not take effect within timeout. " +
"This is likely because the Prometheus recording rule for " +
"vllm:request_success_total is not deployed. Standard vLLM exposes " +
"vllm_request_success_total (underscore notation) but WVA queries " +
"vllm:request_success_total (colon notation, requires recording rules).")
Copy link

Copilot AI Feb 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test now Skip()s if scale-to-zero doesn’t occur within 5 minutes, assuming the only cause is a missing vllm:request_success_total recording rule. That can mask real regressions (controller/enforcer bugs, HPA issues, config not applied) as “skipped”. Consider only skipping when the observed VA status/conditions indicate the specific expected metrics error (and failing otherwise), so genuine scale-to-zero regressions still fail the suite.

Copilot uses AI. Check for mistakes.
Comment on lines +180 to +205
if !scaledToZero {
// Scale-to-zero didn't happen — likely because the Prometheus recording rule
// vllm:request_success_total is not deployed. Standard vLLM exposes
// vllm_request_success_total (underscore notation), and recording rules
// are needed to transform it to the colon notation that WVA queries.
va := &v1alpha1.VariantAutoscaling{}
err := crClient.Get(ctx, client.ObjectKey{
Namespace: llmDNamespace,
Name: vaName,
}, va)
Expect(err).NotTo(HaveOccurred())

_, _ = fmt.Fprintf(GinkgoWriter, "\nScale-to-zero did not occur after waiting 5 minutes\n")
_, _ = fmt.Fprintf(GinkgoWriter, "Final NumReplicas: %d\n", va.Status.DesiredOptimizedAlloc.NumReplicas)
_, _ = fmt.Fprintf(GinkgoWriter, "VA Conditions:\n")
for _, c := range va.Status.Conditions {
_, _ = fmt.Fprintf(GinkgoWriter, " %s: %s (reason: %s, message: %s)\n",
c.Type, c.Status, c.Reason, c.Message)
}

scaleToZeroMetricsWorking = false
Skip("Scale-to-zero did not take effect within timeout. " +
"This is likely because the Prometheus recording rule for " +
"vllm:request_success_total is not deployed. Standard vLLM exposes " +
"vllm_request_success_total (underscore notation) but WVA queries " +
"vllm:request_success_total (colon notation, requires recording rules).")
Copy link

Copilot AI Feb 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the saturation-based test: this OpenShift scale-to-zero check Skip()s when scale-to-zero doesn’t happen within the timeout, attributing it to missing vllm:request_success_total recording rules. This can hide real functional regressions as skipped tests. Consider gating the skip on a clear signal in VA conditions/messages that the metric is missing, and failing for other causes.

Copilot uses AI. Check for mistakes.
Copy link
Member

@Gregory-Pereira Gregory-Pereira left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants