🐛 Fix nightly E2E: update DEFAULT_MODEL_ID and add WVA_METRICS_SECURE by clubanderson · Pull Request #720 · llm-d/llm-d-workload-variant-autoscaler

clubanderson · 2026-02-13T15:30:40Z

Summary

Two fixes for the nightly E2E tests on OpenShift:

Update DEFAULT_MODEL_ID from Qwen/Qwen3-32B to Qwen/Qwen3-0.6B
- The llm-d repo changed its default model, but install.sh still had the old value
- The yq model replacement silently failed — vLLM served Qwen/Qwen3-0.6B while WVA queried for unsloth/Meta-Llama-3.1-8B metrics
- Result: MetricsAvailable=False, DesiredOptimizedAlloc empty, Scale-to-Zero tests fail
Add WVA_METRICS_SECURE env var (default: true) wired to wva.metrics.secure Helm value
- When set to false, disables bearer token auth on the WVA /metrics endpoint
- Needed for OpenShift user-workload-monitoring where the controller-manager SA token auth fails with "Token does not match server's copy"
- Without this, Prometheus can't scrape wva_desired_replicas → external metrics API empty → ShareGPT test fails

Before fix: 1 Passed, 2 Failed, 22 Skipped
After model ID fix only: 3 Passed, 1 Failed, 21 Skipped
Expected after both fixes: 4+ Passed, 0 Failed

Companion PR: llm-d/llm-d-infra#20 (sets WVA_METRICS_SECURE=false in reusable workflow)

Note: Nightly workflow temporarily points to llm-d-infra@fix/nightly-metrics-secure. Will revert to @main after llm-d-infra#20 merges.

Test plan

Trigger nightly from this branch — all core tests should pass
Verify MetricsAvailable=True in metrics pipeline verification step
Verify wva_desired_replicas is accessible via external metrics API

Two fixes for nightly E2E tests: 1. Update DEFAULT_MODEL_ID from Qwen/Qwen3-32B to Qwen/Qwen3-0.6B to match the llm-d repo's current default. The stale value caused the yq model replacement to silently fail, resulting in vLLM serving the wrong model and MetricsAvailable=False. 2. Add WVA_METRICS_SECURE env var (default: true) wired to wva.metrics.secure helm value. When set to false, disables bearer token auth on the /metrics endpoint. Needed for OpenShift user-workload-monitoring where the SA token auth fails. Temporarily points nightly workflow to llm-d-infra branch fix/nightly-metrics-secure (PR #20) for testing. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Andrew Anderson <andy@clubanderson.com>

github-actions · 2026-02-13T15:33:18Z

GPU Pre-flight Check ✅

GPUs are available for e2e-openshift tests. Proceeding with deployment.

Resource	Total	Allocated	Available
GPUs	50	26	24

Cluster	Value
Nodes	16 (7 with GPUs)
Total CPU	993 cores
Total Memory	10383 Gi
GPUs required	4 (min) / 6 (recommended)

Copilot

Pull request overview

This PR fixes two issues affecting nightly E2E tests on OpenShift. The first fix updates the DEFAULT_MODEL_ID to match the llm-d repository's current default model, ensuring that the yq model replacement logic works correctly. The second fix adds a WVA_METRICS_SECURE environment variable to disable bearer token authentication on the WVA /metrics endpoint, which is needed for OpenShift's user-workload-monitoring to successfully scrape metrics. The PR also temporarily references a companion PR branch in the workflow configuration.

Changes:

Update DEFAULT_MODEL_ID from "Qwen/Qwen3-32B" to "Qwen/Qwen3-0.6B" to align with llm-d repository defaults
Add WVA_METRICS_SECURE environment variable (default: true) to control metrics endpoint authentication, wired through to Helm chart
Temporarily reference llm-d-infra@fix/nightly-metrics-secure branch until companion PR merges

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
deploy/install.sh	Updates DEFAULT_MODEL_ID constant and adds WVA_METRICS_SECURE environment variable with default value, wiring it to Helm chart via --set flag
.github/workflows/nightly-e2e-openshift.yaml	Temporarily references fix/nightly-metrics-secure branch of llm-d-infra reusable workflow instead of @main

Copilot · 2026-02-13T15:34:26Z

.github/workflows/nightly-e2e-openshift.yaml

 jobs:
  nightly:
-    uses: llm-d/llm-d-infra/.github/workflows/reusable-nightly-e2e-openshift.yaml@main
+    uses: llm-d/llm-d-infra/.github/workflows/reusable-nightly-e2e-openshift.yaml@fix/nightly-metrics-secure


This temporary reference to the fix/nightly-metrics-secure branch deviates from the established pattern of referencing reusable workflows at @main. While documented in the PR description as temporary, this creates a merge dependency where this PR cannot be merged and tested properly until llm-d-infra#20 is merged first. Consider either: 1) merging llm-d-infra#20 first and updating this to @main, or 2) keeping this at @main and accepting that the nightly workflow may fail until llm-d-infra#20 is merged. The current approach blocks independent verification of these fixes.

Suggested change

uses: llm-d/llm-d-infra/.github/workflows/reusable-nightly-e2e-openshift.yaml@fix/nightly-metrics-secure

uses: llm-d/llm-d-infra/.github/workflows/reusable-nightly-e2e-openshift.yaml@main

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Andrew Anderson <andy@clubanderson.com>

github-actions · 2026-02-13T15:41:28Z

GPU Pre-flight Check ✅

GPUs are available for e2e-openshift tests. Proceeding with deployment.

Resource	Total	Allocated	Available
GPUs	50	26	24

Cluster	Value
Nodes	16 (7 with GPUs)
Total CPU	993 cores
Total Memory	10383 Gi
GPUs required	4 (min) / 6 (recommended)

Three fixes for the nightly E2E test failures: 1. DEFAULT_MODEL_ID: Update from Qwen/Qwen3-32B to Qwen/Qwen3-0.6B to match the llm-d repo's current default model. The stale value caused yq replacement to silently fail, making WVA query wrong model metrics. 2. WVA_METRICS_SECURE: Add env var to control bearer token auth on the WVA /metrics endpoint. OpenShift's user-workload-monitoring cannot authenticate with the controller-manager SA token. 3. KEDA APIService conflict: On clusters with KEDA, the v1beta1.external.metrics.k8s.io APIService points to KEDA's metrics server, which only serves ScaledObject metrics. After deploying Prometheus Adapter, detect and patch the APIService to point to Prometheus Adapter instead. Supersedes #720. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Andrew Anderson <andy@clubanderson.com>

clubanderson · 2026-02-13T15:53:25Z

Superseded by #721 which also includes the KEDA APIService conflict fix.

Three fixes for the nightly E2E test failures: 1. DEFAULT_MODEL_ID: Update from Qwen/Qwen3-32B to Qwen/Qwen3-0.6B to match the llm-d repo's current default model. The stale value caused yq replacement to silently fail, making WVA query wrong model metrics. 2. WVA_METRICS_SECURE: Add env var to control bearer token auth on the WVA /metrics endpoint. OpenShift's user-workload-monitoring cannot authenticate with the controller-manager SA token. 3. KEDA APIService conflict: On clusters with KEDA, the v1beta1.external.metrics.k8s.io APIService points to KEDA's metrics server, which only serves ScaledObject metrics. After deploying Prometheus Adapter, detect and patch the APIService to point to Prometheus Adapter instead. Supersedes #720. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Andrew Anderson <andy@clubanderson.com>

Copilot AI review requested due to automatic review settings February 13, 2026 15:30

Copilot started reviewing on behalf of clubanderson February 13, 2026 15:31 View session

Copilot AI reviewed Feb 13, 2026

View reviewed changes

chore: revert nightly workflow to @main (infra PR #20 merged)

638d3fc

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Andrew Anderson <andy@clubanderson.com>

clubanderson mentioned this pull request Feb 13, 2026

🐛 fix: resolve KEDA APIService conflict for external metrics in nightly E2E #721

Merged

2 tasks

clubanderson closed this Feb 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 Fix nightly E2E: update DEFAULT_MODEL_ID and add WVA_METRICS_SECURE#720

🐛 Fix nightly E2E: update DEFAULT_MODEL_ID and add WVA_METRICS_SECURE#720
clubanderson wants to merge 2 commits intomainfrom
fix/nightly-model-id-and-metrics-auth

clubanderson commented Feb 13, 2026

Uh oh!

github-actions bot commented Feb 13, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 13, 2026

Uh oh!

github-actions bot commented Feb 13, 2026

Uh oh!

clubanderson commented Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	uses: llm-d/llm-d-infra/.github/workflows/reusable-nightly-e2e-openshift.yaml@fix/nightly-metrics-secure
	uses: llm-d/llm-d-infra/.github/workflows/reusable-nightly-e2e-openshift.yaml@main

Conversation

clubanderson commented Feb 13, 2026

Summary

Test plan

Uh oh!

github-actions bot commented Feb 13, 2026

GPU Pre-flight Check ✅

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Feb 13, 2026

GPU Pre-flight Check ✅

Uh oh!

clubanderson commented Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant