🐛 Fix nightly E2E: update DEFAULT_MODEL_ID and add WVA_METRICS_SECURE#720
🐛 Fix nightly E2E: update DEFAULT_MODEL_ID and add WVA_METRICS_SECURE#720clubanderson wants to merge 2 commits intomainfrom
Conversation
Two fixes for nightly E2E tests: 1. Update DEFAULT_MODEL_ID from Qwen/Qwen3-32B to Qwen/Qwen3-0.6B to match the llm-d repo's current default. The stale value caused the yq model replacement to silently fail, resulting in vLLM serving the wrong model and MetricsAvailable=False. 2. Add WVA_METRICS_SECURE env var (default: true) wired to wva.metrics.secure helm value. When set to false, disables bearer token auth on the /metrics endpoint. Needed for OpenShift user-workload-monitoring where the SA token auth fails. Temporarily points nightly workflow to llm-d-infra branch fix/nightly-metrics-secure (PR #20) for testing. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Andrew Anderson <andy@clubanderson.com>
GPU Pre-flight Check ✅GPUs are available for e2e-openshift tests. Proceeding with deployment.
|
There was a problem hiding this comment.
Pull request overview
This PR fixes two issues affecting nightly E2E tests on OpenShift. The first fix updates the DEFAULT_MODEL_ID to match the llm-d repository's current default model, ensuring that the yq model replacement logic works correctly. The second fix adds a WVA_METRICS_SECURE environment variable to disable bearer token authentication on the WVA /metrics endpoint, which is needed for OpenShift's user-workload-monitoring to successfully scrape metrics. The PR also temporarily references a companion PR branch in the workflow configuration.
Changes:
- Update DEFAULT_MODEL_ID from "Qwen/Qwen3-32B" to "Qwen/Qwen3-0.6B" to align with llm-d repository defaults
- Add WVA_METRICS_SECURE environment variable (default: true) to control metrics endpoint authentication, wired through to Helm chart
- Temporarily reference llm-d-infra@fix/nightly-metrics-secure branch until companion PR merges
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| deploy/install.sh | Updates DEFAULT_MODEL_ID constant and adds WVA_METRICS_SECURE environment variable with default value, wiring it to Helm chart via --set flag |
| .github/workflows/nightly-e2e-openshift.yaml | Temporarily references fix/nightly-metrics-secure branch of llm-d-infra reusable workflow instead of @main |
| jobs: | ||
| nightly: | ||
| uses: llm-d/llm-d-infra/.github/workflows/reusable-nightly-e2e-openshift.yaml@main | ||
| uses: llm-d/llm-d-infra/.github/workflows/reusable-nightly-e2e-openshift.yaml@fix/nightly-metrics-secure |
There was a problem hiding this comment.
This temporary reference to the fix/nightly-metrics-secure branch deviates from the established pattern of referencing reusable workflows at @main. While documented in the PR description as temporary, this creates a merge dependency where this PR cannot be merged and tested properly until llm-d-infra#20 is merged first. Consider either: 1) merging llm-d-infra#20 first and updating this to @main, or 2) keeping this at @main and accepting that the nightly workflow may fail until llm-d-infra#20 is merged. The current approach blocks independent verification of these fixes.
| uses: llm-d/llm-d-infra/.github/workflows/reusable-nightly-e2e-openshift.yaml@fix/nightly-metrics-secure | |
| uses: llm-d/llm-d-infra/.github/workflows/reusable-nightly-e2e-openshift.yaml@main |
GPU Pre-flight Check ✅GPUs are available for e2e-openshift tests. Proceeding with deployment.
|
Three fixes for the nightly E2E test failures: 1. DEFAULT_MODEL_ID: Update from Qwen/Qwen3-32B to Qwen/Qwen3-0.6B to match the llm-d repo's current default model. The stale value caused yq replacement to silently fail, making WVA query wrong model metrics. 2. WVA_METRICS_SECURE: Add env var to control bearer token auth on the WVA /metrics endpoint. OpenShift's user-workload-monitoring cannot authenticate with the controller-manager SA token. 3. KEDA APIService conflict: On clusters with KEDA, the v1beta1.external.metrics.k8s.io APIService points to KEDA's metrics server, which only serves ScaledObject metrics. After deploying Prometheus Adapter, detect and patch the APIService to point to Prometheus Adapter instead. Supersedes #720. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Andrew Anderson <andy@clubanderson.com>
|
Superseded by #721 which also includes the KEDA APIService conflict fix. |
Three fixes for the nightly E2E test failures: 1. DEFAULT_MODEL_ID: Update from Qwen/Qwen3-32B to Qwen/Qwen3-0.6B to match the llm-d repo's current default model. The stale value caused yq replacement to silently fail, making WVA query wrong model metrics. 2. WVA_METRICS_SECURE: Add env var to control bearer token auth on the WVA /metrics endpoint. OpenShift's user-workload-monitoring cannot authenticate with the controller-manager SA token. 3. KEDA APIService conflict: On clusters with KEDA, the v1beta1.external.metrics.k8s.io APIService points to KEDA's metrics server, which only serves ScaledObject metrics. After deploying Prometheus Adapter, detect and patch the APIService to point to Prometheus Adapter instead. Supersedes #720. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Andrew Anderson <andy@clubanderson.com>
Three fixes for the nightly E2E test failures: 1. DEFAULT_MODEL_ID: Update from Qwen/Qwen3-32B to Qwen/Qwen3-0.6B to match the llm-d repo's current default model. The stale value caused yq replacement to silently fail, making WVA query wrong model metrics. 2. WVA_METRICS_SECURE: Add env var to control bearer token auth on the WVA /metrics endpoint. OpenShift's user-workload-monitoring cannot authenticate with the controller-manager SA token. 3. KEDA APIService conflict: On clusters with KEDA, the v1beta1.external.metrics.k8s.io APIService points to KEDA's metrics server, which only serves ScaledObject metrics. After deploying Prometheus Adapter, detect and patch the APIService to point to Prometheus Adapter instead. Supersedes #720. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Andrew Anderson <andy@clubanderson.com>
Summary
Two fixes for the nightly E2E tests on OpenShift:
Update DEFAULT_MODEL_ID from
Qwen/Qwen3-32BtoQwen/Qwen3-0.6BQwen/Qwen3-0.6Bwhile WVA queried forunsloth/Meta-Llama-3.1-8BmetricsMetricsAvailable=False,DesiredOptimizedAllocempty, Scale-to-Zero tests failAdd
WVA_METRICS_SECUREenv var (default:true) wired towva.metrics.secureHelm valuefalse, disables bearer token auth on the WVA/metricsendpointwva_desired_replicas→ external metrics API empty → ShareGPT test failsBefore fix: 1 Passed, 2 Failed, 22 Skipped
After model ID fix only: 3 Passed, 1 Failed, 21 Skipped
Expected after both fixes: 4+ Passed, 0 Failed
Companion PR: llm-d/llm-d-infra#20 (sets
WVA_METRICS_SECURE=falsein reusable workflow)Test plan
MetricsAvailable=Truein metrics pipeline verification stepwva_desired_replicasis accessible via external metrics API