Skip to content

🐛 Fix nightly E2E: update DEFAULT_MODEL_ID and add WVA_METRICS_SECURE#720

Closed
clubanderson wants to merge 2 commits intomainfrom
fix/nightly-model-id-and-metrics-auth
Closed

🐛 Fix nightly E2E: update DEFAULT_MODEL_ID and add WVA_METRICS_SECURE#720
clubanderson wants to merge 2 commits intomainfrom
fix/nightly-model-id-and-metrics-auth

Conversation

@clubanderson
Copy link
Contributor

Summary

Two fixes for the nightly E2E tests on OpenShift:

  1. Update DEFAULT_MODEL_ID from Qwen/Qwen3-32B to Qwen/Qwen3-0.6B

    • The llm-d repo changed its default model, but install.sh still had the old value
    • The yq model replacement silently failed — vLLM served Qwen/Qwen3-0.6B while WVA queried for unsloth/Meta-Llama-3.1-8B metrics
    • Result: MetricsAvailable=False, DesiredOptimizedAlloc empty, Scale-to-Zero tests fail
  2. Add WVA_METRICS_SECURE env var (default: true) wired to wva.metrics.secure Helm value

    • When set to false, disables bearer token auth on the WVA /metrics endpoint
    • Needed for OpenShift user-workload-monitoring where the controller-manager SA token auth fails with "Token does not match server's copy"
    • Without this, Prometheus can't scrape wva_desired_replicas → external metrics API empty → ShareGPT test fails

Before fix: 1 Passed, 2 Failed, 22 Skipped
After model ID fix only: 3 Passed, 1 Failed, 21 Skipped
Expected after both fixes: 4+ Passed, 0 Failed

Companion PR: llm-d/llm-d-infra#20 (sets WVA_METRICS_SECURE=false in reusable workflow)

Note: Nightly workflow temporarily points to llm-d-infra@fix/nightly-metrics-secure. Will revert to @main after llm-d-infra#20 merges.

Test plan

  • Trigger nightly from this branch — all core tests should pass
  • Verify MetricsAvailable=True in metrics pipeline verification step
  • Verify wva_desired_replicas is accessible via external metrics API

Two fixes for nightly E2E tests:

1. Update DEFAULT_MODEL_ID from Qwen/Qwen3-32B to Qwen/Qwen3-0.6B
   to match the llm-d repo's current default. The stale value caused
   the yq model replacement to silently fail, resulting in vLLM
   serving the wrong model and MetricsAvailable=False.

2. Add WVA_METRICS_SECURE env var (default: true) wired to
   wva.metrics.secure helm value. When set to false, disables
   bearer token auth on the /metrics endpoint. Needed for OpenShift
   user-workload-monitoring where the SA token auth fails.

Temporarily points nightly workflow to llm-d-infra branch
fix/nightly-metrics-secure (PR #20) for testing.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: Andrew Anderson <andy@clubanderson.com>
Copilot AI review requested due to automatic review settings February 13, 2026 15:30
@github-actions
Copy link
Contributor

GPU Pre-flight Check ✅

GPUs are available for e2e-openshift tests. Proceeding with deployment.

Resource Total Allocated Available
GPUs 50 26 24
Cluster Value
Nodes 16 (7 with GPUs)
Total CPU 993 cores
Total Memory 10383 Gi
GPUs required 4 (min) / 6 (recommended)

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes two issues affecting nightly E2E tests on OpenShift. The first fix updates the DEFAULT_MODEL_ID to match the llm-d repository's current default model, ensuring that the yq model replacement logic works correctly. The second fix adds a WVA_METRICS_SECURE environment variable to disable bearer token authentication on the WVA /metrics endpoint, which is needed for OpenShift's user-workload-monitoring to successfully scrape metrics. The PR also temporarily references a companion PR branch in the workflow configuration.

Changes:

  • Update DEFAULT_MODEL_ID from "Qwen/Qwen3-32B" to "Qwen/Qwen3-0.6B" to align with llm-d repository defaults
  • Add WVA_METRICS_SECURE environment variable (default: true) to control metrics endpoint authentication, wired through to Helm chart
  • Temporarily reference llm-d-infra@fix/nightly-metrics-secure branch until companion PR merges

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
deploy/install.sh Updates DEFAULT_MODEL_ID constant and adds WVA_METRICS_SECURE environment variable with default value, wiring it to Helm chart via --set flag
.github/workflows/nightly-e2e-openshift.yaml Temporarily references fix/nightly-metrics-secure branch of llm-d-infra reusable workflow instead of @main

jobs:
nightly:
uses: llm-d/llm-d-infra/.github/workflows/reusable-nightly-e2e-openshift.yaml@main
uses: llm-d/llm-d-infra/.github/workflows/reusable-nightly-e2e-openshift.yaml@fix/nightly-metrics-secure
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This temporary reference to the fix/nightly-metrics-secure branch deviates from the established pattern of referencing reusable workflows at @main. While documented in the PR description as temporary, this creates a merge dependency where this PR cannot be merged and tested properly until llm-d-infra#20 is merged first. Consider either: 1) merging llm-d-infra#20 first and updating this to @main, or 2) keeping this at @main and accepting that the nightly workflow may fail until llm-d-infra#20 is merged. The current approach blocks independent verification of these fixes.

Suggested change
uses: llm-d/llm-d-infra/.github/workflows/reusable-nightly-e2e-openshift.yaml@fix/nightly-metrics-secure
uses: llm-d/llm-d-infra/.github/workflows/reusable-nightly-e2e-openshift.yaml@main

Copilot uses AI. Check for mistakes.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: Andrew Anderson <andy@clubanderson.com>
@github-actions
Copy link
Contributor

GPU Pre-flight Check ✅

GPUs are available for e2e-openshift tests. Proceeding with deployment.

Resource Total Allocated Available
GPUs 50 26 24
Cluster Value
Nodes 16 (7 with GPUs)
Total CPU 993 cores
Total Memory 10383 Gi
GPUs required 4 (min) / 6 (recommended)

clubanderson added a commit that referenced this pull request Feb 13, 2026
Three fixes for the nightly E2E test failures:

1. DEFAULT_MODEL_ID: Update from Qwen/Qwen3-32B to Qwen/Qwen3-0.6B to
   match the llm-d repo's current default model. The stale value caused
   yq replacement to silently fail, making WVA query wrong model metrics.

2. WVA_METRICS_SECURE: Add env var to control bearer token auth on the
   WVA /metrics endpoint. OpenShift's user-workload-monitoring cannot
   authenticate with the controller-manager SA token.

3. KEDA APIService conflict: On clusters with KEDA, the
   v1beta1.external.metrics.k8s.io APIService points to KEDA's metrics
   server, which only serves ScaledObject metrics. After deploying
   Prometheus Adapter, detect and patch the APIService to point to
   Prometheus Adapter instead.

Supersedes #720.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: Andrew Anderson <andy@clubanderson.com>
@clubanderson
Copy link
Contributor Author

Superseded by #721 which also includes the KEDA APIService conflict fix.

clubanderson added a commit that referenced this pull request Feb 13, 2026
Three fixes for the nightly E2E test failures:

1. DEFAULT_MODEL_ID: Update from Qwen/Qwen3-32B to Qwen/Qwen3-0.6B to
   match the llm-d repo's current default model. The stale value caused
   yq replacement to silently fail, making WVA query wrong model metrics.

2. WVA_METRICS_SECURE: Add env var to control bearer token auth on the
   WVA /metrics endpoint. OpenShift's user-workload-monitoring cannot
   authenticate with the controller-manager SA token.

3. KEDA APIService conflict: On clusters with KEDA, the
   v1beta1.external.metrics.k8s.io APIService points to KEDA's metrics
   server, which only serves ScaledObject metrics. After deploying
   Prometheus Adapter, detect and patch the APIService to point to
   Prometheus Adapter instead.

Supersedes #720.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: Andrew Anderson <andy@clubanderson.com>
clubanderson added a commit that referenced this pull request Feb 14, 2026
Three fixes for the nightly E2E test failures:

1. DEFAULT_MODEL_ID: Update from Qwen/Qwen3-32B to Qwen/Qwen3-0.6B to
   match the llm-d repo's current default model. The stale value caused
   yq replacement to silently fail, making WVA query wrong model metrics.

2. WVA_METRICS_SECURE: Add env var to control bearer token auth on the
   WVA /metrics endpoint. OpenShift's user-workload-monitoring cannot
   authenticate with the controller-manager SA token.

3. KEDA APIService conflict: On clusters with KEDA, the
   v1beta1.external.metrics.k8s.io APIService points to KEDA's metrics
   server, which only serves ScaledObject metrics. After deploying
   Prometheus Adapter, detect and patch the APIService to point to
   Prometheus Adapter instead.

Supersedes #720.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: Andrew Anderson <andy@clubanderson.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant