Add ok-to-test gate, use cluster HF token for e2e tests, fix istio issues, run wva against 2 different stacks simultaneously by clubanderson · Pull Request #451 · llm-d/llm-d-workload-variant-autoscaler

clubanderson · 2025-12-19T14:52:11Z

Executive Summary

This PR introduces an ok-to-test gate for GPU-intensive e2e tests and adds comprehensive multi-model testing capabilities for the workload-variant-autoscaler on OpenShift. The changes enable testing 2 models in 2 separate namespaces while using a single shared WVA controller in a 3rd namespace, demonstrating production-ready multi-tenant scaling scenarios.

Key Changes

Security & CI Gate

Add ci-e2e-openshift-gate.yaml workflow requiring /ok-to-test comment from maintainers before running GPU tests
Automatically post instructions on new PRs explaining the approval process
Read HF_TOKEN from cluster secret instead of GitHub secrets

Helm Chart: Controller-Only and Variant-Only Installation

Add controller.enabled flag to helm chart for modular deployment
When controller.enabled=true (default): Deploy full WVA stack (controller + VA + HPA + ServiceMonitor)
When controller.enabled=false: Deploy only variant resources (VA + HPA + ServiceMonitor)
Enables single controller to manage multiple models across namespaces

Multi-Model E2E Testing Architecture

Deploy 2 models in 2 namespaces with 1 shared WVA controller:

Model A: Full llm-d stack in llm-d-inference-scheduler-pr-XXX
Model B: Full llm-d stack in llm-d-inference-scheduler-pr-XXX-b
Shared WVA: Single controller in llm-d-autoscaler-pr-XXX managing both

Critical Bug Fixes for Multi-Namespace Support

1. Saturation Engine Multi-Namespace Grouping

Problem: VAs with the same modelID in different namespaces were grouped together, causing incorrect scaling decisions.

Fix (internal/utils/variant.go): Group VAs by modelID|namespace composite key instead of just modelID:

func GroupVariantAutoscalingByModel(vas []VariantAutoscaling) map[string][]VariantAutoscaling {
    groups := make(map[string][]VariantAutoscaling)
    for _, va := range vas {
        key := va.Spec.ModelID + "|" + va.Namespace  // Changed from just modelID
        groups[key] = append(groups[key], va)
    }
    return groups
}

2. HPA External Metrics Namespace Isolation

Problem: HPA metrics from different namespaces could collide when using the same metric name.

Fix (charts/workload-variant-autoscaler/templates/hpa.yaml): Add exported_namespace label selector:

matchLabels:
  variant: {{ .Values.variantAutoscaling.scaleTargetRef.name }}
  exported_namespace: {{ .Values.llmd.namespace }}

3. Istio 1.28+ InferencePool API Compatibility

Problem: Istio 1.28.1 requires GA API (inference.networking.k8s.io) but llm-d charts deploy with experimental API (inference.networking.x-k8s.io).

Fix (.github/workflows/ci-e2e-openshift.yaml): Add workflow step to patch resources:

# 1. Create InferencePool with GA API
kubectl apply -f - <<EOF
apiVersion: inference.networking.k8s.io/v1
kind: InferencePool
metadata:
  name: gaie-inference-scheduling
  namespace: $NAMESPACE
spec:
  targetPortNumber: 8000
  selector:
    matchLabels:
      app.kubernetes.io/name: vllm
EOF

# 2. Patch HTTPRoute to reference GA API group
kubectl patch httproute llm-d-inference-scheduling -n $NAMESPACE --type=json -p='[
  {"op": "replace", "path": "/spec/rules/0/backendRefs/0/group", "value": "inference.networking.k8s.io"}
]'

# 3. Patch EPP deployment to use GA pool-group
kubectl set env deployment/gaie-inference-scheduling-epp -n $NAMESPACE \
  POOL_GROUP=inference.networking.k8s.io

# 4. Create RBAC for EPP to access GA InferencePool
kubectl apply -f - <<EOF
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: inference-pool-reader
  namespace: $NAMESPACE
rules:
- apiGroups: ["inference.networking.k8s.io"]
  resources: ["inferencepools"]
  verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: epp-inference-pool-reader
  namespace: $NAMESPACE
subjects:
- kind: ServiceAccount
  name: gaie-inference-scheduling-epp
  namespace: $NAMESPACE
roleRef:
  kind: Role
  name: inference-pool-reader
  apiGroup: rbac.authorization.k8s.io
EOF

4. Load Generation Through Gateway

Problem: Tests were sending load directly to vllm-service:8200, bypassing the InferencePool/EPP routing.

Fix (test/e2e-openshift/sharegpt_scaleup_test.go): Route all traffic through Istio gateway on port 80:

// Use gateway service for load routing (port 80)
gatewayServiceName := "infra-inference-scheduling-inference-gateway-istio"

// Health check via gateway
curl -sf http://$gatewayService:80/v1/models

// Load generation via gateway
curl -X POST http://$gatewayService:80/v1/completions ...

Configuration Requirements

For Istio 1.28+ Clusters

If using Istio 1.28.1 or later, you must apply the InferencePool API compatibility fix shown above. The llm-d charts currently deploy with the experimental API group which Istio 1.28+ does not support.

For Multi-Namespace Deployments

Deploy WVA controller with controller.enabled=true in controller namespace
Deploy additional models with controller.enabled=false (variant-only)
Ensure HPA metrics include exported_namespace label for proper isolation

Monitoring CI Runs

Watch cluster resources during CI (replace <PR_NUMBER> with your PR number):

# WVA controller
oc get pods,deploy -n llm-d-autoscaler-pr-<PR_NUMBER>

# Model A (primary namespace)
oc get pods,deploy,va,hpa -n llm-d-inference-scheduler-pr-<PR_NUMBER>

# Model B (secondary namespace)
oc get pods,deploy,va,hpa -n llm-d-inference-scheduler-pr-<PR_NUMBER>-b

# All VAs with detailed status
oc get va -A -o custom-columns='NAMESPACE:.metadata.namespace,NAME:.metadata.name,OPTIMIZED:.status.desiredOptimizedAlloc.numReplicas,REPLICAS:.status.currentAlloc.numReplicas,RATE:.status.currentAlloc.load.arrivalRate'

CI Cleanup Behavior

Before tests: All PR namespaces are cleaned up for a fresh start
After successful tests: Resources are cleaned up automatically
After failed tests: Resources are left in place for debugging

Testing

CI runs multi-model deployment and verifies both VAs scale independently
ShareGPT scale-up test validates autoscaling behavior for both models
Both models demonstrate scale-up under load and scale-down when load stops

🤖 Generated with Claude Code

Copilot

Pull request overview

This PR implements a security gate for OpenShift E2E tests and migrates HuggingFace token management from GitHub secrets to cluster secrets. The gate requires /ok-to-test approval from maintainers before running GPU-intensive tests on external PRs, while allowing automatic runs for trusted contributors.

Key Changes:

Added approval workflow that posts instructions on new PRs and validates /ok-to-test commands
Modified E2E workflow to support both workflow_call (from gate) and workflow_dispatch triggers
Replaced GitHub secret-based HF_TOKEN with cluster secret retrieval from llm-d-hf-token

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
.github/workflows/ci-e2e-openshift-gate.yaml	New gate workflow implementing `/ok-to-test` approval mechanism with permission checks and PR instruction posting
.github/workflows/ci-e2e-openshift.yaml	Updated to use workflow_call trigger, unified input handling, and cluster-based HF token retrieval

.github/workflows/ci-e2e-openshift-gate.yaml

.github/workflows/ci-e2e-openshift.yaml

Copilot

Pull request overview

Copilot reviewed 38 out of 38 changed files in this pull request and generated 3 comments.

internal/controller/predicates.go

charts/workload-variant-autoscaler/templates/vllm-service.yaml

test/e2e-openshift/sharegpt_scaleup_test.go

Copilot

Pull request overview

Copilot reviewed 38 out of 38 changed files in this pull request and generated 2 comments.

internal/controller/variantautoscaling_controller.go

charts/workload-variant-autoscaler/templates/vllm-service.yaml

Copilot

Pull request overview

Copilot reviewed 38 out of 38 changed files in this pull request and generated 5 comments.

internal/controller/predicates.go

.github/workflows/ci-e2e-openshift.yaml

test/e2e-openshift/sharegpt_scaleup_test.go

.github/workflows/ci-e2e-openshift.yaml

Copilot

Pull request overview

Copilot reviewed 39 out of 39 changed files in this pull request and generated 5 comments.

.github/workflows/ci-e2e-openshift.yaml

test/e2e-openshift/sharegpt_scaleup_test.go

internal/controller/variantautoscaling_controller.go

internal/controller/predicates.go

.github/workflows/ci-e2e-openshift.yaml

Copilot

Pull request overview

Copilot reviewed 39 out of 39 changed files in this pull request and generated 3 comments.

.github/workflows/ci-e2e-openshift.yaml

Copilot

Pull request overview

Copilot reviewed 40 out of 40 changed files in this pull request and generated 2 comments.

.github/workflows/ci-e2e-openshift.yaml

internal/controller/variantautoscaling_controller.go

Copilot

Pull request overview

Copilot reviewed 46 out of 46 changed files in this pull request and generated 3 comments.

Copilot · 2025-12-20T08:18:55Z

test/e2e-openshift/sharegpt_scaleup_test.go

+	re := regexp.MustCompile(`[^a-z0-9-]`)
+	result = re.ReplaceAllString(result, "")
+	// Trim leading/trailing hyphens
+	result = strings.Trim(result, "-")


The function doesn't handle the case where the sanitized result is empty (e.g., if the input contains only special characters). Consider adding validation to return an error or default value when the result is empty to prevent silent failures.

Suggested change

result = strings.Trim(result, "-")

result = strings.Trim(result, "-")

// If everything was stripped out, fall back to a safe default name

if result == "" {

result = "default"

}

.github/workflows/ci-e2e-openshift.yaml

charts/workload-variant-autoscaler/templates/hpa.yaml

Copilot

Pull request overview

Copilot reviewed 46 out of 46 changed files in this pull request and generated 1 comment.

test/e2e-openshift/sharegpt_scaleup_test.go

Copilot

Pull request overview

Copilot reviewed 46 out of 46 changed files in this pull request and generated 2 comments.

Copilot · 2025-12-20T08:54:51Z

.github/workflows/ci-e2e-openshift.yaml

+          rm -f kubectl.sha256
          # Install oc (OpenShift CLI)
-          curl -LO "https://mirror.openshift.com/pub/openshift-v4/clients/ocp/stable/openshift-client-linux.tar.gz"
+          curl -fsSL --retry 3 --retry-delay 5 -O "https://mirror.openshift.com/pub/openshift-v4/clients/ocp/stable/openshift-client-linux.tar.gz"


The OpenShift CLI (oc) is downloaded without any checksum verification. This creates a supply chain security risk. Consider adding checksum verification similar to kubectl, or use a pinned version with known checksum.

Copilot · 2025-12-20T08:54:52Z

.github/workflows/ci-e2e-openshift.yaml

          rm -f openshift-client-linux.tar.gz kubectl README.md
          # Install helm
-          curl -fsSL https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
+          curl -fsSL --retry 3 --retry-delay 5 https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash


Executing a script directly from the internet via pipe to bash is a security risk. The script content could change between runs or be compromised. Consider pinning to a specific commit hash or downloading and verifying the script before execution.

Copilot

Pull request overview

Copilot reviewed 46 out of 46 changed files in this pull request and generated 5 comments.

Copilot · 2025-12-20T13:58:43Z

.github/workflows/ci-e2e-openshift.yaml

+          KUBECTL_VERSION="v1.31.0"
+          echo "Installing kubectl version: $KUBECTL_VERSION"
+          curl -fsSL --retry 3 --retry-delay 5 -o kubectl "https://dl.k8s.io/release/${KUBECTL_VERSION}/bin/linux/amd64/kubectl"


The comment states 'Pinned 2025-12' but it's currently December 2025. Consider updating the comment to reflect when this pinning decision was actually made, or use a more generic phrase like 'Last updated: 2025-12' to avoid confusion about whether this is a future date.

Copilot · 2025-12-20T13:58:43Z

test/e2e-openshift/sharegpt_scaleup_test.go

+			It("should create and run parallel load generation jobs", func() {
+				By("cleaning up any existing jobs")
+				deleteParallelLoadJobs(ctx, jobBaseName, model.namespace, numLoadWorkers)
+				time.Sleep(2 * time.Second)


Replace hardcoded sleep with a conditional wait or increase timeout on the subsequent Eventually block. Using fixed sleeps can make tests flaky and slower than necessary. The Eventually block starting at line 273 already provides proper waiting logic for endpoints to be ready.

Suggested change

time.Sleep(2 * time.Second)

Copilot · 2025-12-20T13:58:44Z

test/e2e-openshift/sharegpt_scaleup_test.go

+				_ = k8sClient.BatchV1().Jobs(model.namespace).Delete(ctx, healthCheckJobName, metav1.DeleteOptions{
+					PropagationPolicy: &backgroundPropagation,
+				})
+				time.Sleep(2 * time.Second)


Replace hardcoded sleeps with eventual consistency checks. These arbitrary 2-second waits can cause test flakiness. Consider using Eventually blocks to wait for the actual state you need (e.g., job deletion completion) rather than fixed time periods.

Copilot · 2025-12-20T13:58:44Z

test/e2e-openshift/sharegpt_scaleup_test.go

+					metav1.ListOptions{
+						LabelSelector: fmt.Sprintf("experiment=%s", jobBaseName),
+					})
+				time.Sleep(2 * time.Second)


Replace hardcoded sleeps with eventual consistency checks. These arbitrary 2-second waits can cause test flakiness. Consider using Eventually blocks to wait for the actual state you need (e.g., job deletion completion) rather than fixed time periods.

Suggested change

time.Sleep(2 * time.Second)

Eventually(func(g Gomega) {

jobList, err := k8sClient.BatchV1().Jobs(model.namespace).List(ctx, metav1.ListOptions{

LabelSelector: fmt.Sprintf("experiment=%s", jobBaseName),

})

g.Expect(err).NotTo(HaveOccurred(), "Should be able to list load generation jobs for cleanup")

g.Expect(len(jobList.Items)).To(BeZero(), "All previous load generation jobs should be deleted before starting new ones")

}, 2*time.Minute, 5*time.Second).Should(Succeed())

charts/workload-variant-autoscaler/templates/hpa.yaml

Copilot

Pull request overview

Copilot reviewed 46 out of 46 changed files in this pull request and generated 6 comments.

Copilot · 2025-12-20T14:27:34Z

.github/workflows/ci-e2e-openshift.yaml

+          # Clean up orphaned PR-specific namespaces from previous runs
+          # Pattern: llm-d-inference-scheduler-pr-* and llm-d-autoscaler-pr-*
+          echo "Cleaning up orphaned PR-specific namespaces..."
+          for ns in $(kubectl get ns -o name | grep -E 'llm-d-(inference-scheduler|autoscaler|autoscaling)-pr-' | cut -d/ -f2); do


The namespace pattern includes 'autoscaling' but the actual namespace pattern used in the workflow is 'autoscaler' (line 141). This inconsistency could cause the cleanup to miss orphaned namespaces with the 'autoscaling' pattern.

Copilot · 2025-12-20T14:27:34Z

.github/workflows/ci-e2e-openshift.yaml

+          apiVersion: inference.networking.k8s.io/v1
+          kind: InferencePool


The InferencePool resource definition is duplicated in the fix_namespace function for both namespaces. Consider extracting this into a template or using a Kubernetes manifest file to avoid duplication and reduce maintenance burden.

Copilot · 2025-12-20T14:27:35Z

test/e2e-openshift/sharegpt_scaleup_test.go

+				By("waiting for vllm-service endpoints to exist")
+				Eventually(func(g Gomega) {
+					endpoints, err := k8sClient.CoreV1().Endpoints(model.namespace).Get(ctx, vllmServiceName, metav1.GetOptions{})
+					g.Expect(err).NotTo(HaveOccurred(), "vllm-service endpoints should exist")
+					g.Expect(endpoints.Subsets).NotTo(BeEmpty(), "vllm-service should have endpoints")
+
+					readyCount := 0
+					for _, subset := range endpoints.Subsets {
+						readyCount += len(subset.Addresses)
+					}
+					_, _ = fmt.Fprintf(GinkgoWriter, "%s has %d ready endpoints\n", vllmServiceName, readyCount)
+					g.Expect(readyCount).To(BeNumerically(">", 0), "vllm-service should have at least one ready endpoint")
+				}, 5*time.Minute, 10*time.Second).Should(Succeed())


The code attempts to get endpoints for vllmServiceName which might be empty (lines 251-256 set it conditionally). If vllmServiceName is empty, this will fail with an unclear error. Add a check to skip this validation if vllmServiceName is empty, or make it a required field.

Suggested change

By("waiting for vllm-service endpoints to exist")

Eventually(func(g Gomega) {

endpoints, err := k8sClient.CoreV1().Endpoints(model.namespace).Get(ctx, vllmServiceName, metav1.GetOptions{})

g.Expect(err).NotTo(HaveOccurred(), "vllm-service endpoints should exist")

g.Expect(endpoints.Subsets).NotTo(BeEmpty(), "vllm-service should have endpoints")

readyCount := 0

for _, subset := range endpoints.Subsets {

readyCount += len(subset.Addresses)

}

_, _ = fmt.Fprintf(GinkgoWriter, "%s has %d ready endpoints\n", vllmServiceName, readyCount)

g.Expect(readyCount).To(BeNumerically(">", 0), "vllm-service should have at least one ready endpoint")

}, 5*time.Minute, 10*time.Second).Should(Succeed())

if vllmServiceName == "" {

By("skipping vllm-service endpoints check because vllmServiceName is empty")

} else {

By("waiting for vllm-service endpoints to exist")

Eventually(func(g Gomega) {

endpoints, err := k8sClient.CoreV1().Endpoints(model.namespace).Get(ctx, vllmServiceName, metav1.GetOptions{})

g.Expect(err).NotTo(HaveOccurred(), "vllm-service endpoints should exist")

g.Expect(endpoints.Subsets).NotTo(BeEmpty(), "vllm-service should have endpoints")

readyCount := 0

for _, subset := range endpoints.Subsets {

readyCount += len(subset.Addresses)

}

_, _ = fmt.Fprintf(GinkgoWriter, "%s has %d ready endpoints\n", vllmServiceName, readyCount)

g.Expect(readyCount).To(BeNumerically(">", 0), "vllm-service should have at least one ready endpoint")

}, 5*time.Minute, 10*time.Second).Should(Succeed())

}

Copilot · 2025-12-20T14:27:35Z

internal/engines/saturation/engine.go

-	for modelID, modelVAs := range modelGroups {
+	for groupKey, modelVAs := range modelGroups {
+		// The groupKey is "modelID|namespace" - extract actual modelID from VAs
+		// All VAs in the group have the same modelID and namespace


Accessing modelVAs[0] without checking if the slice is empty could cause a panic. Although the grouping logic should ensure non-empty groups, add a defensive check to verify len(modelVAs) > 0 before accessing the first element.

Suggested change

// All VAs in the group have the same modelID and namespace

// All VAs in the group have the same modelID and namespace

if len(modelVAs) == 0 {

logger.V(logging.DEBUG).Info("Skipping empty model group",

"groupKey", groupKey)

continue

}

charts/workload-variant-autoscaler/templates/hpa.yaml

Copilot · 2025-12-20T14:27:36Z

internal/controller/variantautoscaling_controller.go

+	// Attempts to resolve the target model variant using scaleTargetRef
+	scaleTargetName := va.Spec.ScaleTargetRef.Name
+	if scaleTargetName == "" {
+		// Fallback to VA name for backwards compatibility


The fallback to VA name for backwards compatibility is undocumented in the CRD or user-facing documentation. Add a comment explaining when this fallback is used (legacy VAs without scaleTargetRef) and consider logging a deprecation warning to encourage users to migrate to the explicit scaleTargetRef field.

Suggested change

// Fallback to VA name for backwards compatibility

// Fallback to VA name for backwards compatibility:

// - This is used for legacy VariantAutoscaling resources that do not specify spec.scaleTargetRef.name.

// - New configurations should set spec.scaleTargetRef.name explicitly and not rely on this implicit fallback.

logger.Info("Using VariantAutoscaling name as fallback for spec.scaleTargetRef.name; this behavior is deprecated and may be removed in a future release. Please set spec.scaleTargetRef.name explicitly.",

"variantAutoscaling", va.Name,

"namespace", va.Namespace,

)

Copilot

Pull request overview

Copilot reviewed 46 out of 46 changed files in this pull request and generated 2 comments.

test/e2e-openshift/sharegpt_scaleup_test.go

Copilot · 2025-12-20T14:55:12Z

internal/engines/saturation/engine.go

 	if len(allDecisions) > 0 {
 		logger.Info("Applying scaling decisions",
 			"totalDecisions", len(allDecisions))
-		if err := e.applySaturationDecisions(ctx, allDecisions, vaMap); err != nil {
-			logger.Error(err, "Failed to apply saturation decisions")
-			return err
-		}
 	} else {
-		logger.Info("No scaling decisions to apply")
+		logger.Info("No scaling decisions to apply, updating VA status with metrics")
+	}


The if-else block at lines 232-237 only logs messages but doesn't affect control flow - the applySaturationDecisions call happens regardless. Consider simplifying by removing the conditional and using a single log statement like logger.Info(\"Applying decisions and updating VA status\", \"totalDecisions\", len(allDecisions)) to reduce code complexity.

Suggested change

if len(allDecisions) > 0 {

logger.Info("Applying scaling decisions",

"totalDecisions", len(allDecisions))

if err := e.applySaturationDecisions(ctx, allDecisions, vaMap); err != nil {

logger.Error(err, "Failed to apply saturation decisions")

return err

}

} else {

logger.Info("No scaling decisions to apply")

logger.Info("No scaling decisions to apply, updating VA status with metrics")

}

logger.Info("Applying decisions and updating VA status",

"totalDecisions", len(allDecisions))

Copilot

Pull request overview

Copilot reviewed 46 out of 46 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

.github/workflows/ci-e2e-openshift.yaml:1

The WVA_IMAGE_PULL_POLICY variable is defined but never used in the workflow or install script. Consider removing it or documenting where it should be used.

name: CI - OpenShift E2E Tests

charts/workload-variant-autoscaler/templates/hpa.yaml

…d WVA Multi-model E2E Testing: - Deploy 2 models in 2 namespaces with 1 shared WVA controller - Model A in llm-d-inference-scheduler-pr-XXX - Model B in llm-d-inference-scheduler-pr-XXX-b - Shared WVA in llm-d-autoscaler-pr-XXX Test Improvements: - Move HPA retrieval before VA stabilization to know minReplicas - Wait for VA to stabilize at exact minReplicas before load test - Increase stabilization timeout to 5 minutes - Route load through Istio gateway instead of direct vLLM service CI Cleanup Behavior: - Before tests: Clean up all PR namespaces for fresh start - After successful tests: Clean up automatically - After failed tests: Leave resources for debugging Documentation: - Add monitoring commands with PR number placeholder - Document multi-model testing architecture - Document CI cleanup behavior 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 47 out of 47 changed files in this pull request and generated 3 comments.

Copilot · 2025-12-22T15:20:06Z

.github/workflows/ci-e2e-openshift.yaml

+          # Pinned 2025-12: v1.31.0 tested compatible with OpenShift 4.16+
+          # Update this version when upgrading target cluster or during regular dependency reviews


The comment states kubectl is 'Pinned 2025-12' but the PR description indicates the current date is December 22, 2025. Consider updating the pin date comment to reflect when this version was actually pinned, or clarify if this refers to a future update schedule.

Suggested change

# Pinned 2025-12: v1.31.0 tested compatible with OpenShift 4.16+

# Update this version when upgrading target cluster or during regular dependency reviews

# Pinned on 2025-12-22: kubectl v1.31.0 tested compatible with OpenShift 4.16+

# Review and update this version during dependency reviews or when upgrading target clusters

Copilot · 2025-12-22T15:20:07Z

test/e2e-openshift/sharegpt_scaleup_test.go

+
+	models := []modelTestConfig{
+		{
+			name:           "Model A1",


The model is named 'Model A1' in the test but the PR description and other documentation refers to it as 'Model A'. For consistency, consider using 'Model A' instead of 'Model A1'.

Copilot · 2025-12-22T15:20:07Z

internal/collector/prometheus/saturation_metrics.go

 	existingPods := cmc.getExistingPods(ctx, namespace, deployments, podSet)
 	stalePodCount := 0

-	// Filter out pods that don't exist according to the queried Prometheus kube-state-metrics
-	for podName := range podSet {
-		if !existingPods[podName] {
-			stalePodCount++
-			// TODO: remove debug log after verification
-			logger.V(logging.DEBUG).Info("Filtering pod from stale vLLM metrics", "pod", podName, "namespace", namespace, "model", modelID)
-			delete(podSet, podName)
+	// If getExistingPods returns empty but we have candidate pods with metrics,
+	// skip the filtering - this handles the case where kube_pod_info hasn't been
+	// scraped yet for new pods. It's better to include all candidates than to
+	// filter them all out and skip saturation analysis entirely.
+	// Note: This workaround may include metrics from recently-terminated pods if
+	// kube_pod_info is stale. The typical staleness window is ~30s based on scrape intervals.
+	// TODO: Consider adding time-based filtering or retry logic for more accurate pod filtering.
+	if len(existingPods) == 0 && len(podSet) > 0 {
+		logger.Info("kube_pod_info returned no pods but we have metric candidates, skipping stale pod filtering",
+			"candidatePods", len(podSet), "namespace", namespace, "model", modelID)
+	} else {
+		// Filter out pods that don't exist according to the queried Prometheus kube-state-metrics


This workaround bypasses stale pod filtering when kube_pod_info is unavailable, which could include metrics from terminated pods for up to 30 seconds. Consider implementing the time-based filtering or retry logic mentioned in the TODO to improve accuracy, or document acceptable tolerance levels for including stale metrics in saturation analysis.

Suggested change

existingPods := cmc.getExistingPods(ctx, namespace, deployments, podSet)

stalePodCount := 0

// Filter out pods that don't exist according to the queried Prometheus kube-state-metrics

for podName := range podSet {

if !existingPods[podName] {

stalePodCount++

// TODO: remove debug log after verification

logger.V(logging.DEBUG).Info("Filtering pod from stale vLLM metrics", "pod", podName, "namespace", namespace, "model", modelID)

delete(podSet, podName)

// If getExistingPods returns empty but we have candidate pods with metrics,

// skip the filtering - this handles the case where kube_pod_info hasn't been

// scraped yet for new pods. It's better to include all candidates than to

// filter them all out and skip saturation analysis entirely.

// Note: This workaround may include metrics from recently-terminated pods if

// kube_pod_info is stale. The typical staleness window is ~30s based on scrape intervals.

// TODO: Consider adding time-based filtering or retry logic for more accurate pod filtering.

if len(existingPods) == 0 && len(podSet) > 0 {

logger.Info("kube_pod_info returned no pods but we have metric candidates, skipping stale pod filtering",

"candidatePods", len(podSet), "namespace", namespace, "model", modelID)

} else {

// Filter out pods that don't exist according to the queried Prometheus kube-state-metrics

//

// To avoid races where kube_pod_info has not yet been scraped for new pods, we allow a bounded

// retry window before falling back to including all candidate pods. This reduces the chance of

// including metrics from recently-terminated pods while still avoiding dropping all metrics when

// kube_pod_info is briefly empty.

maxPodInfoWait := 5 * time.Second

retryInterval := 500 * time.Millisecond

podInfoStart := time.Now()

existingPods := cmc.getExistingPods(ctx, namespace, deployments, podSet)

stalePodCount := 0

// If getExistingPods returns empty but we have candidate pods with metrics, retry for a short,

// bounded window before skipping filtering entirely. This handles the case where kube_pod_info

// hasn't been scraped yet for new pods. If, after the retry window, we still have no pod info,

// we proceed without filtering but log that stale metrics may be included for up to the typical

// kube-state-metrics staleness window (~30s).

if len(existingPods) == 0 && len(podSet) > 0 {

logger.Info("kube_pod_info returned no pods but we have metric candidates, retrying before skipping stale pod filtering",

"candidatePods", len(podSet), "namespace", namespace, "model", modelID,

"maxWait", maxPodInfoWait, "retryInterval", retryInterval)

for len(existingPods) == 0 && time.Since(podInfoStart) < maxPodInfoWait {

select {

case <-ctx.Done():

// Context cancelled or deadline exceeded; stop retrying.

break

case <-time.After(retryInterval):

existingPods = cmc.getExistingPods(ctx, namespace, deployments, podSet)

}

// If context is done, avoid extra work.

if ctx.Err() != nil {

break

}

}

if len(existingPods) == 0 {

// Fall back to previous behavior: include all candidate pods without stale filtering,

// but document the tolerated window for potentially stale metrics.

logger.Info("kube_pod_info remained empty after retry window; proceeding without stale pod filtering",

"candidatePods", len(podSet), "namespace", namespace, "model", modelID,

"retryDuration", time.Since(podInfoStart),

"toleratedStalenessWindow", "up to ~30s due to Prometheus retention")

}

}

// If we have pod information, filter out pods that don't exist according to the queried

// Prometheus kube-state-metrics.

if len(existingPods) > 0 {

asm582

Address review

README.md

internal/collector/prometheus/saturation_metrics.go

internal/config/prometheus.go

internal/utils/variant.go

- Add documentation for CONFIG_MAP_NAME env var in prometheus.go explaining how it's set by Helm chart for multi-instance support - Add TODO for scaleTargetRef.name CRD requirement in variant.go noting the check should be enforced at schema level Related issues: - #454: Consider hardware type when inferring target deployment - #455: Replace pod filtering with app-level readiness probes 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

asm582

lgtm

Some issues will be addressed later.

Add ok-to-test gate, use cluster HF token for e2e tests, fix istio issues, run wva against 2 different stacks simultaneously

Copilot AI review requested due to automatic review settings December 19, 2025 14:52

Copilot AI reviewed Dec 19, 2025

View reviewed changes

.github/workflows/ci-e2e-openshift-gate.yaml Outdated Show resolved Hide resolved

.github/workflows/ci-e2e-openshift.yaml Outdated Show resolved Hide resolved

.github/workflows/ci-e2e-openshift.yaml Show resolved Hide resolved

Copilot AI review requested due to automatic review settings December 19, 2025 15:05

This comment was marked as resolved.

Sign in to view

Copilot AI review requested due to automatic review settings December 19, 2025 15:14

This comment was marked as resolved.

Sign in to view

Copilot AI review requested due to automatic review settings December 19, 2025 15:33

This comment was marked as resolved.

Sign in to view

Copilot AI review requested due to automatic review settings December 19, 2025 15:47

This comment was marked as resolved.

Sign in to view

Copilot AI review requested due to automatic review settings December 19, 2025 16:04

This comment was marked as resolved.

Sign in to view

Copilot AI review requested due to automatic review settings December 19, 2025 16:19

This comment was marked as resolved.

Sign in to view

Copilot AI review requested due to automatic review settings December 19, 2025 16:33

This comment was marked as resolved.

Sign in to view

Copilot AI review requested due to automatic review settings December 19, 2025 16:39

This comment was marked as resolved.

Sign in to view

Copilot AI review requested due to automatic review settings December 19, 2025 17:03

Copilot AI reviewed Dec 19, 2025

View reviewed changes

internal/controller/predicates.go Show resolved Hide resolved

charts/workload-variant-autoscaler/templates/vllm-service.yaml Show resolved Hide resolved

test/e2e-openshift/sharegpt_scaleup_test.go Outdated Show resolved Hide resolved

Copilot AI review requested due to automatic review settings December 19, 2025 17:13

Copilot AI reviewed Dec 19, 2025

View reviewed changes

internal/controller/variantautoscaling_controller.go Show resolved Hide resolved

charts/workload-variant-autoscaler/templates/vllm-service.yaml Show resolved Hide resolved

Copilot AI review requested due to automatic review settings December 19, 2025 17:17

Copilot AI reviewed Dec 19, 2025

View reviewed changes

Copilot AI review requested due to automatic review settings December 19, 2025 17:24

Copilot AI reviewed Dec 19, 2025

View reviewed changes

Copilot AI review requested due to automatic review settings December 19, 2025 17:28

Copilot AI reviewed Dec 19, 2025

View reviewed changes

.github/workflows/ci-e2e-openshift.yaml Show resolved Hide resolved

.github/workflows/ci-e2e-openshift.yaml Outdated Show resolved Hide resolved

.github/workflows/ci-e2e-openshift.yaml Outdated Show resolved Hide resolved

Copilot AI review requested due to automatic review settings December 19, 2025 17:40

Copilot AI reviewed Dec 19, 2025

View reviewed changes

.github/workflows/ci-e2e-openshift.yaml Outdated Show resolved Hide resolved

internal/controller/variantautoscaling_controller.go Outdated Show resolved Hide resolved

clubanderson force-pushed the runner-ci-policy branch from 783e71e to 5119cf1 Compare December 20, 2025 07:44

Copilot AI review requested due to automatic review settings December 20, 2025 08:17

Copilot AI reviewed Dec 20, 2025

View reviewed changes

Copilot AI review requested due to automatic review settings December 20, 2025 08:38

Copilot AI reviewed Dec 20, 2025

View reviewed changes

test/e2e-openshift/sharegpt_scaleup_test.go Show resolved Hide resolved

Copilot AI review requested due to automatic review settings December 20, 2025 08:53

Copilot AI reviewed Dec 20, 2025

View reviewed changes

Copilot AI review requested due to automatic review settings December 20, 2025 13:57

Copilot AI reviewed Dec 20, 2025

View reviewed changes

Copilot AI review requested due to automatic review settings December 20, 2025 14:26

Copilot AI reviewed Dec 20, 2025

View reviewed changes

Copilot AI review requested due to automatic review settings December 20, 2025 14:54

Copilot AI reviewed Dec 20, 2025

View reviewed changes

clubanderson force-pushed the runner-ci-policy branch from c4a6a04 to aa0475f Compare December 20, 2025 15:09

Copilot AI review requested due to automatic review settings December 20, 2025 15:28

Copilot AI reviewed Dec 20, 2025

View reviewed changes

charts/workload-variant-autoscaler/templates/hpa.yaml Show resolved Hide resolved

clubanderson changed the title ~~Add ok-to-test gate and use cluster HF token for e2e tests~~ Add ok-to-test gate, use cluster HF token for e2e tests, fix istio issues, run wva against 2 different stacks simultaneously Dec 22, 2025

Copilot AI review requested due to automatic review settings December 22, 2025 15:19

clubanderson force-pushed the runner-ci-policy branch from 97bd05a to e25066c Compare December 22, 2025 15:19

Copilot AI reviewed Dec 22, 2025

View reviewed changes

asm582 requested changes Dec 22, 2025

View reviewed changes

README.md Show resolved Hide resolved

internal/collector/prometheus/saturation_metrics.go Show resolved Hide resolved

internal/config/prometheus.go Show resolved Hide resolved

internal/utils/variant.go Show resolved Hide resolved

This was referenced Dec 22, 2025

Consider hardware type when inferring target deployment from modelID #454

Closed

Replace pod filtering with app-level readiness probes #455

Open

clubanderson requested a review from asm582 December 22, 2025 16:05

asm582 approved these changes Dec 22, 2025

View reviewed changes

clubanderson merged commit 53b22bd into main Dec 22, 2025
5 checks passed

clubanderson mentioned this pull request Dec 22, 2025

Decouple WVA controller from model resources to enable multi-model deployments #445

Closed

14 tasks

-	result = strings.Trim(result, "-")
+	result = strings.Trim(result, "-")
+	// If everything was stripped out, fall back to a safe default name
+	if result == "" {
+		result = "default"
+	}

-				time.Sleep(2 * time.Second)
+				Eventually(func(g Gomega) {
+					jobList, err := k8sClient.BatchV1().Jobs(model.namespace).List(ctx, metav1.ListOptions{
+						LabelSelector: fmt.Sprintf("experiment=%s", jobBaseName),
+					})
+					g.Expect(err).NotTo(HaveOccurred(), "Should be able to list load generation jobs for cleanup")
+					g.Expect(len(jobList.Items)).To(BeZero(), "All previous load generation jobs should be deleted before starting new ones")
+				}, 2*time.Minute, 5*time.Second).Should(Succeed())

		apiVersion: inference.networking.k8s.io/v1
		kind: InferencePool

-		// All VAs in the group have the same modelID and namespace
+		// All VAs in the group have the same modelID and namespace
+		if len(modelVAs) == 0 {
+			logger.V(logging.DEBUG).Info("Skipping empty model group",
+				"groupKey", groupKey)
+			continue
+		}

-		// Fallback to VA name for backwards compatibility
+		// Fallback to VA name for backwards compatibility:
+		// - This is used for legacy VariantAutoscaling resources that do not specify spec.scaleTargetRef.name.
+		// - New configurations should set spec.scaleTargetRef.name explicitly and not rely on this implicit fallback.
+		logger.Info("Using VariantAutoscaling name as fallback for spec.scaleTargetRef.name; this behavior is deprecated and may be removed in a future release. Please set spec.scaleTargetRef.name explicitly.",
+			"variantAutoscaling", va.Name,
+			"namespace", va.Namespace,
+		)

		# Pinned 2025-12: v1.31.0 tested compatible with OpenShift 4.16+
		# Update this version when upgrading target cluster or during regular dependency reviews

Conversation

clubanderson commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Executive Summary

Key Changes

Security & CI Gate

Helm Chart: Controller-Only and Variant-Only Installation

Multi-Model E2E Testing Architecture

Critical Bug Fixes for Multi-Namespace Support

1. Saturation Engine Multi-Namespace Grouping

2. HPA External Metrics Namespace Isolation

3. Istio 1.28+ InferencePool API Compatibility

4. Load Generation Through Gateway

Configuration Requirements

For Istio 1.28+ Clusters

For Multi-Namespace Deployments

Monitoring CI Runs

CI Cleanup Behavior

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

clubanderson commented Dec 19, 2025 •

edited

Loading