Skip to content

Add ok-to-test gate, use cluster HF token for e2e tests, fix istio issues, run wva against 2 different stacks simultaneously#451

Merged
clubanderson merged 7 commits intomainfrom
runner-ci-policy
Dec 22, 2025
Merged

Add ok-to-test gate, use cluster HF token for e2e tests, fix istio issues, run wva against 2 different stacks simultaneously#451
clubanderson merged 7 commits intomainfrom
runner-ci-policy

Conversation

@clubanderson
Copy link
Contributor

@clubanderson clubanderson commented Dec 19, 2025

Executive Summary

This PR introduces an ok-to-test gate for GPU-intensive e2e tests and adds comprehensive multi-model testing capabilities for the workload-variant-autoscaler on OpenShift. The changes enable testing 2 models in 2 separate namespaces while using a single shared WVA controller in a 3rd namespace, demonstrating production-ready multi-tenant scaling scenarios.

Key Changes

Security & CI Gate

  • Add ci-e2e-openshift-gate.yaml workflow requiring /ok-to-test comment from maintainers before running GPU tests
  • Automatically post instructions on new PRs explaining the approval process
  • Read HF_TOKEN from cluster secret instead of GitHub secrets

Helm Chart: Controller-Only and Variant-Only Installation

  • Add controller.enabled flag to helm chart for modular deployment
  • When controller.enabled=true (default): Deploy full WVA stack (controller + VA + HPA + ServiceMonitor)
  • When controller.enabled=false: Deploy only variant resources (VA + HPA + ServiceMonitor)
  • Enables single controller to manage multiple models across namespaces

Multi-Model E2E Testing Architecture

Deploy 2 models in 2 namespaces with 1 shared WVA controller:

  • Model A: Full llm-d stack in llm-d-inference-scheduler-pr-XXX
  • Model B: Full llm-d stack in llm-d-inference-scheduler-pr-XXX-b
  • Shared WVA: Single controller in llm-d-autoscaler-pr-XXX managing both

Critical Bug Fixes for Multi-Namespace Support

1. Saturation Engine Multi-Namespace Grouping

Problem: VAs with the same modelID in different namespaces were grouped together, causing incorrect scaling decisions.

Fix (internal/utils/variant.go): Group VAs by modelID|namespace composite key instead of just modelID:

func GroupVariantAutoscalingByModel(vas []VariantAutoscaling) map[string][]VariantAutoscaling {
    groups := make(map[string][]VariantAutoscaling)
    for _, va := range vas {
        key := va.Spec.ModelID + "|" + va.Namespace  // Changed from just modelID
        groups[key] = append(groups[key], va)
    }
    return groups
}

2. HPA External Metrics Namespace Isolation

Problem: HPA metrics from different namespaces could collide when using the same metric name.

Fix (charts/workload-variant-autoscaler/templates/hpa.yaml): Add exported_namespace label selector:

matchLabels:
  variant: {{ .Values.variantAutoscaling.scaleTargetRef.name }}
  exported_namespace: {{ .Values.llmd.namespace }}

3. Istio 1.28+ InferencePool API Compatibility

Problem: Istio 1.28.1 requires GA API (inference.networking.k8s.io) but llm-d charts deploy with experimental API (inference.networking.x-k8s.io).

Fix (.github/workflows/ci-e2e-openshift.yaml): Add workflow step to patch resources:

# 1. Create InferencePool with GA API
kubectl apply -f - <<EOF
apiVersion: inference.networking.k8s.io/v1
kind: InferencePool
metadata:
  name: gaie-inference-scheduling
  namespace: $NAMESPACE
spec:
  targetPortNumber: 8000
  selector:
    matchLabels:
      app.kubernetes.io/name: vllm
EOF

# 2. Patch HTTPRoute to reference GA API group
kubectl patch httproute llm-d-inference-scheduling -n $NAMESPACE --type=json -p='[
  {"op": "replace", "path": "/spec/rules/0/backendRefs/0/group", "value": "inference.networking.k8s.io"}
]'

# 3. Patch EPP deployment to use GA pool-group
kubectl set env deployment/gaie-inference-scheduling-epp -n $NAMESPACE \
  POOL_GROUP=inference.networking.k8s.io

# 4. Create RBAC for EPP to access GA InferencePool
kubectl apply -f - <<EOF
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: inference-pool-reader
  namespace: $NAMESPACE
rules:
- apiGroups: ["inference.networking.k8s.io"]
  resources: ["inferencepools"]
  verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: epp-inference-pool-reader
  namespace: $NAMESPACE
subjects:
- kind: ServiceAccount
  name: gaie-inference-scheduling-epp
  namespace: $NAMESPACE
roleRef:
  kind: Role
  name: inference-pool-reader
  apiGroup: rbac.authorization.k8s.io
EOF

4. Load Generation Through Gateway

Problem: Tests were sending load directly to vllm-service:8200, bypassing the InferencePool/EPP routing.

Fix (test/e2e-openshift/sharegpt_scaleup_test.go): Route all traffic through Istio gateway on port 80:

// Use gateway service for load routing (port 80)
gatewayServiceName := "infra-inference-scheduling-inference-gateway-istio"

// Health check via gateway
curl -sf http://$gatewayService:80/v1/models

// Load generation via gateway
curl -X POST http://$gatewayService:80/v1/completions ...

Configuration Requirements

For Istio 1.28+ Clusters

If using Istio 1.28.1 or later, you must apply the InferencePool API compatibility fix shown above. The llm-d charts currently deploy with the experimental API group which Istio 1.28+ does not support.

For Multi-Namespace Deployments

  1. Deploy WVA controller with controller.enabled=true in controller namespace
  2. Deploy additional models with controller.enabled=false (variant-only)
  3. Ensure HPA metrics include exported_namespace label for proper isolation

Monitoring CI Runs

Watch cluster resources during CI (replace <PR_NUMBER> with your PR number):

# WVA controller
oc get pods,deploy -n llm-d-autoscaler-pr-<PR_NUMBER>

# Model A (primary namespace)
oc get pods,deploy,va,hpa -n llm-d-inference-scheduler-pr-<PR_NUMBER>

# Model B (secondary namespace)
oc get pods,deploy,va,hpa -n llm-d-inference-scheduler-pr-<PR_NUMBER>-b

# All VAs with detailed status
oc get va -A -o custom-columns='NAMESPACE:.metadata.namespace,NAME:.metadata.name,OPTIMIZED:.status.desiredOptimizedAlloc.numReplicas,REPLICAS:.status.currentAlloc.numReplicas,RATE:.status.currentAlloc.load.arrivalRate'

CI Cleanup Behavior

  • Before tests: All PR namespaces are cleaned up for a fresh start
  • After successful tests: Resources are cleaned up automatically
  • After failed tests: Resources are left in place for debugging

Testing

  • CI runs multi-model deployment and verifies both VAs scale independently
  • ShareGPT scale-up test validates autoscaling behavior for both models
  • Both models demonstrate scale-up under load and scale-down when load stops

🤖 Generated with Claude Code

Copilot AI review requested due to automatic review settings December 19, 2025 14:52
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements a security gate for OpenShift E2E tests and migrates HuggingFace token management from GitHub secrets to cluster secrets. The gate requires /ok-to-test approval from maintainers before running GPU-intensive tests on external PRs, while allowing automatic runs for trusted contributors.

Key Changes:

  • Added approval workflow that posts instructions on new PRs and validates /ok-to-test commands
  • Modified E2E workflow to support both workflow_call (from gate) and workflow_dispatch triggers
  • Replaced GitHub secret-based HF_TOKEN with cluster secret retrieval from llm-d-hf-token

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
.github/workflows/ci-e2e-openshift-gate.yaml New gate workflow implementing /ok-to-test approval mechanism with permission checks and PR instruction posting
.github/workflows/ci-e2e-openshift.yaml Updated to use workflow_call trigger, unified input handling, and cluster-based HF token retrieval

Copilot AI review requested due to automatic review settings December 19, 2025 15:05

This comment was marked as resolved.

Copilot AI review requested due to automatic review settings December 19, 2025 15:14

This comment was marked as resolved.

Copilot AI review requested due to automatic review settings December 19, 2025 15:33

This comment was marked as resolved.

Copilot AI review requested due to automatic review settings December 19, 2025 15:47

This comment was marked as resolved.

Copilot AI review requested due to automatic review settings December 19, 2025 16:04

This comment was marked as resolved.

Copilot AI review requested due to automatic review settings December 19, 2025 16:19

This comment was marked as resolved.

Copilot AI review requested due to automatic review settings December 19, 2025 16:33

This comment was marked as resolved.

Copilot AI review requested due to automatic review settings December 19, 2025 16:39

This comment was marked as resolved.

Copilot AI review requested due to automatic review settings December 19, 2025 17:03
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 38 out of 38 changed files in this pull request and generated 3 comments.

Copilot AI review requested due to automatic review settings December 19, 2025 17:13
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 38 out of 38 changed files in this pull request and generated 2 comments.

Copilot AI review requested due to automatic review settings December 19, 2025 17:17
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 38 out of 38 changed files in this pull request and generated 5 comments.

Copilot AI review requested due to automatic review settings December 19, 2025 17:24
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 39 out of 39 changed files in this pull request and generated 5 comments.

Copilot AI review requested due to automatic review settings December 19, 2025 17:28
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 39 out of 39 changed files in this pull request and generated 3 comments.

Copilot AI review requested due to automatic review settings December 19, 2025 17:40
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 40 out of 40 changed files in this pull request and generated 2 comments.

Copilot AI review requested due to automatic review settings December 20, 2025 08:17
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 46 out of 46 changed files in this pull request and generated 3 comments.

re := regexp.MustCompile(`[^a-z0-9-]`)
result = re.ReplaceAllString(result, "")
// Trim leading/trailing hyphens
result = strings.Trim(result, "-")
Copy link

Copilot AI Dec 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function doesn't handle the case where the sanitized result is empty (e.g., if the input contains only special characters). Consider adding validation to return an error or default value when the result is empty to prevent silent failures.

Suggested change
result = strings.Trim(result, "-")
result = strings.Trim(result, "-")
// If everything was stripped out, fall back to a safe default name
if result == "" {
result = "default"
}

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings December 20, 2025 08:38
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 46 out of 46 changed files in this pull request and generated 1 comment.

Copilot AI review requested due to automatic review settings December 20, 2025 08:53
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 46 out of 46 changed files in this pull request and generated 2 comments.

rm -f kubectl.sha256
# Install oc (OpenShift CLI)
curl -LO "https://mirror.openshift.com/pub/openshift-v4/clients/ocp/stable/openshift-client-linux.tar.gz"
curl -fsSL --retry 3 --retry-delay 5 -O "https://mirror.openshift.com/pub/openshift-v4/clients/ocp/stable/openshift-client-linux.tar.gz"
Copy link

Copilot AI Dec 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The OpenShift CLI (oc) is downloaded without any checksum verification. This creates a supply chain security risk. Consider adding checksum verification similar to kubectl, or use a pinned version with known checksum.

Copilot uses AI. Check for mistakes.
rm -f openshift-client-linux.tar.gz kubectl README.md
# Install helm
curl -fsSL https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
curl -fsSL --retry 3 --retry-delay 5 https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
Copy link

Copilot AI Dec 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Executing a script directly from the internet via pipe to bash is a security risk. The script content could change between runs or be compromised. Consider pinning to a specific commit hash or downloading and verifying the script before execution.

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings December 20, 2025 13:57
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 46 out of 46 changed files in this pull request and generated 5 comments.

Comment on lines +172 to +174
KUBECTL_VERSION="v1.31.0"
echo "Installing kubectl version: $KUBECTL_VERSION"
curl -fsSL --retry 3 --retry-delay 5 -o kubectl "https://dl.k8s.io/release/${KUBECTL_VERSION}/bin/linux/amd64/kubectl"
Copy link

Copilot AI Dec 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment states 'Pinned 2025-12' but it's currently December 2025. Consider updating the comment to reflect when this pinning decision was actually made, or use a more generic phrase like 'Last updated: 2025-12' to avoid confusion about whether this is a future date.

Copilot uses AI. Check for mistakes.
It("should create and run parallel load generation jobs", func() {
By("cleaning up any existing jobs")
deleteParallelLoadJobs(ctx, jobBaseName, model.namespace, numLoadWorkers)
time.Sleep(2 * time.Second)
Copy link

Copilot AI Dec 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace hardcoded sleep with a conditional wait or increase timeout on the subsequent Eventually block. Using fixed sleeps can make tests flaky and slower than necessary. The Eventually block starting at line 273 already provides proper waiting logic for endpoints to be ready.

Suggested change
time.Sleep(2 * time.Second)

Copilot uses AI. Check for mistakes.
_ = k8sClient.BatchV1().Jobs(model.namespace).Delete(ctx, healthCheckJobName, metav1.DeleteOptions{
PropagationPolicy: &backgroundPropagation,
})
time.Sleep(2 * time.Second)
Copy link

Copilot AI Dec 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace hardcoded sleeps with eventual consistency checks. These arbitrary 2-second waits can cause test flakiness. Consider using Eventually blocks to wait for the actual state you need (e.g., job deletion completion) rather than fixed time periods.

Copilot uses AI. Check for mistakes.
metav1.ListOptions{
LabelSelector: fmt.Sprintf("experiment=%s", jobBaseName),
})
time.Sleep(2 * time.Second)
Copy link

Copilot AI Dec 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace hardcoded sleeps with eventual consistency checks. These arbitrary 2-second waits can cause test flakiness. Consider using Eventually blocks to wait for the actual state you need (e.g., job deletion completion) rather than fixed time periods.

Suggested change
time.Sleep(2 * time.Second)
Eventually(func(g Gomega) {
jobList, err := k8sClient.BatchV1().Jobs(model.namespace).List(ctx, metav1.ListOptions{
LabelSelector: fmt.Sprintf("experiment=%s", jobBaseName),
})
g.Expect(err).NotTo(HaveOccurred(), "Should be able to list load generation jobs for cleanup")
g.Expect(len(jobList.Items)).To(BeZero(), "All previous load generation jobs should be deleted before starting new ones")
}, 2*time.Minute, 5*time.Second).Should(Succeed())

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings December 20, 2025 14:26
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 46 out of 46 changed files in this pull request and generated 6 comments.

# Clean up orphaned PR-specific namespaces from previous runs
# Pattern: llm-d-inference-scheduler-pr-* and llm-d-autoscaler-pr-*
echo "Cleaning up orphaned PR-specific namespaces..."
for ns in $(kubectl get ns -o name | grep -E 'llm-d-(inference-scheduler|autoscaler|autoscaling)-pr-' | cut -d/ -f2); do
Copy link

Copilot AI Dec 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The namespace pattern includes 'autoscaling' but the actual namespace pattern used in the workflow is 'autoscaler' (line 141). This inconsistency could cause the cleanup to miss orphaned namespaces with the 'autoscaling' pattern.

Copilot uses AI. Check for mistakes.
Comment on lines +400 to +401
apiVersion: inference.networking.k8s.io/v1
kind: InferencePool
Copy link

Copilot AI Dec 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The InferencePool resource definition is duplicated in the fix_namespace function for both namespaces. Consider extracting this into a template or using a Kubernetes manifest file to avoid duplication and reduce maintenance burden.

Copilot uses AI. Check for mistakes.
Comment on lines 284 to 296
By("waiting for vllm-service endpoints to exist")
Eventually(func(g Gomega) {
endpoints, err := k8sClient.CoreV1().Endpoints(model.namespace).Get(ctx, vllmServiceName, metav1.GetOptions{})
g.Expect(err).NotTo(HaveOccurred(), "vllm-service endpoints should exist")
g.Expect(endpoints.Subsets).NotTo(BeEmpty(), "vllm-service should have endpoints")

readyCount := 0
for _, subset := range endpoints.Subsets {
readyCount += len(subset.Addresses)
}
_, _ = fmt.Fprintf(GinkgoWriter, "%s has %d ready endpoints\n", vllmServiceName, readyCount)
g.Expect(readyCount).To(BeNumerically(">", 0), "vllm-service should have at least one ready endpoint")
}, 5*time.Minute, 10*time.Second).Should(Succeed())
Copy link

Copilot AI Dec 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code attempts to get endpoints for vllmServiceName which might be empty (lines 251-256 set it conditionally). If vllmServiceName is empty, this will fail with an unclear error. Add a check to skip this validation if vllmServiceName is empty, or make it a required field.

Suggested change
By("waiting for vllm-service endpoints to exist")
Eventually(func(g Gomega) {
endpoints, err := k8sClient.CoreV1().Endpoints(model.namespace).Get(ctx, vllmServiceName, metav1.GetOptions{})
g.Expect(err).NotTo(HaveOccurred(), "vllm-service endpoints should exist")
g.Expect(endpoints.Subsets).NotTo(BeEmpty(), "vllm-service should have endpoints")
readyCount := 0
for _, subset := range endpoints.Subsets {
readyCount += len(subset.Addresses)
}
_, _ = fmt.Fprintf(GinkgoWriter, "%s has %d ready endpoints\n", vllmServiceName, readyCount)
g.Expect(readyCount).To(BeNumerically(">", 0), "vllm-service should have at least one ready endpoint")
}, 5*time.Minute, 10*time.Second).Should(Succeed())
if vllmServiceName == "" {
By("skipping vllm-service endpoints check because vllmServiceName is empty")
} else {
By("waiting for vllm-service endpoints to exist")
Eventually(func(g Gomega) {
endpoints, err := k8sClient.CoreV1().Endpoints(model.namespace).Get(ctx, vllmServiceName, metav1.GetOptions{})
g.Expect(err).NotTo(HaveOccurred(), "vllm-service endpoints should exist")
g.Expect(endpoints.Subsets).NotTo(BeEmpty(), "vllm-service should have endpoints")
readyCount := 0
for _, subset := range endpoints.Subsets {
readyCount += len(subset.Addresses)
}
_, _ = fmt.Fprintf(GinkgoWriter, "%s has %d ready endpoints\n", vllmServiceName, readyCount)
g.Expect(readyCount).To(BeNumerically(">", 0), "vllm-service should have at least one ready endpoint")
}, 5*time.Minute, 10*time.Second).Should(Succeed())
}

Copilot uses AI. Check for mistakes.
for modelID, modelVAs := range modelGroups {
for groupKey, modelVAs := range modelGroups {
// The groupKey is "modelID|namespace" - extract actual modelID from VAs
// All VAs in the group have the same modelID and namespace
Copy link

Copilot AI Dec 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accessing modelVAs[0] without checking if the slice is empty could cause a panic. Although the grouping logic should ensure non-empty groups, add a defensive check to verify len(modelVAs) > 0 before accessing the first element.

Suggested change
// All VAs in the group have the same modelID and namespace
// All VAs in the group have the same modelID and namespace
if len(modelVAs) == 0 {
logger.V(logging.DEBUG).Info("Skipping empty model group",
"groupKey", groupKey)
continue
}

Copilot uses AI. Check for mistakes.
// Attempts to resolve the target model variant using scaleTargetRef
scaleTargetName := va.Spec.ScaleTargetRef.Name
if scaleTargetName == "" {
// Fallback to VA name for backwards compatibility
Copy link

Copilot AI Dec 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fallback to VA name for backwards compatibility is undocumented in the CRD or user-facing documentation. Add a comment explaining when this fallback is used (legacy VAs without scaleTargetRef) and consider logging a deprecation warning to encourage users to migrate to the explicit scaleTargetRef field.

Suggested change
// Fallback to VA name for backwards compatibility
// Fallback to VA name for backwards compatibility:
// - This is used for legacy VariantAutoscaling resources that do not specify spec.scaleTargetRef.name.
// - New configurations should set spec.scaleTargetRef.name explicitly and not rely on this implicit fallback.
logger.Info("Using VariantAutoscaling name as fallback for spec.scaleTargetRef.name; this behavior is deprecated and may be removed in a future release. Please set spec.scaleTargetRef.name explicitly.",
"variantAutoscaling", va.Name,
"namespace", va.Namespace,
)

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings December 20, 2025 14:54
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 46 out of 46 changed files in this pull request and generated 2 comments.

Comment on lines 210 to +237
if len(allDecisions) > 0 {
logger.Info("Applying scaling decisions",
"totalDecisions", len(allDecisions))
if err := e.applySaturationDecisions(ctx, allDecisions, vaMap); err != nil {
logger.Error(err, "Failed to apply saturation decisions")
return err
}
} else {
logger.Info("No scaling decisions to apply")
logger.Info("No scaling decisions to apply, updating VA status with metrics")
}
Copy link

Copilot AI Dec 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The if-else block at lines 232-237 only logs messages but doesn't affect control flow - the applySaturationDecisions call happens regardless. Consider simplifying by removing the conditional and using a single log statement like logger.Info(\"Applying decisions and updating VA status\", \"totalDecisions\", len(allDecisions)) to reduce code complexity.

Suggested change
if len(allDecisions) > 0 {
logger.Info("Applying scaling decisions",
"totalDecisions", len(allDecisions))
if err := e.applySaturationDecisions(ctx, allDecisions, vaMap); err != nil {
logger.Error(err, "Failed to apply saturation decisions")
return err
}
} else {
logger.Info("No scaling decisions to apply")
logger.Info("No scaling decisions to apply, updating VA status with metrics")
}
logger.Info("Applying decisions and updating VA status",
"totalDecisions", len(allDecisions))

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings December 20, 2025 15:28
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 46 out of 46 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

.github/workflows/ci-e2e-openshift.yaml:1

  • The WVA_IMAGE_PULL_POLICY variable is defined but never used in the workflow or install script. Consider removing it or documenting where it should be used.
name: CI - OpenShift E2E Tests

@clubanderson clubanderson changed the title Add ok-to-test gate and use cluster HF token for e2e tests Add ok-to-test gate, use cluster HF token for e2e tests, fix istio issues, run wva against 2 different stacks simultaneously Dec 22, 2025
…d WVA

Multi-model E2E Testing:
- Deploy 2 models in 2 namespaces with 1 shared WVA controller
- Model A in llm-d-inference-scheduler-pr-XXX
- Model B in llm-d-inference-scheduler-pr-XXX-b
- Shared WVA in llm-d-autoscaler-pr-XXX

Test Improvements:
- Move HPA retrieval before VA stabilization to know minReplicas
- Wait for VA to stabilize at exact minReplicas before load test
- Increase stabilization timeout to 5 minutes
- Route load through Istio gateway instead of direct vLLM service

CI Cleanup Behavior:
- Before tests: Clean up all PR namespaces for fresh start
- After successful tests: Clean up automatically
- After failed tests: Leave resources for debugging

Documentation:
- Add monitoring commands with PR number placeholder
- Document multi-model testing architecture
- Document CI cleanup behavior

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings December 22, 2025 15:19
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 47 out of 47 changed files in this pull request and generated 3 comments.

Comment on lines +170 to +171
# Pinned 2025-12: v1.31.0 tested compatible with OpenShift 4.16+
# Update this version when upgrading target cluster or during regular dependency reviews
Copy link

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment states kubectl is 'Pinned 2025-12' but the PR description indicates the current date is December 22, 2025. Consider updating the pin date comment to reflect when this version was actually pinned, or clarify if this refers to a future update schedule.

Suggested change
# Pinned 2025-12: v1.31.0 tested compatible with OpenShift 4.16+
# Update this version when upgrading target cluster or during regular dependency reviews
# Pinned on 2025-12-22: kubectl v1.31.0 tested compatible with OpenShift 4.16+
# Review and update this version during dependency reviews or when upgrading target clusters

Copilot uses AI. Check for mistakes.

models := []modelTestConfig{
{
name: "Model A1",
Copy link

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The model is named 'Model A1' in the test but the PR description and other documentation refers to it as 'Model A'. For consistency, consider using 'Model A' instead of 'Model A1'.

Copilot uses AI. Check for mistakes.
Comment on lines 311 to +325
existingPods := cmc.getExistingPods(ctx, namespace, deployments, podSet)
stalePodCount := 0

// Filter out pods that don't exist according to the queried Prometheus kube-state-metrics
for podName := range podSet {
if !existingPods[podName] {
stalePodCount++
// TODO: remove debug log after verification
logger.V(logging.DEBUG).Info("Filtering pod from stale vLLM metrics", "pod", podName, "namespace", namespace, "model", modelID)
delete(podSet, podName)
// If getExistingPods returns empty but we have candidate pods with metrics,
// skip the filtering - this handles the case where kube_pod_info hasn't been
// scraped yet for new pods. It's better to include all candidates than to
// filter them all out and skip saturation analysis entirely.
// Note: This workaround may include metrics from recently-terminated pods if
// kube_pod_info is stale. The typical staleness window is ~30s based on scrape intervals.
// TODO: Consider adding time-based filtering or retry logic for more accurate pod filtering.
if len(existingPods) == 0 && len(podSet) > 0 {
logger.Info("kube_pod_info returned no pods but we have metric candidates, skipping stale pod filtering",
"candidatePods", len(podSet), "namespace", namespace, "model", modelID)
} else {
// Filter out pods that don't exist according to the queried Prometheus kube-state-metrics
Copy link

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This workaround bypasses stale pod filtering when kube_pod_info is unavailable, which could include metrics from terminated pods for up to 30 seconds. Consider implementing the time-based filtering or retry logic mentioned in the TODO to improve accuracy, or document acceptable tolerance levels for including stale metrics in saturation analysis.

Suggested change
existingPods := cmc.getExistingPods(ctx, namespace, deployments, podSet)
stalePodCount := 0
// Filter out pods that don't exist according to the queried Prometheus kube-state-metrics
for podName := range podSet {
if !existingPods[podName] {
stalePodCount++
// TODO: remove debug log after verification
logger.V(logging.DEBUG).Info("Filtering pod from stale vLLM metrics", "pod", podName, "namespace", namespace, "model", modelID)
delete(podSet, podName)
// If getExistingPods returns empty but we have candidate pods with metrics,
// skip the filtering - this handles the case where kube_pod_info hasn't been
// scraped yet for new pods. It's better to include all candidates than to
// filter them all out and skip saturation analysis entirely.
// Note: This workaround may include metrics from recently-terminated pods if
// kube_pod_info is stale. The typical staleness window is ~30s based on scrape intervals.
// TODO: Consider adding time-based filtering or retry logic for more accurate pod filtering.
if len(existingPods) == 0 && len(podSet) > 0 {
logger.Info("kube_pod_info returned no pods but we have metric candidates, skipping stale pod filtering",
"candidatePods", len(podSet), "namespace", namespace, "model", modelID)
} else {
// Filter out pods that don't exist according to the queried Prometheus kube-state-metrics
//
// To avoid races where kube_pod_info has not yet been scraped for new pods, we allow a bounded
// retry window before falling back to including all candidate pods. This reduces the chance of
// including metrics from recently-terminated pods while still avoiding dropping all metrics when
// kube_pod_info is briefly empty.
maxPodInfoWait := 5 * time.Second
retryInterval := 500 * time.Millisecond
podInfoStart := time.Now()
existingPods := cmc.getExistingPods(ctx, namespace, deployments, podSet)
stalePodCount := 0
// If getExistingPods returns empty but we have candidate pods with metrics, retry for a short,
// bounded window before skipping filtering entirely. This handles the case where kube_pod_info
// hasn't been scraped yet for new pods. If, after the retry window, we still have no pod info,
// we proceed without filtering but log that stale metrics may be included for up to the typical
// kube-state-metrics staleness window (~30s).
if len(existingPods) == 0 && len(podSet) > 0 {
logger.Info("kube_pod_info returned no pods but we have metric candidates, retrying before skipping stale pod filtering",
"candidatePods", len(podSet), "namespace", namespace, "model", modelID,
"maxWait", maxPodInfoWait, "retryInterval", retryInterval)
for len(existingPods) == 0 && time.Since(podInfoStart) < maxPodInfoWait {
select {
case <-ctx.Done():
// Context cancelled or deadline exceeded; stop retrying.
break
case <-time.After(retryInterval):
existingPods = cmc.getExistingPods(ctx, namespace, deployments, podSet)
}
// If context is done, avoid extra work.
if ctx.Err() != nil {
break
}
}
if len(existingPods) == 0 {
// Fall back to previous behavior: include all candidate pods without stale filtering,
// but document the tolerated window for potentially stale metrics.
logger.Info("kube_pod_info remained empty after retry window; proceeding without stale pod filtering",
"candidatePods", len(podSet), "namespace", namespace, "model", modelID,
"retryDuration", time.Since(podInfoStart),
"toleratedStalenessWindow", "up to ~30s due to Prometheus retention")
}
}
// If we have pod information, filter out pods that don't exist according to the queried
// Prometheus kube-state-metrics.
if len(existingPods) > 0 {

Copilot uses AI. Check for mistakes.
Copy link
Collaborator

@asm582 asm582 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Address review

- Add documentation for CONFIG_MAP_NAME env var in prometheus.go
  explaining how it's set by Helm chart for multi-instance support
- Add TODO for scaleTargetRef.name CRD requirement in variant.go
  noting the check should be enforced at schema level

Related issues:
- #454: Consider hardware type when inferring target deployment
- #455: Replace pod filtering with app-level readiness probes

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@clubanderson clubanderson requested a review from asm582 December 22, 2025 16:05
Copy link
Collaborator

@asm582 asm582 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Some issues will be addressed later.

@clubanderson clubanderson merged commit 53b22bd into main Dec 22, 2025
5 checks passed
ev-shindin pushed a commit to ev-shindin/workload-variant-autoscaler that referenced this pull request Jan 14, 2026
Add ok-to-test gate, use cluster HF token for e2e tests, fix istio issues, run wva against 2 different stacks simultaneously
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants