Add ok-to-test gate, use cluster HF token for e2e tests, fix istio issues, run wva against 2 different stacks simultaneously#451
Conversation
There was a problem hiding this comment.
Pull request overview
This PR implements a security gate for OpenShift E2E tests and migrates HuggingFace token management from GitHub secrets to cluster secrets. The gate requires /ok-to-test approval from maintainers before running GPU-intensive tests on external PRs, while allowing automatic runs for trusted contributors.
Key Changes:
- Added approval workflow that posts instructions on new PRs and validates
/ok-to-testcommands - Modified E2E workflow to support both workflow_call (from gate) and workflow_dispatch triggers
- Replaced GitHub secret-based HF_TOKEN with cluster secret retrieval from llm-d-hf-token
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| .github/workflows/ci-e2e-openshift-gate.yaml | New gate workflow implementing /ok-to-test approval mechanism with permission checks and PR instruction posting |
| .github/workflows/ci-e2e-openshift.yaml | Updated to use workflow_call trigger, unified input handling, and cluster-based HF token retrieval |
783e71e to
5119cf1
Compare
| re := regexp.MustCompile(`[^a-z0-9-]`) | ||
| result = re.ReplaceAllString(result, "") | ||
| // Trim leading/trailing hyphens | ||
| result = strings.Trim(result, "-") |
There was a problem hiding this comment.
The function doesn't handle the case where the sanitized result is empty (e.g., if the input contains only special characters). Consider adding validation to return an error or default value when the result is empty to prevent silent failures.
| result = strings.Trim(result, "-") | |
| result = strings.Trim(result, "-") | |
| // If everything was stripped out, fall back to a safe default name | |
| if result == "" { | |
| result = "default" | |
| } |
| rm -f kubectl.sha256 | ||
| # Install oc (OpenShift CLI) | ||
| curl -LO "https://mirror.openshift.com/pub/openshift-v4/clients/ocp/stable/openshift-client-linux.tar.gz" | ||
| curl -fsSL --retry 3 --retry-delay 5 -O "https://mirror.openshift.com/pub/openshift-v4/clients/ocp/stable/openshift-client-linux.tar.gz" |
There was a problem hiding this comment.
The OpenShift CLI (oc) is downloaded without any checksum verification. This creates a supply chain security risk. Consider adding checksum verification similar to kubectl, or use a pinned version with known checksum.
| rm -f openshift-client-linux.tar.gz kubectl README.md | ||
| # Install helm | ||
| curl -fsSL https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash | ||
| curl -fsSL --retry 3 --retry-delay 5 https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash |
There was a problem hiding this comment.
Executing a script directly from the internet via pipe to bash is a security risk. The script content could change between runs or be compromised. Consider pinning to a specific commit hash or downloading and verifying the script before execution.
| KUBECTL_VERSION="v1.31.0" | ||
| echo "Installing kubectl version: $KUBECTL_VERSION" | ||
| curl -fsSL --retry 3 --retry-delay 5 -o kubectl "https://dl.k8s.io/release/${KUBECTL_VERSION}/bin/linux/amd64/kubectl" |
There was a problem hiding this comment.
The comment states 'Pinned 2025-12' but it's currently December 2025. Consider updating the comment to reflect when this pinning decision was actually made, or use a more generic phrase like 'Last updated: 2025-12' to avoid confusion about whether this is a future date.
| It("should create and run parallel load generation jobs", func() { | ||
| By("cleaning up any existing jobs") | ||
| deleteParallelLoadJobs(ctx, jobBaseName, model.namespace, numLoadWorkers) | ||
| time.Sleep(2 * time.Second) |
There was a problem hiding this comment.
Replace hardcoded sleep with a conditional wait or increase timeout on the subsequent Eventually block. Using fixed sleeps can make tests flaky and slower than necessary. The Eventually block starting at line 273 already provides proper waiting logic for endpoints to be ready.
| time.Sleep(2 * time.Second) |
| _ = k8sClient.BatchV1().Jobs(model.namespace).Delete(ctx, healthCheckJobName, metav1.DeleteOptions{ | ||
| PropagationPolicy: &backgroundPropagation, | ||
| }) | ||
| time.Sleep(2 * time.Second) |
There was a problem hiding this comment.
Replace hardcoded sleeps with eventual consistency checks. These arbitrary 2-second waits can cause test flakiness. Consider using Eventually blocks to wait for the actual state you need (e.g., job deletion completion) rather than fixed time periods.
| metav1.ListOptions{ | ||
| LabelSelector: fmt.Sprintf("experiment=%s", jobBaseName), | ||
| }) | ||
| time.Sleep(2 * time.Second) |
There was a problem hiding this comment.
Replace hardcoded sleeps with eventual consistency checks. These arbitrary 2-second waits can cause test flakiness. Consider using Eventually blocks to wait for the actual state you need (e.g., job deletion completion) rather than fixed time periods.
| time.Sleep(2 * time.Second) | |
| Eventually(func(g Gomega) { | |
| jobList, err := k8sClient.BatchV1().Jobs(model.namespace).List(ctx, metav1.ListOptions{ | |
| LabelSelector: fmt.Sprintf("experiment=%s", jobBaseName), | |
| }) | |
| g.Expect(err).NotTo(HaveOccurred(), "Should be able to list load generation jobs for cleanup") | |
| g.Expect(len(jobList.Items)).To(BeZero(), "All previous load generation jobs should be deleted before starting new ones") | |
| }, 2*time.Minute, 5*time.Second).Should(Succeed()) |
| # Clean up orphaned PR-specific namespaces from previous runs | ||
| # Pattern: llm-d-inference-scheduler-pr-* and llm-d-autoscaler-pr-* | ||
| echo "Cleaning up orphaned PR-specific namespaces..." | ||
| for ns in $(kubectl get ns -o name | grep -E 'llm-d-(inference-scheduler|autoscaler|autoscaling)-pr-' | cut -d/ -f2); do |
There was a problem hiding this comment.
The namespace pattern includes 'autoscaling' but the actual namespace pattern used in the workflow is 'autoscaler' (line 141). This inconsistency could cause the cleanup to miss orphaned namespaces with the 'autoscaling' pattern.
| apiVersion: inference.networking.k8s.io/v1 | ||
| kind: InferencePool |
There was a problem hiding this comment.
The InferencePool resource definition is duplicated in the fix_namespace function for both namespaces. Consider extracting this into a template or using a Kubernetes manifest file to avoid duplication and reduce maintenance burden.
| By("waiting for vllm-service endpoints to exist") | ||
| Eventually(func(g Gomega) { | ||
| endpoints, err := k8sClient.CoreV1().Endpoints(model.namespace).Get(ctx, vllmServiceName, metav1.GetOptions{}) | ||
| g.Expect(err).NotTo(HaveOccurred(), "vllm-service endpoints should exist") | ||
| g.Expect(endpoints.Subsets).NotTo(BeEmpty(), "vllm-service should have endpoints") | ||
|
|
||
| readyCount := 0 | ||
| for _, subset := range endpoints.Subsets { | ||
| readyCount += len(subset.Addresses) | ||
| } | ||
| _, _ = fmt.Fprintf(GinkgoWriter, "%s has %d ready endpoints\n", vllmServiceName, readyCount) | ||
| g.Expect(readyCount).To(BeNumerically(">", 0), "vllm-service should have at least one ready endpoint") | ||
| }, 5*time.Minute, 10*time.Second).Should(Succeed()) |
There was a problem hiding this comment.
The code attempts to get endpoints for vllmServiceName which might be empty (lines 251-256 set it conditionally). If vllmServiceName is empty, this will fail with an unclear error. Add a check to skip this validation if vllmServiceName is empty, or make it a required field.
| By("waiting for vllm-service endpoints to exist") | |
| Eventually(func(g Gomega) { | |
| endpoints, err := k8sClient.CoreV1().Endpoints(model.namespace).Get(ctx, vllmServiceName, metav1.GetOptions{}) | |
| g.Expect(err).NotTo(HaveOccurred(), "vllm-service endpoints should exist") | |
| g.Expect(endpoints.Subsets).NotTo(BeEmpty(), "vllm-service should have endpoints") | |
| readyCount := 0 | |
| for _, subset := range endpoints.Subsets { | |
| readyCount += len(subset.Addresses) | |
| } | |
| _, _ = fmt.Fprintf(GinkgoWriter, "%s has %d ready endpoints\n", vllmServiceName, readyCount) | |
| g.Expect(readyCount).To(BeNumerically(">", 0), "vllm-service should have at least one ready endpoint") | |
| }, 5*time.Minute, 10*time.Second).Should(Succeed()) | |
| if vllmServiceName == "" { | |
| By("skipping vllm-service endpoints check because vllmServiceName is empty") | |
| } else { | |
| By("waiting for vllm-service endpoints to exist") | |
| Eventually(func(g Gomega) { | |
| endpoints, err := k8sClient.CoreV1().Endpoints(model.namespace).Get(ctx, vllmServiceName, metav1.GetOptions{}) | |
| g.Expect(err).NotTo(HaveOccurred(), "vllm-service endpoints should exist") | |
| g.Expect(endpoints.Subsets).NotTo(BeEmpty(), "vllm-service should have endpoints") | |
| readyCount := 0 | |
| for _, subset := range endpoints.Subsets { | |
| readyCount += len(subset.Addresses) | |
| } | |
| _, _ = fmt.Fprintf(GinkgoWriter, "%s has %d ready endpoints\n", vllmServiceName, readyCount) | |
| g.Expect(readyCount).To(BeNumerically(">", 0), "vllm-service should have at least one ready endpoint") | |
| }, 5*time.Minute, 10*time.Second).Should(Succeed()) | |
| } |
| for modelID, modelVAs := range modelGroups { | ||
| for groupKey, modelVAs := range modelGroups { | ||
| // The groupKey is "modelID|namespace" - extract actual modelID from VAs | ||
| // All VAs in the group have the same modelID and namespace |
There was a problem hiding this comment.
Accessing modelVAs[0] without checking if the slice is empty could cause a panic. Although the grouping logic should ensure non-empty groups, add a defensive check to verify len(modelVAs) > 0 before accessing the first element.
| // All VAs in the group have the same modelID and namespace | |
| // All VAs in the group have the same modelID and namespace | |
| if len(modelVAs) == 0 { | |
| logger.V(logging.DEBUG).Info("Skipping empty model group", | |
| "groupKey", groupKey) | |
| continue | |
| } |
| // Attempts to resolve the target model variant using scaleTargetRef | ||
| scaleTargetName := va.Spec.ScaleTargetRef.Name | ||
| if scaleTargetName == "" { | ||
| // Fallback to VA name for backwards compatibility |
There was a problem hiding this comment.
The fallback to VA name for backwards compatibility is undocumented in the CRD or user-facing documentation. Add a comment explaining when this fallback is used (legacy VAs without scaleTargetRef) and consider logging a deprecation warning to encourage users to migrate to the explicit scaleTargetRef field.
| // Fallback to VA name for backwards compatibility | |
| // Fallback to VA name for backwards compatibility: | |
| // - This is used for legacy VariantAutoscaling resources that do not specify spec.scaleTargetRef.name. | |
| // - New configurations should set spec.scaleTargetRef.name explicitly and not rely on this implicit fallback. | |
| logger.Info("Using VariantAutoscaling name as fallback for spec.scaleTargetRef.name; this behavior is deprecated and may be removed in a future release. Please set spec.scaleTargetRef.name explicitly.", | |
| "variantAutoscaling", va.Name, | |
| "namespace", va.Namespace, | |
| ) |
| if len(allDecisions) > 0 { | ||
| logger.Info("Applying scaling decisions", | ||
| "totalDecisions", len(allDecisions)) | ||
| if err := e.applySaturationDecisions(ctx, allDecisions, vaMap); err != nil { | ||
| logger.Error(err, "Failed to apply saturation decisions") | ||
| return err | ||
| } | ||
| } else { | ||
| logger.Info("No scaling decisions to apply") | ||
| logger.Info("No scaling decisions to apply, updating VA status with metrics") | ||
| } |
There was a problem hiding this comment.
The if-else block at lines 232-237 only logs messages but doesn't affect control flow - the applySaturationDecisions call happens regardless. Consider simplifying by removing the conditional and using a single log statement like logger.Info(\"Applying decisions and updating VA status\", \"totalDecisions\", len(allDecisions)) to reduce code complexity.
| if len(allDecisions) > 0 { | |
| logger.Info("Applying scaling decisions", | |
| "totalDecisions", len(allDecisions)) | |
| if err := e.applySaturationDecisions(ctx, allDecisions, vaMap); err != nil { | |
| logger.Error(err, "Failed to apply saturation decisions") | |
| return err | |
| } | |
| } else { | |
| logger.Info("No scaling decisions to apply") | |
| logger.Info("No scaling decisions to apply, updating VA status with metrics") | |
| } | |
| logger.Info("Applying decisions and updating VA status", | |
| "totalDecisions", len(allDecisions)) |
c4a6a04 to
aa0475f
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 46 out of 46 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (1)
.github/workflows/ci-e2e-openshift.yaml:1
- The
WVA_IMAGE_PULL_POLICYvariable is defined but never used in the workflow or install script. Consider removing it or documenting where it should be used.
name: CI - OpenShift E2E Tests
…d WVA Multi-model E2E Testing: - Deploy 2 models in 2 namespaces with 1 shared WVA controller - Model A in llm-d-inference-scheduler-pr-XXX - Model B in llm-d-inference-scheduler-pr-XXX-b - Shared WVA in llm-d-autoscaler-pr-XXX Test Improvements: - Move HPA retrieval before VA stabilization to know minReplicas - Wait for VA to stabilize at exact minReplicas before load test - Increase stabilization timeout to 5 minutes - Route load through Istio gateway instead of direct vLLM service CI Cleanup Behavior: - Before tests: Clean up all PR namespaces for fresh start - After successful tests: Clean up automatically - After failed tests: Leave resources for debugging Documentation: - Add monitoring commands with PR number placeholder - Document multi-model testing architecture - Document CI cleanup behavior 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
97bd05a to
e25066c
Compare
| # Pinned 2025-12: v1.31.0 tested compatible with OpenShift 4.16+ | ||
| # Update this version when upgrading target cluster or during regular dependency reviews |
There was a problem hiding this comment.
The comment states kubectl is 'Pinned 2025-12' but the PR description indicates the current date is December 22, 2025. Consider updating the pin date comment to reflect when this version was actually pinned, or clarify if this refers to a future update schedule.
| # Pinned 2025-12: v1.31.0 tested compatible with OpenShift 4.16+ | |
| # Update this version when upgrading target cluster or during regular dependency reviews | |
| # Pinned on 2025-12-22: kubectl v1.31.0 tested compatible with OpenShift 4.16+ | |
| # Review and update this version during dependency reviews or when upgrading target clusters |
|
|
||
| models := []modelTestConfig{ | ||
| { | ||
| name: "Model A1", |
There was a problem hiding this comment.
The model is named 'Model A1' in the test but the PR description and other documentation refers to it as 'Model A'. For consistency, consider using 'Model A' instead of 'Model A1'.
| existingPods := cmc.getExistingPods(ctx, namespace, deployments, podSet) | ||
| stalePodCount := 0 | ||
|
|
||
| // Filter out pods that don't exist according to the queried Prometheus kube-state-metrics | ||
| for podName := range podSet { | ||
| if !existingPods[podName] { | ||
| stalePodCount++ | ||
| // TODO: remove debug log after verification | ||
| logger.V(logging.DEBUG).Info("Filtering pod from stale vLLM metrics", "pod", podName, "namespace", namespace, "model", modelID) | ||
| delete(podSet, podName) | ||
| // If getExistingPods returns empty but we have candidate pods with metrics, | ||
| // skip the filtering - this handles the case where kube_pod_info hasn't been | ||
| // scraped yet for new pods. It's better to include all candidates than to | ||
| // filter them all out and skip saturation analysis entirely. | ||
| // Note: This workaround may include metrics from recently-terminated pods if | ||
| // kube_pod_info is stale. The typical staleness window is ~30s based on scrape intervals. | ||
| // TODO: Consider adding time-based filtering or retry logic for more accurate pod filtering. | ||
| if len(existingPods) == 0 && len(podSet) > 0 { | ||
| logger.Info("kube_pod_info returned no pods but we have metric candidates, skipping stale pod filtering", | ||
| "candidatePods", len(podSet), "namespace", namespace, "model", modelID) | ||
| } else { | ||
| // Filter out pods that don't exist according to the queried Prometheus kube-state-metrics |
There was a problem hiding this comment.
This workaround bypasses stale pod filtering when kube_pod_info is unavailable, which could include metrics from terminated pods for up to 30 seconds. Consider implementing the time-based filtering or retry logic mentioned in the TODO to improve accuracy, or document acceptable tolerance levels for including stale metrics in saturation analysis.
| existingPods := cmc.getExistingPods(ctx, namespace, deployments, podSet) | |
| stalePodCount := 0 | |
| // Filter out pods that don't exist according to the queried Prometheus kube-state-metrics | |
| for podName := range podSet { | |
| if !existingPods[podName] { | |
| stalePodCount++ | |
| // TODO: remove debug log after verification | |
| logger.V(logging.DEBUG).Info("Filtering pod from stale vLLM metrics", "pod", podName, "namespace", namespace, "model", modelID) | |
| delete(podSet, podName) | |
| // If getExistingPods returns empty but we have candidate pods with metrics, | |
| // skip the filtering - this handles the case where kube_pod_info hasn't been | |
| // scraped yet for new pods. It's better to include all candidates than to | |
| // filter them all out and skip saturation analysis entirely. | |
| // Note: This workaround may include metrics from recently-terminated pods if | |
| // kube_pod_info is stale. The typical staleness window is ~30s based on scrape intervals. | |
| // TODO: Consider adding time-based filtering or retry logic for more accurate pod filtering. | |
| if len(existingPods) == 0 && len(podSet) > 0 { | |
| logger.Info("kube_pod_info returned no pods but we have metric candidates, skipping stale pod filtering", | |
| "candidatePods", len(podSet), "namespace", namespace, "model", modelID) | |
| } else { | |
| // Filter out pods that don't exist according to the queried Prometheus kube-state-metrics | |
| // | |
| // To avoid races where kube_pod_info has not yet been scraped for new pods, we allow a bounded | |
| // retry window before falling back to including all candidate pods. This reduces the chance of | |
| // including metrics from recently-terminated pods while still avoiding dropping all metrics when | |
| // kube_pod_info is briefly empty. | |
| maxPodInfoWait := 5 * time.Second | |
| retryInterval := 500 * time.Millisecond | |
| podInfoStart := time.Now() | |
| existingPods := cmc.getExistingPods(ctx, namespace, deployments, podSet) | |
| stalePodCount := 0 | |
| // If getExistingPods returns empty but we have candidate pods with metrics, retry for a short, | |
| // bounded window before skipping filtering entirely. This handles the case where kube_pod_info | |
| // hasn't been scraped yet for new pods. If, after the retry window, we still have no pod info, | |
| // we proceed without filtering but log that stale metrics may be included for up to the typical | |
| // kube-state-metrics staleness window (~30s). | |
| if len(existingPods) == 0 && len(podSet) > 0 { | |
| logger.Info("kube_pod_info returned no pods but we have metric candidates, retrying before skipping stale pod filtering", | |
| "candidatePods", len(podSet), "namespace", namespace, "model", modelID, | |
| "maxWait", maxPodInfoWait, "retryInterval", retryInterval) | |
| for len(existingPods) == 0 && time.Since(podInfoStart) < maxPodInfoWait { | |
| select { | |
| case <-ctx.Done(): | |
| // Context cancelled or deadline exceeded; stop retrying. | |
| break | |
| case <-time.After(retryInterval): | |
| existingPods = cmc.getExistingPods(ctx, namespace, deployments, podSet) | |
| } | |
| // If context is done, avoid extra work. | |
| if ctx.Err() != nil { | |
| break | |
| } | |
| } | |
| if len(existingPods) == 0 { | |
| // Fall back to previous behavior: include all candidate pods without stale filtering, | |
| // but document the tolerated window for potentially stale metrics. | |
| logger.Info("kube_pod_info remained empty after retry window; proceeding without stale pod filtering", | |
| "candidatePods", len(podSet), "namespace", namespace, "model", modelID, | |
| "retryDuration", time.Since(podInfoStart), | |
| "toleratedStalenessWindow", "up to ~30s due to Prometheus retention") | |
| } | |
| } | |
| // If we have pod information, filter out pods that don't exist according to the queried | |
| // Prometheus kube-state-metrics. | |
| if len(existingPods) > 0 { |
- Add documentation for CONFIG_MAP_NAME env var in prometheus.go explaining how it's set by Helm chart for multi-instance support - Add TODO for scaleTargetRef.name CRD requirement in variant.go noting the check should be enforced at schema level Related issues: - #454: Consider hardware type when inferring target deployment - #455: Replace pod filtering with app-level readiness probes 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
asm582
left a comment
There was a problem hiding this comment.
lgtm
Some issues will be addressed later.
Add ok-to-test gate, use cluster HF token for e2e tests, fix istio issues, run wva against 2 different stacks simultaneously
Executive Summary
This PR introduces an ok-to-test gate for GPU-intensive e2e tests and adds comprehensive multi-model testing capabilities for the workload-variant-autoscaler on OpenShift. The changes enable testing 2 models in 2 separate namespaces while using a single shared WVA controller in a 3rd namespace, demonstrating production-ready multi-tenant scaling scenarios.
Key Changes
Security & CI Gate
ci-e2e-openshift-gate.yamlworkflow requiring/ok-to-testcomment from maintainers before running GPU testsHelm Chart: Controller-Only and Variant-Only Installation
controller.enabledflag to helm chart for modular deploymentcontroller.enabled=true(default): Deploy full WVA stack (controller + VA + HPA + ServiceMonitor)controller.enabled=false: Deploy only variant resources (VA + HPA + ServiceMonitor)Multi-Model E2E Testing Architecture
Deploy 2 models in 2 namespaces with 1 shared WVA controller:
llm-d-inference-scheduler-pr-XXXllm-d-inference-scheduler-pr-XXX-bllm-d-autoscaler-pr-XXXmanaging bothCritical Bug Fixes for Multi-Namespace Support
1. Saturation Engine Multi-Namespace Grouping
Problem: VAs with the same modelID in different namespaces were grouped together, causing incorrect scaling decisions.
Fix (
internal/utils/variant.go): Group VAs bymodelID|namespacecomposite key instead of justmodelID:2. HPA External Metrics Namespace Isolation
Problem: HPA metrics from different namespaces could collide when using the same metric name.
Fix (
charts/workload-variant-autoscaler/templates/hpa.yaml): Addexported_namespacelabel selector:3. Istio 1.28+ InferencePool API Compatibility
Problem: Istio 1.28.1 requires GA API (
inference.networking.k8s.io) but llm-d charts deploy with experimental API (inference.networking.x-k8s.io).Fix (
.github/workflows/ci-e2e-openshift.yaml): Add workflow step to patch resources:4. Load Generation Through Gateway
Problem: Tests were sending load directly to vllm-service:8200, bypassing the InferencePool/EPP routing.
Fix (
test/e2e-openshift/sharegpt_scaleup_test.go): Route all traffic through Istio gateway on port 80:Configuration Requirements
For Istio 1.28+ Clusters
If using Istio 1.28.1 or later, you must apply the InferencePool API compatibility fix shown above. The llm-d charts currently deploy with the experimental API group which Istio 1.28+ does not support.
For Multi-Namespace Deployments
controller.enabled=truein controller namespacecontroller.enabled=false(variant-only)exported_namespacelabel for proper isolationMonitoring CI Runs
Watch cluster resources during CI (replace
<PR_NUMBER>with your PR number):CI Cleanup Behavior
Testing
🤖 Generated with Claude Code