Skip to content

Comments

Migrate to consolidated e2e test suite#760

Merged
mamy-CS merged 11 commits intollm-d:mainfrom
mamy-CS:e2e-reorganize
Feb 19, 2026
Merged

Migrate to consolidated e2e test suite#760
mamy-CS merged 11 commits intollm-d:mainfrom
mamy-CS:e2e-reorganize

Conversation

@mamy-CS
Copy link
Collaborator

@mamy-CS mamy-CS commented Feb 18, 2026

Summary

Consolidated e2e test suite (test/e2e/) infrastructure

  • All tests use an environment-agnostic consolidated suite
  • Added test fixtures and builders
  • Added test configuration system
  • Added better error handling and diagnostics throughout test suite
  • Updated to run test-e2e-smoke-with-setup automatically on PRs with code changes
  • Added deprecation notes to test-e2e and test-e2e-openshift Makefile targets
  • Old targets remain functional for backward compatibility during migration, and will be removed after full verification
  • Some tests are labeled flaky/ skipped (currently being worked on by other folks)
  • More prs coming to update/ add/ remove e2es as needed, this pr is focused on e2e infrastructure

Signed-off-by: Mohammed Abdi <mohammed.munir.abdi@ibm.com>
Signed-off-by: Mohammed Abdi <mohammed.munir.abdi@ibm.com>
Signed-off-by: Mohammed Abdi <mohammed.munir.abdi@ibm.com>
Signed-off-by: Mohammed Abdi <mohammed.munir.abdi@ibm.com>
Signed-off-by: Mohammed Abdi <mohammed.munir.abdi@ibm.com>
Signed-off-by: Mohammed Abdi <mohammed.munir.abdi@ibm.com>
Signed-off-by: Mohammed Abdi <mohammed.munir.abdi@ibm.com>
gger ci

Signed-off-by: Mohammed Abdi <mohammed.munir.abdi@ibm.com>
@mamy-CS
Copy link
Collaborator Author

mamy-CS commented Feb 18, 2026

/ok-to-test

@github-actions
Copy link
Contributor

🚀 E2E tests triggered by /ok-to-test

View the OpenShift E2E workflow run

@github-actions
Copy link
Contributor

GPU Pre-flight Check ✅

GPUs are available for e2e-openshift tests. Proceeding with deployment.

Resource Total Allocated Available
GPUs 50 18 32
Cluster Value
Nodes 16 (7 with GPUs)
Total CPU 993 cores
Total Memory 10383 Gi
GPUs required 4 (min) / 6 (recommended)

Signed-off-by: Mohammed Abdi <mohammed.munir.abdi@ibm.com>
Signed-off-by: Mohammed Abdi <mohammed.munir.abdi@ibm.com>
@mamy-CS mamy-CS self-assigned this Feb 18, 2026
else
log_success "Successfully pulled image '$WVA_IMAGE_REPO:$WVA_IMAGE_TAG' from registry"
# Try to pull the image, or use local image if pull fails
if ! docker pull "$WVA_IMAGE_REPO:$WVA_IMAGE_TAG"; then
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the usecase for pulling wva image from a (I suppose) remote repository?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for testing a released version without building locally, if I understand your question correctly.

)

// Constants for MetricsAvailable condition
// Note: Reasons should match api/v1alpha1 constants for consistency
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about moving the 4 constants below to llmdVariantAutoscalingV1alpha1? Less code that way

)

// CreateHPA creates a HorizontalPodAutoscaler resource for WVA integration
func CreateHPA(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

People familiar with the k8s API verbs would expect Create to fail if the object already exists. It may be a source of confusion. What about having (at least) 3 functions: CreateHPA, DeleteHPA and EnsureHPA? The 2 first functions match the semantic of the k8s API verbs, and the third one is a convenient function calling lower-level function. If you agree then make sure to also update the other builder functions

Copy link
Collaborator

@lionelvillard lionelvillard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks! I added minor comments.

// InferencePool compatibility via llm-d.ai/model-pool label.
// This function is idempotent: it will delete any existing deployment with the same name
// before creating a new one to handle leftover resources from previous test runs.
func CreateModelService(ctx context.Context, k8sClient *kubernetes.Clientset, namespace, name, poolName, modelID string, useSimulator bool, maxNumSeqs int) error {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would put this builder function in model_service_builder.go for consistency reason.

}

// LoadConfigFromEnv reads e2e test configuration from environment variables
func LoadConfigFromEnv() E2EConfig {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe consider return a pointer? Not very important.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the struct isn't large enough to justify a pointer. Better keep the current code simple and clear


// Helper functions for environment variable parsing

func getEnv(key, defaultValue string) string {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure but all getEnvXX function can be replaced by one generic function

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, avoiding verbosity there, and the intent is to keep it clear and explicit.

// Feature gate defaults
ScaleToZeroEnabled: getEnvBool("SCALE_TO_ZERO_ENABLED", false),

// EPP defaults
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be cleaner (IMO) to create "stacks" (InferencePool, VA, ModelService, etc..) when running the tests, to not duplicate this logic. Eventually e2e tests may be deploying helm releases, one per stack. This is beyond the scope of this PR, just something to keep in mind.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good suggestion.

Copy link
Collaborator

@asm582 asm582 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

We need to evaluate the use of hermetic tests and remove redundant tests in the subsequent PRs.

@mamy-CS
Copy link
Collaborator Author

mamy-CS commented Feb 19, 2026

Thanks for the review, merging this pr to unblock other work. Kept a note of some of the relevant comments. Will address them in subsequent pr.

@mamy-CS mamy-CS merged commit f8d74f2 into llm-d:main Feb 19, 2026
46 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants