Skip to content

Auto-MNNVL: Add autoMNNVL e2e tests and cluster setup scripts#421

Open
shmuel-runai wants to merge 6 commits intoai-dynamo:mainfrom
shmuel-runai:RUN-36134/mnnvl-e2e-2
Open

Auto-MNNVL: Add autoMNNVL e2e tests and cluster setup scripts#421
shmuel-runai wants to merge 6 commits intoai-dynamo:mainfrom
shmuel-runai:RUN-36134/mnnvl-e2e-2

Conversation

@shmuel-runai
Copy link
Contributor

@shmuel-runai shmuel-runai commented Feb 10, 2026

What type of PR is this?

/kind feature

What this PR does / why we need it:

Auto-MNNVL: Add autoMNNVL e2e tests and cluster setup scripts

Introduce end-to-end tests for the autoMNNVL feature, covering all 4
configurations:
- supported + enabled
- supported + disabled
- unsupported + enabled
- unsupported + disabled.

Each test suite validates operator behavior (annotation mutation, ComputeDomain lifecycle, resourceClaim injection,
annotation immutability) under its specific cluster configuration.

Add Python and shell scripts under hack/e2e-autoMNNVL/ to automate
k3d cluster creation, fake GPU operator installation, Grove operator
deployment, and test execution across all configurations.

Auto-MNNVL: Add CI workflow and Makefile targets for autoMNNVL e2e tests

Add a dedicated e2e-mnnvl job to the GitHub Actions workflow that runs
all 4 autoMNNVL configurations (supported/unsupported x enabled/disabled)
in CI, parallel to the existing e2e matrix jobs. The job uses the same
trigger conditions and self-hosted runner as the other e2e tests.

Add Makefile targets (run-e2e-mnnvl-full, e2e-mnnvl-cluster-down) to
invoke the autoMNNVL test orchestration scripts from hack/e2e-autoMNNVL/.

Which issue(s) this PR fixes:

refs #270

Special notes for your reviewer:

Does this PR introduce a API change?

NONE

Introduce end-to-end tests for the autoMNNVL feature, covering all 4
configurations:
- supported + enabled
- supported + disabled
- unsupported + enabled
- unsupported + disabled.

Each test suite validates operator behavior (annotation mutation, ComputeDomain lifecycle, resourceClaim injection,
annotation immutability) under its specific cluster configuration.

Add Python and shell scripts under hack/e2e-autoMNNVL/ to automate
k3d cluster creation, fake GPU operator installation, Grove operator
deployment, and test execution across all configurations.
Add a dedicated e2e-mnnvl job to the GitHub Actions workflow that runs
all 4 autoMNNVL configurations (supported/unsupported x enabled/disabled)
in CI, parallel to the existing e2e matrix jobs. The job uses the same
trigger conditions and self-hosted runner as the other e2e tests.

Add Makefile targets (run-e2e-mnnvl-full, e2e-mnnvl-cluster-down) to
invoke the autoMNNVL test orchestration scripts from hack/e2e-autoMNNVL/.
@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 10, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@shmuel-runai shmuel-runai self-assigned this Feb 10, 2026
@shmuel-runai shmuel-runai force-pushed the RUN-36134/mnnvl-e2e-2 branch 3 times, most recently from 30b9e3b to f7bf382 Compare February 15, 2026 10:54
Build()
}

// deletePCS deletes a PCS by name

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the functions from here till the end of the file are not specific to MNNVL and can be moved to a common testutil file, outside of the auto-mnnvl package.


// testNoMNNVLArtifactsWhenDisabled verifies that no ComputeDomain is created and no
// resourceClaims are injected, even for GPU PCS when the feature is disabled.
func testNoMNNVLArtifactsWhenDisabled(t *testing.T, tc testContext) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What makes this a 'shared' func? I see it only used in 'unsupported and disabled'.

run_quiet(f"helm uninstall {FAKE_GPU_RELEASE} -n {FAKE_GPU_NAMESPACE}")
run_quiet("kubectl delete runtimeclass nvidia")
run_quiet(f"kubectl delete crd {COMPUTE_DOMAIN_CRD}")
log_success("Fake GPU operator removed")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding code that actually ensures the CRD was removed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants