test(e2e): add ARM64 GPU end-to-end test on merge to main#670
test(e2e): add ARM64 GPU end-to-end test on merge to main#670ArangoGutierrez merged 3 commits intoNVIDIA:mainfrom
Conversation
Add an E2E test that exercises the full GPU stack (driver, CTK, Docker, Kubernetes) on an ARM64 g5g.xlarge instance (Graviton2 + T4g GPU). The test intentionally omits image.architecture to validate that the architecture inference from instance type (added in NVIDIA#669) works end-to-end in production. The g5g instance type is arm64-only, so holodeck must infer arm64 and resolve the correct AMI automatically. This test only runs on merge to main (not on PRs) since g5g instances are more expensive than the standard x86_64 test fleet. The periodic cleanup workflow already covers us-east-1 where g5g is available. Changes: - tests/data/test_aws_arm64.yml: g5g.xlarge config, no explicit arch - tests/aws_test.go: new "arm64" labeled test entry - .github/workflows/e2e.yaml: e2e-test-arm64 job gated on main Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Pull Request Test Coverage Report for Build 22036329921Details
💛 - Coveralls |
There was a problem hiding this comment.
Pull request overview
This PR adds end-to-end test coverage for ARM64 GPU instances (g5g.xlarge with Graviton2 + T4g GPU), validating the architecture inference feature introduced in PR #669. The test is strategically gated to run only on merge to main branch to control costs associated with the more expensive ARM64 GPU instances. The test intentionally omits the image.architecture field to exercise the automatic inference from instance type.
Changes:
- Add ARM64 GPU E2E test configuration for g5g.xlarge instances in us-east-1
- Integrate test into Ginkgo test suite with "arm64" label for selective execution
- Add dedicated workflow job with conditional execution based on branch
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| tests/data/test_aws_arm64.yml | New ARM64 test config with g5g.xlarge, architecture intentionally omitted to validate inference |
| tests/aws_test.go | New Ginkgo table entry for ARM64 test with "arm64" label |
| .github/workflows/e2e.yaml | New e2e-test-arm64 job gated to main branch with cost control conditional |
| privateKey: /home/runner/.cache/key | ||
| instance: | ||
| type: g5g.xlarge | ||
| region: us-east-1 |
There was a problem hiding this comment.
The region is set to 'us-east-1', but all other single-instance E2E test configurations in the tests/data directory use 'us-west-1' (test_aws.yml, test_aws_dra.yml, test_aws_kernel.yml, test_aws_legacy.yml, test_aws_ctk_git.yml, test_aws_k8s_git.yml, test_aws_k8s_kind_git.yml, test_aws_k8s_latest.yml). Using a different region creates inconsistency and could lead to regional quota issues or cleanup problems if the periodic cleanup workflow only targets specific regions. Consider changing to 'us-west-1' to maintain consistency with existing E2E tests.
| region: us-east-1 | |
| region: us-west-1 |
| make -f tests/Makefile test GINKGO_ARGS="--label-filter='arm64'" | ||
|
|
There was a problem hiding this comment.
The ARM64 E2E test job is missing an artifact upload step that exists in the main e2e-test job. The e2e-test job includes an "Archive Ginkgo logs" step (lines 68-73) that uploads ginkgo.json artifacts with 15-day retention. This step should be added after the "Run ARM64 GPU e2e test" step to maintain consistency and ensure test results are preserved for debugging. Note that the test run command in line 107 doesn't generate ginkgo.json (no --json-report flag), so you would either need to add the flag to generate the artifact or adjust the artifact upload to capture different logs.
| make -f tests/Makefile test GINKGO_ARGS="--label-filter='arm64'" | |
| make -f tests/Makefile test GINKGO_ARGS="--label-filter='arm64' --json-report=${LOG_ARTIFACT_DIR}/ginkgo.json" | |
| - name: Archive Ginkgo logs | |
| if: always() | |
| uses: actions/upload-artifact@v4 | |
| with: | |
| name: e2e-ginkgo-logs-arm64 | |
| path: e2e_logs/ginkgo.json | |
| retention-days: 15 | |
| if-no-files-found: ignore |
The Docker package-install template hardcoded CRI_DOCKERD_ARCH="amd64", causing an x86_64 binary to be downloaded on arm64 hosts. This results in "Exec format error" when systemd tries to start cri-docker.service. Replace the hardcoded value with runtime detection using uname -m, the same pattern already used by the git-source install path in the same template and by all other templates (containerd, kubernetes, CRI-O). Validated manually: full ARM64 stack (g5g.xlarge, NVIDIA T4G driver 575.57.08, Docker 29.2.1, cri-dockerd arm64 binary, CTK 1.18.2, Kubernetes v1.33.3) provisioned successfully. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
CRI-O migrated from pkgs.k8s.io/addons:/cri-o to the independent download.opensuse.org/repositories/isv:/cri-o repository. The old URL returns 403, breaking all CRI-O installations. Additionally, when no version is specified, the template produced a malformed URL with an empty version component. Now defaults to v1.33 and normalizes the version to vX.Y format. Reference: https://github.com/cri-o/packaging#readme Signed-off-by: Eduardo Arango <earango@nvidia.com> Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
| # Default to latest stable CRI-O if no version specified | ||
| if [[ -z "$CRIO_VERSION" ]]; then | ||
| CRIO_VERSION="v1.33" | ||
| holodeck_log "INFO" "$COMPONENT" "No version specified, defaulting to ${CRIO_VERSION}" | ||
| fi | ||
|
|
||
| # Ensure version starts with 'v' and is in vX.Y format (strip patch if present) | ||
| CRIO_VERSION="${CRIO_VERSION#v}" | ||
| CRIO_VERSION="v$(echo "$CRIO_VERSION" | cut -d. -f1,2)" | ||
|
|
||
| # CRI-O migrated from pkgs.k8s.io to download.opensuse.org | ||
| # See: https://github.com/cri-o/packaging#readme | ||
| CRIO_REPO_URL="https://download.opensuse.org/repositories/isv:/cri-o:/stable:/${CRIO_VERSION}" | ||
|
|
||
| # Add CRI-O repo (idempotent) | ||
| if [[ ! -f /etc/apt/keyrings/cri-o-apt-keyring.gpg ]]; then | ||
| sudo mkdir -p /etc/apt/keyrings | ||
| holodeck_retry 3 "$COMPONENT" curl -fsSL \ | ||
| "https://pkgs.k8s.io/addons:/cri-o:/stable:/${CRIO_VERSION}/deb/Release.key" | \ | ||
| "${CRIO_REPO_URL}/deb/Release.key" | \ | ||
| sudo gpg --dearmor -o /etc/apt/keyrings/cri-o-apt-keyring.gpg | ||
| else | ||
| holodeck_log "INFO" "$COMPONENT" "CRI-O GPG key already present" | ||
| fi | ||
|
|
||
| if [[ ! -f /etc/apt/sources.list.d/cri-o.list ]]; then | ||
| echo "deb [signed-by=/etc/apt/keyrings/cri-o-apt-keyring.gpg] https://pkgs.k8s.io/addons:/cri-o:/stable:/${CRIO_VERSION}/deb/ /" | \ | ||
| echo "deb [signed-by=/etc/apt/keyrings/cri-o-apt-keyring.gpg] ${CRIO_REPO_URL}/deb/ /" | \ | ||
| sudo tee /etc/apt/sources.list.d/cri-o.list > /dev/null | ||
| else | ||
| holodeck_log "INFO" "$COMPONENT" "CRI-O repository already configured" |
There was a problem hiding this comment.
The CRI-O repository migration changes (switching from pkgs.k8s.io to download.opensuse.org) and version defaulting logic are not mentioned in the PR description and appear unrelated to ARM64 GPU testing. The ARM64 test configuration uses Docker, not CRI-O.
While these changes appear to be fixing a legitimate issue with CRI-O repository availability, they should either:
- Be documented in the PR description explaining why they're included
- Be split into a separate PR focused on CRI-O repository migration
Including unrelated changes makes it harder to review, understand the scope of changes, and potentially revert specific functionality if issues arise.
Summary
g5g.xlargeGraviton2 + T4g GPU)Changes
tests/data/test_aws_arm64.ymltests/aws_test.goarm64-labeled Ginkgo table entry.github/workflows/e2e.yamle2e-test-arm64job withif: github.ref == 'refs/heads/main'gateMotivation
Prior to #669, holodeck had zero ARM64 E2E coverage. The arch-inference fix was validated by unit tests only. This test ensures the full provisioning path — AMI resolution, instance creation, driver install, CTK, Docker, and Kubernetes — works correctly on ARM64 hardware.
Cost Control
The
g5g.xlargetest runs only after merge, avoiding per-PR costs. The periodic cleanup workflow already coversus-east-1.Test plan
go vet ./tests/...— cleangithub.ref == 'refs/heads/main'skips PR runs