Skip to content

test(e2e): add ARM64 GPU end-to-end test on merge to main#670

Merged
ArangoGutierrez merged 3 commits intoNVIDIA:mainfrom
ArangoGutierrez:e2e-arm64
Feb 16, 2026
Merged

test(e2e): add ARM64 GPU end-to-end test on merge to main#670
ArangoGutierrez merged 3 commits intoNVIDIA:mainfrom
ArangoGutierrez:e2e-arm64

Conversation

@ArangoGutierrez
Copy link
Collaborator

Summary

Changes

File Description
tests/data/test_aws_arm64.yml ARM64 test config: g5g.xlarge in us-east-1, no explicit architecture
tests/aws_test.go New arm64-labeled Ginkgo table entry
.github/workflows/e2e.yaml e2e-test-arm64 job with if: github.ref == 'refs/heads/main' gate

Motivation

Prior to #669, holodeck had zero ARM64 E2E coverage. The arch-inference fix was validated by unit tests only. This test ensures the full provisioning path — AMI resolution, instance creation, driver install, CTK, Docker, and Kubernetes — works correctly on ARM64 hardware.

Cost Control

The g5g.xlarge test runs only after merge, avoiding per-PR costs. The periodic cleanup workflow already covers us-east-1.

Test plan

  • go vet ./tests/... — clean
  • Test config matches existing patterns (auth, runtime, driver, k8s)
  • Workflow condition github.ref == 'refs/heads/main' skips PR runs
  • ARM64 E2E passes on first merge to main

Add an E2E test that exercises the full GPU stack (driver, CTK, Docker,
Kubernetes) on an ARM64 g5g.xlarge instance (Graviton2 + T4g GPU).

The test intentionally omits image.architecture to validate that the
architecture inference from instance type (added in NVIDIA#669) works
end-to-end in production. The g5g instance type is arm64-only, so
holodeck must infer arm64 and resolve the correct AMI automatically.

This test only runs on merge to main (not on PRs) since g5g instances
are more expensive than the standard x86_64 test fleet. The periodic
cleanup workflow already covers us-east-1 where g5g is available.

Changes:
- tests/data/test_aws_arm64.yml: g5g.xlarge config, no explicit arch
- tests/aws_test.go: new "arm64" labeled test entry
- .github/workflows/e2e.yaml: e2e-test-arm64 job gated on main

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Copilot AI review requested due to automatic review settings February 15, 2026 10:36
@coveralls
Copy link

coveralls commented Feb 15, 2026

Pull Request Test Coverage Report for Build 22036329921

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 48.234%

Totals Coverage Status
Change from base Build 22025328511: 0.0%
Covered Lines: 2609
Relevant Lines: 5409

💛 - Coveralls

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds end-to-end test coverage for ARM64 GPU instances (g5g.xlarge with Graviton2 + T4g GPU), validating the architecture inference feature introduced in PR #669. The test is strategically gated to run only on merge to main branch to control costs associated with the more expensive ARM64 GPU instances. The test intentionally omits the image.architecture field to exercise the automatic inference from instance type.

Changes:

  • Add ARM64 GPU E2E test configuration for g5g.xlarge instances in us-east-1
  • Integrate test into Ginkgo test suite with "arm64" label for selective execution
  • Add dedicated workflow job with conditional execution based on branch

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
tests/data/test_aws_arm64.yml New ARM64 test config with g5g.xlarge, architecture intentionally omitted to validate inference
tests/aws_test.go New Ginkgo table entry for ARM64 test with "arm64" label
.github/workflows/e2e.yaml New e2e-test-arm64 job gated to main branch with cost control conditional

privateKey: /home/runner/.cache/key
instance:
type: g5g.xlarge
region: us-east-1
Copy link

Copilot AI Feb 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The region is set to 'us-east-1', but all other single-instance E2E test configurations in the tests/data directory use 'us-west-1' (test_aws.yml, test_aws_dra.yml, test_aws_kernel.yml, test_aws_legacy.yml, test_aws_ctk_git.yml, test_aws_k8s_git.yml, test_aws_k8s_kind_git.yml, test_aws_k8s_latest.yml). Using a different region creates inconsistency and could lead to regional quota issues or cleanup problems if the periodic cleanup workflow only targets specific regions. Consider changing to 'us-west-1' to maintain consistency with existing E2E tests.

Suggested change
region: us-east-1
region: us-west-1

Copilot uses AI. Check for mistakes.
Comment on lines +107 to +108
make -f tests/Makefile test GINKGO_ARGS="--label-filter='arm64'"

Copy link

Copilot AI Feb 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ARM64 E2E test job is missing an artifact upload step that exists in the main e2e-test job. The e2e-test job includes an "Archive Ginkgo logs" step (lines 68-73) that uploads ginkgo.json artifacts with 15-day retention. This step should be added after the "Run ARM64 GPU e2e test" step to maintain consistency and ensure test results are preserved for debugging. Note that the test run command in line 107 doesn't generate ginkgo.json (no --json-report flag), so you would either need to add the flag to generate the artifact or adjust the artifact upload to capture different logs.

Suggested change
make -f tests/Makefile test GINKGO_ARGS="--label-filter='arm64'"
make -f tests/Makefile test GINKGO_ARGS="--label-filter='arm64' --json-report=${LOG_ARTIFACT_DIR}/ginkgo.json"
- name: Archive Ginkgo logs
if: always()
uses: actions/upload-artifact@v4
with:
name: e2e-ginkgo-logs-arm64
path: e2e_logs/ginkgo.json
retention-days: 15
if-no-files-found: ignore

Copilot uses AI. Check for mistakes.
The Docker package-install template hardcoded CRI_DOCKERD_ARCH="amd64",
causing an x86_64 binary to be downloaded on arm64 hosts. This results
in "Exec format error" when systemd tries to start cri-docker.service.

Replace the hardcoded value with runtime detection using uname -m, the
same pattern already used by the git-source install path in the same
template and by all other templates (containerd, kubernetes, CRI-O).

Validated manually: full ARM64 stack (g5g.xlarge, NVIDIA T4G driver
575.57.08, Docker 29.2.1, cri-dockerd arm64 binary, CTK 1.18.2,
Kubernetes v1.33.3) provisioned successfully.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
CRI-O migrated from pkgs.k8s.io/addons:/cri-o to the independent
download.opensuse.org/repositories/isv:/cri-o repository. The old URL
returns 403, breaking all CRI-O installations.

Additionally, when no version is specified, the template produced a
malformed URL with an empty version component. Now defaults to v1.33
and normalizes the version to vX.Y format.

Reference: https://github.com/cri-o/packaging#readme
Signed-off-by: Eduardo Arango <earango@nvidia.com>
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Comment on lines +62 to 90
# Default to latest stable CRI-O if no version specified
if [[ -z "$CRIO_VERSION" ]]; then
CRIO_VERSION="v1.33"
holodeck_log "INFO" "$COMPONENT" "No version specified, defaulting to ${CRIO_VERSION}"
fi

# Ensure version starts with 'v' and is in vX.Y format (strip patch if present)
CRIO_VERSION="${CRIO_VERSION#v}"
CRIO_VERSION="v$(echo "$CRIO_VERSION" | cut -d. -f1,2)"

# CRI-O migrated from pkgs.k8s.io to download.opensuse.org
# See: https://github.com/cri-o/packaging#readme
CRIO_REPO_URL="https://download.opensuse.org/repositories/isv:/cri-o:/stable:/${CRIO_VERSION}"

# Add CRI-O repo (idempotent)
if [[ ! -f /etc/apt/keyrings/cri-o-apt-keyring.gpg ]]; then
sudo mkdir -p /etc/apt/keyrings
holodeck_retry 3 "$COMPONENT" curl -fsSL \
"https://pkgs.k8s.io/addons:/cri-o:/stable:/${CRIO_VERSION}/deb/Release.key" | \
"${CRIO_REPO_URL}/deb/Release.key" | \
sudo gpg --dearmor -o /etc/apt/keyrings/cri-o-apt-keyring.gpg
else
holodeck_log "INFO" "$COMPONENT" "CRI-O GPG key already present"
fi

if [[ ! -f /etc/apt/sources.list.d/cri-o.list ]]; then
echo "deb [signed-by=/etc/apt/keyrings/cri-o-apt-keyring.gpg] https://pkgs.k8s.io/addons:/cri-o:/stable:/${CRIO_VERSION}/deb/ /" | \
echo "deb [signed-by=/etc/apt/keyrings/cri-o-apt-keyring.gpg] ${CRIO_REPO_URL}/deb/ /" | \
sudo tee /etc/apt/sources.list.d/cri-o.list > /dev/null
else
holodeck_log "INFO" "$COMPONENT" "CRI-O repository already configured"
Copy link

Copilot AI Feb 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CRI-O repository migration changes (switching from pkgs.k8s.io to download.opensuse.org) and version defaulting logic are not mentioned in the PR description and appear unrelated to ARM64 GPU testing. The ARM64 test configuration uses Docker, not CRI-O.

While these changes appear to be fixing a legitimate issue with CRI-O repository availability, they should either:

  1. Be documented in the PR description explaining why they're included
  2. Be split into a separate PR focused on CRI-O repository migration

Including unrelated changes makes it harder to review, understand the scope of changes, and potentially revert specific functionality if issues arise.

Copilot uses AI. Check for mistakes.
Copy link
Contributor

@shivakunv shivakunv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ArangoGutierrez ArangoGutierrez merged commit 8347359 into NVIDIA:main Feb 16, 2026
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants