test(e2e): add ARM64 GPU end-to-end test on merge to main by ArangoGutierrez · Pull Request #670 · NVIDIA/holodeck

ArangoGutierrez · 2026-02-15T10:36:05Z

Summary

Add E2E test for full GPU stack on ARM64 (g5g.xlarge Graviton2 + T4g GPU)
Architecture intentionally omitted from config to validate that instance-type inference (fix(aws): infer AMI architecture from instance type for arm64 support #669) works end-to-end
ARM64 test gated to run only on merge to main (not on PRs) since g5g instances are more expensive

Changes

File	Description
`tests/data/test_aws_arm64.yml`	ARM64 test config: g5g.xlarge in us-east-1, no explicit architecture
`tests/aws_test.go`	New `arm64`-labeled Ginkgo table entry
`.github/workflows/e2e.yaml`	`e2e-test-arm64` job with `if: github.ref == 'refs/heads/main'` gate

Motivation

Prior to #669, holodeck had zero ARM64 E2E coverage. The arch-inference fix was validated by unit tests only. This test ensures the full provisioning path — AMI resolution, instance creation, driver install, CTK, Docker, and Kubernetes — works correctly on ARM64 hardware.

Cost Control

The g5g.xlarge test runs only after merge, avoiding per-PR costs. The periodic cleanup workflow already covers us-east-1.

Test plan

go vet ./tests/... — clean
Test config matches existing patterns (auth, runtime, driver, k8s)
Workflow condition github.ref == 'refs/heads/main' skips PR runs
ARM64 E2E passes on first merge to main

Add an E2E test that exercises the full GPU stack (driver, CTK, Docker, Kubernetes) on an ARM64 g5g.xlarge instance (Graviton2 + T4g GPU). The test intentionally omits image.architecture to validate that the architecture inference from instance type (added in NVIDIA#669) works end-to-end in production. The g5g instance type is arm64-only, so holodeck must infer arm64 and resolve the correct AMI automatically. This test only runs on merge to main (not on PRs) since g5g instances are more expensive than the standard x86_64 test fleet. The periodic cleanup workflow already covers us-east-1 where g5g is available. Changes: - tests/data/test_aws_arm64.yml: g5g.xlarge config, no explicit arch - tests/aws_test.go: new "arm64" labeled test entry - .github/workflows/e2e.yaml: e2e-test-arm64 job gated on main Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

coveralls · 2026-02-15T10:39:34Z

Pull Request Test Coverage Report for Build 22036329921

Details

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage remained the same at 48.234%

Totals
Change from base Build 22025328511:	0.0%
Covered Lines:	2609
Relevant Lines:	5409

💛 - Coveralls

Copilot

Pull request overview

This PR adds end-to-end test coverage for ARM64 GPU instances (g5g.xlarge with Graviton2 + T4g GPU), validating the architecture inference feature introduced in PR #669. The test is strategically gated to run only on merge to main branch to control costs associated with the more expensive ARM64 GPU instances. The test intentionally omits the image.architecture field to exercise the automatic inference from instance type.

Changes:

Add ARM64 GPU E2E test configuration for g5g.xlarge instances in us-east-1
Integrate test into Ginkgo test suite with "arm64" label for selective execution
Add dedicated workflow job with conditional execution based on branch

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
tests/data/test_aws_arm64.yml	New ARM64 test config with g5g.xlarge, architecture intentionally omitted to validate inference
tests/aws_test.go	New Ginkgo table entry for ARM64 test with "arm64" label
.github/workflows/e2e.yaml	New e2e-test-arm64 job gated to main branch with cost control conditional

Copilot · 2026-02-15T10:39:37Z

tests/data/test_aws_arm64.yml

+    privateKey: /home/runner/.cache/key
+  instance:
+    type: g5g.xlarge
+    region: us-east-1


The region is set to 'us-east-1', but all other single-instance E2E test configurations in the tests/data directory use 'us-west-1' (test_aws.yml, test_aws_dra.yml, test_aws_kernel.yml, test_aws_legacy.yml, test_aws_ctk_git.yml, test_aws_k8s_git.yml, test_aws_k8s_kind_git.yml, test_aws_k8s_latest.yml). Using a different region creates inconsistency and could lead to regional quota issues or cleanup problems if the periodic cleanup workflow only targets specific regions. Consider changing to 'us-west-1' to maintain consistency with existing E2E tests.

Suggested change

region: us-east-1

region: us-west-1

Copilot · 2026-02-15T10:39:38Z

.github/workflows/e2e.yaml

+        make -f tests/Makefile test GINKGO_ARGS="--label-filter='arm64'"
+


The ARM64 E2E test job is missing an artifact upload step that exists in the main e2e-test job. The e2e-test job includes an "Archive Ginkgo logs" step (lines 68-73) that uploads ginkgo.json artifacts with 15-day retention. This step should be added after the "Run ARM64 GPU e2e test" step to maintain consistency and ensure test results are preserved for debugging. Note that the test run command in line 107 doesn't generate ginkgo.json (no --json-report flag), so you would either need to add the flag to generate the artifact or adjust the artifact upload to capture different logs.

Suggested change

make -f tests/Makefile test GINKGO_ARGS="--label-filter='arm64'"

make -f tests/Makefile test GINKGO_ARGS="--label-filter='arm64' --json-report=${LOG_ARTIFACT_DIR}/ginkgo.json"

- name: Archive Ginkgo logs

if: always()

uses: actions/upload-artifact@v4

with:

name: e2e-ginkgo-logs-arm64

path: e2e_logs/ginkgo.json

retention-days: 15

if-no-files-found: ignore

The Docker package-install template hardcoded CRI_DOCKERD_ARCH="amd64", causing an x86_64 binary to be downloaded on arm64 hosts. This results in "Exec format error" when systemd tries to start cri-docker.service. Replace the hardcoded value with runtime detection using uname -m, the same pattern already used by the git-source install path in the same template and by all other templates (containerd, kubernetes, CRI-O). Validated manually: full ARM64 stack (g5g.xlarge, NVIDIA T4G driver 575.57.08, Docker 29.2.1, cri-dockerd arm64 binary, CTK 1.18.2, Kubernetes v1.33.3) provisioned successfully. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

CRI-O migrated from pkgs.k8s.io/addons:/cri-o to the independent download.opensuse.org/repositories/isv:/cri-o repository. The old URL returns 403, breaking all CRI-O installations. Additionally, when no version is specified, the template produced a malformed URL with an empty version component. Now defaults to v1.33 and normalizes the version to vX.Y format. Reference: https://github.com/cri-o/packaging#readme Signed-off-by: Eduardo Arango <earango@nvidia.com> Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Copilot · 2026-02-15T13:22:24Z

pkg/provisioner/templates/crio.go

+# Default to latest stable CRI-O if no version specified
+if [[ -z "$CRIO_VERSION" ]]; then
+    CRIO_VERSION="v1.33"
+    holodeck_log "INFO" "$COMPONENT" "No version specified, defaulting to ${CRIO_VERSION}"
+fi
+
+# Ensure version starts with 'v' and is in vX.Y format (strip patch if present)
+CRIO_VERSION="${CRIO_VERSION#v}"
+CRIO_VERSION="v$(echo "$CRIO_VERSION" | cut -d. -f1,2)"
+
+# CRI-O migrated from pkgs.k8s.io to download.opensuse.org
+# See: https://github.com/cri-o/packaging#readme
+CRIO_REPO_URL="https://download.opensuse.org/repositories/isv:/cri-o:/stable:/${CRIO_VERSION}"
+
 # Add CRI-O repo (idempotent)
 if [[ ! -f /etc/apt/keyrings/cri-o-apt-keyring.gpg ]]; then
    sudo mkdir -p /etc/apt/keyrings
    holodeck_retry 3 "$COMPONENT" curl -fsSL \
-        "https://pkgs.k8s.io/addons:/cri-o:/stable:/${CRIO_VERSION}/deb/Release.key" | \
+        "${CRIO_REPO_URL}/deb/Release.key" | \
        sudo gpg --dearmor -o /etc/apt/keyrings/cri-o-apt-keyring.gpg
 else
    holodeck_log "INFO" "$COMPONENT" "CRI-O GPG key already present"
 fi

 if [[ ! -f /etc/apt/sources.list.d/cri-o.list ]]; then
-    echo "deb [signed-by=/etc/apt/keyrings/cri-o-apt-keyring.gpg] https://pkgs.k8s.io/addons:/cri-o:/stable:/${CRIO_VERSION}/deb/ /" | \
+    echo "deb [signed-by=/etc/apt/keyrings/cri-o-apt-keyring.gpg] ${CRIO_REPO_URL}/deb/ /" | \
        sudo tee /etc/apt/sources.list.d/cri-o.list > /dev/null
 else
    holodeck_log "INFO" "$COMPONENT" "CRI-O repository already configured"


The CRI-O repository migration changes (switching from pkgs.k8s.io to download.opensuse.org) and version defaulting logic are not mentioned in the PR description and appear unrelated to ARM64 GPU testing. The ARM64 test configuration uses Docker, not CRI-O.

While these changes appear to be fixing a legitimate issue with CRI-O repository availability, they should either:

Be documented in the PR description explaining why they're included

Be split into a separate PR focused on CRI-O repository migration

Including unrelated changes makes it harder to review, understand the scope of changes, and potentially revert specific functionality if issues arise.

shivakunv

LGTM

Copilot AI review requested due to automatic review settings February 15, 2026 10:36

Copilot started reviewing on behalf of ArangoGutierrez February 15, 2026 10:36 View session

Copilot AI reviewed Feb 15, 2026

View reviewed changes

ArangoGutierrez added 2 commits February 15, 2026 13:38

ArangoGutierrez requested a review from Copilot February 15, 2026 13:17

Copilot started reviewing on behalf of ArangoGutierrez February 15, 2026 13:18 View session

Copilot AI reviewed Feb 15, 2026

View reviewed changes

shivakunv approved these changes Feb 16, 2026

View reviewed changes

ArangoGutierrez merged commit 8347359 into NVIDIA:main Feb 16, 2026
26 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(e2e): add ARM64 GPU end-to-end test on merge to main#670

test(e2e): add ARM64 GPU end-to-end test on merge to main#670
ArangoGutierrez merged 3 commits intoNVIDIA:mainfrom
ArangoGutierrez:e2e-arm64

ArangoGutierrez commented Feb 15, 2026

Uh oh!

coveralls commented Feb 15, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 15, 2026

Uh oh!

Copilot AI Feb 15, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 15, 2026

Uh oh!

shivakunv left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		make -f tests/Makefile test GINKGO_ARGS="--label-filter='arm64'"

-        make -f tests/Makefile test GINKGO_ARGS="--label-filter='arm64'"
+        make -f tests/Makefile test GINKGO_ARGS="--label-filter='arm64' --json-report=${LOG_ARTIFACT_DIR}/ginkgo.json"
+    - name: Archive Ginkgo logs
+      if: always()
+      uses: actions/upload-artifact@v4
+      with:
+        name: e2e-ginkgo-logs-arm64
+        path: e2e_logs/ginkgo.json
+        retention-days: 15
+        if-no-files-found: ignore

Conversation

ArangoGutierrez commented Feb 15, 2026

Summary

Changes

Motivation

Cost Control

Test plan

Uh oh!

coveralls commented Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 22036329921

Details

💛 - Coveralls

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 15, 2026

Choose a reason for hiding this comment

Uh oh!

shivakunv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

coveralls commented Feb 15, 2026 •

edited

Loading