fix(templates): dynamic CUDA repo arch for arm64 support#667
fix(templates): dynamic CUDA repo arch for arm64 support#667ArangoGutierrez wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
Replace hardcoded x86_64 in the CUDA repository URL with runtime architecture detection via uname -m. Maps aarch64 to "sbsa" which is the NVIDIA convention for arm64 server CUDA repositories. Previously, arm64 nodes would fail to download the CUDA keyring package because the x86_64 repo URL does not exist for arm64. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
There was a problem hiding this comment.
Pull request overview
This PR adds dynamic architecture detection for NVIDIA CUDA repository URLs to enable arm64 (aarch64) support. Previously, the CUDA repository URL was hardcoded to use x86_64, preventing the NVIDIA driver template from working on arm64 systems. The change implements runtime architecture detection using uname -m and maps aarch64 to sbsa (NVIDIA's naming convention for arm64 server CUDA repositories).
Changes:
- Replace hardcoded
x86_64in CUDA repository URL with runtime-detected architecture - Add
aarch64→sbsamapping to match NVIDIA's arm64 server repository naming - Add comprehensive unit test to verify architecture detection logic
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| pkg/provisioner/templates/nv-driver.go | Implements dynamic CUDA repo architecture detection with aarch64→sbsa mapping |
| pkg/provisioner/templates/nv-driver_test.go | Adds test verifying runtime arch detection and absence of hardcoded x86_64 |
| if [[ "$CUDA_ARCH" == "aarch64" ]]; then | ||
| CUDA_ARCH="sbsa" | ||
| fi |
There was a problem hiding this comment.
The architecture detection logic only handles x86_64 (implicitly) and aarch64→sbsa mapping, but doesn't provide a fallback or error handling for unsupported architectures. Other templates in the codebase (e.g., container-toolkit.go:243-253) use case statements with explicit error handling for unsupported architectures. Consider adding an else clause to handle unexpected architecture values or at least add a comment explaining that x86_64 is used as-is.
| if [[ "$CUDA_ARCH" == "aarch64" ]]; then | |
| CUDA_ARCH="sbsa" | |
| fi | |
| case "$CUDA_ARCH" in | |
| aarch64) | |
| CUDA_ARCH="sbsa" | |
| ;; | |
| x86_64) | |
| # use x86_64 as-is | |
| ;; | |
| *) | |
| holodeck_log "ERROR" "$COMPONENT" "unsupported architecture for CUDA repository: $CUDA_ARCH" | |
| exit 1 | |
| ;; | |
| esac |
|
Closing as superseded. The equivalent fixes were already merged into main via PRs #661-664:
Additionally, these fixes address downstream provisioning issues but do not resolve the actual EC2 |
Summary
x86_64in CUDA repository URL with runtimeuname -mdetectionaarch64→sbsa(NVIDIA's arm64 server CUDA repo convention)Test plan
go test ./pkg/provisioner/templates/... -vpasses