Skip to content

fix(aws): propagate architecture in cluster mode#666

Closed
ArangoGutierrez wants to merge 1 commit intoNVIDIA:mainfrom
ArangoGutierrez:fix/arm64-cluster-arch-propagation
Closed

fix(aws): propagate architecture in cluster mode#666
ArangoGutierrez wants to merge 1 commit intoNVIDIA:mainfrom
ArangoGutierrez:fix/arm64-cluster-arch-propagation

Conversation

@ArangoGutierrez
Copy link
Collaborator

Summary

  • Pass Image.Architecture to resolveImageForNode in cluster mode (was always empty → x86_64)
  • Add documentation for arm64 instance type alternatives

Test plan

  • Existing cluster tests provide coverage
  • CI validation pending

Pass Image.Architecture to resolveImageForNode instead of empty string.
Previously, cluster mode always defaulted to x86_64 regardless of
the user's image architecture specification.

Also add documentation for arm64 instance type alternatives in the
API type comments.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Copilot AI review requested due to automatic review settings February 14, 2026 17:57
@coveralls
Copy link

Pull Request Test Coverage Report for Build 22021791240

Details

  • 0 of 6 (0.0%) changed or added relevant lines in 1 file are covered.
  • 35 unchanged lines in 1 file lost coverage.
  • Overall coverage remained the same at 47.856%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/provider/aws/cluster.go 0 6 0.0%
Files with Coverage Reduction New Missed Lines %
pkg/provider/aws/cluster.go 35 0.6%
Totals Coverage Status
Change from base Build 21999063489: 0.0%
Covered Lines: 2567
Relevant Lines: 5364

💛 - Coveralls

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to ensure AWS cluster-mode AMI selection respects the configured image architecture (e.g., arm64) and updates API docs to point users to arm64 instance-type alternatives.

Changes:

  • Update cluster instance creation to pass an architecture value into resolveImageForNode.
  • Update CRD field comments for control-plane and worker instance types with arm64 alternatives.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
pkg/provider/aws/cluster.go Attempts to propagate image architecture into AMI resolution for cluster node pools.
api/holodeck/v1alpha1/types.go Adds documentation notes about arm64 (Graviton) instance-type alternatives for cluster nodes.

Comment on lines +399 to +402
// Determine architecture from image spec
var arch string
if image != nil && image.Architecture != "" {
arch = image.Architecture
Copy link

Copilot AI Feb 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In cluster mode this still won’t propagate a globally configured architecture (p.Spec.Image.Architecture) when the node-pool image is nil, because arch remains empty and resolveImageForNode defaults to x86_64. If the intent is to support arm64 clusters without requiring cluster.*.image.architecture on every pool, use p.Spec.Image.Architecture as the fallback (and let per-pool image.Architecture override), or move that fallback logic into resolveImageForNode itself. As written, the new arch variable is also redundant because resolveImageForNode already consults image.Architecture when arch is empty.

Suggested change
// Determine architecture from image spec
var arch string
if image != nil && image.Architecture != "" {
arch = image.Architecture
// Determine architecture, preferring per-pool image over global spec
var arch string
if image != nil && image.Architecture != "" {
arch = image.Architecture
} else if p.Spec.Image != nil && p.Spec.Image.Architecture != "" {
arch = p.Spec.Image.Architecture

Copilot uses AI. Check for mistakes.

// InstanceType specifies the EC2 instance type for control-plane nodes.
// Default is "m5.xlarge" (x86_64). For arm64, use Graviton types
// (e.g., "m7g.xlarge", "c7g.xlarge").
Copy link

Copilot AI Feb 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new doc suggests switching to arm64 instance types (e.g., m7g/c7g, g5g) but doesn’t mention that users also need to set the AMI architecture (e.g., image.architecture: arm64) so AMI resolution matches the instance type. Without that, instance creation can fail due to an AMI/instance architecture mismatch.

Suggested change
// (e.g., "m7g.xlarge", "c7g.xlarge").
// (e.g., "m7g.xlarge", "c7g.xlarge"). When using arm64 instance types,
// ensure the AMI architecture also matches arm64 (for example, by
// setting `image.architecture: arm64` when specifying an explicit image)
// so that AMI resolution matches the instance type.

Copilot uses AI. Check for mistakes.
Comment on lines 237 to 241
// InstanceType specifies the EC2 instance type for worker nodes.
// For GPU workloads, use GPU instance types (g4dn, p4d, etc.).
// Default is "g4dn.xlarge" (x86_64). For arm64 GPU workloads,
// use "g5g.xlarge" or similar Graviton GPU instances.
// +kubebuilder:default="g4dn.xlarge"
Copy link

Copilot AI Feb 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc adds arm64 GPU instance guidance (g5g), but it should also call out that the worker pool’s AMI architecture must be set to arm64 (via image.architecture) for OS-based AMI resolution to select an arm64 AMI; otherwise you can end up with an x86_64 AMI on an arm64 instance type.

Copilot uses AI. Check for mistakes.
@ArangoGutierrez
Copy link
Collaborator Author

Closing as superseded. The equivalent fixes were already merged into main via PRs #661-664:

Additionally, these fixes address downstream provisioning issues but do not resolve the actual EC2 RunInstances failure (Unsupported: The requested configuration is currently not supported) reported in https://github.com/NVIDIA/gpu-driver-container/actions/runs/22012665274/job/63611032634. The root cause is missing architecture inference from instance type — when image.architecture is unset, holodeck defaults to x86_64 regardless of the instance type. A new PR will address this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants