fix(aws): propagate architecture in cluster mode#666
fix(aws): propagate architecture in cluster mode#666ArangoGutierrez wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
Pass Image.Architecture to resolveImageForNode instead of empty string. Previously, cluster mode always defaulted to x86_64 regardless of the user's image architecture specification. Also add documentation for arm64 instance type alternatives in the API type comments. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Pull Request Test Coverage Report for Build 22021791240Details
💛 - Coveralls |
There was a problem hiding this comment.
Pull request overview
This PR aims to ensure AWS cluster-mode AMI selection respects the configured image architecture (e.g., arm64) and updates API docs to point users to arm64 instance-type alternatives.
Changes:
- Update cluster instance creation to pass an architecture value into
resolveImageForNode. - Update CRD field comments for control-plane and worker instance types with arm64 alternatives.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
pkg/provider/aws/cluster.go |
Attempts to propagate image architecture into AMI resolution for cluster node pools. |
api/holodeck/v1alpha1/types.go |
Adds documentation notes about arm64 (Graviton) instance-type alternatives for cluster nodes. |
| // Determine architecture from image spec | ||
| var arch string | ||
| if image != nil && image.Architecture != "" { | ||
| arch = image.Architecture |
There was a problem hiding this comment.
In cluster mode this still won’t propagate a globally configured architecture (p.Spec.Image.Architecture) when the node-pool image is nil, because arch remains empty and resolveImageForNode defaults to x86_64. If the intent is to support arm64 clusters without requiring cluster.*.image.architecture on every pool, use p.Spec.Image.Architecture as the fallback (and let per-pool image.Architecture override), or move that fallback logic into resolveImageForNode itself. As written, the new arch variable is also redundant because resolveImageForNode already consults image.Architecture when arch is empty.
| // Determine architecture from image spec | |
| var arch string | |
| if image != nil && image.Architecture != "" { | |
| arch = image.Architecture | |
| // Determine architecture, preferring per-pool image over global spec | |
| var arch string | |
| if image != nil && image.Architecture != "" { | |
| arch = image.Architecture | |
| } else if p.Spec.Image != nil && p.Spec.Image.Architecture != "" { | |
| arch = p.Spec.Image.Architecture |
|
|
||
| // InstanceType specifies the EC2 instance type for control-plane nodes. | ||
| // Default is "m5.xlarge" (x86_64). For arm64, use Graviton types | ||
| // (e.g., "m7g.xlarge", "c7g.xlarge"). |
There was a problem hiding this comment.
The new doc suggests switching to arm64 instance types (e.g., m7g/c7g, g5g) but doesn’t mention that users also need to set the AMI architecture (e.g., image.architecture: arm64) so AMI resolution matches the instance type. Without that, instance creation can fail due to an AMI/instance architecture mismatch.
| // (e.g., "m7g.xlarge", "c7g.xlarge"). | |
| // (e.g., "m7g.xlarge", "c7g.xlarge"). When using arm64 instance types, | |
| // ensure the AMI architecture also matches arm64 (for example, by | |
| // setting `image.architecture: arm64` when specifying an explicit image) | |
| // so that AMI resolution matches the instance type. |
| // InstanceType specifies the EC2 instance type for worker nodes. | ||
| // For GPU workloads, use GPU instance types (g4dn, p4d, etc.). | ||
| // Default is "g4dn.xlarge" (x86_64). For arm64 GPU workloads, | ||
| // use "g5g.xlarge" or similar Graviton GPU instances. | ||
| // +kubebuilder:default="g4dn.xlarge" |
There was a problem hiding this comment.
The doc adds arm64 GPU instance guidance (g5g), but it should also call out that the worker pool’s AMI architecture must be set to arm64 (via image.architecture) for OS-based AMI resolution to select an arm64 AMI; otherwise you can end up with an x86_64 AMI on an arm64 instance type.
|
Closing as superseded. The equivalent fixes were already merged into main via PRs #661-664:
Additionally, these fixes address downstream provisioning issues but do not resolve the actual EC2 |
Summary
Test plan