-
Notifications
You must be signed in to change notification settings - Fork 576
Support AMD SEV-SNP on AWS #2424
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -17,6 +17,13 @@ type AWSMachineProviderConfig struct { | |
AMI AWSResourceReference `json:"ami"` | ||
// instanceType is the type of instance to create. Example: m4.xlarge | ||
InstanceType string `json:"instanceType"` | ||
// cpuOptions defines CPU-related settings for the instance, including the confidential computing policy. | ||
// When omitted, this means no opinion and the AWS platform is left to choose a reasonable default. | ||
// More info: | ||
// https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_CpuOptionsRequest.html, | ||
// https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/cpu-options-supported-instances-values.html | ||
// +optional | ||
CPUOptions *CPUOptions `json:"cpuOptions,omitempty,omitzero"` | ||
// tags is the set of tags to add to apply to an instance, in addition to the ones | ||
fangge1212 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
// added by default by the actuator. These tags are additive. The actuator will ensure | ||
// these tags are present, but will not remove any other tags that may exist on the | ||
|
@@ -109,6 +116,37 @@ type AWSMachineProviderConfig struct { | |
MarketType MarketType `json:"marketType,omitempty"` | ||
} | ||
|
||
// AWSConfidentialComputePolicy represents the confidential compute configuration for the instance. | ||
// +kubebuilder:validation:Enum=Disabled;AMDEncryptedVirtualizationNestedPaging | ||
type AWSConfidentialComputePolicy string | ||
|
||
const ( | ||
// AWSConfidentialComputePolicyDisabled disables confidential computing for the instance. | ||
AWSConfidentialComputePolicyDisabled AWSConfidentialComputePolicy = "Disabled" | ||
// AWSConfidentialComputePolicySEVSNP enables AMD SEV-SNP as the confidential computing technology for the instance. | ||
AWSConfidentialComputePolicySEVSNP AWSConfidentialComputePolicy = "AMDEncryptedVirtualizationNestedPaging" | ||
) | ||
|
||
// CPUOptions defines CPU-related settings for the instance, including the confidential computing policy. | ||
// If provided, it must not be empty — at least one field must be set. | ||
// +kubebuilder:validation:MinProperties=1 | ||
type CPUOptions struct { | ||
fangge1212 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
// confidentialCompute specifies whether confidential computing should be enabled for the instance, | ||
// and, if so, which confidential computing technology to use. | ||
// Valid values are: Disabled, AMDEncryptedVirtualizationNestedPaging and omitted. | ||
// When set to Disabled, confidential computing will be disabled for the instance. | ||
// When set to AMDEncryptedVirtualizationNestedPaging, AMD SEV-SNP will be used as the confidential computing technology for the instance. | ||
// In this case, ensure the following conditions are met: | ||
// 1) The selected instance type supports AMD SEV-SNP. | ||
// 2) The selected AWS region supports AMD SEV-SNP. | ||
// 3) The selected AMI supports AMD SEV-SNP. | ||
// More details can be checked at https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/sev-snp.html | ||
Comment on lines
+139
to
+143
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I forgot to ask this originally, but what happens if the user specifies AMD SEV-SNP on an instance type, region, or AMI that does not support this? Is there some feedback loop that enables a user to recover from this bad state? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. AWS will return error messages in such situations.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How do those get returned to end users? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In cluster-api-provider-aws, errors from the EC2 RunInstances API (such as InvalidParameterCombination for unsupported instance types, AMIs, or regions) are caught and wrapped before being returned. These errors propagate up through the reconciliation loop until they surface in the AWSMachine (or higher-level resource) status conditions. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not as familiar with the CAPI work, so I want to make sure I clarify here - this means we should see a status condition populated in the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm also not familiar with how it works, I did a little research. The end user can see the error message in the terminal or install log. Here is the flow of returning an error to the end user when the cluster creation fails is like this:
This sets the The corresponding
This function inspects the Machine status and decides whether it’s ready, still provisioning, or failed.
So if the Machine enters the Failed phase, this function returns an error — a Go error that directly includes the AWS failure reason (e.g., “machine failed to provision: InvalidParameterValue: Instance type m5.x99 not supported”).
The installer then exits with a non-zero status, and the user sees this message in their terminal or install.log. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Gotcha. So regardless the installer will always return the failure. @JoelSpeed this is relatively new territory for me as I haven't reviewed a lot of these machine API changes - is there a preference as to failing at configuration read time over during cluster bootstrap? Do you know if we clean up any resources we may or may not have created if we fail bootstrapping? What is the impact to an end-user if we fail during cluster bootstrapping versus validation of the configuration? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Where possible yes, it is preferred to validate at config time, but with instance type capabilities, that often means fetching information from the cloud provider which may or may not be publicly exposed via an API, so we have to be best effort here.
The installer can clean up all resources when an install fails yes
We spun up most of a cluster and wasted ~30 minutes to 2 hours of their time. Generally early validation is obvious, quick, and easy to read. Where when the cluster fails in this way, it's generally not very obvious why. The installer will report that the Machine API cluster operator isn't available, but it will also report other COs as unavailable, so it may not be immediately obvious to folks that it was this config that got them stuck here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the explanation, that helps me put this into perspective a bit better.
Looks like AWS has an API for instance type discovery based on features outlined in https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_DescribeInstanceTypes.html - seems useful for determining what we would need to here. Is there any history of the installer using these types of APIs for earlier validation? I don't think it would be worth blocking this PR on unless we already have some precedent set here to use those provider APIs, but I would be curious if there is a trivial enough way for us to let users know that this config was what got them stuck. |
||
// When omitted, this means no opinion and the AWS platform is left to choose a reasonable default, | ||
// which is subject to change without notice. The current default is Disabled. | ||
// +optional | ||
ConfidentialCompute *AWSConfidentialComputePolicy `json:"confidentialCompute,omitempty"` | ||
} | ||
|
||
// BlockDeviceMappingSpec describes a block device mapping | ||
type BlockDeviceMappingSpec struct { | ||
// The device name exposed to the machine (for example, /dev/sdh or xvdh). | ||
|
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if this field is not specified by a user?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If unset, no CPU options are passed to the AWS platform and AWS default values are used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know what the default values are currently on AWS?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if we don't have concrete defaults we can point to, it would be nice to include guidance on how an end-user could identify what the defaults for their configuration would be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CpuOptions in AWS consists of three fields:
Refer to: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/cpu-options-supported-instances-values.html
In this PR, only amdSevSnp is exposed to users. I'm not entirely sure how best to describe this in the API documentation — should we include a link to the AWS documentation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the link to AWS website