Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 38 additions & 0 deletions machine/v1beta1/types_awsprovider.go
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,13 @@ type AWSMachineProviderConfig struct {
AMI AWSResourceReference `json:"ami"`
// instanceType is the type of instance to create. Example: m4.xlarge
InstanceType string `json:"instanceType"`
// cpuOptions defines CPU-related settings for the instance, including the confidential computing policy.
// When omitted, this means no opinion and the AWS platform is left to choose a reasonable default.
// More info:
// https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_CpuOptionsRequest.html,
// https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/cpu-options-supported-instances-values.html
// +optional
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if this field is not specified by a user?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If unset, no CPU options are passed to the AWS platform and AWS default values are used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know what the default values are currently on AWS?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if we don't have concrete defaults we can point to, it would be nice to include guidance on how an end-user could identify what the defaults for their configuration would be.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CpuOptions in AWS consists of three fields:

In this PR, only amdSevSnp is exposed to users. I'm not entirely sure how best to describe this in the API documentation — should we include a link to the AWS documentation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the link to AWS website

CPUOptions *CPUOptions `json:"cpuOptions,omitempty,omitzero"`
// tags is the set of tags to add to apply to an instance, in addition to the ones
// added by default by the actuator. These tags are additive. The actuator will ensure
// these tags are present, but will not remove any other tags that may exist on the
Expand Down Expand Up @@ -109,6 +116,37 @@ type AWSMachineProviderConfig struct {
MarketType MarketType `json:"marketType,omitempty"`
}

// AWSConfidentialComputePolicy represents the confidential compute configuration for the instance.
// +kubebuilder:validation:Enum=Disabled;AMDEncryptedVirtualizationNestedPaging
type AWSConfidentialComputePolicy string

const (
// AWSConfidentialComputePolicyDisabled disables confidential computing for the instance.
AWSConfidentialComputePolicyDisabled AWSConfidentialComputePolicy = "Disabled"
// AWSConfidentialComputePolicySEVSNP enables AMD SEV-SNP as the confidential computing technology for the instance.
AWSConfidentialComputePolicySEVSNP AWSConfidentialComputePolicy = "AMDEncryptedVirtualizationNestedPaging"
)

// CPUOptions defines CPU-related settings for the instance, including the confidential computing policy.
// If provided, it must not be empty — at least one field must be set.
// +kubebuilder:validation:MinProperties=1
type CPUOptions struct {
// confidentialCompute specifies whether confidential computing should be enabled for the instance,
// and, if so, which confidential computing technology to use.
// Valid values are: Disabled, AMDEncryptedVirtualizationNestedPaging and omitted.
// When set to Disabled, confidential computing will be disabled for the instance.
// When set to AMDEncryptedVirtualizationNestedPaging, AMD SEV-SNP will be used as the confidential computing technology for the instance.
// In this case, ensure the following conditions are met:
// 1) The selected instance type supports AMD SEV-SNP.
// 2) The selected AWS region supports AMD SEV-SNP.
// 3) The selected AMI supports AMD SEV-SNP.
// More details can be checked at https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/sev-snp.html
Comment on lines +139 to +143
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to ask this originally, but what happens if the user specifies AMD SEV-SNP on an instance type, region, or AMI that does not support this?

Is there some feedback loop that enables a user to recover from this bad state?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AWS will return error messages in such situations.

  • For invalid instance type
    api error InvalidParameterCombination: The specified instance type does not support AMD SEV-SNP. Specify a supported instance type and try again.
  • For invalid AMI
    api error InvalidParameterCombination: The specified AMI does not support a boot mode that is compatible with AMD SEV-SNP. Specify a compatible AMI and try again.
  • For invalid region
    (I haven't tried this yet)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do those get returned to end users?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In cluster-api-provider-aws, errors from the EC2 RunInstances API (such as InvalidParameterCombination for unsupported instance types, AMIs, or regions) are caught and wrapped before being returned. These errors propagate up through the reconciliation loop until they surface in the AWSMachine (or higher-level resource) status conditions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not as familiar with the CAPI work, so I want to make sure I clarify here - this means we should see a status condition populated in the Machine resource that user can clearly see what the problem is and make the necessary updates to that Machine resource to recover?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also not familiar with how it works, I did a little research. The end user can see the error message in the terminal or install log. Here is the flow of returning an error to the end user when the cluster creation fails is like this:

  1. Installer → applies Cluster API manifests (Cluster, Machine, AWSMachine, etc.)

  2. CAPA controller (specifically awsmachine_controller.go) runs createInstance() via the Reconcile loop.

  3. If AWS returns an error (e.g., InvalidParameterValue: instance type not supported in region), it bubbles up:

instance, err = r.createInstance(...)
if err != nil {
    machineScope.Error(err, "unable to create instance")
    conditions.MarkFalse(machineScope.AWSMachine, infrav1.InstanceReadyCondition, infrav1.InstanceProvisionFailedReason, ...)
    return ctrl.Result{}, err
}

This sets the AWSMachine status condition InstanceReady=False with reason InstanceProvisionFailed.

The corresponding Machine object (in the Cluster API layer) reflects this failure in its status — its Phase becomes "Failed", and Status.FailureMessage is populated with the error message (e.g., “Invalid instance type”).

  1. When the OpenShift Installer monitors the cluster creation (in pkg/infrastructure/clusterapi/clusterapi.go), it repeatedly calls:

checkMachineReady(machine, requirePublicIP)

This function inspects the Machine status and decides whether it’s ready, still provisioning, or failed.

if machine.Status.Phase != "Provisioned" && machine.Status.Phase != "Running" {
    // Not ready yet, keep waiting
    return false, nil
} else if machine.Status.Phase == "Failed" {
    msg := ptr.Deref(machine.Status.FailureMessage, "machine.Status.FailureMessage was not set")
    return false, fmt.Errorf("machine %s failed to provision: %s", machine.Name, msg)
}

So if the Machine enters the Failed phase, this function returns an error — a Go error that directly includes the AWS failure reason (e.g., “machine failed to provision: InvalidParameterValue: Instance type m5.x99 not supported”).

  1. The installer monitors Machines until all are “ready”. If any checkMachineReady() returns an error, the installer stops waiting and reports the failure in the terminal output — for example:

level=error msg="Bootstrap failed to complete: machine master-0 failed to provision: Invalid instance type m5.x99"

The installer then exits with a non-zero status, and the user sees this message in their terminal or install.log.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. So regardless the installer will always return the failure.

@JoelSpeed this is relatively new territory for me as I haven't reviewed a lot of these machine API changes - is there a preference as to failing at configuration read time over during cluster bootstrap? Do you know if we clean up any resources we may or may not have created if we fail bootstrapping? What is the impact to an end-user if we fail during cluster bootstrapping versus validation of the configuration?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a preference as to failing at configuration read time over during cluster bootstrap?

Where possible yes, it is preferred to validate at config time, but with instance type capabilities, that often means fetching information from the cloud provider which may or may not be publicly exposed via an API, so we have to be best effort here.

Do you know if we clean up any resources we may or may not have created if we fail bootstrapping?

The installer can clean up all resources when an install fails yes

What is the impact to an end-user if we fail during cluster bootstrapping versus validation of the configuration?

We spun up most of a cluster and wasted ~30 minutes to 2 hours of their time. Generally early validation is obvious, quick, and easy to read. Where when the cluster fails in this way, it's generally not very obvious why. The installer will report that the Machine API cluster operator isn't available, but it will also report other COs as unavailable, so it may not be immediately obvious to folks that it was this config that got them stuck here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation, that helps me put this into perspective a bit better.

but with instance type capabilities, that often means fetching information from the cloud provider which may or may not be publicly exposed via an API, so we have to be best effort here.

Looks like AWS has an API for instance type discovery based on features outlined in https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_DescribeInstanceTypes.html - seems useful for determining what we would need to here. Is there any history of the installer using these types of APIs for earlier validation?

I don't think it would be worth blocking this PR on unless we already have some precedent set here to use those provider APIs, but I would be curious if there is a trivial enough way for us to let users know that this config was what got them stuck.

// When omitted, this means no opinion and the AWS platform is left to choose a reasonable default,
// which is subject to change without notice. The current default is Disabled.
// +optional
ConfidentialCompute *AWSConfidentialComputePolicy `json:"confidentialCompute,omitempty"`
}

// BlockDeviceMappingSpec describes a block device mapping
type BlockDeviceMappingSpec struct {
// The device name exposed to the machine (for example, /dev/sdh or xvdh).
Expand Down
26 changes: 26 additions & 0 deletions machine/v1beta1/zz_generated.deepcopy.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

10 changes: 10 additions & 0 deletions machine/v1beta1/zz_generated.swagger_doc_generated.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

29 changes: 28 additions & 1 deletion openapi/generated_openapi/zz_generated.openapi.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

14 changes: 14 additions & 0 deletions openapi/openapi.json
Original file line number Diff line number Diff line change
Expand Up @@ -21960,6 +21960,10 @@
"type": "string",
"default": ""
},
"cpuOptions": {
"description": "cpuOptions defines CPU-related settings for the instance, including the confidential computing policy. When omitted, this means no opinion and the AWS platform is left to choose a reasonable default. More info: https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_CpuOptionsRequest.html, https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/cpu-options-supported-instances-values.html",
"$ref": "#/definitions/com.github.openshift.api.machine.v1beta1.CPUOptions"
},
"credentialsSecret": {
"description": "credentialsSecret is a reference to the secret with AWS credentials. Otherwise, defaults to permissions provided by attached IAM role where the actuator is running.",
"$ref": "#/definitions/io.k8s.api.core.v1.LocalObjectReference"
Expand Down Expand Up @@ -22435,6 +22439,16 @@
}
}
},
"com.github.openshift.api.machine.v1beta1.CPUOptions": {
"description": "CPUOptions defines CPU-related settings for the instance, including the confidential computing policy. If provided, it must not be empty — at least one field must be set.",
"type": "object",
"properties": {
"confidentialCompute": {
"description": "confidentialCompute specifies whether confidential computing should be enabled for the instance, and, if so, which confidential computing technology to use. Valid values are: Disabled, AMDEncryptedVirtualizationNestedPaging and omitted. When set to Disabled, confidential computing will be disabled for the instance. When set to AMDEncryptedVirtualizationNestedPaging, AMD SEV-SNP will be used as the confidential computing technology for the instance. In this case, ensure the following conditions are met: 1) The selected instance type supports AMD SEV-SNP. 2) The selected AWS region supports AMD SEV-SNP. 3) The selected AMI supports AMD SEV-SNP. More details can be checked at https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/sev-snp.html When omitted, this means no opinion and the AWS platform is left to choose a reasonable default, which is subject to change without notice. The current default is Disabled.",
"type": "string"
}
}
},
"com.github.openshift.api.machine.v1beta1.Condition": {
"description": "Condition defines an observation of a Machine API resource operational state.",
"type": "object",
Expand Down