From 923691ef3316054a29e98a20961c62df9d7b2e81 Mon Sep 17 00:00:00 2001 From: Flavian Missi Date: Mon, 20 Oct 2025 12:13:10 +0200 Subject: [PATCH 01/13] initial version of kms encryption provider enhancement --- ...-encryption-provider-at-datastore-layer.md | 491 ++++++++++++++++++ 1 file changed, 491 insertions(+) create mode 100644 enhancements/kube-apiserver/kms-encryption-provider-at-datastore-layer.md diff --git a/enhancements/kube-apiserver/kms-encryption-provider-at-datastore-layer.md b/enhancements/kube-apiserver/kms-encryption-provider-at-datastore-layer.md new file mode 100644 index 0000000000..fcdcfab37c --- /dev/null +++ b/enhancements/kube-apiserver/kms-encryption-provider-at-datastore-layer.md @@ -0,0 +1,491 @@ +--- +title: kms-encryption-provider-at-datastore-layer +authors: + - "@ardaguclu" + - "@dgrisonnet" + - "@flavianmissi" +reviewers: # Include a comment about what domain expertise a reviewer is expected to bring and what area of the enhancement you expect them to focus on. For example: - "@networkguru, for networking aspects, please look at IP bootstrapping aspect" + - "@ibihim" + - "@sjenning" + - "@tkashem" +approvers: # A single approver is preferred, the role of the approver is to raise important questions, help ensure the enhancement receives reviews from all applicable areas/SMEs, and determine when consensus is achieved such that the EP can move forward to implementation. Having multiple approvers makes it difficult to determine who is responsible for the actual approval. + - "@sjenning" +api-approvers: # In case of new or modified APIs or API extensions (CRDs, aggregated apiservers, webhooks, finalizers). If there is no API change, use "None" + - "@JoelSpeed" +creation-date: 2025-10-17 +last-updated: yyyy-mm-dd +tracking-link: # link to the tracking ticket (for example: Jira Feature or Epic ticket) that corresponds to this enhancement + - "https://issues.redhat.com/browse/OCPSTRAT-108" + - "https://issues.redhat.com/browse/OCPSTRAT-1638" +see-also: + - "enhancements/kube-apiserver/encrypting-data-at-datastore-layer.md" + - "enhancements/etcd/storage-migration-for-etcd-encryption.md" +replaces: + - "" +superseded-by: + - "" +--- + +# KMS Encryption Provider at Datastore Layer + +This is the title of the enhancement. Keep it simple and descriptive. A good +title can help communicate what the enhancement is and should be considered as +part of any review. + +The YAML `title` should be lowercased and spaces/punctuation should be +replaced with `-`. + +The `Metadata` section above is intended to support the creation of tooling +around the enhancement process. + + +## Summary + +Provide a user-configurable interface to support encryption of data stored in +etcd using a supported Key Management Service (KMS). + +## Motivation + +OpenShift supports AES encryption at the datastore layer using local keys. +It protects against etcd data leaks in the event of an etcd backup compromise. +However, aescbc and aesgcm, which are supported encryption technologies today +available in OpenShift do not protect against online host compromise i.e. in +such cases, attackers can decrypt encrypted data from etcd using local keys, +KMS managed keys protects against such scenarios. + +### User Stories + +* As a cluster admin, I want the APIServer config to be the single source of + etcd encryption configuration for my cluster, so that I can easily manage all + encryption related configuration in a single place +* As a cluster admin, I want the kas-operator to manage KMS plugin lifecycle on + my behalf, so that I don’t need to do any manual work when configuring KMS + etcd encryption for my cluster +* As a cluster admin, I want to easily understand the operations done by CKASO + when managing the KMS plugin lifecycle via Conditions in the APIServer CR’s + Status +* As a cluster admin, I want to be able to switch to a different KMS plugin, + i.e. from AWS to a pre-installed Vault, by performing a single configuration + change without needing to perform any other manual intervention + * TODO: confirm this requirement +* As a cluster admin, I want to configure my chosen KMS to automatically rotate + encryption keys and have OpenShift to automatically become aware of these new + keys, without any manual intervention +* As a cluster admin, I want to know when anything goes wrong during key + rotation, so that I can manually take the necessary actions to fix the state + of the cluster + +### Goals + +* Users have an easy to use interface to configure KMS encryption +* Users will configure OpenShift clusters to use a specific KMS key, created by + them +* Encryption keys managed by the KMS, and are not stored in the cluster +* Encryption keys are rotated by the KMS, and the configuration is managed by + the user +* OpenShift clusters automatically detect KMS key rotation and react + appropriately +* Users can disable encryption after enabling it +* Configuring KMS encryption should not meaningfully degrade the performance of + the cluster +* OpenShift will manage KMS plugins' lifecycle on behalf of the users +* Provide users with the tools to monitor the state of KMS plugins and KMS + itself + +### Non-Goals + +* Support for users to control what resources they want to encrypt +* Support for OpenShift managed encryption keys in KMS +* Direct support for hardware security models (these might still be supported + via KMS plugins, i.e. Hashicorp Vault or Thales) +* Full data recovery in cases where the KMS key is lost +* Support for users to specify which resources they want to encrypt + +## Proposal + +To support KMS encryption in OpenShift, we will leverage the work +that was done in [upstream Kubernetes](https://github.com/kubernetes/enhancements/tree/master/keps/sig-auth/3299-kms-v2-improvements). +However, we will need to extend and adapt the encryption workflow in OpenShift +to support new constraints introduced by the externalization of encryption keys +in a KMS. Because OpenShift will not own the keys from the KMS, we will also +need to provide tools to the users to detect KMS-related failures and take +action toward recovering their clusters whenever possible. + +We focus on supporting KMS v2 only, as KMS v1 has considerable performance +impact in the cluster. + +We will extend the APIServer API to add a new `kms` encryption type alongside +the existing `aescbc` and `aesgcm` types. Unlike `aescbc` and `aesgcm`, KMS will +require additional input from users to configure their KMS provider, such as +connection details, authentication credentials, and key references. + +From a UX perspective, these are the only changes the KMS feature will +introduce. It is intentionally minimal to reduce the burden on users and +the potential for errors. + +This feature will re-use as much of the existing encryption logic as possible, +leveraging the existing encryption and migration workflow introduced for +AES-CBC and AES-GCM. However, because encryption keys for KMS are managed +externally rather than by the apiserver operators, we will extend the existing +encryption controllers to support external key rotation detection and introduce +a new controller to manage KMS plugin pod lifecycle. + +The existing encryption controllers will be extended to support KMS with minimal +changes. KMS plugin health checks will be integrated into the controller +precondition system, and encryption key management will be adapted to work with +externally-managed keys while maintaining feature parity with existing +encryption providers. + +### Workflow Description + +Explain how the user will use the feature. Be detailed and explicit. +Describe all of the actors, their roles, and the APIs or interfaces +involved. Define a starting state and then list the steps that the +user would need to go through to trigger the feature described in the +enhancement. Optionally add a +[mermaid](https://github.com/mermaid-js/mermaid#readme) sequence +diagram. + +Use sub-sections to explain variations, such as for error handling, +failure recovery, or alternative outcomes. + +For example: + +**cluster creator** is a human user responsible for deploying a +cluster. + +**application administrator** is a human user responsible for +deploying an application in a cluster. + +1. The cluster creator sits down at their keyboard... +2. ... +3. The cluster creator sees that their cluster is ready to receive + applications, and gives the application administrator their + credentials. + +See +https://github.com/openshift/enhancements/blob/master/enhancements/workload-partitioning/management-workload-partitioning.md#high-level-end-to-end-workflow +and https://github.com/openshift/enhancements/blob/master/enhancements/agent-installer/automated-workflow-for-agent-based-installer.md for more detailed examples. + +### API Extensions + +API Extensions are CRDs, admission and conversion webhooks, aggregated API servers, +and finalizers, i.e. those mechanisms that change the OCP API surface and behaviour. + +- Name the API extensions this enhancement adds or modifies. +- Does this enhancement modify the behaviour of existing resources, especially those owned + by other parties than the authoring team (including upstream resources), and, if yes, how? + Please add those other parties as reviewers to the enhancement. + + Examples: + - Adds a finalizer to namespaces. Namespace cannot be deleted without our controller running. + - Restricts the label format for objects to X. + - Defaults field Y on object kind Z. + +Fill in the operational impact of these API Extensions in the "Operational Aspects +of API Extensions" section. + +### Topology Considerations + +#### Hypershift / Hosted Control Planes + +Are there any unique considerations for making this change work with +Hypershift? + +See https://github.com/openshift/enhancements/blob/e044f84e9b2bafa600e6c24e35d226463c2308a5/enhancements/multi-arch/heterogeneous-architecture-clusters.md?plain=1#L282 + +How does it affect any of the components running in the +management cluster? How does it affect any components running split +between the management cluster and guest cluster? + +#### Standalone Clusters + +Is the change relevant for standalone clusters? + +#### Single-node Deployments or MicroShift + +How does this proposal affect the resource consumption of a +single-node OpenShift deployment (SNO), CPU and memory? + +How does this proposal affect MicroShift? For example, if the proposal +adds configuration options through API resources, should any of those +behaviors also be exposed to MicroShift admins through the +configuration file for MicroShift? + +### Implementation Details/Notes/Constraints + +#### Controller Preconditions and KMS Plugin Health + +Encryption controllers should only run when the KMS provider plugin is up and +running. All the encryption controllers take in a preconditionsFulfilled +function as a parameter. The controllers use this to decide whether they should +sync or not. We can leverage this existing mechanism to check if the KMS plugin +is healthy, in addition to the existing checks. + +#### Encryption Key Secret Management for KMS + +The keyController will continue managing encryption key secrets as it does +today. The difference is that for the KMS encryption provider, the encryption +key secret contents will be empty. This secret must be empty because when the +KMS provider is used the root encryption key (KEK) is stored and managed by KMS +itself. We still want the encryption key secret to exist, even if empty, so +that we can leverage functionality in the existing encryption controllers, thus +having full feature parity between existing encryption providers and the new +KMS encryption provider. + +### Risks and Mitigations + +What are the risks of this proposal and how do we mitigate. Think broadly. For +example, consider both security and how this will impact the larger OKD +ecosystem. + +How will security be reviewed and by whom? + +How will UX be reviewed and by whom? + +Consider including folks that also work outside your immediate sub-project. + +### Drawbacks + +The idea is to find the best form of an argument why this enhancement should +_not_ be implemented. + +What trade-offs (technical/efficiency cost, user experience, flexibility, +supportability, etc) must be made in order to implement this? What are the reasons +we might not want to undertake this proposal, and how do we overcome them? + +Does this proposal implement a behavior that's new/unique/novel? Is it poorly +aligned with existing user expectations? Will it be a significant maintenance +burden? Is it likely to be superceded by something else in the near future? + +## Alternatives (Not Implemented) + +Similar to the `Drawbacks` section the `Alternatives` section is used +to highlight and record other possible approaches to delivering the +value proposed by an enhancement, including especially information +about why the alternative was not selected. + +## Open Questions [optional] + +This is where to call out areas of the design that require closure before deciding +to implement the design. For instance, + > 1. This requires exposing previously private resources which contain sensitive + information. Can we do this? + +## Test Plan + +**Note:** *Section not required until targeted at a release.* + +Consider the following in developing a test plan for this enhancement: +- Will there be e2e and integration tests, in addition to unit tests? +- How will it be tested in isolation vs with other components? +- What additional testing is necessary to support managed OpenShift service-based offerings? + +No need to outline all of the test cases, just the general strategy. Anything +that would count as tricky in the implementation and anything particularly +challenging to test should be called out. + +All code is expected to have adequate tests (eventually with coverage +expectations). + +## Graduation Criteria + +**Note:** *Section not required until targeted at a release.* + +Define graduation milestones. + +These may be defined in terms of API maturity, or as something else. Initial proposal +should keep this high-level with a focus on what signals will be looked at to +determine graduation. + +Consider the following in developing the graduation criteria for this +enhancement: + +- Maturity levels + - [`alpha`, `beta`, `stable` in upstream Kubernetes][maturity-levels] + - `Dev Preview`, `Tech Preview`, `GA` in OpenShift +- [Deprecation policy][deprecation-policy] + +Clearly define what graduation means by either linking to the [API doc definition](https://kubernetes.io/docs/concepts/overview/kubernetes-api/#api-versioning), +or by redefining what graduation means. + +In general, we try to use the same stages (alpha, beta, GA), regardless how the functionality is accessed. + +[maturity-levels]: https://git.k8s.io/community/contributors/devel/sig-architecture/api_changes.md#alpha-beta-and-stable-versions +[deprecation-policy]: https://kubernetes.io/docs/reference/using-api/deprecation-policy/ + +**If this is a user facing change requiring new or updated documentation in [openshift-docs](https://github.com/openshift/openshift-docs/), +please be sure to include in the graduation criteria.** + +**Examples**: These are generalized examples to consider, in addition +to the aforementioned [maturity levels][maturity-levels]. + +### Dev Preview -> Tech Preview + +- Ability to utilize the enhancement end to end +- End user documentation, relative API stability +- Sufficient test coverage +- Gather feedback from users rather than just developers +- Enumerate service level indicators (SLIs), expose SLIs as metrics +- Write symptoms-based alerts for the component(s) + +### Tech Preview -> GA + +- More testing (upgrade, downgrade, scale) +- Sufficient time for feedback +- Available by default +- Backhaul SLI telemetry +- Document SLOs for the component +- Conduct load testing +- User facing documentation created in [openshift-docs](https://github.com/openshift/openshift-docs/) + +**For non-optional features moving to GA, the graduation criteria must include +end to end tests.** + +### Removing a deprecated feature + +- Announce deprecation and support policy of the existing feature +- Deprecate the feature + +## Upgrade / Downgrade Strategy + +If applicable, how will the component be upgraded and downgraded? Make sure this +is in the test plan. + +Consider the following in developing an upgrade/downgrade strategy for this +enhancement: +- What changes (in invocations, configurations, API use, etc.) is an existing + cluster required to make on upgrade in order to keep previous behavior? +- What changes (in invocations, configurations, API use, etc.) is an existing + cluster required to make on upgrade in order to make use of the enhancement? + +Upgrade expectations: +- Each component should remain available for user requests and + workloads during upgrades. Ensure the components leverage best practices in handling [voluntary + disruption](https://kubernetes.io/docs/concepts/workloads/pods/disruptions/). Any exception to + this should be identified and discussed here. +- Micro version upgrades - users should be able to skip forward versions within a + minor release stream without being required to pass through intermediate + versions - i.e. `x.y.N->x.y.N+2` should work without requiring `x.y.N->x.y.N+1` + as an intermediate step. +- Minor version upgrades - you only need to support `x.N->x.N+1` upgrade + steps. So, for example, it is acceptable to require a user running 4.3 to + upgrade to 4.5 with a `4.3->4.4` step followed by a `4.4->4.5` step. +- While an upgrade is in progress, new component versions should + continue to operate correctly in concert with older component + versions (aka "version skew"). For example, if a node is down, and + an operator is rolling out a daemonset, the old and new daemonset + pods must continue to work correctly even while the cluster remains + in this partially upgraded state for some time. + +Downgrade expectations: +- If an `N->N+1` upgrade fails mid-way through, or if the `N+1` cluster is + misbehaving, it should be possible for the user to rollback to `N`. It is + acceptable to require some documented manual steps in order to fully restore + the downgraded cluster to its previous state. Examples of acceptable steps + include: + - Deleting any CVO-managed resources added by the new version. The + CVO does not currently delete resources that no longer exist in + the target version. + +## Version Skew Strategy + +How will the component handle version skew with other components? +What are the guarantees? Make sure this is in the test plan. + +Consider the following in developing a version skew strategy for this +enhancement: +- During an upgrade, we will always have skew among components, how will this impact your work? +- Does this enhancement involve coordinating behavior in the control plane and + in the kubelet? How does an n-2 kubelet without this feature available behave + when this feature is used? +- Will any other components on the node change? For example, changes to CSI, CRI + or CNI may require updating that component before the kubelet. + +## Operational Aspects of API Extensions + +Describe the impact of API extensions (mentioned in the proposal section, i.e. CRDs, +admission and conversion webhooks, aggregated API servers, finalizers) here in detail, +especially how they impact the OCP system architecture and operational aspects. + +- For conversion/admission webhooks and aggregated apiservers: what are the SLIs (Service Level + Indicators) an administrator or support can use to determine the health of the API extensions + + Examples (metrics, alerts, operator conditions) + - authentication-operator condition `APIServerDegraded=False` + - authentication-operator condition `APIServerAvailable=True` + - openshift-authentication/oauth-apiserver deployment and pods health + +- What impact do these API extensions have on existing SLIs (e.g. scalability, API throughput, + API availability) + + Examples: + - Adds 1s to every pod update in the system, slowing down pod scheduling by 5s on average. + - Fails creation of ConfigMap in the system when the webhook is not available. + - Adds a dependency on the SDN service network for all resources, risking API availability in case + of SDN issues. + - Expected use-cases require less than 1000 instances of the CRD, not impacting + general API throughput. + +- How is the impact on existing SLIs to be measured and when (e.g. every release by QE, or + automatically in CI) and by whom (e.g. perf team; name the responsible person and let them review + this enhancement) + +- Describe the possible failure modes of the API extensions. +- Describe how a failure or behaviour of the extension will impact the overall cluster health + (e.g. which kube-controller-manager functionality will stop working), especially regarding + stability, availability, performance and security. +- Describe which OCP teams are likely to be called upon in case of escalation with one of the failure modes + and add them as reviewers to this enhancement. + +## Support Procedures + +Describe how to +- detect the failure modes in a support situation, describe possible symptoms (events, metrics, + alerts, which log output in which component) + + Examples: + - If the webhook is not running, kube-apiserver logs will show errors like "failed to call admission webhook xyz". + - Operator X will degrade with message "Failed to launch webhook server" and reason "WehhookServerFailed". + - The metric `webhook_admission_duration_seconds("openpolicyagent-admission", "mutating", "put", "false")` + will show >1s latency and alert `WebhookAdmissionLatencyHigh` will fire. + +- disable the API extension (e.g. remove MutatingWebhookConfiguration `xyz`, remove APIService `foo`) + + - What consequences does it have on the cluster health? + + Examples: + - Garbage collection in kube-controller-manager will stop working. + - Quota will be wrongly computed. + - Disabling/removing the CRD is not possible without removing the CR instances. Customer will lose data. + Disabling the conversion webhook will break garbage collection. + + - What consequences does it have on existing, running workloads? + + Examples: + - New namespaces won't get the finalizer "xyz" and hence might leak resource X + when deleted. + - SDN pod-to-pod routing will stop updating, potentially breaking pod-to-pod + communication after some minutes. + + - What consequences does it have for newly created workloads? + + Examples: + - New pods in namespace with Istio support will not get sidecars injected, breaking + their networking. + +- Does functionality fail gracefully and will work resume when re-enabled without risking + consistency? + + Examples: + - The mutating admission webhook "xyz" has FailPolicy=Ignore and hence + will not block the creation or updates on objects when it fails. When the + webhook comes back online, there is a controller reconciling all objects, applying + labels that were not applied during admission webhook downtime. + - Namespaces deletion will not delete all objects in etcd, leading to zombie + objects when another namespace with the same name is created. + +## Infrastructure Needed [optional] + +Use this section if you need things from the project. Examples include a new +subproject, repos requested, github details, and/or testing infrastructure. From cd1f13cb1adb3070cf1efac1177da81d62b61797 Mon Sep 17 00:00:00 2001 From: Flavian Missi Date: Mon, 20 Oct 2025 14:25:32 +0200 Subject: [PATCH 02/13] start working on "Workflow Description" also split the Proposal into sub-sections. --- ...-encryption-provider-at-datastore-layer.md | 104 ++++++++++-------- 1 file changed, 61 insertions(+), 43 deletions(-) diff --git a/enhancements/kube-apiserver/kms-encryption-provider-at-datastore-layer.md b/enhancements/kube-apiserver/kms-encryption-provider-at-datastore-layer.md index fcdcfab37c..503c782866 100644 --- a/enhancements/kube-apiserver/kms-encryption-provider-at-datastore-layer.md +++ b/enhancements/kube-apiserver/kms-encryption-provider-at-datastore-layer.md @@ -103,69 +103,73 @@ KMS managed keys protects against such scenarios. ## Proposal -To support KMS encryption in OpenShift, we will leverage the work -that was done in [upstream Kubernetes](https://github.com/kubernetes/enhancements/tree/master/keps/sig-auth/3299-kms-v2-improvements). +To support KMS encryption in OpenShift, we will leverage the work done in +[upstream Kubernetes](https://github.com/kubernetes/enhancements/tree/master/keps/sig-auth/3299-kms-v2-improvements). However, we will need to extend and adapt the encryption workflow in OpenShift to support new constraints introduced by the externalization of encryption keys in a KMS. Because OpenShift will not own the keys from the KMS, we will also -need to provide tools to the users to detect KMS-related failures and take +need to provide tools to users to detect KMS-related failures and take action toward recovering their clusters whenever possible. We focus on supporting KMS v2 only, as KMS v1 has considerable performance impact in the cluster. -We will extend the APIServer API to add a new `kms` encryption type alongside -the existing `aescbc` and `aesgcm` types. Unlike `aescbc` and `aesgcm`, KMS will -require additional input from users to configure their KMS provider, such as -connection details, authentication credentials, and key references. +#### API Extensions -From a UX perspective, these are the only changes the KMS feature will -introduce. It is intentionally minimal to reduce the burden on users and -the potential for errors. +We will extend the APIServer config to add a new `kms` encryption type alongside +the existing `aescbc` and `aesgcm` types. Unlike `aescbc` and `aesgcm`, KMS +will require additional input from users to configure their KMS provider, such +as connection details, authentication credentials, and key references. From a +UX perspective, this is the only change the KMS feature introduces—it is +intentionally minimal to reduce user burden and potential for errors. -This feature will re-use as much of the existing encryption logic as possible, -leveraging the existing encryption and migration workflow introduced for -AES-CBC and AES-GCM. However, because encryption keys for KMS are managed -externally rather than by the apiserver operators, we will extend the existing -encryption controllers to support external key rotation detection and introduce -a new controller to manage KMS plugin pod lifecycle. +#### Encryption Controller Extensions -The existing encryption controllers will be extended to support KMS with minimal -changes. KMS plugin health checks will be integrated into the controller -precondition system, and encryption key management will be adapted to work with -externally-managed keys while maintaining feature parity with existing -encryption providers. +This feature will reuse existing encryption and migration workflows while +extending them to handle externally-managed keys. We will introduce a new +controller to manage KMS plugin pod lifecycle and integrate KMS plugin health +checks into the existing controller precondition system. + +#### KMS Plugin Lifecycle + +KMS encryption requires KMS plugin pods to bridge communication between the +kube-apiserver and the external KMS. In OpenShift, the kube-apiserver-operator +will manage these plugins on behalf of users, reducing operational complexity +and ensuring consistent behavior across the platform. The operator will handle +plugin deployment, health monitoring, and lifecycle management during key +rotation events. ### Workflow Description -Explain how the user will use the feature. Be detailed and explicit. -Describe all of the actors, their roles, and the APIs or interfaces -involved. Define a starting state and then list the steps that the -user would need to go through to trigger the feature described in the -enhancement. Optionally add a -[mermaid](https://github.com/mermaid-js/mermaid#readme) sequence -diagram. +#### Roles -Use sub-sections to explain variations, such as for error handling, -failure recovery, or alternative outcomes. +**cluster admin** is a human user responsible for the overall configuration and +maintainenance of a cluster. -For example: +**KMS** the Key Management Service responsible automatic rotation of the Key +Encryption Key (KEK). -**cluster creator** is a human user responsible for deploying a -cluster. +#### Initial Resource Encryption -**application administrator** is a human user responsible for -deploying an application in a cluster. +1. The cluster admin creates an encryption key (KEK) in their KMS of choice +1. The cluster admin give the OpenShift apiservers access to the newly created + KMS KEK +1. The cluster admiin updates the APIServer configuration resource, providing + the necessary configuration options for the KMS of choice +1. The cluster admin observes the `kube-apiserver` `clusteroperator` resource, + for progress on the configuration, as well as migration of resources -1. The cluster creator sits down at their keyboard... -2. ... -3. The cluster creator sees that their cluster is ready to receive - applications, and gives the application administrator their - credentials. +#### Key rotation -See -https://github.com/openshift/enhancements/blob/master/enhancements/workload-partitioning/management-workload-partitioning.md#high-level-end-to-end-workflow -and https://github.com/openshift/enhancements/blob/master/enhancements/agent-installer/automated-workflow-for-agent-based-installer.md for more detailed examples. +1. The cluster admin configures automatic periodic rotation of the KEK in KMS +1. KMS rotates the KEK +1. OpenShift detects the KEK has been rotated, and starts migrating encrypted + data to use the new KEK +1. The cluster admin eventually checks the `kube-apiserver` `clusteroperator` + resource, and sees that the KEK was rotated, and the status of the data + migration + +#### Change of KMS Provider ### API Extensions @@ -233,6 +237,20 @@ that we can leverage functionality in the existing encryption controllers, thus having full feature parity between existing encryption providers and the new KMS encryption provider. +#### Key Rotation Handling + +Keys can be rotated in the following ways: +* Automatic periodic key rotation by the KMS, following user provided rotation + policy in the KMS itself +* The user creates a new KMS key, and updates the KMS section of the APIServer + config with the new key + +OpenShift must detect the change and trigger re-encryption of affected +resources. + +TODO: elaborate details about the two rotation scenarios, +how detection works, migration process, etc. + ### Risks and Mitigations What are the risks of this proposal and how do we mitigate. Think broadly. For From aa6e146897b479c95202aa13719bac3ca3a1fe3b Mon Sep 17 00:00:00 2001 From: Flavian Missi Date: Mon, 20 Oct 2025 14:30:33 +0200 Subject: [PATCH 03/13] remove title section description --- .../kms-encryption-provider-at-datastore-layer.md | 11 ----------- 1 file changed, 11 deletions(-) diff --git a/enhancements/kube-apiserver/kms-encryption-provider-at-datastore-layer.md b/enhancements/kube-apiserver/kms-encryption-provider-at-datastore-layer.md index 503c782866..da62f72ebf 100644 --- a/enhancements/kube-apiserver/kms-encryption-provider-at-datastore-layer.md +++ b/enhancements/kube-apiserver/kms-encryption-provider-at-datastore-layer.md @@ -28,17 +28,6 @@ superseded-by: # KMS Encryption Provider at Datastore Layer -This is the title of the enhancement. Keep it simple and descriptive. A good -title can help communicate what the enhancement is and should be considered as -part of any review. - -The YAML `title` should be lowercased and spaces/punctuation should be -replaced with `-`. - -The `Metadata` section above is intended to support the creation of tooling -around the enhancement process. - - ## Summary Provide a user-configurable interface to support encryption of data stored in From 5c98a6a214dbe32f8e4bddaf4960d385851005bd Mon Sep 17 00:00:00 2001 From: Flavian Missi Date: Mon, 20 Oct 2025 15:49:46 +0200 Subject: [PATCH 04/13] start detailing plugin management during key rotation --- ...-encryption-provider-at-datastore-layer.md | 69 +++++++++++++++++-- 1 file changed, 64 insertions(+), 5 deletions(-) diff --git a/enhancements/kube-apiserver/kms-encryption-provider-at-datastore-layer.md b/enhancements/kube-apiserver/kms-encryption-provider-at-datastore-layer.md index da62f72ebf..086239a0a6 100644 --- a/enhancements/kube-apiserver/kms-encryption-provider-at-datastore-layer.md +++ b/enhancements/kube-apiserver/kms-encryption-provider-at-datastore-layer.md @@ -135,7 +135,7 @@ rotation events. **cluster admin** is a human user responsible for the overall configuration and maintainenance of a cluster. -**KMS** the Key Management Service responsible automatic rotation of the Key +**KMS** is the Key Management Service responsible automatic rotation of the Key Encryption Key (KEK). #### Initial Resource Encryption @@ -146,7 +146,7 @@ Encryption Key (KEK). 1. The cluster admiin updates the APIServer configuration resource, providing the necessary configuration options for the KMS of choice 1. The cluster admin observes the `kube-apiserver` `clusteroperator` resource, - for progress on the configuration, as well as migration of resources + for progress on the configuration change, as well as migration of resources #### Key rotation @@ -160,6 +160,15 @@ Encryption Key (KEK). #### Change of KMS Provider +1. The cluster admin creates a KEK in a KMS different than the one currently + configured in the cluster +1. The cluster admin configures the new KMS provider in the APIServer + configuration resource +1. The cluster detects the encryption configuration change, and starts + migrating the encrypted data to use the new KMS encryption key +1. The cluster admin observes the `kube-apiserver` `clusteroperator` resource, + for progress on the configuration change, as well as migration of resources + ### API Extensions API Extensions are CRDs, admission and conversion webhooks, aggregated API servers, @@ -226,7 +235,7 @@ that we can leverage functionality in the existing encryption controllers, thus having full feature parity between existing encryption providers and the new KMS encryption provider. -#### Key Rotation Handling +#### Key Rotation and Data Migration Keys can be rotated in the following ways: * Automatic periodic key rotation by the KMS, following user provided rotation @@ -237,8 +246,58 @@ Keys can be rotated in the following ways: OpenShift must detect the change and trigger re-encryption of affected resources. -TODO: elaborate details about the two rotation scenarios, -how detection works, migration process, etc. +Rotation detection is standard. KMS Plugins must return a `key_id` as part of +the response to a Status gRPC call. This `key_id` is authoritative, so when it +changes, we must consider the key rotated, and migrate the current encrypted +resources to use the new key. The `keyController` will be updated to perform +periodic checks of the `key_id` in the response to a Status call, and recreate +the encryption key secret resource when it detects a change in `key_id`. +TODO: Where is the currently in use `key_id` stored? The `keyController` must +have something to compare with the Status `key_id`. + +Once the encryption key secret resource is recreated as a reaction to a change +in `key_id`, the `migrationController` will detect that a migration is needed, +and will do its job without any modifications. + +Key rotation is unfortunately not standardized, so every KMS plugin can +implement rotation in a different way, as long as the `key_id` returned by the +Status call remains authoritative. The section below enumerates the steps needed +to perform rotation for the supported KMS plugins. For example, the AWS KMS +plugin supports two KMS keys to be configured at the same time, allowing the +plugin to run with two keys with different ARNs without the need for another +plugin pod to be configured. Azure KMS plugin on the other hand, can only be +configured with a single key, so if a user creates a new KMS key, OpenShift +must create a whole new plugin pod, and run it in parallel with the one +configured with the previous key. The two pods must run in parallel until +migration to the new key is complete. + +TODO: how do KMS plugins determine `key_id`? It mustn't be the same as i.e. +KeyARN, because that won't change when the key is rotated. For OpenShift to be +able to migrate content to a rotated key, we must be able to detect a `key_id` +change, and thus the `key_id` must not be the same after the key is rotated. +It must somehow be calculated taking into consideration the key materials... + +##### Key Rotation For the AWS KMS Plugin + +TODO + +##### Key Rotation For the Azure KMS Plugin + +The Azure KMS plugin will not be supported in Tech-Preview. + +TODO: explain process or two plugin pods running until migration finishes. + +TODO: confirm rotation detection works as expected with the Azure plugin: it +requires a key version as a parameter to run, and afaik while rotating a key +doesn't cause it's `key_id` to change (only the key materials), the fact that +the Azure KMS plugin takes in a key version is concerning, because a new +version of the key is created when the key is rotated. This might just mean +that the Status `key_id` will change, and then we need a new pod with just a +version bump. My concern is that the `key_id` will not change. This would make +the plugin incompatible with the KMS plugin interface, but still. I want to be +sure. + + ### Risks and Mitigations From 3f91ebc9e6b649368db29a68dec6b25d1640771e Mon Sep 17 00:00:00 2001 From: Flavian Missi Date: Tue, 21 Oct 2025 09:57:35 +0200 Subject: [PATCH 05/13] pull in api extensions from openshift/enhancements#1682 --- ...-encryption-provider-at-datastore-layer.md | 133 ++++++++++++++++-- 1 file changed, 118 insertions(+), 15 deletions(-) diff --git a/enhancements/kube-apiserver/kms-encryption-provider-at-datastore-layer.md b/enhancements/kube-apiserver/kms-encryption-provider-at-datastore-layer.md index 086239a0a6..7d8bf5bd89 100644 --- a/enhancements/kube-apiserver/kms-encryption-provider-at-datastore-layer.md +++ b/enhancements/kube-apiserver/kms-encryption-provider-at-datastore-layer.md @@ -171,21 +171,124 @@ Encryption Key (KEK). ### API Extensions -API Extensions are CRDs, admission and conversion webhooks, aggregated API servers, -and finalizers, i.e. those mechanisms that change the OCP API surface and behaviour. - -- Name the API extensions this enhancement adds or modifies. -- Does this enhancement modify the behaviour of existing resources, especially those owned - by other parties than the authoring team (including upstream resources), and, if yes, how? - Please add those other parties as reviewers to the enhancement. - - Examples: - - Adds a finalizer to namespaces. Namespace cannot be deleted without our controller running. - - Restricts the label format for objects to X. - - Defaults field Y on object kind Z. - -Fill in the operational impact of these API Extensions in the "Operational Aspects -of API Extensions" section. +While in tech-preview, the KMS feature will be placed behind the +`KMSEncryptionProvider` feature-gate. + +Similar to the upstream `EncryptionConfig`'s [`ProviderConfiguration`](https://github.com/kubernetes/apiserver/blob/cccad306d649184bf2a0e319ba830c53f65c445c/pkg/apis/apiserver/types_encryption.go#L89-L101), +we will add a new `EncryptionType` to the existing `APIServer` config: +```diff +diff --git a/config/v1/types_apiserver.go b/config/v1/types_apiserver.go +index d815556d2..c9098024f 100644 +--- a/config/v1/types_apiserver.go ++++ b/config/v1/types_apiserver.go +@@ -208,6 +225,11 @@ const ( + // aesgcm refers to a type where AES-GCM with random nonce and a 32-byte key + // is used to perform encryption at the datastore layer. + EncryptionTypeAESGCM EncryptionType = "aesgcm" ++ ++ // kms refers to a type of encryption where the encryption keys are managed ++ // outside the control plane in a Key Management Service instance, ++ // encryption is still performed at the datastore layer. ++ EncryptionTypeKMS EncryptionType = "KMS" + ) +``` + +The default value today is an empty string, which implies identity, meaning no +encryption is used in the cluster by default. Other possible local encryption +schemes include `aescbc` and `aesgcm`, which will remain as-is. Similar to how +local AES encryption works, the apiserver operators will observe this config +and apply the KMS `EncryptionProvider` to the `EncryptionConfig`. + +```diff +@@ -191,9 +194,23 @@ type APIServerEncryption struct { + // +unionDiscriminator + // +optional + Type EncryptionType `json:"type,omitempty"` ++ ++ // kms defines the configuration for the external KMS instance that manages the encryption keys, ++ // when KMS encryption is enabled sensitive resources will be encrypted using keys managed by an ++ // externally configured KMS instance. ++ // ++ // The Key Management Service (KMS) instance provides symmetric encryption and is responsible for ++ // managing the lifecyle of the encryption keys outside of the control plane. ++ // This allows integration with an external provider to manage the data encryption keys securely. ++ // ++ // +openshift:enable:FeatureGate=KMSEncryptionProvider ++ // +unionMember ++ // +optional ++ KMS *KMSConfig `json:"kms,omitempty"` +``` + +The KMS encryption type will have a dedicated configuration: + +```diff +diff --git a/config/v1/types_kmsencryption.go b/config/v1/types_kmsencryption.go +new file mode 100644 +index 000000000..8841cd749 +--- /dev/null ++++ b/config/v1/types_kmsencryption.go +@@ -0,0 +1,49 @@ ++package v1 ++ ++// KMSConfig defines the configuration for the KMS instance ++// that will be used with KMSEncryptionProvider encryption ++// +kubebuilder:validation:XValidation:rule="has(self.type) && self.type == 'AWS' ? has(self.aws) : !has(self.aws)",message="aws config is required when kms provider type is AWS, and forbidden otherwise" ++// +union ++type KMSConfig struct { ++ // type defines the kind of platform for the KMS provider ++ // ++ // +unionDiscriminator ++ // +kubebuilder:validation:Required ++ Type KMSProviderType `json:"type"` ++ ++ // aws defines the key config for using an AWS KMS instance ++ // for the encryption. The AWS KMS instance is managed ++ // by the user outside the purview of the control plane. ++ // ++ // +unionMember ++ // +optional ++ AWS *AWSKMSConfig `json:"aws,omitempty"` ++} + ++// KMSProviderType is a specific supported KMS provider ++// +kubebuilder:validation:Enum=AWS ++type KMSProviderType string ++ ++const ( ++ // AWSKMSProvider represents a supported KMS provider for use with AWS KMS ++ AWSKMSProvider KMSProviderType = "AWS" ++) +``` + +This configuration will also include an enum of the various KMS supported by +OCP. For Tech-Preview, it will only have the `AWS` type, but we will add more as we +progress on the feature. This enum is essential to signal users which KMS providers +are currently supported by the platform. + +Each KMS type will have a dedicated configuration that will be reflected on the +plugin when installed. It will only contain fields that are relevant to end +users. + +```diff ++// AWSKMSConfig defines the KMS config specific to AWS KMS provider ++type AWSKMSConfig struct { ++ // keyARN specifies the Amazon Resource Name (ARN) of the AWS KMS key used for encryption. ++ // The value must adhere to the format `arn:aws:kms:::key/`, where: ++ // - `` is the AWS region consisting of lowercase letters and hyphens followed by a number. ++ // - `` is a 12-digit numeric identifier for the AWS account. ++ // - `` is a unique identifier for the KMS key, consisting of lowercase hexadecimal characters and hyphens. ++ // ++ // +kubebuilder:validation:Required ++ // +kubebuilder:validation:XValidation:rule="self.matches('^arn:aws:kms:[a-z0-9-]+:[0-9]{12}:key/[a-f0-9-]+$') && self.size() <= 128",message="keyARN must follow the format `arn:aws:kms:::key/`. The account ID must be a 12 digit number and the region and key ID should consist only of lowercase hexadecimal characters and hyphens (-)." ++ KeyARN string `json:"keyARN"` ++ // region specifies the AWS region where the KMS intance exists, and follows the format ++ // `--`, e.g.: `us-east-1`. ++ // Only lowercase letters and hyphens followed by numbers are allowed. ++ // ++ // +kubebuilder:validation:XValidation:rule="self.matches('^[a-z]{2}-[a-z]+-[0-9]+$') && self.size() <= 64",message="region must be a valid AWS region" ++ Region string `json:"region"` ++} +``` ### Topology Considerations From d9ea20bbcb19a66803d8589aad85331ab6fbab58 Mon Sep 17 00:00:00 2001 From: Flavian Missi Date: Wed, 22 Oct 2025 11:12:39 +0200 Subject: [PATCH 06/13] wip: kms plugin management --- ...-encryption-provider-at-datastore-layer.md | 32 +++++++++++++++++++ 1 file changed, 32 insertions(+) diff --git a/enhancements/kube-apiserver/kms-encryption-provider-at-datastore-layer.md b/enhancements/kube-apiserver/kms-encryption-provider-at-datastore-layer.md index 7d8bf5bd89..3f28728b9d 100644 --- a/enhancements/kube-apiserver/kms-encryption-provider-at-datastore-layer.md +++ b/enhancements/kube-apiserver/kms-encryption-provider-at-datastore-layer.md @@ -319,6 +319,13 @@ configuration file for MicroShift? ### Implementation Details/Notes/Constraints +Enabling KMS encryption requires a KMS plugin running in the cluster so that +the apiservers can communicate through the plugin with the external KMS +provider. + +Each KMS provider has a different KMS plugin. OpenShift will manage the entire +lifecycle of KMS plugins. + #### Controller Preconditions and KMS Plugin Health Encryption controllers should only run when the KMS provider plugin is up and @@ -400,6 +407,31 @@ version bump. My concern is that the `key_id` will not change. This would make the plugin incompatible with the KMS plugin interface, but still. I want to be sure. +#### KMS Plugin Management + +1. apiservers share a single instance of kms plugin, this can be achieved in two variations: + a. kms plugin and kas-o share revisions + b. kms plugin revisions are independent of kas-o revisions +2. apiservers have dedicated kms plugin instance, managed by their respective operators + +**Option 1.a** +Pros: +* shared revision means kas-o and kms plugin encryption configuration will never drift +* we don't have to think about alternative ways to deploy the kms plugin, there's only the static pod +Cons: +* other apiservers encryption configuration might drift +* ? +**Option 1.b** +Pros: +* we don't have to think about alternative ways to deploy the kms plugin, there's only the static pod +* has the potential for avoiding downtime of kas during encryption config update, since we can ensure the encryption config update isn't rolled out until kms plugins are ready +Cons: +* apiservers encryption configuration might drift +**Option 2** +Pros: +* shared revision between all apiservers and their respective kms plugin means config will never drift +Cons: +* kube-apiserver is a static pod, and the rest of openshift apiservers are regular pods managed by deployments means kms plugins pods cannot be static pods in all cases ### Risks and Mitigations From 235b431b1ef2b23b357ac4c07920e594d73be025 Mon Sep 17 00:00:00 2001 From: Flavian Missi Date: Wed, 22 Oct 2025 13:00:57 +0200 Subject: [PATCH 07/13] add requirements for kms plugin management we probably have more. --- .../kms-encryption-provider-at-datastore-layer.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/enhancements/kube-apiserver/kms-encryption-provider-at-datastore-layer.md b/enhancements/kube-apiserver/kms-encryption-provider-at-datastore-layer.md index 3f28728b9d..1df75b7f17 100644 --- a/enhancements/kube-apiserver/kms-encryption-provider-at-datastore-layer.md +++ b/enhancements/kube-apiserver/kms-encryption-provider-at-datastore-layer.md @@ -409,6 +409,12 @@ sure. #### KMS Plugin Management +Requirements: +* apiservers must have access the unix socket for the kms plugin +* running multiple instances of the kms plugin with different encryption + configuration + +Alternatives: 1. apiservers share a single instance of kms plugin, this can be achieved in two variations: a. kms plugin and kas-o share revisions b. kms plugin revisions are independent of kas-o revisions From 9ad897661b41ee4a92f99207a68412109ce681a3 Mon Sep 17 00:00:00 2001 From: Flavian Missi Date: Wed, 29 Oct 2025 16:22:10 +0100 Subject: [PATCH 08/13] expand on kms plugin management and key rotation also improves some sentences and add some links --- ...-encryption-provider-at-datastore-layer.md | 199 +++++++++++++----- 1 file changed, 141 insertions(+), 58 deletions(-) diff --git a/enhancements/kube-apiserver/kms-encryption-provider-at-datastore-layer.md b/enhancements/kube-apiserver/kms-encryption-provider-at-datastore-layer.md index 1df75b7f17..344ad7fb8e 100644 --- a/enhancements/kube-apiserver/kms-encryption-provider-at-datastore-layer.md +++ b/enhancements/kube-apiserver/kms-encryption-provider-at-datastore-layer.md @@ -15,13 +15,13 @@ api-approvers: # In case of new or modified APIs or API extensions (CRDs, aggreg creation-date: 2025-10-17 last-updated: yyyy-mm-dd tracking-link: # link to the tracking ticket (for example: Jira Feature or Epic ticket) that corresponds to this enhancement - - "https://issues.redhat.com/browse/OCPSTRAT-108" - - "https://issues.redhat.com/browse/OCPSTRAT-1638" + - "https://issues.redhat.com/browse/OCPSTRAT-108" # TP feature + - "https://issues.redhat.com/browse/OCPSTRAT-1638" # GA feature see-also: - "enhancements/kube-apiserver/encrypting-data-at-datastore-layer.md" - "enhancements/etcd/storage-migration-for-etcd-encryption.md" replaces: - - "" + - "https://github.com/openshift/enhancements/pull/1682" superseded-by: - "" --- @@ -31,7 +31,7 @@ superseded-by: ## Summary Provide a user-configurable interface to support encryption of data stored in -etcd using a supported Key Management Service (KMS). +etcd using a supported [Key Management Service (KMS)](https://kubernetes.io/docs/tasks/administer-cluster/kms-provider/). ## Motivation @@ -40,7 +40,8 @@ It protects against etcd data leaks in the event of an etcd backup compromise. However, aescbc and aesgcm, which are supported encryption technologies today available in OpenShift do not protect against online host compromise i.e. in such cases, attackers can decrypt encrypted data from etcd using local keys, -KMS managed keys protects against such scenarios. +KMS managed keys protects against such scenarios since the keys are stored and +managed externally. ### User Stories @@ -55,7 +56,8 @@ KMS managed keys protects against such scenarios. Status * As a cluster admin, I want to be able to switch to a different KMS plugin, i.e. from AWS to a pre-installed Vault, by performing a single configuration - change without needing to perform any other manual intervention + change without needing to perform any other manual intervention or manually + migrating data * TODO: confirm this requirement * As a cluster admin, I want to configure my chosen KMS to automatically rotate encryption keys and have OpenShift to automatically become aware of these new @@ -67,18 +69,18 @@ KMS managed keys protects against such scenarios. ### Goals * Users have an easy to use interface to configure KMS encryption -* Users will configure OpenShift clusters to use a specific KMS key, created by - them -* Encryption keys managed by the KMS, and are not stored in the cluster +* Users will configure OpenShift clusters to use one of the supported KMS + providers +* Encryption keys managed by the KMS (i.e. KEKs), and are not stored in the + cluster * Encryption keys are rotated by the KMS, and the configuration is managed by the user * OpenShift clusters automatically detect KMS key rotation and react appropriately * Users can disable encryption after enabling it -* Configuring KMS encryption should not meaningfully degrade the performance of - the cluster +* Overall cluster performance should be similar to other encryption mechanisms * OpenShift will manage KMS plugins' lifecycle on behalf of the users -* Provide users with the tools to monitor the state of KMS plugins and KMS +* Provide users with the means to monitor the state of KMS plugins and KMS itself ### Non-Goals @@ -89,6 +91,13 @@ KMS managed keys protects against such scenarios. via KMS plugins, i.e. Hashicorp Vault or Thales) * Full data recovery in cases where the KMS key is lost * Support for users to specify which resources they want to encrypt +* Immediate encryption: OpenShift's encryption works in an eventual model, i.e + it takes OpenShift several minutes to encrypt all the configured resources in + the cluster after encryption is initially enable. This means that even if + cluster admins enable encryption immediately after cluster creation, OpenShift + may still store unencrypted secrets in etcd. However, OpenShift will eventually + migrate all secrets (and other to-be-encrypted resources) to use encryption, + so they will all _eventually_ be encrypted in etcd ## Proposal @@ -96,9 +105,9 @@ To support KMS encryption in OpenShift, we will leverage the work done in [upstream Kubernetes](https://github.com/kubernetes/enhancements/tree/master/keps/sig-auth/3299-kms-v2-improvements). However, we will need to extend and adapt the encryption workflow in OpenShift to support new constraints introduced by the externalization of encryption keys -in a KMS. Because OpenShift will not own the keys from the KMS, we will also -need to provide tools to users to detect KMS-related failures and take -action toward recovering their clusters whenever possible. +in a KMS. Because OpenShift will not own the keys from the KMS, we will detect +the KMS-related failures and surface them to the cluster admins for any +necessary actions. We focus on supporting KMS v2 only, as KMS v1 has considerable performance impact in the cluster. @@ -112,14 +121,14 @@ as connection details, authentication credentials, and key references. From a UX perspective, this is the only change the KMS feature introduces—it is intentionally minimal to reduce user burden and potential for errors. -#### Encryption Controller Extensions +##### Encryption Controller Extensions This feature will reuse existing encryption and migration workflows while extending them to handle externally-managed keys. We will introduce a new controller to manage KMS plugin pod lifecycle and integrate KMS plugin health checks into the existing controller precondition system. -#### KMS Plugin Lifecycle +##### KMS Plugin Lifecycle KMS encryption requires KMS plugin pods to bridge communication between the kube-apiserver and the external KMS. In OpenShift, the kube-apiserver-operator @@ -135,18 +144,38 @@ rotation events. **cluster admin** is a human user responsible for the overall configuration and maintainenance of a cluster. -**KMS** is the Key Management Service responsible automatic rotation of the Key -Encryption Key (KEK). +**KMS** is the cloud Key Management Service responsible for managing the full +lifecycle of the Key Encryption Key (KEK), including automatic rotation. #### Initial Resource Encryption -1. The cluster admin creates an encryption key (KEK) in their KMS of choice +1. The cluster admin creates an encryption key (KEK) in their cloud KMS of choice 1. The cluster admin give the OpenShift apiservers access to the newly created - KMS KEK + cloud KMS KEK 1. The cluster admiin updates the APIServer configuration resource, providing - the necessary configuration options for the KMS of choice -1. The cluster admin observes the `kube-apiserver` `clusteroperator` resource, - for progress on the configuration change, as well as migration of resources + the necessary [encryption configuration options](encryption-cfg-opts) for + the cloud KMS of choice +1. The cluster admin observes the `clusteroperator/kube-apiserver` resource + for progress on the configuration change and encryption of existing resources + +[encryption-cfg-opts]: https://github.com/openshift/api/blob/master/config/v1/types_kmsencryption.go#L7-L22 + +#### KMS Plugin Management + +1. The cluster admin configures encryption in the cluster +1. The KMS plugin controller generates a unix socket name, unique for this + encryption configuration +1. The KMS plugin controller generates a pod manifest for the configured cloud + KMS, setting the `key_id`, the unix socket name generated in the previous, + and any other configurations required by the KMS plugin in question +1. The KMS plugin controller watches the `key_id` from the KMS plugin Status + gRPC endpoint, and when it detects it has changed, it configures the KMS + plugin to use the new key _in addition to the current key_ + * TODO: should the KMS plugin controller also watch the encryption key secret + for changes? +1. The KMS plugin controller watches the migration status of encrypted + resources, and once migration finishes, it configures the KMS plugin to only + use the new key, removing the previous one from the plugin configuration #### Key rotation @@ -158,6 +187,16 @@ Encryption Key (KEK). resource, and sees that the KEK was rotated, and the status of the data migration +1. TODO: how will `keyController` learn the unix socket name? it needs to be + able to call the kms plugin's Status gRPC endpont, and it needs the unix + socket to do that +1. `stateController` generates the `EncryptionConfig`, using the unix socket + path generated by the KMS plugin controller, and any other configurations + required +1. `keyController` watches the `key_id` from the KMS plugin Status gRPC call, + as well as the APIServer config for changes, and when either of these + change, it creates a new encryption key secret + #### Change of KMS Provider 1. The cluster admin creates a KEK in a KMS different than the one currently @@ -319,6 +358,11 @@ configuration file for MicroShift? ### Implementation Details/Notes/Constraints +This feature may bring slight degradation of performance due to the reliance of +an external system. However, overall performance should be similar to other +encryption mechanisms. During migrations, performance depends on the number of +resources that will be migrated. + Enabling KMS encryption requires a KMS plugin running in the cluster so that the apiservers can communicate through the plugin with the external KMS provider. @@ -348,48 +392,68 @@ KMS encryption provider. #### Key Rotation and Data Migration Keys can be rotated in the following ways: -* Automatic periodic key rotation by the KMS, following user provided rotation - policy in the KMS itself -* The user creates a new KMS key, and updates the KMS section of the APIServer - config with the new key - -OpenShift must detect the change and trigger re-encryption of affected -resources. - -Rotation detection is standard. KMS Plugins must return a `key_id` as part of -the response to a Status gRPC call. This `key_id` is authoritative, so when it -changes, we must consider the key rotated, and migrate the current encrypted -resources to use the new key. The `keyController` will be updated to perform -periodic checks of the `key_id` in the response to a Status call, and recreate -the encryption key secret resource when it detects a change in `key_id`. -TODO: Where is the currently in use `key_id` stored? The `keyController` must -have something to compare with the Status `key_id`. +* Key materials rotation: automatic periodic key rotation by the KMS, following + user provided rotation policy in the KMS itself; or manual key rotation in the + cloud KMS; does not result in a new key resource in the cloud KMS +* Key rotation: the user creates a new KMS key resource (while keeping the + previous one), and updates the encryption section of the APIServer config to + point to the new key + +Regardless of whether the key itself is rotated (causing a change in `key_id`) +or the key materials is rotated (not causing a change in `key_id`), OpenShift +must detect the change and trigger re-encryption of resources. + +**Key rotation** +KMS Plugins must return a `key_id` as part of the response to a Status gRPC call. +This `key_id` is authoritative, so when it changes, we must consider the key +rotated, and migrate all encrypted resources to use the new key. The +`keyController` will be updated to perform periodic checks of the `key_id` in +the response to a Status call, and recreate the encryption key secret resource +when it detects a change in `key_id`. +The `key_id` will be stored in the encryption secret resource managed by the +`keyController`. Currently, this resource is used to store key materials for +AES-CBC and AES-GCM keys, so we'll simply reuse this logic, but without storing +key materials when the KMS provider is selected. Once the encryption key secret resource is recreated as a reaction to a change in `key_id`, the `migrationController` will detect that a migration is needed, and will do its job without any modifications. -Key rotation is unfortunately not standardized, so every KMS plugin can -implement rotation in a different way, as long as the `key_id` returned by the -Status call remains authoritative. The section below enumerates the steps needed -to perform rotation for the supported KMS plugins. For example, the AWS KMS -plugin supports two KMS keys to be configured at the same time, allowing the -plugin to run with two keys with different ARNs without the need for another -plugin pod to be configured. Azure KMS plugin on the other hand, can only be -configured with a single key, so if a user creates a new KMS key, OpenShift -must create a whole new plugin pod, and run it in parallel with the one -configured with the previous key. The two pods must run in parallel until -migration to the new key is complete. - -TODO: how do KMS plugins determine `key_id`? It mustn't be the same as i.e. -KeyARN, because that won't change when the key is rotated. For OpenShift to be -able to migrate content to a rotated key, we must be able to detect a `key_id` -change, and thus the `key_id` must not be the same after the key is rotated. -It must somehow be calculated taking into consideration the key materials... +There is no standardized way to configure a KMS plugin to rotate a key. +For example, the AWS KMS plugin supports two KMS keys to be configured for the +same process, allowing the plugin to run with two keys with different ARNs +without the need for another plugin pod to be configured. Azure KMS plugin on +the other hand, can only be configured with a single key, so if a user creates +a new KMS key, OpenShift must create a whole new plugin pod, and run it in +parallel with the one configured with the previous key. These two pods must run +in parallel until all resources are migrated to the new encryption key. + +**Key material rotation** + +Rotation of key materials is commonly implemented by cloud providers through +automatic creation of versions of the same key. +Unfortunately, at the time of writing there is no starndard way for KMS plugins +to communicate a change in key materials to clients. + +For example, the AWS KMS plugin does nothing to indicate a change in key version. +In other words, the `key_id` remains unchanged. The Azure KMS plugin on the other +hand, defines the `key_id` as a hash of the key name and key version, so a change +in key materials always results in a change in `key_id`. +However, the Azure KMS plugin also requires a restart when a new version of a +key is created, which in turn requires two parallel versions of the kms plugin +to run in tandem until all data is migrated from the old key to the new. ##### Key Rotation For the AWS KMS Plugin -TODO +During Tech-Preview, the AWS KMS plugin will be the only plugin supported. + +Rotating a KMS key is always a user-invoked operation. It requires users +to edit the APIserver configuration, setting a new key. +Openshift already automates the necessary [step-by-step changes](https://kubernetes.io/docs/tasks/administer-cluster/encrypt-data/#rotating-a-decryption-key) +to kubernetes' `EncryptionConfig`, including migrating resources encrypted by +the old key to use the new key. + +When rotating the key, the AWS KMS plugin pods must be updated to run ##### Key Rotation For the Azure KMS Plugin @@ -442,6 +506,25 @@ Cons: ### Risks and Mitigations +#### Loss of encryption key + +TODO + +* Ensure cloud KMS key is configured with grace period after key deletion +* In-memory caches of the unencrypted DEK seed +* Monitoring and alerts in place to detect when the KMS key has been deleted + and not followed by a `key_id` change + * deleted keys cannot be used for encryption, only decryption. we can use + use this (along with an unchanged `key_id`) to detect when a key was deleted + +#### Temporary Cloud KMS Outages + +TODO + +* In-memory caches of the unencrypted DEK seed + +---- + What are the risks of this proposal and how do we mitigate. Think broadly. For example, consider both security and how this will impact the larger OKD ecosystem. From 7a02c8368d894fcfd712e48d1517548c4286f03d Mon Sep 17 00:00:00 2001 From: Flavian Missi Date: Thu, 30 Oct 2025 14:32:29 +0100 Subject: [PATCH 09/13] improve kms plugin management section --- ...-encryption-provider-at-datastore-layer.md | 27 ++++++++++++++++++- 1 file changed, 26 insertions(+), 1 deletion(-) diff --git a/enhancements/kube-apiserver/kms-encryption-provider-at-datastore-layer.md b/enhancements/kube-apiserver/kms-encryption-provider-at-datastore-layer.md index 344ad7fb8e..add36b15dd 100644 --- a/enhancements/kube-apiserver/kms-encryption-provider-at-datastore-layer.md +++ b/enhancements/kube-apiserver/kms-encryption-provider-at-datastore-layer.md @@ -477,9 +477,34 @@ Requirements: * apiservers must have access the unix socket for the kms plugin * running multiple instances of the kms plugin with different encryption configuration +* kms plugins must have access to cloud kms + +KMS plugins will be deployed as sidecar containers running along with each of +OpenShift's apiservers. + +Library-go will contain the shared sidecar container specification, which all +apiservers will base their plugin sidecar containers from. + +Due to differences in deployment type, KMS plugin sidecar container for +kube-apiserver will differ from those for the openshift and oauth apiservers. + +| API Server | Deployment Type | hostNetwork | IMDS Access | +|---------------------|-----------------|--------------------------------|-------------------------------------------| +| kube-apiserver | Static Pod | ✅ true | ✅ Direct access to EC2 instance IAM role | +| openshift-apiserver | Deployment | ❌ Not set (defaults to false) | ❌ No direct IMDS access | +| oauth-apiserver | Deployment | ❌ Not set (defaults to false) | ❌ No direct IMDS access | + +When `hostNetwork: true`, the control-plane IAM role must have permission to +encrypt/decrypt objects in the cloud KMS. TODO: explain how this will be done. + +When `hostNetwork: false`, each apiserver IAM role must have permission to +encrypt/decrypt objects in the cloud KMS. TODO: explain how this will be done. + + +TODO: move the below to alternatives section Alternatives: -1. apiservers share a single instance of kms plugin, this can be achieved in two variations: +1. apiservers share a single instance of kms plugin, achievable through two variations: a. kms plugin and kas-o share revisions b. kms plugin revisions are independent of kas-o revisions 2. apiservers have dedicated kms plugin instance, managed by their respective operators From 1dc67d0a94b543dc744c41f28c6d050ab0e42c09 Mon Sep 17 00:00:00 2001 From: Flavian Missi Date: Thu, 30 Oct 2025 17:31:53 +0100 Subject: [PATCH 10/13] improve rotation section --- ...-encryption-provider-at-datastore-layer.md | 49 +++++++++---------- 1 file changed, 24 insertions(+), 25 deletions(-) diff --git a/enhancements/kube-apiserver/kms-encryption-provider-at-datastore-layer.md b/enhancements/kube-apiserver/kms-encryption-provider-at-datastore-layer.md index add36b15dd..6ddb754219 100644 --- a/enhancements/kube-apiserver/kms-encryption-provider-at-datastore-layer.md +++ b/enhancements/kube-apiserver/kms-encryption-provider-at-datastore-layer.md @@ -391,18 +391,31 @@ KMS encryption provider. #### Key Rotation and Data Migration -Keys can be rotated in the following ways: -* Key materials rotation: automatic periodic key rotation by the KMS, following - user provided rotation policy in the KMS itself; or manual key rotation in the - cloud KMS; does not result in a new key resource in the cloud KMS -* Key rotation: the user creates a new KMS key resource (while keeping the - previous one), and updates the encryption section of the APIServer config to - point to the new key - -Regardless of whether the key itself is rotated (causing a change in `key_id`) -or the key materials is rotated (not causing a change in `key_id`), OpenShift -must detect the change and trigger re-encryption of resources. +Data migration must happen in the following scenarios: +* The cluster admin enables encryption in the cluster for the first time +* The cluster admin updates the `KMSConfig` in APIServer config +* The cloud KMS automatically rotates the KEK +* The cluster admin manually rotates the KEK + +KMS Key rotation does not change the identity of the key in the cloud KMS, it +only changes the key materials, and in most cloud KMS providers it results in a +new version of the same key. Despite that, KMS plugins are required to return a +different `key_id` when the KMS key (KEK) is rotated. + +Note that the AWS KMS plugin does not change the `key_id` when the KEK is +rotated. This is an [as of now unreported] bug in the AWS KMS plugin. However, +it should not cause any problems to end-users, since AWS does not expire old +versions of a key. TODO: explain what we mean by "expire". + +The encryption controllers in library-go already handle migration of encrypted +resources. The `keyController`, response for creating and rotating keys, need +to change so that the encryption key secret it manages becomes a reflection of +what the `key_id` returned by the KMS plugin Status gRPC call. +TODO: elaborate on the above. + + +TODO: merge below and above blocks **Key rotation** KMS Plugins must return a `key_id` as part of the response to a Status gRPC call. This `key_id` is authoritative, so when it changes, we must consider the key @@ -428,20 +441,6 @@ a new KMS key, OpenShift must create a whole new plugin pod, and run it in parallel with the one configured with the previous key. These two pods must run in parallel until all resources are migrated to the new encryption key. -**Key material rotation** - -Rotation of key materials is commonly implemented by cloud providers through -automatic creation of versions of the same key. -Unfortunately, at the time of writing there is no starndard way for KMS plugins -to communicate a change in key materials to clients. - -For example, the AWS KMS plugin does nothing to indicate a change in key version. -In other words, the `key_id` remains unchanged. The Azure KMS plugin on the other -hand, defines the `key_id` as a hash of the key name and key version, so a change -in key materials always results in a change in `key_id`. -However, the Azure KMS plugin also requires a restart when a new version of a -key is created, which in turn requires two parallel versions of the kms plugin -to run in tandem until all data is migrated from the old key to the new. ##### Key Rotation For the AWS KMS Plugin From 2a37ac5def2efebf7188bd398b4833b1dfc813c3 Mon Sep 17 00:00:00 2001 From: Flavian Missi Date: Thu, 27 Nov 2025 17:35:51 +0100 Subject: [PATCH 11/13] wip: plugin management --- ...-encryption-provider-at-datastore-layer.md | 213 +++++++++++++----- 1 file changed, 151 insertions(+), 62 deletions(-) diff --git a/enhancements/kube-apiserver/kms-encryption-provider-at-datastore-layer.md b/enhancements/kube-apiserver/kms-encryption-provider-at-datastore-layer.md index 6ddb754219..57e0c39798 100644 --- a/enhancements/kube-apiserver/kms-encryption-provider-at-datastore-layer.md +++ b/enhancements/kube-apiserver/kms-encryption-provider-at-datastore-layer.md @@ -5,11 +5,12 @@ authors: - "@dgrisonnet" - "@flavianmissi" reviewers: # Include a comment about what domain expertise a reviewer is expected to bring and what area of the enhancement you expect them to focus on. For example: - "@networkguru, for networking aspects, please look at IP bootstrapping aspect" + - "@derekwaynecarr" - "@ibihim" - "@sjenning" - "@tkashem" approvers: # A single approver is preferred, the role of the approver is to raise important questions, help ensure the enhancement receives reviews from all applicable areas/SMEs, and determine when consensus is achieved such that the EP can move forward to implementation. Having multiple approvers makes it difficult to determine who is responsible for the actual approval. - - "@sjenning" + - "@benluddy" api-approvers: # In case of new or modified APIs or API extensions (CRDs, aggregated apiservers, webhooks, finalizers). If there is no API change, use "None" - "@JoelSpeed" creation-date: 2025-10-17 @@ -149,12 +150,12 @@ lifecycle of the Key Encryption Key (KEK), including automatic rotation. #### Initial Resource Encryption -1. The cluster admin creates an encryption key (KEK) in their cloud KMS of choice +1. The cluster admin creates an encryption key (KEK) in their KMS of choice 1. The cluster admin give the OpenShift apiservers access to the newly created - cloud KMS KEK + KMS KEK 1. The cluster admiin updates the APIServer configuration resource, providing the necessary [encryption configuration options](encryption-cfg-opts) for - the cloud KMS of choice + the KMS of choice 1. The cluster admin observes the `clusteroperator/kube-apiserver` resource for progress on the configuration change and encryption of existing resources @@ -395,11 +396,11 @@ Data migration must happen in the following scenarios: * The cluster admin enables encryption in the cluster for the first time * The cluster admin updates the `KMSConfig` in APIServer config -* The cloud KMS automatically rotates the KEK +* The KMS automatically rotates the KEK * The cluster admin manually rotates the KEK -KMS Key rotation does not change the identity of the key in the cloud KMS, it -only changes the key materials, and in most cloud KMS providers it results in a +KMS Key rotation does not change the identity of the key in the KMS, it +only changes the key materials, and in most KMS providers it results in a new version of the same key. Despite that, KMS plugins are required to return a different `key_id` when the KMS key (KEK) is rotated. @@ -472,60 +473,148 @@ sure. #### KMS Plugin Management -Requirements: -* apiservers must have access the unix socket for the kms plugin -* running multiple instances of the kms plugin with different encryption - configuration -* kms plugins must have access to cloud kms - -KMS plugins will be deployed as sidecar containers running along with each of -OpenShift's apiservers. - -Library-go will contain the shared sidecar container specification, which all -apiservers will base their plugin sidecar containers from. - -Due to differences in deployment type, KMS plugin sidecar container for -kube-apiserver will differ from those for the openshift and oauth apiservers. - -| API Server | Deployment Type | hostNetwork | IMDS Access | -|---------------------|-----------------|--------------------------------|-------------------------------------------| -| kube-apiserver | Static Pod | ✅ true | ✅ Direct access to EC2 instance IAM role | -| openshift-apiserver | Deployment | ❌ Not set (defaults to false) | ❌ No direct IMDS access | -| oauth-apiserver | Deployment | ❌ Not set (defaults to false) | ❌ No direct IMDS access | - -When `hostNetwork: true`, the control-plane IAM role must have permission to -encrypt/decrypt objects in the cloud KMS. TODO: explain how this will be done. - -When `hostNetwork: false`, each apiserver IAM role must have permission to -encrypt/decrypt objects in the cloud KMS. TODO: explain how this will be done. - - -TODO: move the below to alternatives section - -Alternatives: -1. apiservers share a single instance of kms plugin, achievable through two variations: - a. kms plugin and kas-o share revisions - b. kms plugin revisions are independent of kas-o revisions -2. apiservers have dedicated kms plugin instance, managed by their respective operators - -**Option 1.a** -Pros: -* shared revision means kas-o and kms plugin encryption configuration will never drift -* we don't have to think about alternative ways to deploy the kms plugin, there's only the static pod -Cons: -* other apiservers encryption configuration might drift -* ? -**Option 1.b** -Pros: -* we don't have to think about alternative ways to deploy the kms plugin, there's only the static pod -* has the potential for avoiding downtime of kas during encryption config update, since we can ensure the encryption config update isn't rolled out until kms plugins are ready -Cons: -* apiservers encryption configuration might drift -**Option 2** -Pros: -* shared revision between all apiservers and their respective kms plugin means config will never drift -Cons: -* kube-apiserver is a static pod, and the rest of openshift apiservers are regular pods managed by deployments means kms plugins pods cannot be static pods in all cases +##### Requirements + +* API servers must have access to the Unix domain socket for the KMS plugin + (this can be achieve via Kubernetes Services) +* Support running multiple instances of the KMS plugin with different + encryption configurations. This is required for KEK rotation +* KMS plugins must be authorized to communicate with the KMS +* KMS plugin lifecycle must be fully managed by OpenShift, including + reconciliation based on APIServer configuration changes +* OpenShift must fully report on plugin status and health. This is expanded + under Recovery section (TODO: write recovery section) + +##### Implementation Approach + +KMS plugins are deployed as **sidecar containers** running alongside each of +OpenShift's API servers. Each of the 3 apiserver operators manages its own KMS +plugin sidecar instance. + +**Deployment Architecture:** + +| API Server | Deployment Type | hostNetwork | Volume Type | Credential Source | Managed By | +|---------------------|-----------------|-------------|-------------|-------------------|-----------------------------------------| +| kube-apiserver | Static Pod | true | hostPath | TODO | cluster-kube-apiserver-operator | +| openshift-apiserver | Deployment | false | emptyDir | TODO | cluster-openshift-apiserver-operator | +| oauth-apiserver | Deployment | false | emptyDir | TODO | cluster-authentication-operator | + +TODO: document vault and thales credential sources + +##### Shared library-go Components + +The implementation leverages `library-go/pkg/operator/encryption/kms/` which +provides: + +1. **Shared Container Specification**: `ContainerConfig` struct that + encapsulates KMS plugin container configuration +2. **Volume Management**: Functions to create socket and credential volumes + based on deployment type +3. **Pod Injection Logic**: `AddKMSPluginToPodSpec()` function that handles + sidecar injection +4. **Socket Path Generation**: Builds Unix socket paths based on APIServer KMS + configuration + +All three API server operators import and use these shared components, ensuring +consistency across the platform. + +##### Configuration Detection + +All operators watch the cluster-scoped `config.openshift.io/v1 APIServer` +resource. When `spec.encryption.type` is set to `"KMS"`, operators +automatically inject the KMS plugin sidecar into their respective API server +pods. + +The KMS plugin image is specified via the `KMS_PLUGIN_IMAGE` environment +variable on each operator deployment. +To fully automate the process of KMS plugin deployment, we will add supported +KMS plugin images to the OpenShift release payload in GA. + +##### Credential Management + +**For kube-apiserver (Static Pod with hostNetwork: true):** + +The KMS plugin sidecar accesses AWS credentials through the EC2 Instance Metadata Service (IMDS). The master node's IAM role must have the following KMS permissions: + +```json +{ + "Effect": "Allow", + "Action": [ + "kms:Encrypt", + "kms:Decrypt", + "kms:DescribeKey", + "kms:GenerateDataKey" + ], + "Resource": "" +} +``` + +Users are responsible for configuring the master node IAM role with these permissions. A helper script is provided in `library-go/pkg/operator/encryption/kms/master-node-iam-setup.sh`. + +**For openshift-apiserver and oauth-apiserver (Deployments with hostNetwork: false):** + +These API servers cannot access IMDS directly, so they use AWS credentials from Kubernetes Secrets created by the Cloud Credential Operator (CCO). + +CredentialsRequest resources are provided in `library-go/pkg/operator/encryption/kms/`: +- `openshift-apiserver-kms-credentials-request.yaml` +- `oauth-apiserver-kms-credentials-request.yaml` + +When CCO operates in **Mint mode**, it automatically creates IAM users and provisions the `kms-credentials` secret in each API server's namespace. The operators watch for these secrets and only inject the KMS plugin sidecar once the credentials are available. + +**Graceful Degradation:** +If KMS encryption is enabled but credentials aren't ready, operators: +1. Log a warning indicating credentials are pending +2. Skip sidecar injection (return nil, not error) +3. Allow the deployment to proceed without the KMS sidecar +4. Automatically inject the sidecar on the next reconciliation when credentials become available + +This prevents blocking API server rollouts while waiting for CCO to provision credentials. + +##### Sidecar Injection Mechanism + +Each operator injects the KMS plugin sidecar at a specific point in its reconciliation loop: + +**kube-apiserver-operator:** +- Injection point: `targetconfigcontroller.managePods()` +- Modifies the static pod manifest before writing to the pod ConfigMap +- Uses `hostPath` volume pointing to `/var/run/kmsplugin` on the host + +**openshift-apiserver-operator:** +- Injection point: `workload.manageOpenShiftAPIServerDeployment_v311_00_to_latest()` +- Modifies the deployment spec after setting input hashes +- Uses `emptyDir` volume for socket isolation + +**authentication-operator (oauth-apiserver):** +- Injection point: `workload.syncDeployment()` +- Modifies the deployment spec after setting input hashes +- Uses `emptyDir` volume for socket isolation + +##### Socket Communication + +The API server and KMS plugin communicate via a Unix domain socket: +- **Socket path**: `/var/run/kmsplugin/socket.sock` +- **Volume name**: `kms-plugin-socket` +- **Protocol**: gRPC over Unix domain socket (KMS v2 API) + +The socket path can be customized based on the KMS configuration to support multiple concurrent KMS providers if needed. + +##### Reactivity and Updates + +All operators watch: +1. **APIServer resource**: Triggers reconciliation when encryption type or KMS config changes +2. **Secrets** (for Deployment-based API servers): Triggers reconciliation when credentials are created or updated +3. **Operator environment variables**: `KMS_PLUGIN_IMAGE` changes trigger operator pod restart and subsequent sidecar updates + +When the `keyARN` is updated in the APIServer configuration: +1. Operators detect the configuration change +2. New deployment/static pod revision is created with updated KMS configuration +3. Rolling update replaces old pods with new ones +4. Old pods continue serving requests until new pods are ready +5. No socket path conflicts occur due to pod-level volume isolation (emptyDir) or sequential rollout (static pods) + +##### Alternative Approaches Considered + +See the [Alternatives](#alternatives-not-implemented) section for details on shared KMS plugin deployment models that were not selected. ### Risks and Mitigations @@ -534,7 +623,7 @@ Cons: TODO -* Ensure cloud KMS key is configured with grace period after key deletion +* Ensure KMS key is configured with grace period after key deletion * In-memory caches of the unencrypted DEK seed * Monitoring and alerts in place to detect when the KMS key has been deleted and not followed by a `key_id` change From 13b3d48d025b7bc04f7155125f8e70d535d9fe44 Mon Sep 17 00:00:00 2001 From: Flavian Missi Date: Fri, 28 Nov 2025 14:56:38 +0100 Subject: [PATCH 12/13] split kms enhancement in 3 --- .../kms-encryption-foundations.md | 721 ++++++++++++++++++ .../kube-apiserver/kms-migration-recovery.md | 248 ++++++ .../kube-apiserver/kms-plugin-management.md | 504 ++++++++++++ 3 files changed, 1473 insertions(+) create mode 100644 enhancements/kube-apiserver/kms-encryption-foundations.md create mode 100644 enhancements/kube-apiserver/kms-migration-recovery.md create mode 100644 enhancements/kube-apiserver/kms-plugin-management.md diff --git a/enhancements/kube-apiserver/kms-encryption-foundations.md b/enhancements/kube-apiserver/kms-encryption-foundations.md new file mode 100644 index 0000000000..1a8389e2ba --- /dev/null +++ b/enhancements/kube-apiserver/kms-encryption-foundations.md @@ -0,0 +1,721 @@ +--- +title: kms-encryption-foundations +authors: + - "@ardaguclu" + - "@dgrisonnet" + - "@flavianmissi" +reviewers: + - "@ibihim" + - "@sjenning" + - "@tkashem" + - "@derekwaynecarr" +approvers: + - "@benluddy" +api-approvers: + - "@JoelSpeed" +creation-date: 2025-01-28 +last-updated: 2025-01-28 +tracking-link: + - "https://issues.redhat.com/browse/OCPSTRAT-108" +see-also: + - "enhancements/kube-apiserver/kms-plugin-management.md" + - "enhancements/kube-apiserver/kms-migration-recovery.md" + - "enhancements/kube-apiserver/encrypting-data-at-datastore-layer.md" + - "enhancements/etcd/storage-migration-for-etcd-encryption.md" +replaces: + - "" +superseded-by: + - "" +--- + +# KMS Encryption Foundations + +## Summary + +Provide the foundational support for Key Management Service (KMS) encryption in OpenShift by: +1. Extending the `config.openshift.io/v1/APIServer` resource to add KMS encryption configuration +1. Extending encryption controllers in `openshift/library-go` to support KMS encryption + +This enhancement enables the existing encryption infrastructure (`keyController`, `stateController`, `migrationController`) to work with externally-managed encryption keys from KMS providers, while maintaining feature parity with local encryption modes (aescbc, aesgcm). For Tech Preview, only AWS KMS is supported; additional providers (Vault, Thales) will be added in GA. + +## Motivation + +OpenShift's existing encryption controllers manage local AES keys for encrypting data at rest in etcd. Adding KMS support to these controllers enables integration with external Key Management Systems (KMS) where encryption keys are stored and rotated outside the cluster. KMS encryption protects against attackers who gain access to control plane nodes, since the encryption keys are stored externally rather than on the nodes themselves. + +The controller extensions are designed to minimize provider-specific logic. While some provider-specific code is necessary for configuration handling, the core controller logic for key rotation detection and migration remains provider-agnostic. + +### User Stories + +* As a cluster admin, I want encryption controllers to automatically detect when my external KMS rotates an encryption key, so that OpenShift can re-encrypt my data with the new key without manual intervention +* As a cluster admin, I want the same migration and monitoring experience for KMS encryption as I have with local AES encryption, so that I don't need to learn new operational procedures +* As a cluster admin, I want encryption controllers to verify KMS plugin health before performing operations, so that encryption/decryption failures don't impact cluster availability + +### Goals + +* Extend encryption controllers to support KMS as a new encryption mode +* Implement automatic key rotation detection based on KMS plugin-provided `key_id` +* Maintain empty encryption key secrets for KMS (keys stored externally) +* Ensure existing migration workflows work seamlessly with KMS +* Provide provider-agnostic controller implementation (no provider-specific logic) +* Maintain feature parity with existing encryption modes + +### Non-Goals + +* KMS plugin deployment and lifecycle management (see [KMS Plugin Management](kms-plugin-management.md)) +* Provider-specific KMS configurations (AWS, Vault, Thales details are in [KMS Plugin Management](kms-plugin-management.md)) +* Migration between different KMS providers (deferred to [KMS Migration and Recovery](kms-migration-recovery.md) for GA) +* Recovery from KMS key loss scenarios (deferred to [KMS Migration and Recovery](kms-migration-recovery.md) for GA) + +## Proposal + +Extend the existing encryption controller framework in `openshift/library-go` to support KMS encryption by: + +1. Adding KMS as a new encryption mode in the `state` package +2. Implementing hash-based key rotation detection using KMS configuration and `key_id` +3. Managing empty encryption key secrets for KMS (actual keys stored in external KMS) +4. Extending controller preconditions to verify KMS plugin health +5. Ensuring migration controller works with KMS encryption transitions + +The implementation maintains the existing controller architecture while adding KMS-specific logic where necessary. + +### Workflow Description + +#### Roles + +**keyController** is the library-go controller responsible for creating and rotating encryption key secrets. + +**KMS Plugin** is a gRPC service implementing the Kubernetes KMS v2 API, running as a sidecar to API server pods. + +**External KMS** is the cloud or on-premises Key Management Service (e.g., AWS KMS, HashiCorp Vault) that stores and manages the Key Encryption Key (KEK). + +#### Key Rotation Detection Workflow + +1. The cluster admin configures a KMS provider in the APIServer config +2. The cluster admin configures automatic key rotation in their external KMS +3. The external KMS rotates the KEK (e.g., AWS KMS creates a new key version) +4. The KMS plugin detects the rotation and updates its Status response with a new `key_id` +5. The keyController polls the KMS plugin Status endpoint (via gRPC) +6. The keyController detects the `key_id` has changed +7. The keyController computes a new `kmsKeyIDHash` (combining config hash + new `key_id`) +8. The keyController creates a new encryption key secret with the updated hash +9. The migrationController detects the new secret and initiates data re-encryption +10. Resources are re-encrypted using the new KEK in the external KMS + +#### KMS Configuration Change Workflow + +1. The cluster admin updates the KMS configuration (e.g., changes Vault address or AWS key ARN) +2. The keyController detects the configuration change in APIServer config +3. The keyController computes a new `kmsConfigHash` +4. The keyController computes a new `kmsKeyIDHash` (new config hash + current `key_id`) +5. The keyController creates a new encryption key secret +6. The migrationController initiates re-encryption with the new configuration + +### API Extensions + +This enhancement extends the `config.openshift.io/v1/APIServer` resource to add KMS as a new encryption type. The API provides a foundation for KMS encryption that is extended in future releases to support additional KMS providers. + +#### Encryption Type Extension + +```diff +diff --git a/config/v1/types_apiserver.go b/config/v1/types_apiserver.go +index d815556d2..c9098024f 100644 +--- a/config/v1/types_apiserver.go ++++ b/config/v1/types_apiserver.go +@@ -191,9 +194,23 @@ type APIServerEncryption struct { + // +unionDiscriminator + // +optional + Type EncryptionType `json:"type,omitempty"` ++ ++ // kms defines the configuration for external KMS encryption. ++ // When KMS encryption is enabled, sensitive resources are encrypted using keys managed by an ++ // externally configured KMS instance. ++ // ++ // The Key Management Service (KMS) instance provides symmetric encryption and is responsible for ++ // managing the lifecycle of encryption keys outside of the control plane. ++ // ++ // +openshift:enable:FeatureGate=KMSEncryptionProvider ++ // +unionMember ++ // +optional ++ KMS *KMSConfig `json:"kms,omitempty"` +``` + +```diff +@@ -208,6 +225,11 @@ const ( + // aesgcm refers to a type where AES-GCM with random nonce and a 32-byte key + // is used to perform encryption at the datastore layer. + EncryptionTypeAESGCM EncryptionType = "aesgcm" ++ ++ // kms refers to a type of encryption where the encryption keys are managed ++ // outside the control plane in a Key Management Service instance. ++ // Encryption is still performed at the datastore layer. ++ EncryptionTypeKMS EncryptionType = "KMS" + ) +``` + +#### KMS Configuration Types + +New file: `config/v1/types_kmsencryption.go` + +```go +package v1 + +// KMSConfig defines the configuration for the KMS instance used with KMS encryption. +// The configuration is provider-specific and uses a union discriminator pattern to +// ensure only the appropriate provider configuration is set. +// +// +kubebuilder:validation:XValidation:rule="has(self.type) && self.type == 'AWS' ? has(self.aws) : !has(self.aws)",message="aws config is required when kms provider type is AWS, and forbidden otherwise" +// +union +type KMSConfig struct { + // type defines the KMS provider type. + // + // For Tech Preview, only AWS is supported. + // Additional providers (Vault, Thales) will be added in GA. + // + // +unionDiscriminator + // +kubebuilder:validation:Required + Type KMSProviderType `json:"type"` + + // aws defines the configuration for AWS KMS encryption. + // The AWS KMS instance is managed by the user outside the control plane. + // + // +unionMember + // +optional + AWS *AWSKMSConfig `json:"aws,omitempty"` +} + +// KMSProviderType defines the supported KMS provider types. +// +// For Tech Preview, only AWS is supported. +// +kubebuilder:validation:Enum=AWS +type KMSProviderType string + +const ( + // AWSKMSProvider represents AWS Key Management Service + AWSKMSProvider KMSProviderType = "AWS" +) + +// AWSKMSConfig defines the configuration specific to AWS KMS provider. +type AWSKMSConfig struct { + // keyARN specifies the Amazon Resource Name (ARN) of the AWS KMS key used for encryption. + // The value must adhere to the format `arn:aws:kms:::key/`, where: + // - `` is the AWS region consisting of lowercase letters and hyphens followed by a number. + // - `` is a 12-digit numeric identifier for the AWS account. + // - `` is a unique identifier for the KMS key, consisting of lowercase hexadecimal characters and hyphens. + // + // Example: arn:aws:kms:us-east-1:123456789012:key/12345678-1234-1234-1234-123456789012 + // + // +kubebuilder:validation:Required + // +kubebuilder:validation:XValidation:rule="self.matches('^arn:aws:kms:[a-z0-9-]+:[0-9]{12}:key/[a-f0-9-]+$') && self.size() <= 128",message="keyARN must follow the format `arn:aws:kms:::key/`" + KeyARN string `json:"keyARN"` + + // region specifies the AWS region where the KMS instance exists. + // The format is `--`, e.g., `us-east-1`. + // Only lowercase letters, hyphens, and numbers are allowed. + // + // +kubebuilder:validation:Required + // +kubebuilder:validation:XValidation:rule="self.matches('^[a-z]{2}-[a-z]+-[0-9]+$') && self.size() <= 64",message="region must be a valid AWS region format" + Region string `json:"region"` +} +``` + +#### Graduation Path for Additional Providers + +**Tech Preview:** +- `KMSProviderType` enum contains only `AWS` +- `KMSConfig` union has only `aws` field + +**GA (Future Enhancement):** +The enum and union will be extended to support additional providers: + +```go +// +kubebuilder:validation:Enum=AWS;Vault;Thales +type KMSProviderType string + +const ( + AWSKMSProvider KMSProviderType = "AWS" + VaultKMSProvider KMSProviderType = "Vault" // Added in GA + ThalesKMSProvider KMSProviderType = "Thales" // Added in GA +) + +type KMSConfig struct { + Type KMSProviderType `json:"type"` + AWS *AWSKMSConfig `json:"aws,omitempty"` + Vault *VaultKMSConfig `json:"vault,omitempty"` // Added in GA + Thales *ThalesKMSConfig `json:"thales,omitempty"` // Added in GA +} +``` + +The provider-specific management details for Vault and Thales are documented in [KMS Plugin Management](kms-plugin-management.md). + +### Topology Considerations + +#### Hypershift / Hosted Control Planes + +The library-go encryption controllers run in the management cluster as part of the hosted control plane operators. KMS plugin health checks must account for the split architecture where plugins may run in different contexts than the controllers. + +#### Standalone Clusters + +This enhancement applies to standalone clusters. The controllers run in the cluster-kube-apiserver-operator, cluster-openshift-apiserver-operator, and cluster-authentication-operator. + +#### Single-node Deployments or MicroShift + +Resource consumption impact is minimal - the controllers already exist and are extended with KMS-specific logic. Single-node deployments will see slightly increased CPU usage during key rotation detection (gRPC Status calls), but this is negligible. + +MicroShift may adopt this enhancement if KMS encryption is desired, but the configuration mechanism may differ (file-based vs API resource). + +### Implementation Details/Notes/Constraints + +This section documents the implementation in `openshift/library-go` PR #2045. + +#### Hash-Based Key Rotation Detection + +The controllers track two separate hashes: + +1. **kmsConfigHash** - Hash of the KMS configuration (provider type, endpoint, credentials reference) + - Used to detect when admin changes KMS configuration + - Stored in secret annotation: `encryption.apiserver.operator.openshift.io/kms-config-hash` + +2. **kmsKeyIDHash** - Combined hash of config + `key_id` from KMS plugin + - Used to detect when external KMS rotates the key + - Stored in secret `Data` field (base64 encoded) + - Computed as: `kms.ComputeKMSKeyHash(configHash, keyId)` + +**Why two hashes?** +- Config changes may not change the key (e.g., updating Vault address but same key) +- Key rotation may not change config (e.g., AWS KMS rotates key materials, same ARN) +- Separating them allows proper handling of each scenario + +#### Key Controller Changes + +Modified functions in `pkg/operator/encryption/controllers/key_controller.go`: + +```go +// New function type for getting KMS hashes (allows testing/mocking) +var kmsHashesGetterFunc func(ctx context.Context, kmsConfig *configv1.KMSConfig) (configHash string, keyIDHash []byte, err error) + +// Extended to return KMS config +func (c *keyController) getCurrentModeAndExternalReason(ctx context.Context) (state.Mode, string, *configv1.KMSConfig, error) + +// Extended to accept KMS hashes +func (c *keyController) generateKeySecret(keyID uint64, currentMode state.Mode, internalReason, externalReason string, kmsConfigHash string, kmsKeyIDHash []byte) (*corev1.Secret, error) + +// Extended to check KMS key hash changes +func needsNewKey(grKeys state.GroupResourceState, currentMode state.Mode, externalReason string, encryptedGRs []schema.GroupResource, kmsKeyHash []byte) (uint64, string, bool) +``` + +Key rotation logic for KMS mode: +```go +if currentMode == state.KMS { + if latestKey.Key.Secret != base64.StdEncoding.EncodeToString(kmsKeyHash) { + // Either config changed or key_id rotated + return latestKeyID, "kms-key-changed", true + } + // For KMS mode, NO time-based rotation + // KMS keys are rotated externally by the KMS system + return 0, "", false +} +``` + +#### Empty Encryption Key Secrets + +For KMS mode, encryption key secrets do NOT contain actual key material (the KEK lives in external KMS). Instead: +- `Data["secrets"]` contains the base64-encoded `kmsKeyIDHash` +- Annotations contain metadata: + - `encryption.apiserver.operator.openshift.io/mode: "kms"` + - `encryption.apiserver.operator.openshift.io/kms-config-hash: ""` + - `encryption.apiserver.operator.openshift.io/internal-reason: "kms-key-changed"` + +This allows reusing existing secret management logic while clearly indicating KMS mode. + +#### Migration Controller Compatibility + +No changes required to `migration_controller.go` - it already works with KMS because: +- It triggers on new encryption key secrets (regardless of mode) +- Migration uses the `EncryptionConfiguration` generated by `stateController` +- The actual encryption/decryption happens in kube-apiserver via KMS plugin + +Test coverage added in `migration_controller_test.go` for KMS rotation scenarios. + +#### Static vs Dynamic key_id (Tech Preview vs GA) + +**Tech Preview Implementation:** +```go +func defaultGetKMSHashes(ctx context.Context, kmsConfig *configv1.KMSConfig) (string, []byte, error) { + _, configHash, err := kms.GenerateUnixSocketPath(kmsConfig) + if err != nil { + return "", nil, fmt.Errorf("failed to generate KMS unix socket path: %w", err) + } + + // TODO: Call KMS plugin Status gRPC endpoint to get actual key_id + // For TP, use static key_id (AWS KMS doesn't rotate key_id anyway) + keyId := "static-key-id" + return configHash, kms.ComputeKMSKeyHash(configHash, keyId), nil +} +``` + +**GA Implementation (Future Work):** +- Call KMS plugin Status endpoint: `grpc.Dial(socketPath)` → `kmsv2.Status()` +- Extract `key_id` from response +- Implement retry/timeout logic for Status calls +- Handle plugin unavailability gracefully + +**AWS KMS Special Case:** +The AWS KMS plugin does not change `key_id` when AWS rotates key materials. This is a known limitation. For AWS: +- Rotation triggered only by config changes (new key ARN) +- Automatic AWS key rotation does NOT trigger OpenShift re-encryption +- Users must update APIServer config with new key ARN to rotate + +#### Controller Preconditions + +The existing `preconditionsFulfilled` mechanism needs extension to check KMS plugin health: + +**Current (unchanged):** +```go +type PreconditionFunc func(ctx context.Context) (bool, error) +``` + +**Future (GA):** +- Add KMS plugin health check precondition +- Call KMS plugin Status endpoint +- Verify plugin returns healthy status +- Block controller sync if plugin unavailable +- The health check implementation is provided by operators (see the "KMS Plugin Health Monitoring" section in [KMS Plugin Management](kms-plugin-management.md)) + +### Risks and Mitigations + +#### Risk: KMS Plugin Unavailable During Controller Sync + +**Mitigation:** +- Preconditions check plugin health before operations +- Controllers gracefully skip sync if plugin down +- Existing encryption continues working (kube-apiserver caches DEKs) + +#### Risk: AWS KMS key_id Limitation + +**Mitigation:** +- Document this limitation clearly +- Provide guidance: users must update APIServer config to rotate +- Consider future enhancement to poll AWS KMS directly + +#### Risk: Performance Impact of Status Polling + +**Mitigation:** +- Status calls are cheap (gRPC local Unix socket) +- Controllers already have rate limiting +- Cache `key_id` between syncs (only call Status if cache expired) + +#### Risk: etcd Backup Restoration Without KMS Key Access + +**Risk:** +When restoring an etcd backup, the cluster cannot decrypt data if the KMS key used during encryption is unavailable (deleted, different KMS instance, expired credentials, or key rotated past retention period). + +**Impact:** +- **Data loss:** Resources encrypted with unavailable keys become permanently unrecoverable +- **Cluster inoperable:** API server may fail to start if critical resources cannot be decrypted +- **Partial recovery:** Only resources encrypted with still-available keys can be restored + +**Mitigations:** + +1. **KMS Key Deletion Grace Periods:** + - Configure KMS to use deletion grace periods (e.g., AWS KMS 7-30 day pending deletion) + - Ensure KMS keys are not permanently deleted until backup retention expires + - Document minimum grace period = backup retention period + +2. **Backup Procedure Documentation:** + - Document KMS key dependencies in backup/restore runbooks + - Include KMS key ID and configuration in backup metadata + - Test restore procedures regularly to verify KMS key availability + +3. **KMS Key Backup/Recovery:** + - For on-premises KMS (Vault, Thales): Ensure KMS key material is backed up separately + - For cloud KMS (AWS): Understand key recovery limitations (AWS does not export key material) + - Consider key escrow strategies for critical environments (GA consideration) + +4. **Cross-Region/Cross-Account Scenarios:** + - Document KMS key access requirements for disaster recovery scenarios + - Ensure backup restoration accounts/regions have access to original KMS keys + - Consider multi-region key replication where supported by KMS provider + +5. **Monitoring and Alerts:** + - Alert on KMS key pending deletion (detect before permanent deletion) + - Alert on KMS key access failures during backup operations + - Track KMS key retention vs. backup retention alignment + +6. **User Documentation:** + - Clearly document in openshift-docs: "etcd backups depend on KMS key availability" + - Provide restore procedures that verify KMS key access before attempting restore + - Warn about consequences of KMS key deletion + +**Testing Requirements:** +- E2E tests must validate backup/restore with KMS encryption +- Include failure scenarios (KMS key deleted, credentials expired) +- Document expected behavior and recovery procedures + +### Drawbacks + +- Adds complexity to encryption controllers for KMS-specific logic +- AWS KMS requires config changes for rotation (not automatic) +- Dependency on KMS plugin health for controller operations + +## Alternatives (Not Implemented) + +### Alternative: Separate KMS-Specific Controllers + +Instead of extending existing controllers, create new KMS-only controllers. + +**Why not chosen:** +- Code duplication (migration logic, state management) +- User confusion (different controllers for different encryption types) +- More operational burden (additional monitoring, alerts) + +### Alternative: Time-Based Rotation for KMS + +Continue weekly rotation even with KMS, generate new secrets periodically. + +**Why not chosen:** +- KMS keys are rotated externally, not by OpenShift +- Unnecessary re-encryption burden +- Doesn't align with KMS operational model + +## Open Questions + +None - implementation is complete in PR #2045. + +## Test Plan + +**Unit Tests:** (Already in PR #2045) +- `key_controller_test.go`: KMS key creation, rotation detection, hash changes +- `migration_controller_test.go`: KMS migration scenarios + +**Integration Tests:** (Future work) +- End-to-end KMS encryption workflow +- Key rotation with real KMS plugin +- Migration between encryption modes (aescbc → KMS, KMS → identity) + +**E2E Tests:** (Future work) +- Full cluster with KMS encryption enabled +- Trigger external KMS key rotation +- Verify data re-encryption completes +- Performance testing (time to migrate N secrets) + +## Graduation Criteria + +### Tech Preview → GA + +- **Dynamic key_id fetching:** Call KMS plugin Status endpoint (not static) +- **Health check preconditions:** Block operations when plugin unhealthy +- **AWS KMS workaround:** Document or implement solution for non-rotating key_id +- **Performance validation:** Ensure migration completes within SLOs +- **Comprehensive test coverage:** Integration and E2E tests passing +- **Production validation:** Run in multiple environments successfully + +## Upgrade / Downgrade Strategy + +**Upgrade:** +- PR #2045 code lands in library-go +- Operators import updated library-go version +- No user action required (controllers remain backward compatible) +- Existing aescbc/aesgcm encryption unaffected + +**Downgrade:** +- If KMS encryption enabled, downgrade requires switching back to aescbc +- KMS-specific code paths are new, no risk to existing encryption + +## Version Skew Strategy + +The encryption controllers run in operator pods, not on nodes. Version skew concerns: + +- **kube-apiserver version:** Must support KMS v2 API (Kubernetes 1.27+) +- **library-go version:** Operators must use same library-go version +- **KMS plugin version:** Controllers don't directly interact with plugins (operators do) + +No special version skew handling required. + +## Operational Aspects of API Extensions + +This enhancement extends the `config.openshift.io/v1/APIServer` resource with new fields for KMS configuration. This is not a CRD, webhook, or aggregated API server - it's an extension to an existing core OpenShift API resource. + +### Service Level Indicators (SLIs) + +Administrators can monitor KMS encryption health through: + +**Operator Conditions:** +- `cluster-kube-apiserver-operator` conditions: + - `EncryptionControllerDegraded=False` - Controllers are functioning + - `EncryptionMigrationControllerProgressing` - Migration status (key rotation) + - `KMSPluginDegraded=False` - KMS plugin is healthy (see [KMS Plugin Management](kms-plugin-management.md)) + +**Metrics:** +- `apiserver_storage_transformation_operations_total` - Encryption/decryption operations +- `apiserver_storage_transformation_duration_seconds` - Latency of encryption operations +- KMS plugin health metrics (see [KMS Plugin Management](kms-plugin-management.md)) + +### Impact on Existing SLIs + +**API Availability:** +- KMS encryption adds latency to resource creation/updates (external KMS call required) +- Expected impact: +10-50ms per operation (depends on KMS latency) +- Mitigation: DEK caching in kube-apiserver reduces calls to KMS + +**API Throughput:** +- Minimal impact on read operations (decryption uses cached DEKs) +- Write operations may see slight throughput reduction due to KMS latency +- Expected: <5% throughput reduction under normal conditions + +**Scalability:** +- KMS configuration is cluster-scoped (single `APIServer` resource) +- Expected use case: 1 KMS configuration per cluster +- No impact on scalability limits + +### Failure Modes + +**KMS Plugin Unavailable:** +- **Impact:** New resource creation fails, existing resources remain readable (DEKs cached) +- **Detection:** `KMSPluginDegraded=True` condition, alerts fire +- **Recovery:** Automatic (plugin restarts), or manual intervention (see Support Procedures) +- **Affected Teams:** API Server team, etcd team + +**KMS Service Unavailable (External):** +- **Impact:** New DEK generation fails, encryption operations fail +- **Detection:** Increased encryption operation failures, KMS plugin health checks fail +- **Recovery:** Depends on external KMS (AWS, Vault, Thales) restoration +- **Affected Teams:** Customer infrastructure team + +**Invalid KMS Configuration:** +- **Impact:** KMS plugin fails to start, encryption unavailable +- **Detection:** `KMSPluginDegraded=True`, plugin container crash loops +- **Recovery:** Fix APIServer configuration (credentials, endpoint, key ID) +- **Affected Teams:** Customer infrastructure team, API Server team + +**Key Rotation Stuck:** +- **Impact:** Migration controller unable to re-encrypt resources +- **Detection:** `EncryptionMigrationControllerProgressing=True` for extended period +- **Recovery:** Check migration controller logs, verify KMS health +- **Affected Teams:** API Server team, etcd team + +**etcd Backup Restoration Without KMS Access:** +- **Impact:** Restored cluster cannot decrypt etcd data if KMS key is unavailable or deleted +- **Detection:** API server fails to start or resource reads return decryption errors after restore +- **Recovery:** + - **Best case:** Restore KMS key from backup/recovery (if KMS supports key recovery within grace period) + - **Worst case:** Data loss - resources encrypted with lost key are unrecoverable + - **Prevention:** Document KMS key dependencies in backup procedures, test restore procedures +- **Affected Teams:** etcd team, Customer infrastructure team, API Server team +- **Note:** This is why KMS key deletion grace periods are critical (see Risks and Mitigations) + +### Measurement and Monitoring + +**How to measure impact:** +- Prometheus queries for encryption operation latency percentiles (p50, p95, p99) +- Compare pre/post KMS enablement metrics +- Load testing before GA to establish SLOs + +**When to measure:** +- Every release by QE (automated tests) +- Performance team review during GA graduation +- Customer escalations (if performance issues reported) + +**Who measures:** +- **Dev/QE:** Automated CI tests, pre-release validation +- **Performance Team:** Load testing, SLO validation for GA +- **Site Reliability:** Production monitoring, SLI tracking + +## Support Procedures + +### Detecting KMS Rotation Issues + +**Symptoms:** +- `EncryptionMigrationControllerProgressing` condition stuck at `True` +- Events in operator namespace: "migration in progress for KMS key rotation" +- No new encryption key secret created despite KMS key rotation + +**Diagnosis:** +```bash +# Check if key controller detected rotation +oc get secrets -n openshift-config-managed -l encryption.apiserver.operator.openshift.io/component=encryption-key + +# Check controller logs +oc logs -n openshift-kube-apiserver-operator deployment/kube-apiserver-operator | grep -i kms + +# Verify KMS plugin Status response (if dynamic key_id implemented) +# Requires exec into API server pod and calling plugin gRPC endpoint +``` + +**Resolution:** +- If key_id not changing: Update KMS configuration in APIServer config +- If plugin unhealthy: Check plugin pod logs (see [KMS Plugin Management](kms-plugin-management.md)) +- If stuck migration: Check migration controller logs + +### Disabling KMS Encryption + +To switch from KMS back to local encryption: +1. Update APIServer config: `spec.encryption.type: "aescbc"` +2. Wait for migration to complete +3. KMS plugin pods can be removed (handled by operators) + +**Consequences:** +- Data re-encrypted with local AES keys +- Migration takes time proportional to data size +- Cluster remains available during migration + +### etcd Backup and Restore with KMS Encryption + +**Before Taking Backup:** +1. Document current KMS configuration: + ```bash + oc get apiserver cluster -o jsonpath='{.spec.encryption.kms}' | jq . + ``` +2. Record KMS key ID/ARN and provider details +3. Verify KMS key will remain available for backup retention period +4. Include KMS configuration in backup metadata + +**Before Restoring Backup:** + +1. **Verify KMS key availability:** + ```bash + # For AWS KMS - check key status + aws kms describe-key --key-id + # Key state should be "Enabled", not "PendingDeletion" + + # For Vault - verify key exists and is accessible + vault read transit/keys/ + ``` + +2. **Verify KMS credentials:** + - Ensure restored cluster has access to same KMS instance + - Verify IAM roles/credentials are valid for KMS access + - Test KMS plugin can authenticate and call Status endpoint + +3. **Restore etcd backup:** + - Follow standard etcd restore procedures + - Ensure KMS plugin pods start successfully + - Verify API server can decrypt resources + +**Troubleshooting Restore Failures:** + +**Symptom:** API server fails to start after restore with KMS decryption errors + +**Diagnosis:** +```bash +# Check API server logs for decryption errors +oc logs -n openshift-kube-apiserver kube-apiserver- | grep -i "decrypt\|kms" + +# Check KMS plugin health +oc get pods -n openshift-kube-apiserver -l app=kms-plugin + +# Verify KMS key accessibility +# (provider-specific commands as shown above) +``` + +**Resolution:** +- **If KMS key deleted:** Check if within grace period, undelete if possible +- **If key expired/rotated:** Restore KMS key backup (Vault/Thales) or contact KMS admin +- **If credentials invalid:** Update credentials, restart KMS plugin pods +- **If unrecoverable:** Data encrypted with lost key is permanently lost (see Risks and Mitigations) + +**Critical Warning:** +Deleting a KMS key used for encryption **will make etcd backups unrestorable**. Always ensure KMS key retention period ≥ backup retention period. + +## Infrastructure Needed + +None - this enhancement extends existing library-go code. diff --git a/enhancements/kube-apiserver/kms-migration-recovery.md b/enhancements/kube-apiserver/kms-migration-recovery.md new file mode 100644 index 0000000000..a803c994a8 --- /dev/null +++ b/enhancements/kube-apiserver/kms-migration-recovery.md @@ -0,0 +1,248 @@ +--- +title: kms-migration-recovery +authors: + - "@ardaguclu" + - "@dgrisonnet" + - "@flavianmissi" +reviewers: + - "@ibihim" + - "@sjenning" + - "@tkashem" + - "@derekwaynecarr" +approvers: + - "@sjenning" +api-approvers: + - "None" +creation-date: 2025-01-28 +last-updated: 2025-01-28 +tracking-link: + - "https://issues.redhat.com/browse/OCPSTRAT-1638" # GA feature only +see-also: + - "enhancements/kube-apiserver/kms-encryption-foundations.md" + - "enhancements/kube-apiserver/kms-plugin-management.md" + - "enhancements/kube-apiserver/encrypting-data-at-datastore-layer.md" + - "enhancements/etcd/storage-migration-for-etcd-encryption.md" +replaces: + - "" +superseded-by: + - "" +--- + +# KMS Migration and Disaster Recovery + +## Summary + +**This enhancement is targeted for GA only and is not part of Tech Preview.** + +Provide comprehensive migration and disaster recovery capabilities for KMS encryption in OpenShift. This includes migrating between different KMS providers, recovering from KMS key loss scenarios, handling temporary KMS outages, and providing operational guidance for complex migration scenarios. + +## Motivation + +While basic KMS encryption and key rotation are covered in Tech Preview (see [KMS Encryption Foundations](kms-encryption-foundations.md) and [KMS Plugin Management](kms-plugin-management.md)), production deployments require robust migration and recovery capabilities. Cluster administrators need to: + +- Migrate encrypted data between different KMS providers (e.g., AWS KMS → Vault, Vault → Thales) +- Recover from partial KMS failures (temporary outages, key deletion, credential expiration) +- Transition from local encryption (aescbc/aesgcm) to KMS and vice versa +- Handle cross-region or cross-account KMS migrations +- Understand backup and restore implications with KMS encryption + +These scenarios are complex and require production validation before being supported. Tech Preview will gather operational experience with basic KMS functionality, and GA will build upon that foundation to provide advanced migration and recovery features. + +### User Stories + +* As a cluster admin, I want to migrate from AWS KMS to HashiCorp Vault without cluster downtime, so that I can change KMS providers based on my organization's policies +* As a cluster admin, I want automated recovery from temporary KMS outages, so that my cluster remains available during transient network or KMS service issues +* As a cluster admin, I want clear documentation on recovering from KMS key loss, so that I understand the risks and available options before they occur +* As a cluster admin, I want to migrate my cluster's encryption from local aescbc to KMS, so that I can improve security without disrupting workloads +* As a cluster admin, I want to migrate KMS encryption across AWS regions, so that I can handle disaster recovery scenarios + +### Goals + +* Support seamless migration between KMS providers +* Provide disaster recovery procedures for KMS key loss +* Handle temporary KMS outages gracefully (caching, degraded mode) +* Document and test all supported migration paths +* Provide monitoring and alerting for migration health +* Create runbooks for common recovery scenarios +* Define SLOs for migration completion time + +### Non-Goals + +* Automatic recovery from permanent KMS key deletion (data loss is expected) +* Migration between incompatible KMS versions (e.g., KMS v1 → KMS v2) +* Cross-cluster migration (backup/restore is separate feature) +* Zero-downtime guarantees for all migration scenarios (some may require brief unavailability) + +## Proposal + +**This section will be completed during GA planning, building on Tech Preview operational experience.** + +Areas to be addressed: + +1. **Migration Framework** + - Automated migration between KMS providers + - Pre-flight validation (target KMS reachable, credentials valid) + - Progress tracking and rollback capabilities + - Two-plugin parallel operation during migration + +2. **Disaster Recovery** + - KMS key loss detection mechanisms + - Partial recovery strategies (cached DEKs, grace periods) + - Complete data loss scenarios (when recovery impossible) + - Key escrow considerations (future enhancement) + +3. **Operational Procedures** + - Migration runbooks per scenario + - Health checks and validation scripts + - Monitoring and alerting recommendations + - Performance optimization for large-scale migrations + +4. **Testing Strategy** + - Migration matrix (all provider combinations) + - Failure injection testing + - Scale testing (time to migrate N resources) + - Disaster recovery drills + +## Scope of Work (GA Only) + +### Supported Migration Paths + +The following migration paths will be documented and tested: + +**Between KMS Providers:** +- AWS KMS → HashiCorp Vault +- HashiCorp Vault → AWS KMS +- AWS KMS → Thales HSM +- Thales HSM → HashiCorp Vault +- (All bidirectional combinations) + +**Between Encryption Types:** +- aescbc → AWS KMS +- aescbc → HashiCorp Vault +- aescbc → Thales HSM +- AWS KMS → aescbc (downgrade scenario) +- HashiCorp Vault → identity (disable encryption) + +**Cross-Region/Account (AWS Specific):** +- AWS KMS key in us-east-1 → us-west-2 +- AWS KMS key in account A → account B + +### Disaster Recovery Scenarios + +**Temporary KMS Outages:** +- Network partition (cluster cannot reach KMS) +- KMS service degradation (slow responses, timeouts) +- Credential expiration (temporary auth failure) +- **Mitigation**: In-memory DEK caching, degraded mode operation, automatic retry + +**Permanent Key Loss:** +- KMS key accidentally deleted +- KMS account/subscription terminated +- Encryption key permanently corrupted +- **Impact**: Data encrypted with lost key is unrecoverable +- **Mitigation**: Key deletion grace periods, backup strategies, monitoring + +**Partial Failures:** +- Some resources encrypted with lost key, others still accessible +- Mixed-version API servers during upgrade +- Plugin crashes during migration +- **Recovery**: Identify affected resources, manual intervention, partial restoration + +### Implementation Details + +**To be designed during GA phase. Key considerations:** + +- How to run two KMS plugins simultaneously during migration (different socket paths) +- How to track migration progress per KMS provider +- When to clean up old KMS plugin after migration completes +- How to handle rollback if migration fails mid-way +- Performance optimizations (parallel migration, batching) + +### Risks and Mitigations + +**To be assessed during GA planning based on Tech Preview learnings.** + +## Alternatives (Not Implemented) + +**Alternative: Include in Tech Preview** + +We could attempt to support migration in Tech Preview. + +**Why deferred to GA:** +- Migration complexity requires production validation first +- Need operational experience with single-provider deployments +- Edge cases and failure modes not yet fully understood +- Risk of committing to unsupportable migration paths +- GA allows iteration based on real-world Tech Preview feedback + +## Open Questions + +1. Should we support "live" migration (both KMS active) or "offline" migration (brief unavailability)? +2. How do we handle very large clusters (millions of secrets)? Multi-day migrations acceptable? +3. Key escrow (storing encrypted DEKs for recovery) - security vs recoverability tradeoff? +4. Should we provide automated migration tooling or just documented procedures? +5. What SLOs should we commit to for migration completion time? + +## Test Plan + +**To be defined during GA planning.** + +Areas requiring test coverage: +- All supported migration paths (automated testing) +- Failure injection at each migration phase +- Scale testing with realistic data volumes +- Disaster recovery drill procedures +- Performance benchmarking + +## Graduation Criteria + +This enhancement is **GA-only** and does not have a Tech Preview phase. + +**Prerequisites for GA:** +- [KMS Encryption Foundations](kms-encryption-foundations.md) is GA +- [KMS Plugin Management](kms-plugin-management.md) is GA +- At least 2 KMS providers fully supported and production-validated +- Operational experience gathered from Tech Preview deployments + +**GA Acceptance Criteria:** +- All documented migration paths tested and validated +- Disaster recovery runbooks created and tested +- Monitoring and alerting defined for migration health +- SLOs defined for migration completion time +- User documentation in openshift-docs +- Support team trained on recovery procedures + +## Upgrade / Downgrade Strategy + +This enhancement provides the upgrade/downgrade strategy for KMS encryption itself, so this section will be critical in the full proposal. + +## Version Skew Strategy + +To be defined during GA planning. + +## Operational Aspects of API Extensions + +No new API extensions - uses existing APIServer config from [KMS Plugin Management](kms-plugin-management.md). + +## Support Procedures + +This enhancement IS the support procedures for KMS migration and recovery. This section will be extensive in the full proposal. + +## Infrastructure Needed + +For testing migration and disaster recovery scenarios: +- Multiple KMS provider instances (AWS KMS, Vault, Thales HSM) +- Large-scale test clusters (simulate production data volumes) +- Chaos engineering infrastructure (failure injection) +- Automated testing framework for migration paths + +--- + +## Note to Reviewers + +This is a placeholder enhancement to document the scope of GA work. The full proposal will be developed after Tech Preview has been released and operational experience has been gathered. + +**Do not block Tech Preview on this enhancement.** It exists to: +1. Clearly scope what is NOT in Tech Preview +2. Set expectations for GA requirements +3. Provide a tracking document for future work diff --git a/enhancements/kube-apiserver/kms-plugin-management.md b/enhancements/kube-apiserver/kms-plugin-management.md new file mode 100644 index 0000000000..85af93ffbc --- /dev/null +++ b/enhancements/kube-apiserver/kms-plugin-management.md @@ -0,0 +1,504 @@ +--- +title: kms-plugin-management +authors: + - "@ardaguclu" + - "@dgrisonnet" + - "@flavianmissi" +reviewers: + - "@ibihim" + - "@sjenning" + - "@tkashem" + - "@derekwaynecarr" +approvers: + - "@sjenning" +api-approvers: + - "@JoelSpeed" +creation-date: 2025-01-28 +last-updated: 2025-01-28 +tracking-link: + - "https://issues.redhat.com/browse/OCPSTRAT-108" # TP feature + - "https://issues.redhat.com/browse/OCPSTRAT-1638" # GA feature +see-also: + - "enhancements/kube-apiserver/kms-encryption-foundations.md" + - "enhancements/kube-apiserver/encrypting-data-at-datastore-layer.md" + - "enhancements/etcd/storage-migration-for-etcd-encryption.md" +replaces: + - "https://github.com/openshift/enhancements/pull/1682" +superseded-by: + - "" +--- + +# KMS Plugin Lifecycle Management + +## Summary + +Enable OpenShift to automatically manage the lifecycle of KMS (Key Management Service) plugins across multiple API servers. This enhancement provides a user-configurable interface to deploy, configure, and monitor KMS plugins as sidecar containers alongside kube-apiserver, openshift-apiserver, and oauth-apiserver pods. Support for multiple KMS providers (AWS KMS, HashiCorp Vault, Thales HSM) is included with provider-specific authentication and configuration. + +## Motivation + +KMS encryption requires KMS plugin pods to bridge communication between the kube-apiserver and external KMS providers. Managing these plugins manually is operationally complex and error-prone. OpenShift should handle plugin deployment, authentication, health monitoring, and lifecycle management automatically on behalf of users. + +Different KMS providers have vastly different authentication models, deployment requirements, and operational characteristics. This enhancement provides a unified plugin management framework while accommodating provider-specific needs. + +### User Stories + +* As a cluster admin, I want to enable AWS KMS encryption by simply providing a key ARN in the APIServer config, so that OpenShift automatically deploys and manages the AWS KMS plugin for me +* As a cluster admin using HashiCorp Vault, I want OpenShift to handle Vault authentication (AppRole for TP, certificate-based for GA) and plugin deployment, so that I don't need to manually manage credentials or plugin containers +* As a cluster admin, I want to switch from one KMS provider to another (e.g., AWS KMS to Vault) by updating the APIServer configuration, so that OpenShift handles the plugin transition and data migration automatically +* As a cluster admin, I want to monitor KMS plugin health through standard OpenShift operators and alerts, so that I can detect and respond to KMS-related issues + +### Goals + +* Automatic KMS plugin deployment as sidecar containers in API server pods +* Support for multiple KMS providers with provider-specific configurations +* Credential management for KMS plugin authentication (IAM, AppRole, Cert, PKCS#11) +* Plugin health monitoring and integration with operator conditions +* Reactivity to configuration changes (automatic plugin updates) +* Support for Tech Preview (limited providers) and GA (full provider support) graduation + +### Non-Goals + +* Direct support for hardware security modules (HSMs) - supported via KMS plugins (Thales) +* KMS provider deployment or management (users manage their own AWS KMS, Vault, etc.) +* Encryption controller logic for key rotation (see [KMS Encryption Foundations](kms-encryption-foundations.md)) +* Migration and recovery procedures (deferred to [KMS Migration and Recovery](kms-migration-recovery.md) for GA) +* Custom or user-provided KMS plugins (only officially supported providers) + +## Proposal + +Extend OpenShift's API server operators (kube-apiserver-operator, openshift-apiserver-operator, authentication-operator) to automatically inject KMS plugin sidecar containers when KMS encryption is configured. The plugin management framework is provider-agnostic at the infrastructure level, with provider-specific implementations for authentication, configuration, and deployment. + +**Supported KMS Providers:** + +| Provider | Tech Preview | GA | Primary Use Case | +|----------|--------------|-----|------------------| +| **AWS KMS** | ✅ Full support | ✅ Production-ready | Cloud-native AWS deployments | +| **HashiCorp Vault** | ⚠️ Beta (if Vault plugin available) | ✅ Production-ready | On-premises, multi-cloud, centralized KMS | +| **Thales CipherTrust** | ❌ Not supported | ✅ Production-ready | HSM integration, regulatory compliance | + +### Workflow Description + +#### Roles + +**cluster admin** is a human user responsible for configuring and maintaining the cluster. + +**KMS** is the external Key Management Service (AWS KMS, HashiCorp Vault, Thales HSM) responsible for storing and rotating encryption keys. + +**KMS Plugin** is a gRPC service implementing the Kubernetes KMS v2 API, deployed as a sidecar container. + +**API Server Operator** is the OpenShift operator (kube-apiserver-operator, openshift-apiserver-operator, or authentication-operator) responsible for managing API server deployments. + +#### Initial KMS Configuration (AWS KMS Example) + +1. The cluster admin creates an encryption key (KEK) in AWS KMS +2. The cluster admin grants the OpenShift cluster access to the KMS key: + - For kube-apiserver: Updates master node IAM role with KMS permissions + - For openshift/oauth-apiserver: Ensures Cloud Credential Operator (CCO) can provision credentials +3. The cluster admin updates the APIServer configuration: + ```yaml + apiVersion: config.openshift.io/v1 + kind: APIServer + metadata: + name: cluster + spec: + encryption: + type: KMS + kms: + type: AWS + aws: + keyARN: arn:aws:kms:us-east-1:123456789012:key/12345678-1234-1234-1234-123456789012 + region: us-east-1 + ``` +4. The API server operators detect the configuration change +5. The operators inject AWS KMS plugin sidecar containers into API server pods +6. The KMS plugins start and communicate with AWS KMS +7. Encryption controllers detect the new KMS configuration and begin encryption (see [KMS Encryption Foundations](kms-encryption-foundations.md)) +8. The cluster admin observes progress via `clusteroperator/kube-apiserver` conditions + +#### Vault KMS Configuration (Tech Preview - AppRole) + +1. The cluster admin deploys HashiCorp Vault (external to OpenShift) +2. The cluster admin creates an encryption key in Vault +3. The cluster admin configures AppRole authentication in Vault +4. The cluster admin creates Kubernetes secrets containing AppRole credentials: + ```bash + oc create secret generic vault-kms-credentials -n openshift-kube-apiserver \ + --from-literal=role-id= \ + --from-literal=secret-id= + ``` +5. The cluster admin updates the APIServer configuration: + ```yaml + spec: + encryption: + type: KMS + kms: + type: Vault + vault: + vaultAddress: https://vault.example.com:8200 + keyPath: transit/keys/openshift-encryption + namespace: openshift # Vault Enterprise namespace + authMethod: AppRole + credentialsSecret: + name: vault-kms-credentials + ``` +6. The operators inject Vault KMS plugin sidecars with AppRole credentials +7. Plugins authenticate to Vault and enable encryption + +**Note:** AppRole is for Tech Preview only. GA will require certificate-based authentication (see Graduation Criteria). + +#### Vault KMS Configuration (GA - Certificate Auth) + +1. The cluster admin deploys Vault and configures PKI +2. The cluster admin configures initial AppRole credentials (bootstrap only) +3. OpenShift operators inject Vault KMS plugin sidecars +4. The KMS plugin uses AppRole to authenticate to Vault (first time only) +5. The plugin requests a client certificate from Vault PKI +6. The plugin stores the certificate and switches to certificate-based auth +7. The plugin automatically rotates certificates before expiration +8. AppRole credentials can be revoked after certificate issuance + +This provides stronger security than AppRole-only while solving the bootstrap problem. + +### API Extensions + +This enhancement uses the KMS API types defined in [KMS Encryption Foundations](kms-encryption-foundations.md), which provides the foundational API for KMS encryption. + +**For Tech Preview:** +- [KMS Encryption Foundations](kms-encryption-foundations.md) defines `KMSConfig` with only AWS support (`KMSProviderType` enum contains only `AWS`) +- [KMS Encryption Foundations](kms-encryption-foundations.md) defines `AWSKMSConfig` for AWS-specific configuration +- This enhancement focuses on managing the AWS KMS plugin lifecycle using that API + +**For GA:** +- [KMS Encryption Foundations](kms-encryption-foundations.md) will extend the `KMSProviderType` enum to include `Vault` and `Thales` +- [KMS Encryption Foundations](kms-encryption-foundations.md) will add `VaultKMSConfig` and `ThalesKMSConfig` types +- This enhancement will document the provider-specific plugin management details + +**Example API usage (defined in [KMS Encryption Foundations](kms-encryption-foundations.md)):** + +```yaml +apiVersion: config.openshift.io/v1 +kind: APIServer +metadata: + name: cluster +spec: + encryption: + type: KMS + kms: + type: AWS + aws: + keyARN: arn:aws:kms:us-east-1:123456789012:key/12345678-1234-1234-1234-123456789012 + region: us-east-1 +``` + +For the complete API definitions, see the "API Extensions" section in [KMS Encryption Foundations](kms-encryption-foundations.md). + +#### Provider-Specific Configuration Details (For Reference) + +This section provides examples of provider-specific configurations that will be supported. The actual API types are defined in [KMS Encryption Foundations](kms-encryption-foundations.md). + +**AWS KMS Configuration (Tech Preview - Supported):** +- `keyARN`: AWS KMS key ARN (required) +- `region`: AWS region (required) + +**HashiCorp Vault Configuration (GA - Not in Tech Preview):** +- `vaultAddress`: Vault server URL +- `keyPath`: Path to encryption key in Vault +- `namespace`: Vault namespace (Enterprise only) +- `authMethod`: Authentication method (AppRole or Cert) +- `credentialsSecret`: Reference to secret containing auth credentials + +**Thales CipherTrust Configuration (GA - Not in Tech Preview):** +- `p11LibraryPath`: Path to PKCS#11 library +- `keyLabel`: HSM key label +- `kekID`: Key Encryption Key ID +- `algorithm`: Encryption algorithm (rsa-oaep, aes-gcm) +- `credentialsSecret`: Reference to secret containing PKCS#11 PIN + +### Topology Considerations + +#### Hypershift / Hosted Control Planes + +In Hypershift, the control plane runs in a management cluster while workloads run in a guest cluster. KMS plugin management differs: + +- **kube-apiserver**: Runs in management cluster, plugins deployed there +- **Credential access**: Management cluster must have access to KMS (network, IAM) +- **Plugin images**: Pulled in management cluster registry + +No fundamental blockers, but credential provisioning may require manual setup in management cluster. + +#### Standalone Clusters + +This is the primary target for Tech Preview. All API servers and plugins run in the same cluster. + +#### Single-node Deployments or MicroShift + +Single-node deployments are supported. Resource consumption: +- Each API server pod adds one KMS plugin sidecar (~50MB memory, minimal CPU) +- Total: 3 sidecars for 3 API servers (kube-apiserver, openshift-apiserver, oauth-apiserver) + +MicroShift may adopt this enhancement but will likely use file-based configuration instead of APIServer CR. + +### Implementation Details/Notes/Constraints + +#### KMS Plugin Deployment Architecture + +KMS plugins are deployed as **sidecar containers** in API server pods. Each operator manages its own sidecar injection: + +| API Server | Deployment Type | hostNetwork | Socket Volume | Credential Source | +|------------|-----------------|-------------|---------------|-------------------| +| kube-apiserver | Static Pod | ✅ true | hostPath | IAM (IMDS) or Secret | +| openshift-apiserver | Deployment | ❌ false | emptyDir | Secret (CCO) | +| oauth-apiserver | Deployment | ❌ false | emptyDir | Secret (CCO) | + +#### Sidecar Injection Mechanism + +TODO: Document sidecar injection implementation per operator + +#### Provider-Specific Authentication + +TODO: Document authentication mechanisms for each provider + +#### Static Pod Limitations and Vault Auth + +**Critical constraint**: Static pods (kube-apiserver) **cannot reference ServiceAccount objects** (Kubernetes limitation). + +This has major implications for Vault authentication: + +**Vault Kubernetes Auth Method - NOT VIABLE:** +- Requires ServiceAccount tokens mounted at `/var/run/secrets/kubernetes.io/serviceaccount/token` +- Static pods cannot have ServiceAccount tokens +- **Cannot be used for kube-apiserver** +- **CAN be used for openshift-apiserver and oauth-apiserver** (Deployments) + +**Vault JWT Auth Method - NOT VIABLE:** +- Also requires ServiceAccount tokens +- Same static pod limitation applies + +**Vault AppRole Auth Method - VIABLE (Tech Preview):** +- Uses static credentials (RoleID + SecretID) +- Can be mounted as Secret volumes in static pods +- **Security concerns**: Shared secret, manual rotation, bootstrap problem +- Acceptable for Tech Preview with documented limitations + +**Vault Cert Auth Method - VIABLE (GA):** +- Uses TLS client certificates for authentication +- Certificates can be stored as files and mounted in static pods +- **Bootstrap flow**: AppRole → get cert from Vault PKI → use cert auth → rotate cert automatically +- Solves security concerns of AppRole-only +- **Recommended for GA** + +#### AWS KMS Plugin Configuration + +TODO: Migrate from current enhancement + +#### Vault KMS Plugin Configuration + +TODO: Document Vault plugin deployment and auth + +#### Thales KMS Plugin Configuration + +TODO: Document Thales/HSM plugin requirements + +#### KMS Plugin Health Monitoring + +API server operators are responsible for monitoring KMS plugin health and surfacing plugin status to cluster administrators and to encryption controllers. + +**Health Check Implementation:** + +Each operator implements health checks for its KMS plugin sidecar: + +1. **gRPC Status Endpoint Calls** + - Periodically call the KMS plugin's Status gRPC endpoint (defined by KMS v2 API) + - Parse the response to extract: + - Plugin health status (healthy/unhealthy) + - Current `key_id` (used by encryption controllers for rotation detection) + - Plugin version and other metadata + +2. **Health Check Frequency** + - Default: Poll Status endpoint every 30 seconds + - Configurable via operator environment variable (for tuning) + - Exponential backoff on repeated failures + +3. **Failure Detection** + - Plugin process not running (container crashed) + - Status endpoint unreachable (socket connection failed) + - Status endpoint returns error response + - Status call times out (after 10 seconds) + +**Operator Condition Integration:** + +Plugin health is reflected in operator conditions visible to administrators: + +```yaml +status: + conditions: + - type: KMSPluginDegraded + status: "False" + reason: KMSPluginHealthy + message: "KMS plugin is healthy and responding to Status calls" + + # When plugin is unhealthy: + - type: KMSPluginDegraded + status: "True" + reason: KMSPluginUnhealthy + message: "KMS plugin Status endpoint unreachable: connection refused" +``` + +**Metrics and Alerts:** + +Operators expose metrics for monitoring: +- `kms_plugin_status_call_duration_seconds` - Histogram of Status call latency +- `kms_plugin_status_call_errors_total` - Counter of failed Status calls +- `kms_plugin_healthy` - Gauge (1 = healthy, 0 = unhealthy) + +Alerts fire when plugin health degrades: +- `KMSPluginUnhealthy` - Plugin has been unhealthy for >5 minutes +- `KMSPluginStatusCallLatencyHigh` - Status calls taking >5 seconds + +**Controller Precondition Integration:** + +Operators provide a health check function to encryption controllers: + +```go +// Provided by operator to library-go controllers +func kmsPluginHealthCheck(ctx context.Context) (bool, error) { + // Check cached health status (updated by periodic Status polling) + if !cachedPluginHealth.IsHealthy() { + return false, fmt.Errorf("KMS plugin unhealthy: %s", cachedPluginHealth.Reason) + } + return true, nil +} + +// Controllers use this as a precondition +controllers.NewKeyController( + // ... other params ... + preconditions: []PreconditionFunc{ + kmsPluginHealthCheck, // Block controller sync if plugin unhealthy + // ... other preconditions ... + }, +) +``` + +**Restart and Recovery Logic:** + +When plugin health checks fail: + +1. **Short-term failures (< 1 minute):** + - Log warnings + - Set operator condition to Degraded + - Controllers block but cluster remains operational (kube-apiserver caches DEKs) + +2. **Medium-term failures (1-5 minutes):** + - Attempt container restart (if process crashed) + - Check for configuration issues (invalid credentials, network problems) + - Fire alerts + +3. **Long-term failures (> 5 minutes):** + - Operator condition remains Degraded + - Alerts continue firing + - Manual intervention required (see Support Procedures) + +**Provider-Specific Health Considerations:** + +Different KMS plugins may have provider-specific health indicators: + +- **AWS KMS Plugin:** Check AWS credential validity, KMS API reachability +- **Vault KMS Plugin:** Check Vault token/cert expiration, Vault service health +- **Thales KMS Plugin:** Check HSM device connectivity, PKCS#11 library availability + +**Tech Preview vs GA:** + +- **Tech Preview:** Basic health checking (Status endpoint polling, operator conditions) +- **GA:** Full monitoring suite (metrics, alerts, automatic recovery, dashboard integration) + +### Risks and Mitigations + +TODO: Migrate from current enhancement and add provider-specific risks + +### Drawbacks + +TODO: Update from current enhancement + +## Alternatives (Not Implemented) + +TODO: Migrate alternatives section + +## Open Questions + +1. Should we support mixed authentication methods (e.g., Kubernetes auth for openshift-apiserver, AppRole for kube-apiserver)? +2. How do we handle Vault plugin beta availability? Make Vault support conditional on plugin release? +3. Thales HSM device access - how do control plane nodes access HSMs (network HSM vs USB vs embedded TPM)? + +## Test Plan + +TODO: Define test strategy for multi-provider support + +## Graduation Criteria + +### Tech Preview Acceptance Criteria + +**AWS KMS Provider:** +- ✅ Full support with IAM authentication +- ✅ Sidecar deployment across all 3 API servers +- ✅ Automatic credential provisioning via CCO (openshift/oauth) and IMDS (kube-apiserver) +- ✅ Configuration via APIServer CR +- ✅ Basic monitoring and health checks + +**Vault KMS Provider:** +- ⚠️ Best-effort support (depends on Vault plugin beta release) +- ⚠️ AppRole authentication only (security limitations documented) +- ⚠️ Manual credential setup required +- ⚠️ Marked as experimental, subject to change + +**Thales KMS Provider:** +- ❌ Not supported in Tech Preview +- 📅 Deferred to GA + +**Feature Gate:** +- Behind `KMSEncryptionProvider` feature gate +- Disabled by default + +### Tech Preview → GA + +**AWS KMS Provider:** +- ✅ Production-ready, full SLO coverage +- ✅ Load testing completed +- ✅ Monitoring, alerts, runbooks defined +- ✅ Documentation in openshift-docs + +**Vault KMS Provider:** +- ✅ Certificate-based authentication required +- ✅ Automatic cert rotation by plugin +- ✅ AppRole used only for bootstrap +- ✅ Static pod auth limitations documented +- ✅ Vault plugin reaches GA release +- ✅ Production validation complete + +**Thales KMS Provider:** +- ✅ HSM integration validated (network/USB/TPM) +- ✅ Secure PIN management strategy +- ✅ PKCS#11 library compatibility tested +- ✅ Device access requirements documented + +**Feature Gate:** +- Removed (enabled by default) + +## Upgrade / Downgrade Strategy + +TODO: Define upgrade/downgrade procedures + +## Version Skew Strategy + +TODO: Define version skew handling + +## Operational Aspects of API Extensions + +TODO: Document operational impact + +## Support Procedures + +TODO: Define support runbooks per provider + +## Infrastructure Needed + +TODO: List infrastructure requirements (test KMS instances, HSMs, etc.) From 362bedfbe1a5c5a8a15812b59fd990c91d8f5687 Mon Sep 17 00:00:00 2001 From: Flavian Missi Date: Fri, 28 Nov 2025 16:03:34 +0100 Subject: [PATCH 13/13] fix dates --- enhancements/kube-apiserver/kms-encryption-foundations.md | 4 ++-- enhancements/kube-apiserver/kms-migration-recovery.md | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/enhancements/kube-apiserver/kms-encryption-foundations.md b/enhancements/kube-apiserver/kms-encryption-foundations.md index 1a8389e2ba..a08fc5dd76 100644 --- a/enhancements/kube-apiserver/kms-encryption-foundations.md +++ b/enhancements/kube-apiserver/kms-encryption-foundations.md @@ -13,8 +13,8 @@ approvers: - "@benluddy" api-approvers: - "@JoelSpeed" -creation-date: 2025-01-28 -last-updated: 2025-01-28 +creation-date: 2025-11-28 +last-updated: 2025-11-28 tracking-link: - "https://issues.redhat.com/browse/OCPSTRAT-108" see-also: diff --git a/enhancements/kube-apiserver/kms-migration-recovery.md b/enhancements/kube-apiserver/kms-migration-recovery.md index a803c994a8..7f2de8775a 100644 --- a/enhancements/kube-apiserver/kms-migration-recovery.md +++ b/enhancements/kube-apiserver/kms-migration-recovery.md @@ -13,8 +13,8 @@ approvers: - "@sjenning" api-approvers: - "None" -creation-date: 2025-01-28 -last-updated: 2025-01-28 +creation-date: 2025-11-28 +last-updated: 2025-11-28 tracking-link: - "https://issues.redhat.com/browse/OCPSTRAT-1638" # GA feature only see-also: