Add cluster-api/crd-compatibility-checker #1845

mdbooth · 2025-09-16T14:20:26Z

No description provided.

openshift-ci · 2025-09-16T14:22:31Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign yuqi-zhang for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

enhancements/cluster-api/crd-compatibility-checker.md

JoelSpeed · 2025-09-16T16:26:01Z

enhancements/cluster-api/crd-compatibility-checker.md

+## Open Questions [optional]
+
+1. What is the specific algorithm for determining CRD compatibility?
+2. How should the system handle CRD conversion webhooks during compatibility checks?


The system has no insight into conversion webhooks. It will be able to check compatibility between versions when a webhook is not defined, but otherwise it is assumed that the conversion webhook is converting in a compatible manner

This needs to be rewritten, and I think it's more of an operational issue.

We should document that it is the responsibility of the CRD owner to run conversion webhooks

We should document the potential operational impacts if we're ever using a non-storage version in openshift-cluster-api.

Still to be rewritten?

It's my intention that the section on the explicit responsibilities of the Adopting Manager covers this.

JoelSpeed · 2025-09-16T16:27:08Z

enhancements/cluster-api/crd-compatibility-checker.md

+
+1. What is the specific algorithm for determining CRD compatibility?
+2. How should the system handle CRD conversion webhooks during compatibility checks?
+3. What is the performance impact of the admission webhooks on high-traffic clusters?


Have you looked at all at match conditions on the webhook configurations to make it very explicit about which resources do and do not need to be sent to the webhook? Could be worth explaining some of the options there in this context?

I thought I'd covered that somewhere? Here I'm really just highlighting that we don't yet know if further performance optimisation will be required.

I think there was a very brief mention of selection being available, but not an explanation of how granular we expect that to be

We're essentially copying the relevant section of the ValidatingWebhookConfiguration into our API, because that's how this is implemented. It's as granular as a ValidatingWebhookConfiguration can be.

JoelSpeed · 2025-09-16T16:27:29Z

enhancements/cluster-api/crd-compatibility-checker.md

+**Note:** *Section not required until targeted at a release.*
+
+The test plan should include:
+- Unit tests for compatibility checking logic


Just checking, this doesn't mean to test the external library right?
If we are relying on an external library, shouldn't we assume that the external library is well tested?

We're not going to write unit tests for CRDify.

However, we will have our own definition of compatible which applies in this specific context. We need to ensure that we have configured CRDify correctly to implement that. This means that we're going to need our own unit testing, but it doesn't need to be as detailed as what CRDify hopefully has internally.

JoelSpeed · 2025-09-16T16:30:54Z

enhancements/cluster-api/crd-compatibility-checker.md

+* Create a `CRDCompatibilityRequirement` for every CRD in the 'active' transport config map if that CRD is also marked 'unmanaged'.
+This requirement will have `crdAdmitAction=Enforce`.
+
+If `CRDCompatibilityCheckerEnforce` is not enabled, cluster-capi-operator will look for all 'active' `CRDCompatibilityRequirement`s and delete them.


We don't generally both with disabled logic like this, once a gate is enabled in OCP, it should never be disabled

I've specifically documented this behaviour because it neatly covers the downgrade case, and I'm guessing it's probably not too hard to implement. Happy to take it out, but I think we'd spend as much effort writing the documentation on how to do it manually.

We don't support downgrades, so while this is a nice to have, I would not block the feature shipping if this wasn't implemented. Feel free to leave in, but, I wouldn't expect us to prio implementing this

I've noted that we don't intend to support downgrades.

damdo · 2025-09-16T19:14:30Z

/assign @damdo

enhancements/cluster-api/crd-compatibility-checker.md

chrischdi · 2025-09-16T14:36:37Z

enhancements/cluster-api/crd-compatibility-checker.md

+    action: Enforce
+```
+
+CCAPIO marks itself Degraded if any CRDCompatibilityRequirement it creates during this process becomes not Progressing, and:


What does marking itself as Degraded mean? Is it in the operator CRD?

It means setting the Degraded condition on CCAPIO's ClusterOperator, which is an OpenShift thing.

@JoelSpeed as a stylistic point, the above is an entirely reasonable question. Can I assume that a reader of openshift/enhancements would know this, or should I attempt to describe it somewhere?

I tend to aim these documents so that someone who isn't an engineer in OpenShift could also read them (customers, PM, docs team). I think the concept of Degraded as a way of reporting operator problems in openshift is probably fairly well understood among those groups so in this case, I probably wouldn't expand on what that actually means, hard one to define there tbh, and obviously, doesn't cover everyone

enhancements/cluster-api/crd-compatibility-checker.md

chrischdi · 2025-09-17T05:46:35Z

enhancements/cluster-api/crd-compatibility-checker.md

+* CRDCompatibilityCheckerUpgradeCheck
+* CRDCompatibilityCheckerEnforce


Not sure if we could have better readable ones here or this follows some well-known pattern. Example:

Suggested change

* CRDCompatibilityCheckerUpgradeCheck

* CRDCompatibilityCheckerEnforce

* CRDCompatibilityUpgradeCheck

* CRDCompatibilityEnforce

We will likely also need to make these matrixed by platform

I didn't bother to look up the naming conventions for feature gates at all. I was kinda hoping @JoelSpeed would just give me 2 acceptable values 😅

We are aiming for a feature gate ClusterAPIMachineManagement<platform> for each of the platforms we will support as we migrate to CAPI. The second stage of this (enforcement) is tied to the lifecycle of that feature gate, so I doubt we need an additional gate

For the earlier stage, perhaps we key off of that, and go for ClusterAPIMachineManagement<platform>Upgradeability?

chrischdi · 2025-09-17T05:48:56Z

enhancements/cluster-api/crd-compatibility-checker.md

+`CRDCompatibilityCheckerUpgradeCheck` enables the 'future version' check.
+If `CRDCompatibilityCheckerUpgradeCheck` is enabled, cluster-capi-operator will:
+
+* Create a `CRDCompatibilityRequirement` for every CRD discovered in the 'future' transport config map if that CRD is also marked 'unmanaged'.


What does "marked 'unmanaged'" mean? How is that expressed?

We have a separate document explaining how the flow will work for this, trying to focus the two documents on separate elements (implementation vs usage in an existing system), but there's bleed through

The high level of that though is that we will have a configuration CRD for the CAPI operator that takes a list of CRD names to be considered unmanaged

enhancements/cluster-api/crd-compatibility-checker.md

chrischdi · 2025-09-17T05:56:44Z

enhancements/cluster-api/crd-compatibility-checker.md

+- Enforcing single ownership of CRDs (must be ensured out-of-band)
+- Providing a user interface for managing compatibility requirements
+- Resolving compatibility conflicts between different actors
+


Should we add as non-goal that this proposal won't handle how validating/mutating webhooks should be deployed? (e.g. they should have proper selectors to not refer other namespaces).

Example:

Validating/Mutating webhooks for CAPI deployed in openshift-cluster-api namespace

Validating/Mutating webhooks for CAPI deployed in foo namespace

Yeah so we expect the person who takes over management of the CRDs (adopting manager) to be responsible for this going forward, unless this is clarified later in this document, we should clarify what we expect of the adopting managers once they have taken over.

chrischdi · 2025-09-17T06:03:02Z

enhancements/cluster-api/crd-compatibility-checker.md

+  crdRef: clusters.cluster.x-k8s.io
+  creatorDescription: "OpenShift Cluster CAPI Operator"
+  compatibilityCRD: |
+    ...
+    <complete YAML document of Cluster CRD from transport config map>
+    ...
+  crdAdmitAction: Enforce


I wonder if this should be (I'm not convinced of myself here):

Suggested change

crdRef: clusters.cluster.x-k8s.io

creatorDescription: "OpenShift Cluster CAPI Operator"

compatibilityCRD: |

...

<complete YAML document of Cluster CRD from transport config map>

...

crdAdmitAction: Enforce

creatorDescription: "OpenShift Cluster CAPI Operator"

crd:

ref: clusters.cluster.x-k8s.io

action: Enforce|Warn

compatibility: |

...

Feels otherwise a bit inconsistent to have:

.spec.crdAdmitAction

.spec.objectSchemaValidation.action

Note: above also feels a bit inconsistent, because ref and compatibility are also used for the objectSchemaValidation later...

What if it were something like

target: name: my.crd.io compatibilitySchema: ... crdSchema: action: Enforce|Warn objectSchema: action: Enforce|Warn

That reads way beter :-)

Incidentally @JoelSpeed I'm hesitant to use compatibilitySchema. A CRD contains a schema, but we need the whole CRD, including all its k8s metadata, as yaml. I wonder if we could make that more obvious somehow.

The following fields are 'top level', meaning that they're always required and they apply equally to CRD validation and object validation:

crdRef

creatorDescription

compatibilityCRD

CRD validation only:

crdAdmitAction

Object validation only:

action

namespaceSelector

objectSelector

matchConditions

How about:

crd: name: my.crd.io compatibilityCRD: yaml: | ... creatorDescription: "云团队" crdSchemaValidation: action: Enforce|Warn objectSchemaValidation: action: Enforce|Warn nameSpaceSelector: ... matchConditions: ...

If compatibilityCRD is an object, what if we got rid of crd and moved name inside it?

compatibilityCRD: name: my.crd.io yaml: | ... creatorDescription: "ABC" crdSchemaValidation: action: Enforce|Warn objectSchemaValidation: action: Enforce|Warn nameSpaceSelector: ... matchConditions: ...

...or parsed the name from the yaml, and put it in status?

I'm open to that, with a validation that parsing the name from yaml matches the status entry on all future writes?

Only makes sense if we can use a status field as index field. I never tried that, ~~but I guess it should work?! (starting to check).~~ Note: tested, using a status field as index works, but comes with the effect that we first need a reconciliation before the webhook really makes use of the requirement (which maybe also is a good thing?).

serngawy · 2025-09-22T18:45:25Z

enhancements/cluster-api/crd-compatibility-checker.md

+  **Mitigation**:
+  This would be a bug in the tool.
+  We expect the initial and primary non-payload customer to be HCP.
+  We will coordinate with HCP to ensure that potential issues show up early in their CI pipeline.


There is already a case we experienced before between MCE->CAPA & Hypershift->CAPA . Testing will be mandatory.

serngawy · 2025-09-22T19:03:22Z

enhancements/cluster-api/crd-compatibility-checker.md

+
+The benefits of enabling safe multi-actor CRD management outweigh these drawbacks, and the system is designed to fail gracefully with clear error messages.
+
+## Alternatives (Not Implemented)


Since this proposal letting the CRD to be owned and changed by external workload (ex;HCP). Why not using CRD versions with conversion->strategy->Webhook is not good enough to handle this case ?

to add more thoughts; using the CRD conversion->strategy->Webhook may required effort in the api versions conversions ex; (v1alpha1 -> v1beta2 & v1beta2 -> v1alpha1) but with having that we should be able to handle this case ?

API versions are a use case covered by this tool, but they aren't the primary focus. We're more concerned with 2 different versions of api version v1, where the later one may have more fields and allow more values in its enums.

We'd want to prevent an upgrade to a CRD which removed an old API version.

serngawy · 2025-09-22T19:16:53Z

enhancements/cluster-api/crd-compatibility-checker.md

+* not Compatible - current CRD is incompatible with CCAPIO requirements
+
+HCP uses an as-yet-undefined mechanism to inform CCAPIO of the set of CRDs which are no longer managed by CCAPIO.
+For CRDs in this list, CCAPIO creates a CRDCompatibilityRequirement but does not load or update the CRD.


Would you expand on that using real example from CAPI APIs ex; cluster/machine. Is there a case between CAPI api versions (ex; v1alpha1 & v1beta2) that lead to have breaking changes ?

JoelSpeed

When we have previously discussed upgrades, we have also talked about how the CAPI operator will not roll out new operand versions until it is sure that those are compatible, I don't see that covered here. I also missed the part we previously discussed about current, desired and future. There seems to be no mention of the desired phase (which I guess is part of the new operand roll out), did we end up deciding that this wasn't needed?

JoelSpeed · 2025-10-07T11:52:43Z

enhancements/cluster-api/crd-compatibility-checker.md

+- Enforcing single ownership of CRDs (must be ensured out-of-band)
+- Providing a user interface for managing compatibility requirements
+- Resolving compatibility conflicts between different actors
+


Yeah so we expect the person who takes over management of the CRDs (adopting manager) to be responsible for this going forward, unless this is clarified later in this document, we should clarify what we expect of the adopting managers once they have taken over.

enhancements/cluster-api/crd-compatibility-checker.md

JoelSpeed · 2025-10-07T11:54:53Z

enhancements/cluster-api/crd-compatibility-checker.md

+
+#### Cluster CAPI Operator
+
+* Stops applying updates to the CRD on behalf of the core payload.


Configures CRD compatibility to ensure its operands are compatible with the current and future versions of the CRD?

Cluster CAPI Operator is also a CRD User. This is defined there. I noted that in the CRD user section, but I might switch the backreference into 2 forward references to make it clearer.

JoelSpeed · 2025-10-07T11:55:35Z

enhancements/cluster-api/crd-compatibility-checker.md

+
+Note that the adopting manager and Cluster CAPI Operator are also expected to be CRD users.
+There may be additional CRD users.
+For example when the adopting manager is ACM and deploys HyperShift, HyperShift will also be a CRD user.


When ACM deploys hypershift, does hypershift operator install the CAPI CRDs or does ACM? Have we checked this?

@serngawy Can you answer this?

JoelSpeed · 2025-10-07T11:57:04Z

enhancements/cluster-api/crd-compatibility-checker.md

+
+The CRD Compatibility Checker can also perform schema validation on objects against a custom schema.
+The Cluster CAPI Operator will configure CRD Compatibility Checker to perform schema validation on objects of unmanaged CRDs used by its operands.
+This will ensure objects presented to Cluster CAPI Operator operands conform to the expected schema rather than a later version of the schema which may permit, for example, addition fields or values.


Suggested change

This will ensure objects presented to Cluster CAPI Operator operands conform to the expected schema rather than a later version of the schema which may permit, for example, addition fields or values.

This will ensure objects presented to Cluster CAPI Operator operands conform to the expected schema rather than a later version of the schema which may permit, for example, additional fields or values.

JoelSpeed · 2025-10-07T12:41:43Z

enhancements/cluster-api/crd-compatibility-checker.md

+    # the corresponding ValidatingWebhookConfiguration. Their definitions and
+    # semantics are therefore identical to those in
+    # ValidatingWebhookConfiguration.
+    namespaceSelector:


Do we allow this to be omitted? And if we do, does that mean all namespaces?

JoelSpeed · 2025-10-07T12:42:02Z

enhancements/cluster-api/crd-compatibility-checker.md

+  # An update to spec.compatibilitySchema.crdYAML which would cause this value
+  # to change once it has been set will be rejected.


How are you implementing this validation?

Webhook on CRDCompatibilityRequirement itself (it's already implemented, btw). I could write that here, but it felt like an irrelevant detail.

Sigh, webhooks

Didn't find any way to implement in CEL?

enhancements/cluster-api/crd-compatibility-checker.md

JoelSpeed · 2025-10-07T13:19:32Z

enhancements/cluster-api/crd-compatibility-checker.md

+  **Mitigation**:
+  CRD updates are infrequent enough that this is unlikely to be a concern.
+  For object schema validation there is potential for impact during certain phases of cluster activity.
+  We will optimize performance beyond the initial implementation only if it proves necessary.


Have we considered implementing any metrics to track the impact the webhooks are going to have?

No, but we could. However, is there some existing metric, e.g. length of time to serve an api request, that would indicate a problem here if one emerged? Adding metrics isn't hard, but they are expensive in themselves so I'd rather not.

I expect that metric does exist yes, but I don't know if that will proportion correctly where the time is being spent

JoelSpeed · 2025-10-07T13:20:47Z

enhancements/cluster-api/crd-compatibility-checker.md

+## Open Questions [optional]
+
+1. What is the specific algorithm for determining CRD compatibility?
+2. How should the system handle CRD conversion webhooks during compatibility checks?


Still to be rewritten?

openshift-ci · 2025-10-07T18:47:59Z

@mdbooth: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/markdownlint	`90e38d5`	link	true	`/test markdownlint`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci bot requested review from Miciah and sadasu September 16, 2025 14:22

JoelSpeed reviewed Sep 16, 2025

View reviewed changes

openshift-ci bot assigned damdo Sep 16, 2025

chrischdi reviewed Sep 17, 2025

View reviewed changes

serngawy reviewed Sep 22, 2025

View reviewed changes

mdbooth force-pushed the crd-compatibility branch from ff9313e to 3e98f90 Compare October 6, 2025 12:50

JoelSpeed reviewed Oct 7, 2025

View reviewed changes

Add cluster-api/crd-compatibility-checker

90e38d5

mdbooth force-pushed the crd-compatibility branch from 3e98f90 to 90e38d5 Compare October 7, 2025 17:08

		* CRDCompatibilityCheckerUpgradeCheck
		* CRDCompatibilityCheckerEnforce


		The benefits of enabling safe multi-actor CRD management outweigh these drawbacks, and the system is designed to fail gracefully with clear error messages.

		## Alternatives (Not Implemented)


		#### Cluster CAPI Operator

		* Stops applying updates to the CRD on behalf of the core payload.

	This will ensure objects presented to Cluster CAPI Operator operands conform to the expected schema rather than a later version of the schema which may permit, for example, addition fields or values.
	This will ensure objects presented to Cluster CAPI Operator operands conform to the expected schema rather than a later version of the schema which may permit, for example, additional fields or values.

		# An update to spec.compatibilitySchema.crdYAML which would cause this value
		# to change once it has been set will be rejected.

Add cluster-api/crd-compatibility-checker #1845

Are you sure you want to change the base?

Add cluster-api/crd-compatibility-checker #1845

Uh oh!

Conversation

mdbooth commented Sep 16, 2025

Uh oh!

openshift-ci bot commented Sep 16, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

damdo commented Sep 16, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chrischdi Sep 24, 2025 •

edited

Loading