Skip to content

Conversation

mdbooth
Copy link
Contributor

@mdbooth mdbooth commented Sep 16, 2025

No description provided.

@openshift-ci openshift-ci bot requested review from Miciah and sadasu September 16, 2025 14:22
Copy link
Contributor

openshift-ci bot commented Sep 16, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign yuqi-zhang for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

## Open Questions [optional]

1. What is the specific algorithm for determining CRD compatibility?
2. How should the system handle CRD conversion webhooks during compatibility checks?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The system has no insight into conversion webhooks. It will be able to check compatibility between versions when a webhook is not defined, but otherwise it is assumed that the conversion webhook is converting in a compatible manner

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be rewritten, and I think it's more of an operational issue.

  • We should document that it is the responsibility of the CRD owner to run conversion webhooks
  • We should document the potential operational impacts if we're ever using a non-storage version in openshift-cluster-api.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still to be rewritten?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's my intention that the section on the explicit responsibilities of the Adopting Manager covers this.


1. What is the specific algorithm for determining CRD compatibility?
2. How should the system handle CRD conversion webhooks during compatibility checks?
3. What is the performance impact of the admission webhooks on high-traffic clusters?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you looked at all at match conditions on the webhook configurations to make it very explicit about which resources do and do not need to be sent to the webhook? Could be worth explaining some of the options there in this context?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought I'd covered that somewhere? Here I'm really just highlighting that we don't yet know if further performance optimisation will be required.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there was a very brief mention of selection being available, but not an explanation of how granular we expect that to be

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're essentially copying the relevant section of the ValidatingWebhookConfiguration into our API, because that's how this is implemented. It's as granular as a ValidatingWebhookConfiguration can be.

**Note:** *Section not required until targeted at a release.*

The test plan should include:
- Unit tests for compatibility checking logic
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking, this doesn't mean to test the external library right?
If we are relying on an external library, shouldn't we assume that the external library is well tested?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not going to write unit tests for CRDify.

However, we will have our own definition of compatible which applies in this specific context. We need to ensure that we have configured CRDify correctly to implement that. This means that we're going to need our own unit testing, but it doesn't need to be as detailed as what CRDify hopefully has internally.

* Create a `CRDCompatibilityRequirement` for every CRD in the 'active' transport config map if that CRD is also marked 'unmanaged'.
This requirement will have `crdAdmitAction=Enforce`.

If `CRDCompatibilityCheckerEnforce` is not enabled, cluster-capi-operator will look for all 'active' `CRDCompatibilityRequirement`s and delete them.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't generally both with disabled logic like this, once a gate is enabled in OCP, it should never be disabled

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've specifically documented this behaviour because it neatly covers the downgrade case, and I'm guessing it's probably not too hard to implement. Happy to take it out, but I think we'd spend as much effort writing the documentation on how to do it manually.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't support downgrades, so while this is a nice to have, I would not block the feature shipping if this wasn't implemented. Feel free to leave in, but, I wouldn't expect us to prio implementing this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've noted that we don't intend to support downgrades.

@damdo
Copy link
Member

damdo commented Sep 16, 2025

/assign @damdo

action: Enforce
```

CCAPIO marks itself Degraded if any CRDCompatibilityRequirement it creates during this process becomes not Progressing, and:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does marking itself as Degraded mean? Is it in the operator CRD?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It means setting the Degraded condition on CCAPIO's ClusterOperator, which is an OpenShift thing.

@JoelSpeed as a stylistic point, the above is an entirely reasonable question. Can I assume that a reader of openshift/enhancements would know this, or should I attempt to describe it somewhere?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to aim these documents so that someone who isn't an engineer in OpenShift could also read them (customers, PM, docs team). I think the concept of Degraded as a way of reporting operator problems in openshift is probably fairly well understood among those groups so in this case, I probably wouldn't expand on what that actually means, hard one to define there tbh, and obviously, doesn't cover everyone

Comment on lines 410 to 597
* CRDCompatibilityCheckerUpgradeCheck
* CRDCompatibilityCheckerEnforce

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we could have better readable ones here or this follows some well-known pattern. Example:

Suggested change
* CRDCompatibilityCheckerUpgradeCheck
* CRDCompatibilityCheckerEnforce
* CRDCompatibilityUpgradeCheck
* CRDCompatibilityEnforce

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will likely also need to make these matrixed by platform

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't bother to look up the naming conventions for feature gates at all. I was kinda hoping @JoelSpeed would just give me 2 acceptable values 😅

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are aiming for a feature gate ClusterAPIMachineManagement<platform> for each of the platforms we will support as we migrate to CAPI. The second stage of this (enforcement) is tied to the lifecycle of that feature gate, so I doubt we need an additional gate

For the earlier stage, perhaps we key off of that, and go for ClusterAPIMachineManagement<platform>Upgradeability?

`CRDCompatibilityCheckerUpgradeCheck` enables the 'future version' check.
If `CRDCompatibilityCheckerUpgradeCheck` is enabled, cluster-capi-operator will:

* Create a `CRDCompatibilityRequirement` for every CRD discovered in the 'future' transport config map if that CRD is also marked 'unmanaged'.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "marked 'unmanaged'" mean? How is that expressed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a separate document explaining how the flow will work for this, trying to focus the two documents on separate elements (implementation vs usage in an existing system), but there's bleed through

The high level of that though is that we will have a configuration CRD for the CAPI operator that takes a list of CRD names to be considered unmanaged

- Enforcing single ownership of CRDs (must be ensured out-of-band)
- Providing a user interface for managing compatibility requirements
- Resolving compatibility conflicts between different actors

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add as non-goal that this proposal won't handle how validating/mutating webhooks should be deployed? (e.g. they should have proper selectors to not refer other namespaces).

Example:

  • Validating/Mutating webhooks for CAPI deployed in openshift-cluster-api namespace
  • Validating/Mutating webhooks for CAPI deployed in foo namespace

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah so we expect the person who takes over management of the CRDs (adopting manager) to be responsible for this going forward, unless this is clarified later in this document, we should clarify what we expect of the adopting managers once they have taken over.

Comment on lines 112 to 118
crdRef: clusters.cluster.x-k8s.io
creatorDescription: "OpenShift Cluster CAPI Operator"
compatibilityCRD: |
...
<complete YAML document of Cluster CRD from transport config map>
...
crdAdmitAction: Enforce

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this should be (I'm not convinced of myself here):

Suggested change
crdRef: clusters.cluster.x-k8s.io
creatorDescription: "OpenShift Cluster CAPI Operator"
compatibilityCRD: |
...
<complete YAML document of Cluster CRD from transport config map>
...
crdAdmitAction: Enforce
creatorDescription: "OpenShift Cluster CAPI Operator"
crd:
ref: clusters.cluster.x-k8s.io
action: Enforce|Warn
compatibility: |
...

Feels otherwise a bit inconsistent to have:

  • .spec.crdAdmitAction
  • .spec.objectSchemaValidation.action

Note: above also feels a bit inconsistent, because ref and compatibility are also used for the objectSchemaValidation later...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if it were something like

target:
  name: my.crd.io
compatibilitySchema: ...
crdSchema:
  action: Enforce|Warn
objectSchema:
  action: Enforce|Warn

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That reads way beter :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incidentally @JoelSpeed I'm hesitant to use compatibilitySchema. A CRD contains a schema, but we need the whole CRD, including all its k8s metadata, as yaml. I wonder if we could make that more obvious somehow.

The following fields are 'top level', meaning that they're always required and they apply equally to CRD validation and object validation:

  • crdRef
  • creatorDescription
  • compatibilityCRD

CRD validation only:

  • crdAdmitAction

Object validation only:

  • action
  • namespaceSelector
  • objectSelector
  • matchConditions

How about:

crd:
  name: my.crd.io
compatibilityCRD:
  yaml: |
    ...
creatorDescription: "云团队"
crdSchemaValidation:
  action: Enforce|Warn
objectSchemaValidation:
  action: Enforce|Warn
  nameSpaceSelector:
    ...
  matchConditions:
    ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If compatibilityCRD is an object, what if we got rid of crd and moved name inside it?

compatibilityCRD:
  name: my.crd.io
  yaml: |
    ...
creatorDescription: "ABC"
crdSchemaValidation:
  action: Enforce|Warn
objectSchemaValidation:
  action: Enforce|Warn
  nameSpaceSelector:
    ...
  matchConditions:
    ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...or parsed the name from the yaml, and put it in status?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open to that, with a validation that parsing the name from yaml matches the status entry on all future writes?

Copy link

@chrischdi chrischdi Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only makes sense if we can use a status field as index field. I never tried that, but I guess it should work?! (starting to check). Note: tested, using a status field as index works, but comes with the effect that we first need a reconciliation before the webhook really makes use of the requirement (which maybe also is a good thing?).

**Mitigation**:
This would be a bug in the tool.
We expect the initial and primary non-payload customer to be HCP.
We will coordinate with HCP to ensure that potential issues show up early in their CI pipeline.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is already a case we experienced before between MCE->CAPA & Hypershift->CAPA . Testing will be mandatory.


The benefits of enabling safe multi-actor CRD management outweigh these drawbacks, and the system is designed to fail gracefully with clear error messages.

## Alternatives (Not Implemented)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this proposal letting the CRD to be owned and changed by external workload (ex;HCP). Why not using CRD versions with conversion->strategy->Webhook is not good enough to handle this case ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to add more thoughts; using the CRD conversion->strategy->Webhook may required effort in the api versions conversions ex; (v1alpha1 -> v1beta2 & v1beta2 -> v1alpha1) but with having that we should be able to handle this case ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

API versions are a use case covered by this tool, but they aren't the primary focus. We're more concerned with 2 different versions of api version v1, where the later one may have more fields and allow more values in its enums.

We'd want to prevent an upgrade to a CRD which removed an old API version.

* not Compatible - current CRD is incompatible with CCAPIO requirements

HCP uses an as-yet-undefined mechanism to inform CCAPIO of the set of CRDs which are no longer managed by CCAPIO.
For CRDs in this list, CCAPIO creates a CRDCompatibilityRequirement but does not load or update the CRD.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you expand on that using real example from CAPI APIs ex; cluster/machine. Is there a case between CAPI api versions (ex; v1alpha1 & v1beta2) that lead to have breaking changes ?

@mdbooth mdbooth force-pushed the crd-compatibility branch from ff9313e to 3e98f90 Compare October 6, 2025 12:50
Copy link
Contributor

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we have previously discussed upgrades, we have also talked about how the CAPI operator will not roll out new operand versions until it is sure that those are compatible, I don't see that covered here. I also missed the part we previously discussed about current, desired and future. There seems to be no mention of the desired phase (which I guess is part of the new operand roll out), did we end up deciding that this wasn't needed?

- Enforcing single ownership of CRDs (must be ensured out-of-band)
- Providing a user interface for managing compatibility requirements
- Resolving compatibility conflicts between different actors

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah so we expect the person who takes over management of the CRDs (adopting manager) to be responsible for this going forward, unless this is clarified later in this document, we should clarify what we expect of the adopting managers once they have taken over.


#### Cluster CAPI Operator

* Stops applying updates to the CRD on behalf of the core payload.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Configures CRD compatibility to ensure its operands are compatible with the current and future versions of the CRD?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cluster CAPI Operator is also a CRD User. This is defined there. I noted that in the CRD user section, but I might switch the backreference into 2 forward references to make it clearer.


Note that the adopting manager and Cluster CAPI Operator are also expected to be CRD users.
There may be additional CRD users.
For example when the adopting manager is ACM and deploys HyperShift, HyperShift will also be a CRD user.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When ACM deploys hypershift, does hypershift operator install the CAPI CRDs or does ACM? Have we checked this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@serngawy Can you answer this?


The CRD Compatibility Checker can also perform schema validation on objects against a custom schema.
The Cluster CAPI Operator will configure CRD Compatibility Checker to perform schema validation on objects of unmanaged CRDs used by its operands.
This will ensure objects presented to Cluster CAPI Operator operands conform to the expected schema rather than a later version of the schema which may permit, for example, addition fields or values.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This will ensure objects presented to Cluster CAPI Operator operands conform to the expected schema rather than a later version of the schema which may permit, for example, addition fields or values.
This will ensure objects presented to Cluster CAPI Operator operands conform to the expected schema rather than a later version of the schema which may permit, for example, additional fields or values.

# the corresponding ValidatingWebhookConfiguration. Their definitions and
# semantics are therefore identical to those in
# ValidatingWebhookConfiguration.
namespaceSelector:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we allow this to be omitted? And if we do, does that mean all namespaces?

Comment on lines +389 to +402
# An update to spec.compatibilitySchema.crdYAML which would cause this value
# to change once it has been set will be rejected.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are you implementing this validation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Webhook on CRDCompatibilityRequirement itself (it's already implemented, btw). I could write that here, but it felt like an irrelevant detail.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sigh, webhooks

Didn't find any way to implement in CEL?

**Mitigation**:
CRD updates are infrequent enough that this is unlikely to be a concern.
For object schema validation there is potential for impact during certain phases of cluster activity.
We will optimize performance beyond the initial implementation only if it proves necessary.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we considered implementing any metrics to track the impact the webhooks are going to have?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, but we could. However, is there some existing metric, e.g. length of time to serve an api request, that would indicate a problem here if one emerged? Adding metrics isn't hard, but they are expensive in themselves so I'd rather not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect that metric does exist yes, but I don't know if that will proportion correctly where the time is being spent

## Open Questions [optional]

1. What is the specific algorithm for determining CRD compatibility?
2. How should the system handle CRD conversion webhooks during compatibility checks?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still to be rewritten?

@mdbooth mdbooth force-pushed the crd-compatibility branch from 3e98f90 to 90e38d5 Compare October 7, 2025 17:08
Copy link
Contributor

openshift-ci bot commented Oct 7, 2025

@mdbooth: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/markdownlint 90e38d5 link true /test markdownlint

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants