-
Notifications
You must be signed in to change notification settings - Fork 519
Add cluster-api/crd-compatibility-checker #1845
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
## Open Questions [optional] | ||
|
||
1. What is the specific algorithm for determining CRD compatibility? | ||
2. How should the system handle CRD conversion webhooks during compatibility checks? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The system has no insight into conversion webhooks. It will be able to check compatibility between versions when a webhook is not defined, but otherwise it is assumed that the conversion webhook is converting in a compatible manner
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be rewritten, and I think it's more of an operational issue.
- We should document that it is the responsibility of the CRD owner to run conversion webhooks
- We should document the potential operational impacts if we're ever using a non-storage version in
openshift-cluster-api
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still to be rewritten?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's my intention that the section on the explicit responsibilities of the Adopting Manager covers this.
|
||
1. What is the specific algorithm for determining CRD compatibility? | ||
2. How should the system handle CRD conversion webhooks during compatibility checks? | ||
3. What is the performance impact of the admission webhooks on high-traffic clusters? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you looked at all at match conditions on the webhook configurations to make it very explicit about which resources do and do not need to be sent to the webhook? Could be worth explaining some of the options there in this context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought I'd covered that somewhere? Here I'm really just highlighting that we don't yet know if further performance optimisation will be required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there was a very brief mention of selection being available, but not an explanation of how granular we expect that to be
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're essentially copying the relevant section of the ValidatingWebhookConfiguration into our API, because that's how this is implemented. It's as granular as a ValidatingWebhookConfiguration can be.
**Note:** *Section not required until targeted at a release.* | ||
|
||
The test plan should include: | ||
- Unit tests for compatibility checking logic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just checking, this doesn't mean to test the external library right?
If we are relying on an external library, shouldn't we assume that the external library is well tested?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're not going to write unit tests for CRDify.
However, we will have our own definition of compatible which applies in this specific context. We need to ensure that we have configured CRDify correctly to implement that. This means that we're going to need our own unit testing, but it doesn't need to be as detailed as what CRDify hopefully has internally.
* Create a `CRDCompatibilityRequirement` for every CRD in the 'active' transport config map if that CRD is also marked 'unmanaged'. | ||
This requirement will have `crdAdmitAction=Enforce`. | ||
|
||
If `CRDCompatibilityCheckerEnforce` is not enabled, cluster-capi-operator will look for all 'active' `CRDCompatibilityRequirement`s and delete them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't generally both with disabled logic like this, once a gate is enabled in OCP, it should never be disabled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've specifically documented this behaviour because it neatly covers the downgrade case, and I'm guessing it's probably not too hard to implement. Happy to take it out, but I think we'd spend as much effort writing the documentation on how to do it manually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't support downgrades, so while this is a nice to have, I would not block the feature shipping if this wasn't implemented. Feel free to leave in, but, I wouldn't expect us to prio implementing this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've noted that we don't intend to support downgrades.
/assign @damdo |
action: Enforce | ||
``` | ||
|
||
CCAPIO marks itself Degraded if any CRDCompatibilityRequirement it creates during this process becomes not Progressing, and: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does marking itself as Degraded mean? Is it in the operator CRD?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It means setting the Degraded
condition on CCAPIO's ClusterOperator, which is an OpenShift thing.
@JoelSpeed as a stylistic point, the above is an entirely reasonable question. Can I assume that a reader of openshift/enhancements
would know this, or should I attempt to describe it somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend to aim these documents so that someone who isn't an engineer in OpenShift could also read them (customers, PM, docs team). I think the concept of Degraded as a way of reporting operator problems in openshift is probably fairly well understood among those groups so in this case, I probably wouldn't expand on what that actually means, hard one to define there tbh, and obviously, doesn't cover everyone
* CRDCompatibilityCheckerUpgradeCheck | ||
* CRDCompatibilityCheckerEnforce |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if we could have better readable ones here or this follows some well-known pattern. Example:
* CRDCompatibilityCheckerUpgradeCheck | |
* CRDCompatibilityCheckerEnforce | |
* CRDCompatibilityUpgradeCheck | |
* CRDCompatibilityEnforce |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will likely also need to make these matrixed by platform
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't bother to look up the naming conventions for feature gates at all. I was kinda hoping @JoelSpeed would just give me 2 acceptable values 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are aiming for a feature gate ClusterAPIMachineManagement<platform>
for each of the platforms we will support as we migrate to CAPI. The second stage of this (enforcement) is tied to the lifecycle of that feature gate, so I doubt we need an additional gate
For the earlier stage, perhaps we key off of that, and go for ClusterAPIMachineManagement<platform>Upgradeability
?
`CRDCompatibilityCheckerUpgradeCheck` enables the 'future version' check. | ||
If `CRDCompatibilityCheckerUpgradeCheck` is enabled, cluster-capi-operator will: | ||
|
||
* Create a `CRDCompatibilityRequirement` for every CRD discovered in the 'future' transport config map if that CRD is also marked 'unmanaged'. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does "marked 'unmanaged'" mean? How is that expressed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a separate document explaining how the flow will work for this, trying to focus the two documents on separate elements (implementation vs usage in an existing system), but there's bleed through
The high level of that though is that we will have a configuration CRD for the CAPI operator that takes a list of CRD names to be considered unmanaged
- Enforcing single ownership of CRDs (must be ensured out-of-band) | ||
- Providing a user interface for managing compatibility requirements | ||
- Resolving compatibility conflicts between different actors | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add as non-goal that this proposal won't handle how validating/mutating webhooks should be deployed? (e.g. they should have proper selectors to not refer other namespaces).
Example:
- Validating/Mutating webhooks for CAPI deployed in
openshift-cluster-api
namespace - Validating/Mutating webhooks for CAPI deployed in
foo
namespace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah so we expect the person who takes over management of the CRDs (adopting manager) to be responsible for this going forward, unless this is clarified later in this document, we should clarify what we expect of the adopting managers once they have taken over.
crdRef: clusters.cluster.x-k8s.io | ||
creatorDescription: "OpenShift Cluster CAPI Operator" | ||
compatibilityCRD: | | ||
... | ||
<complete YAML document of Cluster CRD from transport config map> | ||
... | ||
crdAdmitAction: Enforce |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this should be (I'm not convinced of myself here):
crdRef: clusters.cluster.x-k8s.io | |
creatorDescription: "OpenShift Cluster CAPI Operator" | |
compatibilityCRD: | | |
... | |
<complete YAML document of Cluster CRD from transport config map> | |
... | |
crdAdmitAction: Enforce | |
creatorDescription: "OpenShift Cluster CAPI Operator" | |
crd: | |
ref: clusters.cluster.x-k8s.io | |
action: Enforce|Warn | |
compatibility: | | |
... |
Feels otherwise a bit inconsistent to have:
.spec.crdAdmitAction
.spec.objectSchemaValidation.action
Note: above also feels a bit inconsistent, because ref and compatibility are also used for the objectSchemaValidation later...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if it were something like
target:
name: my.crd.io
compatibilitySchema: ...
crdSchema:
action: Enforce|Warn
objectSchema:
action: Enforce|Warn
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That reads way beter :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incidentally @JoelSpeed I'm hesitant to use compatibilitySchema
. A CRD contains a schema, but we need the whole CRD, including all its k8s metadata, as yaml. I wonder if we could make that more obvious somehow.
The following fields are 'top level', meaning that they're always required and they apply equally to CRD validation and object validation:
crdRef
creatorDescription
compatibilityCRD
CRD validation only:
crdAdmitAction
Object validation only:
action
namespaceSelector
objectSelector
matchConditions
How about:
crd:
name: my.crd.io
compatibilityCRD:
yaml: |
...
creatorDescription: "云团队"
crdSchemaValidation:
action: Enforce|Warn
objectSchemaValidation:
action: Enforce|Warn
nameSpaceSelector:
...
matchConditions:
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If compatibilityCRD
is an object, what if we got rid of crd
and moved name inside it?
compatibilityCRD:
name: my.crd.io
yaml: |
...
creatorDescription: "ABC"
crdSchemaValidation:
action: Enforce|Warn
objectSchemaValidation:
action: Enforce|Warn
nameSpaceSelector:
...
matchConditions:
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...or parsed the name from the yaml, and put it in status?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm open to that, with a validation that parsing the name from yaml matches the status entry on all future writes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only makes sense if we can use a status field as index field. I never tried that, but I guess it should work?! (starting to check). Note: tested, using a status field as index works, but comes with the effect that we first need a reconciliation before the webhook really makes use of the requirement (which maybe also is a good thing?).
**Mitigation**: | ||
This would be a bug in the tool. | ||
We expect the initial and primary non-payload customer to be HCP. | ||
We will coordinate with HCP to ensure that potential issues show up early in their CI pipeline. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is already a case we experienced before between MCE->CAPA & Hypershift->CAPA . Testing will be mandatory.
|
||
The benefits of enabling safe multi-actor CRD management outweigh these drawbacks, and the system is designed to fail gracefully with clear error messages. | ||
|
||
## Alternatives (Not Implemented) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this proposal letting the CRD to be owned and changed by external workload (ex;HCP). Why not using CRD versions with conversion->strategy->Webhook is not good enough to handle this case ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to add more thoughts; using the CRD conversion->strategy->Webhook may required effort in the api versions conversions ex; (v1alpha1 -> v1beta2 & v1beta2 -> v1alpha1) but with having that we should be able to handle this case ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
API versions are a use case covered by this tool, but they aren't the primary focus. We're more concerned with 2 different versions of api version v1, where the later one may have more fields and allow more values in its enums.
We'd want to prevent an upgrade to a CRD which removed an old API version.
* not Compatible - current CRD is incompatible with CCAPIO requirements | ||
|
||
HCP uses an as-yet-undefined mechanism to inform CCAPIO of the set of CRDs which are no longer managed by CCAPIO. | ||
For CRDs in this list, CCAPIO creates a CRDCompatibilityRequirement but does not load or update the CRD. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you expand on that using real example from CAPI APIs ex; cluster/machine. Is there a case between CAPI api versions (ex; v1alpha1 & v1beta2) that lead to have breaking changes ?
ff9313e
to
3e98f90
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we have previously discussed upgrades, we have also talked about how the CAPI operator will not roll out new operand versions until it is sure that those are compatible, I don't see that covered here. I also missed the part we previously discussed about current, desired and future. There seems to be no mention of the desired phase (which I guess is part of the new operand roll out), did we end up deciding that this wasn't needed?
- Enforcing single ownership of CRDs (must be ensured out-of-band) | ||
- Providing a user interface for managing compatibility requirements | ||
- Resolving compatibility conflicts between different actors | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah so we expect the person who takes over management of the CRDs (adopting manager) to be responsible for this going forward, unless this is clarified later in this document, we should clarify what we expect of the adopting managers once they have taken over.
|
||
#### Cluster CAPI Operator | ||
|
||
* Stops applying updates to the CRD on behalf of the core payload. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Configures CRD compatibility to ensure its operands are compatible with the current and future versions of the CRD?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cluster CAPI Operator is also a CRD User. This is defined there. I noted that in the CRD user section, but I might switch the backreference into 2 forward references to make it clearer.
|
||
Note that the adopting manager and Cluster CAPI Operator are also expected to be CRD users. | ||
There may be additional CRD users. | ||
For example when the adopting manager is ACM and deploys HyperShift, HyperShift will also be a CRD user. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When ACM deploys hypershift, does hypershift operator install the CAPI CRDs or does ACM? Have we checked this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@serngawy Can you answer this?
|
||
The CRD Compatibility Checker can also perform schema validation on objects against a custom schema. | ||
The Cluster CAPI Operator will configure CRD Compatibility Checker to perform schema validation on objects of unmanaged CRDs used by its operands. | ||
This will ensure objects presented to Cluster CAPI Operator operands conform to the expected schema rather than a later version of the schema which may permit, for example, addition fields or values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will ensure objects presented to Cluster CAPI Operator operands conform to the expected schema rather than a later version of the schema which may permit, for example, addition fields or values. | |
This will ensure objects presented to Cluster CAPI Operator operands conform to the expected schema rather than a later version of the schema which may permit, for example, additional fields or values. |
# the corresponding ValidatingWebhookConfiguration. Their definitions and | ||
# semantics are therefore identical to those in | ||
# ValidatingWebhookConfiguration. | ||
namespaceSelector: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we allow this to be omitted? And if we do, does that mean all namespaces?
# An update to spec.compatibilitySchema.crdYAML which would cause this value | ||
# to change once it has been set will be rejected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are you implementing this validation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Webhook on CRDCompatibilityRequirement itself (it's already implemented, btw). I could write that here, but it felt like an irrelevant detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sigh, webhooks
Didn't find any way to implement in CEL?
**Mitigation**: | ||
CRD updates are infrequent enough that this is unlikely to be a concern. | ||
For object schema validation there is potential for impact during certain phases of cluster activity. | ||
We will optimize performance beyond the initial implementation only if it proves necessary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we considered implementing any metrics to track the impact the webhooks are going to have?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, but we could. However, is there some existing metric, e.g. length of time to serve an api request, that would indicate a problem here if one emerged? Adding metrics isn't hard, but they are expensive in themselves so I'd rather not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expect that metric does exist yes, but I don't know if that will proportion correctly where the time is being spent
## Open Questions [optional] | ||
|
||
1. What is the specific algorithm for determining CRD compatibility? | ||
2. How should the system handle CRD conversion webhooks during compatibility checks? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still to be rewritten?
3e98f90
to
90e38d5
Compare
@mdbooth: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
No description provided.