-
Notifications
You must be signed in to change notification settings - Fork 218
NE-1953: Add Validating Admission Policy for Gateway API CRDs #1192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NE-1953: Add Validating Admission Policy for Gateway API CRDs #1192
Conversation
|
@alebedev87: This pull request references NE-1953 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
| reason: Forbidden | ||
| # Verify that the request was sent from a pod. The presence of both "node" and "pod" claims implies that the token is bound to a pod object. | ||
| # Ref: https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#schema-for-service-account-private-claims. | ||
| - expression: "has(request.userInfo.extra) && ('authentication.kubernetes.io/node-name' in request.userInfo.extra) && ('authentication.kubernetes.io/pod-name' in request.userInfo.extra)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would the node claim alone be enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect so, but I think this is extra safe and I don't see a need to remove the pod name check
|
/assign @JoelSpeed To get a quick feedback on the shape of the VAP. I'm particularly interested in this question. |
|
/retitle [WIP] NE-1953: Add Validating Admission Policy for Gateway API CRDs Testing is TBD. |
JoelSpeed
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At some point, you're going to want to debug why the CIO is broken. Running the CIO locally and connecting it to the cluster won't pass these checks, and will therefore fail.
What is the local development/debugging option once this is merged? Has that been considered?
Otherwise this LGTM
| reason: Forbidden | ||
| # Verify that the request was sent from a pod. The presence of both "node" and "pod" claims implies that the token is bound to a pod object. | ||
| # Ref: https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#schema-for-service-account-private-claims. | ||
| - expression: "has(request.userInfo.extra) && ('authentication.kubernetes.io/node-name' in request.userInfo.extra) && ('authentication.kubernetes.io/pod-name' in request.userInfo.extra)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect so, but I think this is extra safe and I don't see a need to remove the pod name check
Right, CIO running outside the cluster won't be able to manage GWAPI CRDs after this PR is merged. Unless we scale down the CVO and remove the VAP or its binding which may be an option for debugging as we are supposed to be privileged enough for this. |
ac403a5 to
c1ae832
Compare
test/e2e/crd_admission_test.go
Outdated
| } else if !gwapiEnabled { | ||
| t.Skipf("Skipping Test_GatewayAPICRDAdmissionOutsideCluster when GatewayAPI featuregate is not enabled") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably my own misunderstanding: We are deploying the VAP outside of the CIO, so we should be testing that the CRDs are protected regardless of the state of the gate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The life cycle of a feature gate is bound to the feature set it belongs to. There are 3 main sets (in the order of maturity): DevPreview, TechPreview, Default. A feature gate may belong to multiple sets but it gets installed by default on OCP cluster only when it gets into Default set. In that case, it becomes a part of the OCP payload but it doesn't mean that the feature gate disappears, the feature gate is still there and enabled. Since currently the gateway API (and OSSM) management is a TechPreview feature we would like VAP to be only seen on TechPreview clusters (not impacting the current OCP payload). However when GatewayAPI featuregate will be promoted to GA (end of 4.10 dev cycle) the management of GatewAPI CRDs will get into Default set however this test will still be working because the GatewayAPI feature gate will remain.
OpenShift featuregates are different from Kubernetes ones, they don't serve to toggle a piece of code on or off but rather aim at protecting the default OCP payload from immature code.
test/e2e/crd_admission_test.go
Outdated
| // unmanaged CRDs | ||
| //makeGWAPICRD("tcproute", "TCPRoute"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically we "manage" all CRDs in the group, at least to the effect that we don't allow anyone else to manage them.
Was the intention here to eventually test these other APIs? Or was it intentional to leave it as a comment to be illustrative for future code readers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was the intention here to eventually test these other APIs?
Yes, that was the idea. However this case is a little more complicated because the test will have to disable (scale down or override) the cluster version operator to be able to create some "unmanaged" CRDs. Also, this will not test any new "code path": the .spec.group check is the same as for the "managed" CRDs. So I'm not fully convinced it's worth the effort.
eff1846 to
ae3fffa
Compare
|
/retest |
dc45342 to
bd1658b
Compare
|
Factorized the VAP management (removal and recovery) into a dedicated |
|
/retest |
bd1658b to
b6394a0
Compare
|
Cosmetic changes: some comments + |
b6394a0 to
415446a
Compare
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: candita The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
a314182 to
b92609a
Compare
|
After a search in Slack, it seems like it's a recurring issue related to some race condition in one of the controllers responsible for the UID annotations (example). Most of the people just (suspense...) rerun their jobs 😮. |
This commit introduces a validating admission policy that restricts modifications to CRDs from the "gateway.networking.k8s.io" group, allowing only the cluster ingress operator's service account to make changes. This ensures that the core OpenShift retains exclusive ownership of the Gateway API implementation, preventing modifications from any add-ons (whether third-party or Red Hat-owned). The policy and its corresponding binding will be managed by CVO.
TestGatewayAPI creates a test ServiceMeshControlPlane (SMCP) resource, which triggers the creation of a mutating webhook for all pods in the cluster. This introduces a race condition where any pod creation request between the webhook's creation and the SMCP pod becoming ready can fail with: failed calling webhook "sidecar-injector.istio.io": failed to call webhook: no endpoints available for service "istiod-openshift-gateway" Serializing the test ensures it runs in isolation with other tests, preventing any impact of the mutating webhook on pod creation in the cluster. Additionally, the timeout for the e2e tests has been increased to 1.5 hours since TestGatewayAPI adds ~10 minutes to the total execution time. Previously, tests ran for ~50 minutes, which was close to the 1-hour limit.
b92609a to
8e73871
Compare
|
Addressed #1192 (comment). |
Did you file a bug report in Jira? grin. |
|
Some problems with the CI image registry: /retest |
|
/retest |
No. Will do if I will see it more than once. This time was the only one so far. |
Seems like a reoccurred of https://issues.redhat.com/browse/OCPBUGS-17671. /test e2e-aws-operator-techpreview |
|
/lgtm |
|
Two failures in the e2e-aws-operator-techpreview. First one you already noted: Was this one there before? It's got to be some other new techpreview feature if e2e-aws-gatewayapi is fine, don't you think? |
|
/retest |
|
@alebedev87: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
[ART PR BUILD NOTIFIER] Distgit: ose-cluster-ingress-operator |
This PR introduces a validating admission policy that restricts modifications to CRDs from the "gateway.networking.k8s.io" group, allowing only the cluster ingress operator's service account to make changes.
This ensures that the core OpenShift retains exclusive ownership of the Gateway API implementation, preventing modifications from any add-ons (whether third-party or Red Hat-owned).
The policy and its corresponding binding will be managed by CVO.