NE-1953: Add Validating Admission Policy for Gateway API CRDs #1192

alebedev87 · 2025-02-20T09:45:37Z

This PR introduces a validating admission policy that restricts modifications to CRDs from the "gateway.networking.k8s.io" group, allowing only the cluster ingress operator's service account to make changes.

This ensures that the core OpenShift retains exclusive ownership of the Gateway API implementation, preventing modifications from any add-ons (whether third-party or Red Hat-owned).

The policy and its corresponding binding will be managed by CVO.

openshift-ci-robot · 2025-02-20T09:45:43Z

@alebedev87: This pull request references NE-1953 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

This PR introduces a validating admission policy that restricts modifications to CRDs from the "gateway.networking.k8s.io" group, allowing only the cluster ingress operator's service account to make changes.

This ensures that the core OpenShift retains exclusive ownership of the Gateway API implementation, preventing modifications from any add-ons (whether third-party or Red Hat-owned).

The policy and its corresponding binding will be managed by CVO.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

alebedev87 · 2025-02-20T09:47:42Z

manifests/01-validating-admission-policy.yaml

+      reason: Forbidden
+    # Verify that the request was sent from a pod. The presence of both "node" and "pod" claims implies that the token is bound to a pod object.
+    # Ref: https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#schema-for-service-account-private-claims.
+    - expression: "has(request.userInfo.extra) && ('authentication.kubernetes.io/node-name' in request.userInfo.extra) && ('authentication.kubernetes.io/pod-name' in request.userInfo.extra)"


Would the node claim alone be enough?

I suspect so, but I think this is extra safe and I don't see a need to remove the pod name check

alebedev87 · 2025-02-20T09:49:19Z

/assign @JoelSpeed

To get a quick feedback on the shape of the VAP. I'm particularly interested in this question.

alebedev87 · 2025-02-20T09:50:10Z

/retitle [WIP] NE-1953: Add Validating Admission Policy for Gateway API CRDs

Testing is TBD.

JoelSpeed

At some point, you're going to want to debug why the CIO is broken. Running the CIO locally and connecting it to the cluster won't pass these checks, and will therefore fail.

What is the local development/debugging option once this is merged? Has that been considered?

Otherwise this LGTM

JoelSpeed · 2025-02-20T09:59:14Z

manifests/01-validating-admission-policy.yaml

+      reason: Forbidden
+    # Verify that the request was sent from a pod. The presence of both "node" and "pod" claims implies that the token is bound to a pod object.
+    # Ref: https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#schema-for-service-account-private-claims.
+    - expression: "has(request.userInfo.extra) && ('authentication.kubernetes.io/node-name' in request.userInfo.extra) && ('authentication.kubernetes.io/pod-name' in request.userInfo.extra)"


I suspect so, but I think this is extra safe and I don't see a need to remove the pod name check

alebedev87 · 2025-02-20T10:55:55Z

At some point, you're going to want to debug why the CIO is broken. Running the CIO locally and connecting it to the cluster won't pass these checks, and will therefore fail.

What is the local development/debugging option once this is merged? Has that been considered?

Right, CIO running outside the cluster won't be able to manage GWAPI CRDs after this PR is merged. Unless we scale down the CVO and remove the VAP or its binding which may be an option for debugging as we are supposed to be privileged enough for this.
I cannot talk for everybody for sure but from what I know the CIO debugging is always done inside a running cluster: we scale down the CVO and replace the CIO image with the one we test (sub-optiomal, I agree).

shaneutt · 2025-02-25T14:14:48Z

test/e2e/crd_admission_test.go

+	} else if !gwapiEnabled {
+		t.Skipf("Skipping Test_GatewayAPICRDAdmissionOutsideCluster when GatewayAPI featuregate is not enabled")


Probably my own misunderstanding: We are deploying the VAP outside of the CIO, so we should be testing that the CRDs are protected regardless of the state of the gate?

The life cycle of a feature gate is bound to the feature set it belongs to. There are 3 main sets (in the order of maturity): DevPreview, TechPreview, Default. A feature gate may belong to multiple sets but it gets installed by default on OCP cluster only when it gets into Default set. In that case, it becomes a part of the OCP payload but it doesn't mean that the feature gate disappears, the feature gate is still there and enabled. Since currently the gateway API (and OSSM) management is a TechPreview feature we would like VAP to be only seen on TechPreview clusters (not impacting the current OCP payload). However when GatewayAPI featuregate will be promoted to GA (end of 4.10 dev cycle) the management of GatewAPI CRDs will get into Default set however this test will still be working because the GatewayAPI feature gate will remain.

OpenShift featuregates are different from Kubernetes ones, they don't serve to toggle a piece of code on or off but rather aim at protecting the default OCP payload from immature code.

shaneutt · 2025-02-26T15:13:04Z

test/e2e/crd_admission_test.go

+		// unmanaged CRDs
+		//makeGWAPICRD("tcproute", "TCPRoute"),


Technically we "manage" all CRDs in the group, at least to the effect that we don't allow anyone else to manage them.

Was the intention here to eventually test these other APIs? Or was it intentional to leave it as a comment to be illustrative for future code readers?

Was the intention here to eventually test these other APIs?

Yes, that was the idea. However this case is a little more complicated because the test will have to disable (scale down or override) the cluster version operator to be able to create some "unmanaged" CRDs. Also, this will not test any new "code path": the .spec.group check is the same as for the "managed" CRDs. So I'm not fully convinced it's worth the effort.

alebedev87 · 2025-02-27T22:36:15Z

/retest

alebedev87 · 2025-02-28T10:49:47Z

Factorized the VAP management (removal and recovery) into a dedicated struct.

alebedev87 · 2025-02-28T13:39:13Z

/retest

alebedev87 · 2025-03-03T09:53:15Z

Cosmetic changes: some comments + vapManager was made private.

openshift-ci · 2025-03-10T22:43:56Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: candita

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [candita]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2025-03-10T23:31:07Z

/retest-required

Remaining retests: 0 against base HEAD 69cadab and 2 for PR HEAD a314182 in total

openshift-ci-robot · 2025-03-11T03:31:06Z

/retest-required

Remaining retests: 0 against base HEAD e80fa46 and 1 for PR HEAD a314182 in total

alebedev87 · 2025-03-11T10:17:06Z

e2e-aws-operator-techpreview failed to create a pod for HTTPRoute due to unable to find annotation openshift.io/sa.scc.uid-range error:

=== RUN   TestAll/serial/TestGatewayAPI/testGatewayAPIObjects
    gateway_api_test.go:88: Creating namespace "test-e2e-gwapi-ws7d6"...
    gateway_api_test.go:88: Waiting for ServiceAccount test-e2e-gwapi-ws7d6/default to be provisioned...
    gateway_api_test.go:88: Waiting for RoleBinding test-e2e-gwapi-ws7d6/system:image-pullers to be created...
    gateway_api_test.go:88: failed to create one or more gateway object/s: feature gate was enabled, but http route object could not be created: failed to create pod test-e2e-gwapi-ws7d6/test-gateway-openshift-default: pods "test-gateway-openshift-default" is forbidden: error fetching namespace "test-e2e-gwapi-ws7d6": unable to find annotation openshift.io/sa.scc.uid-range

After a search in Slack, it seems like it's a recurring issue related to some race condition in one of the controllers responsible for the UID annotations (example). Most of the people just (suspense...) rerun their jobs 😮.

This commit introduces a validating admission policy that restricts modifications to CRDs from the "gateway.networking.k8s.io" group, allowing only the cluster ingress operator's service account to make changes. This ensures that the core OpenShift retains exclusive ownership of the Gateway API implementation, preventing modifications from any add-ons (whether third-party or Red Hat-owned). The policy and its corresponding binding will be managed by CVO.

TestGatewayAPI creates a test ServiceMeshControlPlane (SMCP) resource, which triggers the creation of a mutating webhook for all pods in the cluster. This introduces a race condition where any pod creation request between the webhook's creation and the SMCP pod becoming ready can fail with: failed calling webhook "sidecar-injector.istio.io": failed to call webhook: no endpoints available for service "istiod-openshift-gateway" Serializing the test ensures it runs in isolation with other tests, preventing any impact of the mutating webhook on pod creation in the cluster. Additionally, the timeout for the e2e tests has been increased to 1.5 hours since TestGatewayAPI adds ~10 minutes to the total execution time. Previously, tests ran for ~50 minutes, which was close to the 1-hour limit.

alebedev87 · 2025-03-11T10:22:41Z

Addressed #1192 (comment).

Miciah · 2025-03-11T13:43:43Z

After a search in Slack, it seems like it's a recurring issue related to some race condition in one of the controllers responsible for the UID annotations (example). Most of the people just (suspense...) rerun their jobs 😮.

Did you file a bug report in Jira? grin.

alebedev87 · 2025-03-11T14:32:41Z

Some problems with the CI image registry:

 * 2025-03-11T12:32:44Z 242x kubelet: Back-off pulling image "registry.ci.openshift.org/ci/entrypoint-wrapper:latest"
* 2025-03-11T11:38:44Z 4x kubelet: Error: ImagePullBackOff

/retest

alebedev87 · 2025-03-11T14:42:30Z

/retest

alebedev87 · 2025-03-11T16:38:41Z

Did you file a bug report in Jira? grin.

No. Will do if I will see it more than once. This time was the only one so far.

alebedev87 · 2025-03-11T18:11:48Z

=== NAME  TestAll/parallel/TestManagedDNSToUnmanagedDNSIngressController
    util_test.go:714: verified connectivity with workload with req http://a6bc4691b16974830b41cb0cc8598e2f-661557309.us-west-2.elb.amazonaws.com and response 200
    unmanaged_dns_test.go:147: Updating ingresscontroller managed-migrated to dnsManagementPolicy=Unmanaged
    unmanaged_dns_test.go:160: Waiting for stable conditions on ingresscontroller managed-migrated after dnsManagementPolicy=Unmanaged
    unmanaged_dns_test.go:176: verifying conditions on DNSRecord zone {ID: Tags:map[Name:ci-op-4fxxjl7i-9e7c5-s6hzp-int kubernetes.io/cluster/ci-op-4fxxjl7i-9e7c5-s6hzp:owned]}
    unmanaged_dns_test.go:176: DNSRecord zone expected to have status=Unknown but got status=True

Seems like a reoccurred of https://issues.redhat.com/browse/OCPBUGS-17671.

/test e2e-aws-operator-techpreview

candita · 2025-03-12T00:27:23Z

/lgtm

candita · 2025-03-12T00:35:02Z

Two failures in the e2e-aws-operator-techpreview. First one you already noted:
RUN TestAll/serial/TestGatewayAPI/testGatewayAPIObjects
.. gateway_api_test.go:88: failed to create one or more gateway object/s: feature gate was enabled, but http route object could not be created: failed to create pod test-e2e-gwapi-t8nzn/test-gateway-openshift-default: pods "test-gateway-openshift-default" is forbidden: error fetching namespace "test-e2e-gwapi-t8nzn": unable to find annotation openshift.io/sa.scc.uid-range

Was this one there before?
=== RUN TestAll/serial/TestGatewayAPI/testGatewayAPIIstioInstallation
util_gatewayapi_test.go:400: found subscription servicemeshoperator at installed version
util_gatewayapi_test.go:748: found catalogSource redhat-operators with last observed state READY
util_gatewayapi_test.go:461: failed to get deployment openshift-operators/istio-operator, retrying...
[repeated, trying]
util_gatewayapi_test.go:461: failed to get deployment openshift-operators/istio-operator, retrying...
gateway_api_test.go:89: failed to find expected Istio operator: OSSM operator failure: pod istio-operator-75bb585bb-bktzx is not running, it is Pending

It's got to be some other new techpreview feature if e2e-aws-gatewayapi is fine, don't you think?

candita · 2025-03-12T00:35:12Z

/retest

openshift-ci-robot · 2025-03-12T02:15:53Z

/retest-required

Remaining retests: 0 against base HEAD e80fa46 and 2 for PR HEAD 8e73871 in total

openshift-ci · 2025-03-12T04:02:12Z

@alebedev87: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-bot · 2025-03-12T05:11:35Z

[ART PR BUILD NOTIFIER]

Distgit: ose-cluster-ingress-operator
This PR has been included in build ose-cluster-ingress-operator-container-v4.19.0-202503120440.p0.g257abf3.assembly.stream.el9.
All builds following this will include this PR.

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Feb 20, 2025

openshift-ci bot requested review from candita and frobware February 20, 2025 09:46

alebedev87 commented Feb 20, 2025

View reviewed changes

openshift-ci bot assigned JoelSpeed Feb 20, 2025

openshift-ci bot changed the title ~~NE-1953: Add Validating Admission Policy for Gateway API CRDs~~ [WIP] NE-1953: Add Validating Admission Policy for Gateway API CRDs Feb 20, 2025

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 20, 2025

JoelSpeed reviewed Feb 20, 2025

View reviewed changes

alebedev87 force-pushed the ne-1953-vap-deny-gwapi-crds branch 2 times, most recently from ac403a5 to c1ae832 Compare February 25, 2025 11:18

shaneutt reviewed Feb 25, 2025

View reviewed changes

shaneutt approved these changes Feb 26, 2025

View reviewed changes

shaneutt reviewed Feb 26, 2025

View reviewed changes

alebedev87 force-pushed the ne-1953-vap-deny-gwapi-crds branch 3 times, most recently from eff1846 to ae3fffa Compare February 27, 2025 14:12

alebedev87 changed the title ~~[WIP] NE-1953: Add Validating Admission Policy for Gateway API CRDs~~ NE-1953: Add Validating Admission Policy for Gateway API CRDs Feb 27, 2025

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 27, 2025

alebedev87 force-pushed the ne-1953-vap-deny-gwapi-crds branch 2 times, most recently from dc45342 to bd1658b Compare February 28, 2025 10:48

alebedev87 force-pushed the ne-1953-vap-deny-gwapi-crds branch from bd1658b to b6394a0 Compare March 3, 2025 09:48

alebedev87 force-pushed the ne-1953-vap-deny-gwapi-crds branch from b6394a0 to 415446a Compare March 3, 2025 09:55

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 10, 2025

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 10, 2025

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 11, 2025

alebedev87 force-pushed the ne-1953-vap-deny-gwapi-crds branch from a314182 to b92609a Compare March 11, 2025 06:47

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Mar 11, 2025

openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 11, 2025

lihongan mentioned this pull request Mar 11, 2025

NE-1968: add e2e tests for FeatureGate GatewayAPI openshift/origin#29597

Merged

alebedev87 mentioned this pull request Mar 11, 2025

NE-1951: Pre-upgrade Admin Gate for Gateway API CRD Management Succession #1196

Merged

alebedev87 added 2 commits March 11, 2025 11:19

alebedev87 force-pushed the ne-1953-vap-deny-gwapi-crds branch from b92609a to 8e73871 Compare March 11, 2025 10:22

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 12, 2025

openshift-merge-bot bot merged commit 257abf3 into openshift:master Mar 12, 2025
19 checks passed

Miciah mentioned this pull request Mar 12, 2025

NE-1934: Bump to OSSM 3.0 for Gateway API support #1152

Merged

		} else if !gwapiEnabled {
		t.Skipf("Skipping Test_GatewayAPICRDAdmissionOutsideCluster when GatewayAPI featuregate is not enabled")

NE-1953: Add Validating Admission Policy for Gateway API CRDs #1192

NE-1953: Add Validating Admission Policy for Gateway API CRDs #1192

Uh oh!

Conversation

alebedev87 commented Feb 20, 2025

Uh oh!

openshift-ci-robot commented Feb 20, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alebedev87 Feb 20, 2025

Choose a reason for hiding this comment

Uh oh!

JoelSpeed Feb 20, 2025

Choose a reason for hiding this comment

Uh oh!

alebedev87 commented Feb 20, 2025

Uh oh!

alebedev87 commented Feb 20, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JoelSpeed left a comment

Choose a reason for hiding this comment

Uh oh!

JoelSpeed Feb 20, 2025

Choose a reason for hiding this comment

Uh oh!

alebedev87 commented Feb 20, 2025

Uh oh!

shaneutt Feb 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alebedev87 Feb 26, 2025

Choose a reason for hiding this comment

Uh oh!

shaneutt Feb 26, 2025

Choose a reason for hiding this comment

Uh oh!

alebedev87 Feb 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alebedev87 commented Feb 27, 2025

Uh oh!

alebedev87 commented Feb 28, 2025

Uh oh!

alebedev87 commented Feb 28, 2025

Uh oh!

alebedev87 commented Mar 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci bot commented Mar 10, 2025

Uh oh!

openshift-ci-robot commented Mar 10, 2025

Uh oh!

openshift-ci-robot commented Mar 11, 2025

Uh oh!

alebedev87 commented Mar 11, 2025

Uh oh!

alebedev87 commented Mar 11, 2025

Uh oh!

Miciah commented Mar 11, 2025

Uh oh!

alebedev87 commented Mar 11, 2025

Uh oh!

alebedev87 commented Mar 11, 2025

Uh oh!

alebedev87 commented Mar 11, 2025

Uh oh!

alebedev87 commented Mar 11, 2025

Uh oh!

candita commented Mar 12, 2025

Uh oh!

candita commented Mar 12, 2025

Uh oh!

candita commented Mar 12, 2025

Uh oh!

openshift-ci-robot commented Mar 12, 2025

Uh oh!

openshift-ci bot commented Mar 12, 2025

openshift-ci-robot commented Feb 20, 2025 •

edited by openshift-ci bot

Loading

alebedev87 commented Feb 20, 2025 •

edited by openshift-ci bot

Loading

shaneutt Feb 25, 2025 •

edited

Loading

alebedev87 Feb 27, 2025 •

edited

Loading

alebedev87 commented Mar 3, 2025 •

edited

Loading