Skip to content

Conversation

@alebedev87
Copy link
Contributor

This PR introduces a validating admission policy that restricts modifications to CRDs from the "gateway.networking.k8s.io" group, allowing only the cluster ingress operator's service account to make changes.

This ensures that the core OpenShift retains exclusive ownership of the Gateway API implementation, preventing modifications from any add-ons (whether third-party or Red Hat-owned).

The policy and its corresponding binding will be managed by CVO.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Feb 20, 2025
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Feb 20, 2025

@alebedev87: This pull request references NE-1953 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

This PR introduces a validating admission policy that restricts modifications to CRDs from the "gateway.networking.k8s.io" group, allowing only the cluster ingress operator's service account to make changes.

This ensures that the core OpenShift retains exclusive ownership of the Gateway API implementation, preventing modifications from any add-ons (whether third-party or Red Hat-owned).

The policy and its corresponding binding will be managed by CVO.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested review from candita and frobware February 20, 2025 09:46
reason: Forbidden
# Verify that the request was sent from a pod. The presence of both "node" and "pod" claims implies that the token is bound to a pod object.
# Ref: https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#schema-for-service-account-private-claims.
- expression: "has(request.userInfo.extra) && ('authentication.kubernetes.io/node-name' in request.userInfo.extra) && ('authentication.kubernetes.io/pod-name' in request.userInfo.extra)"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the node claim alone be enough?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect so, but I think this is extra safe and I don't see a need to remove the pod name check

@alebedev87
Copy link
Contributor Author

/assign @JoelSpeed

To get a quick feedback on the shape of the VAP. I'm particularly interested in this question.

@alebedev87
Copy link
Contributor Author

alebedev87 commented Feb 20, 2025

/retitle [WIP] NE-1953: Add Validating Admission Policy for Gateway API CRDs

Testing is TBD.

@openshift-ci openshift-ci bot changed the title NE-1953: Add Validating Admission Policy for Gateway API CRDs [WIP] NE-1953: Add Validating Admission Policy for Gateway API CRDs Feb 20, 2025
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 20, 2025
Copy link

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point, you're going to want to debug why the CIO is broken. Running the CIO locally and connecting it to the cluster won't pass these checks, and will therefore fail.

What is the local development/debugging option once this is merged? Has that been considered?

Otherwise this LGTM

reason: Forbidden
# Verify that the request was sent from a pod. The presence of both "node" and "pod" claims implies that the token is bound to a pod object.
# Ref: https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#schema-for-service-account-private-claims.
- expression: "has(request.userInfo.extra) && ('authentication.kubernetes.io/node-name' in request.userInfo.extra) && ('authentication.kubernetes.io/pod-name' in request.userInfo.extra)"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect so, but I think this is extra safe and I don't see a need to remove the pod name check

@alebedev87
Copy link
Contributor Author

At some point, you're going to want to debug why the CIO is broken. Running the CIO locally and connecting it to the cluster won't pass these checks, and will therefore fail.

What is the local development/debugging option once this is merged? Has that been considered?

Right, CIO running outside the cluster won't be able to manage GWAPI CRDs after this PR is merged. Unless we scale down the CVO and remove the VAP or its binding which may be an option for debugging as we are supposed to be privileged enough for this.
I cannot talk for everybody for sure but from what I know the CIO debugging is always done inside a running cluster: we scale down the CVO and replace the CIO image with the one we test (sub-optiomal, I agree).

@alebedev87 alebedev87 force-pushed the ne-1953-vap-deny-gwapi-crds branch 2 times, most recently from ac403a5 to c1ae832 Compare February 25, 2025 11:18
Comment on lines 23 to 24
} else if !gwapiEnabled {
t.Skipf("Skipping Test_GatewayAPICRDAdmissionOutsideCluster when GatewayAPI featuregate is not enabled")
Copy link
Member

@shaneutt shaneutt Feb 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably my own misunderstanding: We are deploying the VAP outside of the CIO, so we should be testing that the CRDs are protected regardless of the state of the gate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The life cycle of a feature gate is bound to the feature set it belongs to. There are 3 main sets (in the order of maturity): DevPreview, TechPreview, Default. A feature gate may belong to multiple sets but it gets installed by default on OCP cluster only when it gets into Default set. In that case, it becomes a part of the OCP payload but it doesn't mean that the feature gate disappears, the feature gate is still there and enabled. Since currently the gateway API (and OSSM) management is a TechPreview feature we would like VAP to be only seen on TechPreview clusters (not impacting the current OCP payload). However when GatewayAPI featuregate will be promoted to GA (end of 4.10 dev cycle) the management of GatewAPI CRDs will get into Default set however this test will still be working because the GatewayAPI feature gate will remain.

OpenShift featuregates are different from Kubernetes ones, they don't serve to toggle a piece of code on or off but rather aim at protecting the default OCP payload from immature code.

Comment on lines 33 to 34
// unmanaged CRDs
//makeGWAPICRD("tcproute", "TCPRoute"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically we "manage" all CRDs in the group, at least to the effect that we don't allow anyone else to manage them.

Was the intention here to eventually test these other APIs? Or was it intentional to leave it as a comment to be illustrative for future code readers?

Copy link
Contributor Author

@alebedev87 alebedev87 Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was the intention here to eventually test these other APIs?

Yes, that was the idea. However this case is a little more complicated because the test will have to disable (scale down or override) the cluster version operator to be able to create some "unmanaged" CRDs. Also, this will not test any new "code path": the .spec.group check is the same as for the "managed" CRDs. So I'm not fully convinced it's worth the effort.

@alebedev87 alebedev87 force-pushed the ne-1953-vap-deny-gwapi-crds branch 3 times, most recently from eff1846 to ae3fffa Compare February 27, 2025 14:12
@alebedev87 alebedev87 changed the title [WIP] NE-1953: Add Validating Admission Policy for Gateway API CRDs NE-1953: Add Validating Admission Policy for Gateway API CRDs Feb 27, 2025
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 27, 2025
@alebedev87
Copy link
Contributor Author

/retest

@alebedev87 alebedev87 force-pushed the ne-1953-vap-deny-gwapi-crds branch 2 times, most recently from dc45342 to bd1658b Compare February 28, 2025 10:48
@alebedev87
Copy link
Contributor Author

Factorized the VAP management (removal and recovery) into a dedicated struct.

@alebedev87
Copy link
Contributor Author

/retest

@alebedev87 alebedev87 force-pushed the ne-1953-vap-deny-gwapi-crds branch from bd1658b to b6394a0 Compare March 3, 2025 09:48
@alebedev87
Copy link
Contributor Author

alebedev87 commented Mar 3, 2025

Cosmetic changes: some comments + vapManager was made private.

@alebedev87 alebedev87 force-pushed the ne-1953-vap-deny-gwapi-crds branch from b6394a0 to 415446a Compare March 3, 2025 09:55
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 10, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 10, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: candita

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 10, 2025
@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 69cadab and 2 for PR HEAD a314182 in total

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD e80fa46 and 1 for PR HEAD a314182 in total

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 11, 2025
@alebedev87 alebedev87 force-pushed the ne-1953-vap-deny-gwapi-crds branch from a314182 to b92609a Compare March 11, 2025 06:47
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Mar 11, 2025
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 11, 2025
@alebedev87
Copy link
Contributor Author

e2e-aws-operator-techpreview failed to create a pod for HTTPRoute due to unable to find annotation openshift.io/sa.scc.uid-range error:

=== RUN   TestAll/serial/TestGatewayAPI/testGatewayAPIObjects
    gateway_api_test.go:88: Creating namespace "test-e2e-gwapi-ws7d6"...
    gateway_api_test.go:88: Waiting for ServiceAccount test-e2e-gwapi-ws7d6/default to be provisioned...
    gateway_api_test.go:88: Waiting for RoleBinding test-e2e-gwapi-ws7d6/system:image-pullers to be created...
    gateway_api_test.go:88: failed to create one or more gateway object/s: feature gate was enabled, but http route object could not be created: failed to create pod test-e2e-gwapi-ws7d6/test-gateway-openshift-default: pods "test-gateway-openshift-default" is forbidden: error fetching namespace "test-e2e-gwapi-ws7d6": unable to find annotation openshift.io/sa.scc.uid-range

After a search in Slack, it seems like it's a recurring issue related to some race condition in one of the controllers responsible for the UID annotations (example). Most of the people just (suspense...) rerun their jobs 😮.

This commit introduces a validating admission policy that restricts modifications
to CRDs from the "gateway.networking.k8s.io" group, allowing only the cluster
ingress operator's service account to make changes.

This ensures that the core OpenShift retains exclusive ownership of the Gateway API
implementation, preventing modifications from any add-ons (whether third-party or
Red Hat-owned).

The policy and its corresponding binding will be managed by CVO.
TestGatewayAPI creates a test ServiceMeshControlPlane (SMCP) resource,
which triggers the creation of a mutating webhook for all pods in the cluster.
This introduces a race condition where any pod creation request between
the webhook's creation and the SMCP pod becoming ready can fail with:

failed calling webhook "sidecar-injector.istio.io": failed to call webhook:
no endpoints available for service "istiod-openshift-gateway"

Serializing the test ensures it runs in isolation with other tests,
preventing any impact of the mutating webhook on pod creation in the cluster.

Additionally, the timeout for the e2e tests has been increased to 1.5 hours
since TestGatewayAPI adds ~10 minutes to the total execution time.
Previously, tests ran for ~50 minutes, which was close to the 1-hour limit.
@alebedev87 alebedev87 force-pushed the ne-1953-vap-deny-gwapi-crds branch from b92609a to 8e73871 Compare March 11, 2025 10:22
@alebedev87
Copy link
Contributor Author

Addressed #1192 (comment).

@Miciah
Copy link
Contributor

Miciah commented Mar 11, 2025

After a search in Slack, it seems like it's a recurring issue related to some race condition in one of the controllers responsible for the UID annotations (example). Most of the people just (suspense...) rerun their jobs 😮.

Did you file a bug report in Jira? grin.

@alebedev87
Copy link
Contributor Author

Some problems with the CI image registry:

 * 2025-03-11T12:32:44Z 242x kubelet: Back-off pulling image "registry.ci.openshift.org/ci/entrypoint-wrapper:latest"
* 2025-03-11T11:38:44Z 4x kubelet: Error: ImagePullBackOff 

/retest

@alebedev87
Copy link
Contributor Author

/retest

@alebedev87
Copy link
Contributor Author

Did you file a bug report in Jira? grin.

No. Will do if I will see it more than once. This time was the only one so far.

@alebedev87
Copy link
Contributor Author

=== NAME  TestAll/parallel/TestManagedDNSToUnmanagedDNSIngressController
    util_test.go:714: verified connectivity with workload with req http://a6bc4691b16974830b41cb0cc8598e2f-661557309.us-west-2.elb.amazonaws.com and response 200
    unmanaged_dns_test.go:147: Updating ingresscontroller managed-migrated to dnsManagementPolicy=Unmanaged
    unmanaged_dns_test.go:160: Waiting for stable conditions on ingresscontroller managed-migrated after dnsManagementPolicy=Unmanaged
    unmanaged_dns_test.go:176: verifying conditions on DNSRecord zone {ID: Tags:map[Name:ci-op-4fxxjl7i-9e7c5-s6hzp-int kubernetes.io/cluster/ci-op-4fxxjl7i-9e7c5-s6hzp:owned]}
    unmanaged_dns_test.go:176: DNSRecord zone expected to have status=Unknown but got status=True

Seems like a reoccurred of https://issues.redhat.com/browse/OCPBUGS-17671.

/test e2e-aws-operator-techpreview

@candita
Copy link
Contributor

candita commented Mar 12, 2025

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 12, 2025
@candita
Copy link
Contributor

candita commented Mar 12, 2025

Two failures in the e2e-aws-operator-techpreview. First one you already noted:
RUN TestAll/serial/TestGatewayAPI/testGatewayAPIObjects
.. gateway_api_test.go:88: failed to create one or more gateway object/s: feature gate was enabled, but http route object could not be created: failed to create pod test-e2e-gwapi-t8nzn/test-gateway-openshift-default: pods "test-gateway-openshift-default" is forbidden: error fetching namespace "test-e2e-gwapi-t8nzn": unable to find annotation openshift.io/sa.scc.uid-range

Was this one there before?
=== RUN TestAll/serial/TestGatewayAPI/testGatewayAPIIstioInstallation
util_gatewayapi_test.go:400: found subscription servicemeshoperator at installed version
util_gatewayapi_test.go:748: found catalogSource redhat-operators with last observed state READY
util_gatewayapi_test.go:461: failed to get deployment openshift-operators/istio-operator, retrying...
[repeated, trying]
util_gatewayapi_test.go:461: failed to get deployment openshift-operators/istio-operator, retrying...
gateway_api_test.go:89: failed to find expected Istio operator: OSSM operator failure: pod istio-operator-75bb585bb-bktzx is not running, it is Pending

It's got to be some other new techpreview feature if e2e-aws-gatewayapi is fine, don't you think?

@candita
Copy link
Contributor

candita commented Mar 12, 2025

/retest

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD e80fa46 and 2 for PR HEAD 8e73871 in total

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 12, 2025

@alebedev87: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit 257abf3 into openshift:master Mar 12, 2025
19 checks passed
@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: ose-cluster-ingress-operator
This PR has been included in build ose-cluster-ingress-operator-container-v4.19.0-202503120440.p0.g257abf3.assembly.stream.el9.
All builds following this will include this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. qe-approved Signifies that QE has signed off on this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.