NE-1969: Set Degraded=True if unmanaged Gateway API CRDs exist #1205

alebedev87 · 2025-03-26T20:46:00Z

Set the ingress cluster operator’s Degraded status to true if unmanaged Gateway API CRDs exist on the cluster.

An unmanaged Gateway API CRD is one with "gateway.networking.k8s.io" or "gateway.networking.x-k8s.io" in its spec.group field, and not managed by the gatewayapi controller.

This PR uses the cluster operator’s status.extension field to store the names of unmanaged CRDs. The gatewayapi controller writes to this field, while the status controller reads it and updates the ingress cluster operator’s Degraded status accordingly.

openshift-ci-robot · 2025-03-26T20:46:04Z

@alebedev87: This pull request references NE-1969 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

Set the ingress cluster operator’s Degraded status to true if any unmanaged Gateway API CRDs exist on the cluster.

An unmanaged Gateway API CRD is one with "gateway.networking.k8s.io" or "gateway.networking.x-k8s.io" in its spec.group field, and not managed by the gatewayapi controller.

This PR uses the cluster operator’s status.extension field to store the names of unmanaged CRDs. The gatewayapi controller writes to this field, while the status controller reads it and updates the ingress cluster operator’s Degraded status accordingly.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

alebedev87 · 2025-03-27T10:43:03Z

Rebased to catch up with the controllerName change.

openshift-ci-robot · 2025-03-27T10:43:23Z

@alebedev87: This pull request references NE-1969 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

Set the ingress cluster operator’s Degraded status to true if unmanaged Gateway API CRDs exist on the cluster.

An unmanaged Gateway API CRD is one with "gateway.networking.k8s.io" or "gateway.networking.x-k8s.io" in its spec.group field, and not managed by the gatewayapi controller.

This PR uses the cluster operator’s status.extension field to store the names of unmanaged CRDs. The gatewayapi controller writes to this field, while the status controller reads it and updates the ingress cluster operator’s Degraded status accordingly.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

alebedev87 · 2025-03-27T19:24:34Z

/retest

candita · 2025-03-27T21:46:18Z

/assign

candita · 2025-03-28T19:26:25Z

/test e2e-aws-operator-techpreview

pkg/operator/controller/gatewayapi/controller.go

candita · 2025-03-28T20:02:04Z

pkg/operator/controller/gatewayapi/controller.go

+	controllerName                    = "gatewayapi_controller"
+	experimantalGatewayAPIGroupName   = "gateway.networking.x-k8s.io"
+	crdAPIGroupIndexFieldName         = "crdAPIGroup"
+	gatewayCRDAPIGroupIndexFieldValue = "gateway"


A one-word indexer feels too expansive. Why not at least "gateway.networking"?

The controller-runtime documentation actually says to use the field name (in our case, that would be "spec.group"), and the index value should be the field value (for example, "gateway.networking.x-k8s.io"). But we have written a lot of code that doesn't follow the prescribed idiom, and it seems to work fine, so I suspect that the index key can be any string that is unique among indexes within the manager.

https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/client#FieldIndexer

Make sense to only be looking at the spec.group.

candita · 2025-03-28T20:05:28Z

pkg/operator/controller/gatewayapi/controller.go

-	controllerName = "gatewayapi_controller"
+	controllerName                    = "gatewayapi_controller"
+	experimantalGatewayAPIGroupName   = "gateway.networking.x-k8s.io"
+	crdAPIGroupIndexFieldName         = "crdAPIGroup"


Aligning these two variable names would be helpful.

Suggested change

crdAPIGroupIndexFieldName = "crdAPIGroup"

gatewayCRDAPIGroupIndexFieldName = "crdAPIGroup"

The indexer indexes by the crd group, which can be gateway or not gateway. So, crdAPIGroupIndexFieldName reflects this purpose.

Changed the name to gatewayAPICRDIndexFieldName which will used in the index with unmanaged value.

pkg/operator/controller/gatewayapi/controller.go

candita · 2025-03-28T20:13:05Z

pkg/operator/controller/gatewayapi/controller_test.go

+			expectStatusUpdate: []client.Object{},
+			expectStartCtrl:    true,
+		},
 	}


Maybe add another test for experimental gateway CRDs.

I have a test case called unmanaged gateway API CRDs just above this one, it has a CRD ListenerSet and verifies that the status is updated.

That ListenerSet CRD is already existing in existingObjects. I'm asking for test case where it tries to add an experimental CRD and shows that it makes the operator Degraded=true. This could be in pkg/operator/controller/status/controller_test.go.

I'm asking for test case where it tries to add an experimental CRD and shows that it makes the operator Degraded=true.

This is tested in the e2e test I added, because it involves 2 different controllers (gatewayapi and status).

pkg/operator/controller/gatewayapi/controller_test.go

candita · 2025-03-28T20:27:57Z

pkg/operator/controller/gatewayapi/controller.go

+	// Index Gateway API CRDs by the "spec.group" field
+	// to enable efficient filtering during list operations.
+	// Currently, only Gateway API groups have a dedicated index field.
+	if err := mgr.GetFieldIndexer().IndexField(


Could you create two indexers, one for managed and one for unmanaged CRDs? See https://github.com/openshift/cluster-ingress-operator/pull/1205/files?diff=split&w=1#r2019309675

What would be the purpose of having two indexers?

Ah, I see your comment on listUnmanagedGatewayAPICRDs now.

Wouldn't it make more sense to do this?

client.IndexerFunc(func(o client.Object) []string { if isGatewayAPICRD(o) { return []string{"managed"} } return []string{"unmanaged"} })

and then

r.cache.List(ctx, gatewayAPICRDs, client.MatchingFields{crdAPIGroupIndexFieldName: "managed"});

or

r.cache.List(ctx, gatewayAPICRDs, client.MatchingFields{crdAPIGroupIndexFieldName: "unmanaged"});

I don't quite know why we need this indexer, which picks up managed and unmanaged CRDs together, but there is code around https://github.com/openshift/cluster-ingress-operator/pull/1205/files?diff=split&w=1#r2019309675 that gets all the CRDs from this index and then iterates through to get the unmanaged ones. One indexer for unmanaged CRDs should be enough.

No, I guess my last comment doesn't make sense. The index is used in order to get all Gateway API CRDs, be they managed or be they unmanaged, and then the loop in listUnmanagedGatewayAPICRDs checks each of those Gateway API CRDs to determine whether any unmanaged one was indexed. This makes sense to me: The cluster could have many CRDs, and we only index the ~half dozen (at most) that belong to Gateway API. It's probably cheaper to iterate over a half dozen CRDs than it would be to maintain another index to try to optimize the loop away.

I think that 2 indexes do not make sense as "managed" index would be of no use in this case.

One indexer for unmanaged CRDs should be enough.

This may be another approach to the problem indeed. I was thinking more of the field name which matches the CRD field spec.group (similar to what Miciah described here). I didn't respect the naming convention though.

One indexer for unmanaged CRDs should be enough.

My understanding from our call today is that we agreed on this approach: using one index, which has only CRDs that are both in any of the API groups that belong to Gateway API and not managed by cluster-ingress-operator.

Changed the index to have a field for all unmanaged gateway API CRDs.

pkg/operator/controller/gatewayapi/crds.go

pkg/operator/controller/gatewayapi/status.go

pkg/operator/controller/gatewayapi/controller_test.go

candita · 2025-03-28T23:14:20Z

Before I go any further, I have some questions about the departure from the design pattern we've used in the past for setting the ingress operator Degraded condition. It is the ingress operator that sets its own conditions from its own probes, as in https://github.com/openshift/cluster-ingress-operator/blob/master/pkg/operator/controller/ingress/status.go#L54, merging conditions if needed.

The design in this PR gets a status from the gateway api controller, which marks the ingress controller extension. Then the ingress controller consults its extension. Is there a requirement to use the extension that I'm not really aware of?

Miciah · 2025-03-28T23:14:56Z

pkg/operator/controller/gatewayapi/status.go

+	if len(current.Status.Extension.Raw) > 0 /*to avoid "unexpected end of JSON input" error*/ {
+		if err := json.Unmarshal(current.Status.Extension.Raw, currentExtension); err != nil {
+			return false, fmt.Errorf("failed to unmarshal status extension of cluster operator %q: %w", current.Name, err)
+		}
+	}
+	if equality.Semantic.DeepEqual(*currentExtension, *desiredExtension) {
+		return false, nil
+	}
+
+	updated := current.DeepCopy()
+	rawDesiredExtension, err := json.Marshal(desiredExtension)
+	if err != nil {
+		return false, fmt.Errorf("failed to marshal desired status extension of cluster operator %q: %w", updated.Name, err)
+	}
+	updated.Status.Extension.Raw = rawDesiredExtension
+	if err := r.client.Status().Update(ctx, updated); err != nil {


This stomps any unrecognized data in status.extension. If another controller needed to store some state in status.extension, this logic would be problematic. Could you add a comment to that effect (possibly a "to do" comment to preserve unrecognized data)?

If you plan to make any other changes, please do check if there's data in status.extension before overwriting it. Do you know what happens if there are several unmanaged CRDs, then one is deleted? Will the operator flip between Degraded true/false in that case?

Otherwise, please add a "todo" comment as @Miciah suggested.

I updated the logic to update only UnmanagedGatewayAPICRDNames field.

pkg/operator/controller/status/controller.go

candita · 2025-03-28T23:20:12Z

pkg/operator/controller/gatewayapi/status.go

+
+// updateClusterOperatorStatusExtension updates a cluster operator's status extension.
+// Returns a boolean indicating whether the cluster operator was updated, and an error value.
+func (r *reconciler) updateClusterOperatorStatusExtension(ctx context.Context, current *configv1.ClusterOperator, desiredExtension *status.IngressOperatorStatusExtension) (bool, error) {


How does the extension get returned to normal?

We discussed this on our call today. There's some subtlety around the use of marshaling, which omits empty fields. This subtlety warrants a comment to explain why this logic is correct.

I added a dedicated test to showcase what happens when the unmanaged GWAPI CRDs are removed.

I agree that a comment would be great.

Comment added.

pkg/operator/controller/status/controller.go

Miciah · 2025-04-08T06:59:44Z

Most of my recent comments are about minor things. #1205 (comment) (sorting the list of CRDs in the status message) and #1205 (comment) (logic for joining Degraded=False and Degraded=Unknown) are the only two real possible issues I noticed.

alebedev87 · 2025-04-10T08:19:24Z

/retest

alebedev87 · 2025-04-10T12:48:46Z

/retest

alebedev87 · 2025-04-10T20:00:32Z

/retest

alebedev87 · 2025-04-14T15:30:55Z

Rebased to fix the conflict after the merge of #1213.

test/e2e/gateway_api_test.go

candita · 2025-04-14T20:30:54Z

pkg/operator/controller/status/controller.go

+// - Messages are concatenated.
+// If the new condition has a higher-priority status, it overrides the current one.
+// Status priority (highest to lowest): True, Unknown, False, empty.
+func joinConditions(currCond, newCond configv1.ClusterOperatorStatusCondition, condType string) configv1.ClusterOperatorStatusCondition {


nit: condType is unused.

candita · 2025-04-14T20:32:17Z

pkg/operator/controller/status/controller.go

+			currCond.Reason += "And" + newCond.Reason
+			currCond.Message += newCond.Message


Here and elsewhere:

Suggested change

currCond.Reason += "And" + newCond.Reason

currCond.Message += newCond.Message

currCond.Reason += " And " + newCond.Reason

currCond.Message += " " + newCond.Message

I think that "And" is better to be not separated by white spaces in the Reason. Another example of reason the same concatenation (w/o whitespace) is from Miciah's cluster dns operator example. The messages are separated but dots so there will be no confusion about where a new one starts.

I fixed this a little differently in my PR https://github.com/openshift/cluster-dns-operator/pull/373/files#diff-81f08fdfab85d1173ed67b51c2d429a60efbfb17e3dc6ef01ed0afe2011023f2R488

I fixed it because it looked like gibberish in the CI events graph.

Ack. I still prefer to stay with Reason message which does not have white spaces because I believe this is Kubernetes native to use Camel case concatenation in cases of long messages (ex: PodReadyToStartContainers, CrashLoopBackOff etc.).

I prefer the way we did it for computing the Progressing condition: https://github.com/openshift/cluster-ingress-operator/blob/master/pkg/operator/controller/status/controller.go#L574-L576 because it is easier to read and review.

candita · 2025-04-14T20:42:59Z

pkg/operator/controller/status/controller.go

+			currCond.Reason += "And" + newCond.Reason
+			currCond.Message += newCond.Message
+		} else if currCond.Status == "" {
+			// Degraded=False overrides empty status.


nit: I'm curious about why Degraded=False doesn't override Degraded=Unknown in this case.

#1205 (comment)

I think you misunderstood? Or I did? A unit test for this function would help.

Status == "" is not the same as Status == "Unknown". I think you need to account for Status == "Unknown" in this switch case.

Also, if it is currently Degraded=True for some other reason, then in this case it is Degraded=False for CRD issues, does this update anything?

Status == "" is not the same as Status == "Unknown". I think you need to account for Status == "Unknown" in this switch case.

If currCond.Status==Unknown then newCond.Status==False should not change the status. We should stay unknown as Miciah proposed in the link I posted above. This case does exactly this: it leaves the currCond.Status untouched if newCond.Status is False (the only exception is when currCond.Status is empty).

Also, if it is currently Degraded=True for some other reason, then in this case it is Degraded=False for CRD issues, does this update anything?

Same thing. If currCond.Status==True this case won't change anything because True + False == True.

Overall, I still think this is just too much of a departure from the way we normally do this. At some point, can you please go to https://github.com/openshift/cluster-ingress-operator/blob/master/pkg/operator/controller/ingress/status.go#L550-L552 and change it so it says:

// computeIngressDegradedCondition computes the ingresscontroller's "Degraded"
// status condition, which aggregates other status conditions that can indicate
// a degraded state EXCEPT for degraded conditions dealing with CRDs, in which case
// we followed a completely different pattern.

Also, can you please add a unit test to
Test_computeIngressDegradedCondition

Also, the different reasons for a degraded status are listed in https://github.com/openshift/cluster-ingress-operator/blob/master/pkg/operator/controller/ingress/controller.go#L59-L62
so you should try to define a reason this way.

Also, can you please add a unit test to
Test_computeIngressDegradedCondition

Yes, we can follow up on this one. There is a lot of code not covered by unit tests including the status controller's degraded condition computation.

Also, the different reasons for a degraded status are listed in https://github.com/openshift/cluster-ingress-operator/blob/master/pkg/operator/controller/ingress/controller.go#L59-L62
so you should try to define a reason this way.

I'm not sure what exactly we should take as inspiration from the ingresscontroller's degraded condition into the cluster operator degraded condition. Can you please describe in more details?

Overall, I still think this is just too much of a departure from the way we normally do this. At some point, can you please go to https://github.com/openshift/cluster-ingress-operator/blob/master/pkg/operator/controller/ingress/status.go#L550-L552 and change it so it says:

// computeIngressDegradedCondition computes the ingresscontroller's "Degraded" // status condition, which aggregates other status conditions that can indicate // a degraded state EXCEPT for degraded conditions dealing with CRDs, in which case // we followed a completely different pattern.

Are you sure you are not mixing up the Degraded status for the ingresscontrollers and the Degarded status for the ingress cluster operator? The comment you are suggesting is for the former which doesn't have anything to do with the GatewayAPI CRDs. It's the ingress cluster operator's Degraded status which uses both the default ingresscontroller's Degarded status and the Gateway API CRDs.

candita · 2025-04-14T20:51:39Z

pkg/operator/controller/status/controller.go

+	switch newCond.Status {
+	case configv1.ConditionTrue:
+		if currCond.Status == configv1.ConditionTrue {
+			currCond.Reason += "And" + newCond.Reason


If this continues to be Degraded=True for Gateway API CRDs, this doesn't prevent a string of duplicated statuses does it? E.g. Reason="GatewayAPICRDsDegraded And GatewayAPICRDsDegraded And GatewayAPICRDsDegraded"

computeOperatorDegradedCondition computes the status from scratch, the current condition from the cluster is not appended with new reasons/messages.

Can you make a small unit test for this function that demonstrates that it wouldn't keep appending Reasons everytime a reconcile is made and the Degraded=true condition continues?

Well, a unit test for Reconcile won't be small, I would need to mock the client and some other things probably.

Can you please explain to me how exactly the Degraded condition can be appended to the existing one? Knowing that computeOperatorDegradedCondition does not take co.Status.Conditions as a parameter and that mergeConditions override the condition with the new value.

It's not that the conditions are being appended, it's that the reasons and messages are being appended. Reviewing this again, I see that the condition is blank before the two conditions are joined and it seems this couldn't happen after all.

I wouldn't want to see more and more degraded reasons and messages being concatenated in the future, two is bad enough in my opinion. There is no way to enforce this though. I'll just go on record as not being very impressed with this framework, but I can't override @Miciah's decision so I'll let it go.

candita · 2025-04-14T21:07:09Z

pkg/operator/controller/status/controller_test.go

+			expectCondition: configv1.ClusterOperatorStatusCondition{
+				Type:   configv1.OperatorDegraded,
+				Status: configv1.ConditionTrue,
+				Reason: "IngressDegradedAndGatewayAPICRDsDegraded",


See https://github.com/openshift/cluster-ingress-operator/pull/1205/files#r2042909467. I think the reasons should be separated so they can be examined as tokens later, to ensure there are no duplicates added.

No duplicates will be added because the existing degraded condition is not taken into account. The degraded condition is recompute from scratch every time.

pkg/operator/controller/status/controller_test.go

Set the ingress cluster operator’s `Degraded` status to `True` if unmanaged Gateway API CRDs exist on the cluster. An unmanaged Gateway API CRD is one with "gateway.networking.k8s.io" or "gateway.networking.x-k8s.io" in its `spec.group` field, and not managed by the gatewayapi controller. This commit uses the cluster operator’s `status.extension` field to store the names of unmanaged CRDs. The gatewayapi controller writes to this field, while the status controller reads it and updates the ingress cluster operator’s `Degraded` status accordingly.

candita · 2025-04-14T22:41:03Z

pkg/operator/controller/status/controller_test.go

+				Type:    configv1.OperatorDegraded,
+				Status:  configv1.ConditionTrue,
+				Reason:  "IngressDegraded",
+				Message: `The "default" ingress controller reports Degraded=True: dummy: dummy.`,


Is this the revised format?

Suggested change

Message: `The "default" ingress controller reports Degraded=True: dummy: dummy.`,

Message: `The "default" ingress controller reports Degraded=True. dummy: dummy.`,

I didn't change the condition message for this part, so : is what we use right now. I don't want to change this behavior because it's unrelated to this PR.

alebedev87 · 2025-04-15T07:57:19Z

/retest

alebedev87 · 2025-04-15T11:52:56Z

idle-close-on-response test failed while testing the deferred policy:

NAME  TestAll/parallel/Test_IdleConnectionTerminationPolicyDeferred
    idle_connection_test.go:621: [10.131.119.64:35050 -> 35.184.77.0:80] Req: URL=http://35.184.77.0, Host=test-idle-connection-close-deferred-6fg8p.apps.ci-op-lxvhp8hr-76b3b.XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    idle_connection_test.go:621: [10.131.119.64:35050 <- 35.184.77.0:80] Res: Status=200, Headers=map[Content-Length:[8] Content-Type:[text/plain; charset=utf-8] Date:[Tue, 15 Apr 2025 08:59:53 GMT] Set-Cookie:[8e7d269621fcec463491d935c21ae61a=b72b88863173c2b1431c5afd38df7e72; path=/; HttpOnly] X-Pod-Name:[web-service-2] X-Pod-Namespace:[unknown-namespace]]
    idle_connection_test.go:607: step 2: unexpected response: got "web-service-2", want "web-service-1"

This test is not often seen as failed but still something to keep an eye on. However this is unlikely to be related to this PR.

/retest

openshift-ci · 2025-04-15T13:42:48Z

@alebedev87: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

candita · 2025-04-15T22:30:35Z

pkg/operator/controller/status/controller.go

+// - Reason becomes: "MultipleComponents[Not]" + condType.
+// - Messages are concatenated.


nit: Did you drop the MultipleComponents prefix idea?

candita · 2025-04-15T23:04:36Z

pkg/operator/controller/status/controller_test.go

+				Type:    configv1.OperatorDegraded,
+				Status:  configv1.ConditionTrue,
+				Reason:  "IngressDegradedAndGatewayAPICRDsDegraded",
+				Message: `The "default" ingress controller reports Degraded=True: dummy: dummy.Unmanaged Gateway API CRDs found: listenersets.gateway.networking.x-k8s.io.`,


Ok, this is the revised format. The concatenated message is

dummy.Unmanaged Gateway API CRDs found: listenersets.gateway.networking.x-k8s.io.

And a new degraded message will look something like this if the CRDs also put it in a degraded state:

"One or more other status conditions indicate a degraded state: DeploymentReplicasMinAvailable=False (DeploymentMinimumReplicasNotMet: 0/2 of replicas are available, max unavailable is 1).Unmanaged Gateway API CRDs found: listenersets.gateway.networking.x-k8s.io."

I agree, the readability of the degraded message can be improved.

candita · 2025-04-15T23:34:41Z

I have a few misgivings about this which can be addressed in a followup.
/lgtm
/approve

openshift-ci · 2025-04-15T23:36:30Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: candita

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [candita]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-bot · 2025-04-16T03:57:55Z

[ART PR BUILD NOTIFIER]

Distgit: ose-cluster-ingress-operator
This PR has been included in build ose-cluster-ingress-operator-container-v4.19.0-202504160315.p0.g5f89b8a.assembly.stream.el9.
All builds following this will include this PR.

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 26, 2025

openshift-ci bot requested review from candita and knobunc March 26, 2025 20:46

alebedev87 mentioned this pull request Mar 26, 2025

NE-1969: Ensure only managed Gateway API CRDs exist #1201

Closed

alebedev87 force-pushed the ne-1969-crd-degraded-status branch from ef31360 to c80f70c Compare March 27, 2025 10:42

alebedev87 changed the title ~~NE-1969: Set Degraded=True if any unmanaged Gateway API CRDs exist~~ NE-1969: Set Degraded=True if unmanaged Gateway API CRDs exist Mar 27, 2025

openshift-ci bot assigned candita Mar 27, 2025