OCPBUGS-54966: Improve detection of missing DNSRecord for Gateway #1212

alebedev87 · 2025-04-03T09:45:45Z

This PR aims at helping with gateway API e2e test failures when DNSRecord for a gateway cannot be found in time( example):

Adds more logs to the logic of DNSRecord ensuring to see when DNSRecord is updated or left untouched.
Adds more logs to the logic of generation of desired DNSRecord to see why Gateway service is not taken.
Adds more logs to Gateway service enqueue logic to see when reconciling hits or does not hit.
Ensures the Programmed condition is checked on the Gateway resource in the e2e test.
Logs gateway service definition while polling for gateway acceptance in the e2e test.
Logs the reason why DNSRecord is not ready during the assertion.

openshift-ci · 2025-04-03T09:45:50Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

alebedev87 · 2025-04-03T09:46:54Z

/test e2e-aws-gatewayapi
/test e2e-aws-operator-techpreview

alebedev87 · 2025-04-03T12:54:14Z

/test e2e-aws-gatewayapi
/test e2e-aws-operator-techpreview

alebedev87 · 2025-04-04T13:35:08Z

/test all

alebedev87 · 2025-04-04T21:38:11Z

The gateway service didn't get its load balancer target in time (last transition: 17:27:57):

          - type: Programmed
            status: "False"
            observedgeneration: 1
            lasttransitiontime: "2025-04-04T17:27:57Z"
            reason: AddressNotAssigned
            message: 'Assigned to service(s) test-gateway-openshift-default.openshift-ingress.svc.cluster.local:80,
              but failed to assign to all requested addresses: address pending for hostname
              "test-gateway-openshift-default.openshift-ingress.svc.cluster.local"'

Load balancer provisioning happened at 17:29:28:

- apiVersion: v1                                                                                                                                     
  count: 1
  eventTime: null
  firstTimestamp: "2025-04-04T17:29:28Z"
  involvedObject:
    apiVersion: v1
    kind: Service
    name: test-gateway-openshift-default
    namespace: openshift-ingress
    resourceVersion: "95293"
    uid: 33f249bf-a0c3-4350-b3cc-70e4467c44a6
  kind: Event
  lastTimestamp: "2025-04-04T17:29:28Z"
  message: Ensured load balancer

The istiod created the service at 17:29:27:

2025-04-04T17:29:27.435940790Z 2025-04-04T17:29:27.435926Z	info	model	Full push, new service openshift-ingress/test-gateway-openshift-default.openshift-ingress.svc.cluster.local

And itself, the istiod pod was created at 17:29:18:

  creationTimestamp: "2025-04-04T17:29:18Z"
  generateName: istiod-openshift-gateway-57c5c5ccb9-

Who accepted the gatewayclass then?! The check for its acceptance goes before the gateway check.

alebedev87 · 2025-04-04T21:45:11Z

Running testGatewayAPIIstioInstallation before testGatewayAPIObjects may help to avoid problems with LB provisioning.

alebedev87 · 2025-04-06T08:05:14Z

/test e2e-aws-gatewayapi
/test e2e-aws-operator-techpreview

candita · 2025-04-07T16:04:23Z

pkg/resources/dnsrecord/dns.go

 	// Quick sanity check since we don't know how to handle both being set (is
 	// that even a valid state?)
 	if len(ingress.Hostname) > 0 && len(ingress.IP) > 0 {
+		log.Info("no load balancer hostname or IP for dnsrecord", "dnsrecord", name, "service", service.Name)


In this case it has both hostname and ip, which is unexpected.

Miciah · 2025-04-07T22:51:15Z

test/e2e/util_gatewayapi_test.go

-	recordedConditionMsg := "not found"
+	recordedAcceptedConditionMsg, recordedProgrammedConditionMsg := "not found", "not found"


With your change below to use %q instead of %s, it would be clearer to initialize these variables to empty strings.

Miciah · 2025-04-07T22:51:58Z

pkg/operator/controller/gateway-service-dns/controller.go

-			return gatewayListenersHostnamesChanged(old, new)
+			changed := gatewayListenersHostnamesChanged(old, new)
+			if changed {
+				log.Info("listener hostname changed", "gateway", e.ObjectNew.(*gatewayapiv1.Gateway).Name)


Here and in the other log messages that you add, please capitalize the first letter. See https://github.com/kubernetes/community/blob/master/contributors/devel/sig-instrumentation/logging.md#message-style-guidelines.

Miciah · 2025-04-07T22:57:06Z

In some test failures that I have been investigating, there are no events to indicate that the service controller tried to provision an LB, which leads me to wonder whether Istio sometimes doesn't create the service or doesn't configure it correctly. The istiod logs aren't very helpful for diagnosing these failures. For this reason, I believe it could be helpful for ensureGatewayObjectSuccess to get and log the service definition. Would you be interested in adding that to this PR?

candita · 2025-04-09T15:25:33Z

/assign @rfredette
/assign

alebedev87 · 2025-08-04T20:48:49Z

/jire refresh

alebedev87 · 2025-08-04T20:51:19Z

/jira refresh

openshift-ci-robot · 2025-08-04T20:51:24Z

@alebedev87: This pull request references Jira Issue OCPBUGS-54966, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.20.0) matches configured target version for branch (4.20.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @melvinjoseph86

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

candita · 2025-08-14T18:58:40Z

pkg/resources/dnsrecord/dns.go

 	// Quick sanity check since we don't know how to handle both being set (is
 	// that even a valid state?)
 	if len(ingress.Hostname) > 0 && len(ingress.IP) > 0 {
+		log.Info("Both load balancer hostname and IP are set", "dnsrecord", name, "service", service.Name)


nit: how about logging an error here, in case it does happen more than we expect?

Good call, changing log.Info(...) to log.Err(nil, ...) would make it more obvious that this log message represents an error condition.

Changed to log.Error().

candita · 2025-08-14T19:26:11Z

test/e2e/util_gatewayapi_test.go


-	err := wait.PollUntilContextTimeout(context.Background(), 1*time.Second, 1*time.Minute, false, func(context context.Context) (bool, error) {
+	// Wait for the gateway to be accepted and programmed.
+	// Load balancer provisioning can take several minutes on some platforms.


nit: it would be great in the future to set timeouts that vary based on platforms.

Is there an advantage to using a lower timeout on platforms that don't need the higher timeout?

I would prefer to take the longest amount of time we expect provisioning to take, and double it. We don't pay the cost unless provisioning actually takes that amount of time, but we do pay a cost if the test times out when waiting longer would have allowed it to pass.

We pay the cost of higher timeouts when everything is failing. I'm suggesting that we be more choosy with expected timeouts to take advantage of tests failing faster on platforms that don't take as long to provision. It's just a nit though, so no big deal.

I'm suggesting that we be more choosy with expected timeouts to take advantage of tests failing faster on platforms that don't take as long to provision. It's just a nit though, so no big deal.

My concern with "tests failing faster on platforms that don't take as long to provision" is that it could lead to premature failures. For instance, if LB provisioning on GCP typically takes 1 minute, but occasionally takes 2 minutes, should we fail the test after 1 minute of polling (and then follow up with the CCM team)? Or should we allow it to poll for an additional minute to potentially avoid a retest of the entire suite?

I'd say that the cost of retesting flaky tests outweighs the cost of a slightly longer poll in a failed scenario, especially since the LB provisioning is a process outside of our direct control.

We pay the cost of higher timeouts when everything is failing.

Can we test for "everything is failing" in the polling loop, and exit early if the condition is detected? That would fail faster, flake less, and provide useful diagnostic information right in the test output.

Yes, but let's prioritize the merge so we can get better CI signal.

Yes, but let's prioritize the merge so we can get better CI signal.

Agreed. The PR has already merged, but in general when we have tests that are flaky because polling loops time out, I would argue for (1) increasing timeouts as a first pass in order to reduce flakiness and (2) refining the failure conditions as a second pass in order to fail faster and improve diagnostics. Optimizing to reduce flakiness virtually always takes precedence over optimizing to fail fast.

candita · 2025-08-14T19:34:21Z

test/e2e/util_gatewayapi_test.go

+		// The creation of the gateway service may be delayed.
+		// Check the current status of the service to see where we are.
+		svc := &corev1.Service{}
+		svcNsName := types.NamespacedName{Namespace: namespace, Name: gw.Name + "-" + string(gw.Spec.GatewayClassName)}


nit: the generated name could be stored for other uses, like https://github.com/openshift/cluster-ingress-operator/blob/master/pkg/operator/controller/names.go#L319

I prefer to avoid reusing the code which is being tested in tests. Composing a name like this adds an additional check for an implicit assumption about the naming of the publishing service.

candita · 2025-08-14T19:36:49Z

test/e2e/util_gatewayapi_test.go

-		t.Logf("Last observed gateway:\n%s", util.ToYaml(gw))
-		return nil, fmt.Errorf("gateway %v not %v, last recorded status message: %s", nsName, gatewayapiv1.GatewayConditionAccepted, recordedConditionMsg)
+		t.Logf("[%s] Last observed gateway:\n%s", time.Now().Format(time.DateTime), util.ToYaml(gw))
+		return nil, fmt.Errorf("gateway %v does not have all expected conditions, last recorded status messages: %q, %q", nsName, recordedAcceptedConditionMsg, recordedProgrammedConditionMsg)


Suggested change

return nil, fmt.Errorf("gateway %v does not have all expected conditions, last recorded status messages: %q, %q", nsName, recordedAcceptedConditionMsg, recordedProgrammedConditionMsg)

return nil, fmt.Errorf("gateway %v does not have all expected conditions, last recorded status messages for Accepted: %q, Programmed: %q", nsName, recordedAcceptedConditionMsg, recordedProgrammedConditionMsg)

The previous log line, with util.ToYaml(gw), should already log the Accepted and Programmed status conditions; is it useful to include them in the error message as well?

This requested change is just to clarify which condition matches the listed message. The message could end up just showing "last recorded status messages: Pending, Invalid". I'm suggesting we change it to "last recorded status messages for Accepted: Pending, Programmed: Invalid".

I agree this error message is redundant. But I'm fine with changing it to the suggested one so we can more easily see the extracted messages in the logs.

- Add more logs to the logic of DNSRecord ensuring to see when DNSRecord is updated or left untouched. - Add more logs to the logic of generation of desired DNSRecord to see why Gateway service is not taken. - Add more logs to Gateway service enqueue logic to see when reconciling hits or does not hit. - Ensure the `Programmed` condition is checked on the Gateway resource in the e2e test. - Log gateway service definition while polling for gateway acceptance in the e2e test. - Log the reason why DNSRecord is not ready during the assertion.

alebedev87 · 2025-08-18T16:11:42Z

/retest-required

openshift-ci · 2025-08-18T18:31:53Z

@alebedev87: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aws-operator-techpreview	`312078a`	link	false	`/test e2e-aws-operator-techpreview`
ci/prow/okd-scos-e2e-aws-ovn	`312078a`	link	false	`/test okd-scos-e2e-aws-ovn`
ci/prow/e2e-azure-ovn	`312078a`	link	false	`/test e2e-azure-ovn`
ci/prow/e2e-aws-ovn-techpreview	`312078a`	link	false	`/test e2e-aws-ovn-techpreview`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

candita · 2025-08-19T17:11:12Z

/lgtm
/approve

candita · 2025-08-19T17:12:40Z

We know the root cause for failures in CI for e2e-gcp-operator and they are being handled by the cloud controller team.
/override e2e-gcp-operator

openshift-ci · 2025-08-19T17:12:53Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: candita, rfredette

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [candita,rfredette]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2025-08-19T17:13:34Z

@candita: /override requires failed status contexts, check run or a prowjob name to operate on.
The following unknown contexts/checkruns were given:

e2e-gcp-operator

Only the following failed contexts/checkruns were expected:

ci/prow/e2e-aws-operator
ci/prow/e2e-aws-operator-techpreview
ci/prow/e2e-aws-ovn
ci/prow/e2e-aws-ovn-hypershift-conformance
ci/prow/e2e-aws-ovn-serial
ci/prow/e2e-aws-ovn-single-node
ci/prow/e2e-aws-ovn-techpreview
ci/prow/e2e-aws-ovn-upgrade
ci/prow/e2e-azure-operator
ci/prow/e2e-azure-ovn
ci/prow/e2e-gcp-operator
ci/prow/e2e-gcp-ovn
ci/prow/e2e-hypershift
ci/prow/hypershift-e2e-aks
ci/prow/images
ci/prow/okd-scos-e2e-aws-ovn
ci/prow/okd-scos-images
ci/prow/unit
ci/prow/verify
ci/prow/verify-deps
pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-operator
pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-operator-techpreview
pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-ovn
pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-ovn-hypershift-conformance
pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-ovn-serial
pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-ovn-single-node
pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-ovn-techpreview
pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-ovn-upgrade
pull-ci-openshift-cluster-ingress-operator-master-e2e-azure-operator
pull-ci-openshift-cluster-ingress-operator-master-e2e-azure-ovn
pull-ci-openshift-cluster-ingress-operator-master-e2e-gcp-operator
pull-ci-openshift-cluster-ingress-operator-master-e2e-gcp-ovn
pull-ci-openshift-cluster-ingress-operator-master-e2e-hypershift
pull-ci-openshift-cluster-ingress-operator-master-hypershift-e2e-aks
pull-ci-openshift-cluster-ingress-operator-master-images
pull-ci-openshift-cluster-ingress-operator-master-okd-scos-e2e-aws-ovn
pull-ci-openshift-cluster-ingress-operator-master-okd-scos-images
pull-ci-openshift-cluster-ingress-operator-master-unit
pull-ci-openshift-cluster-ingress-operator-master-verify
pull-ci-openshift-cluster-ingress-operator-master-verify-deps
tide

If you are trying to override a checkrun that has a space in it, you must put a double quote on the context.

In response to this:

We know the root cause for failures in CI for e2e-gcp-operator and they are being handled by the cloud controller team.
/override e2e-gcp-operator

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

candita · 2025-08-19T17:15:52Z

We know the root cause for failures in CI for e2e-gcp-operator and they are being handled by the cloud controller team.
/override ci/prow/e2e-gcp-operator

openshift-ci · 2025-08-19T17:16:07Z

@candita: Overrode contexts on behalf of candita: ci/prow/e2e-gcp-operator

In response to this:

/override ci/prow/e2e-gcp-operator

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

candita · 2025-08-19T17:51:22Z

/cancel hold

candita · 2025-08-19T17:52:16Z

/tide refresh

candita · 2025-08-19T18:58:38Z

/unhold

openshift-ci-robot · 2025-08-19T19:01:58Z

@alebedev87: Jira Issue OCPBUGS-54966: All pull requests linked via external trackers have merged:

openshift/cluster-ingress-operator#1212

Jira Issue OCPBUGS-54966 has been moved to the MODIFIED state.

In response to this:

This PR aims at helping with gateway API e2e test failures when DNSRecord for a gateway cannot be found in time( example):

Adds more logs to the logic of DNSRecord ensuring to see when DNSRecord is updated or left untouched.

Adds more logs to the logic of generation of desired DNSRecord to see why Gateway service is not taken.

Adds more logs to Gateway service enqueue logic to see when reconciling hits or does not hit.

Ensures the Programmed condition is checked on the Gateway resource in the e2e test.

Logs gateway service definition while polling for gateway acceptance in the e2e test.

Logs the reason why DNSRecord is not ready during the assertion.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-bot · 2025-08-19T23:59:15Z

[ART PR BUILD NOTIFIER]

Distgit: ose-cluster-ingress-operator
This PR has been included in build ose-cluster-ingress-operator-container-v4.20.0-202508192316.p0.g52a0048.assembly.stream.el9.
All builds following this will include this PR.

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 3, 2025

alebedev87 force-pushed the dns-record-check-lib branch from 063382e to 13f7808 Compare April 3, 2025 12:52

alebedev87 changed the title ~~gwapi e2e: Add more logs to detect DNSRecord absence problem~~ e2e: Improve detection of missing DNSRecord for Gateway Apr 3, 2025

alebedev87 force-pushed the dns-record-check-lib branch 2 times, most recently from ae92836 to 45a4c1c Compare April 3, 2025 14:39

alebedev87 force-pushed the dns-record-check-lib branch 2 times, most recently from b923592 to da2ffeb Compare April 4, 2025 15:03

alebedev87 mentioned this pull request Apr 4, 2025

NE-1969: Set Degraded=True if unmanaged Gateway API CRDs exist #1205

Merged

alebedev87 marked this pull request as ready for review April 4, 2025 15:08

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 4, 2025

openshift-ci bot requested review from frobware and grzpiotrowski April 4, 2025 15:09

alebedev87 force-pushed the dns-record-check-lib branch 2 times, most recently from 2a880e8 to 0424f4e Compare April 4, 2025 15:10

alebedev87 force-pushed the dns-record-check-lib branch from 0424f4e to 005d884 Compare April 4, 2025 21:39

rfredette mentioned this pull request Apr 7, 2025

OCPBUGS-53424: Wait for install plans to enter the "Requires Approval" phase before approving them #1203

Merged

candita reviewed Apr 7, 2025

View reviewed changes

Miciah reviewed Apr 7, 2025

View reviewed changes

alebedev87 changed the title ~~e2e: Improve detection of missing DNSRecord for Gateway~~ [WIP] e2e: Improve detection of missing DNSRecord for Gateway Apr 9, 2025

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 9, 2025

openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Aug 4, 2025

openshift-ci-robot removed the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Aug 4, 2025

candita reviewed Aug 14, 2025

View reviewed changes

openshift-ci bot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm Indicates that a PR is ready to be merged. labels Aug 14, 2025

alebedev87 force-pushed the dns-record-check-lib branch from 443fe50 to 312078a Compare August 18, 2025 09:51

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Aug 18, 2025

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 19, 2025

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 19, 2025

openshift-merge-bot bot merged commit 52a0048 into openshift:master Aug 19, 2025
17 of 21 checks passed

		recordedConditionMsg := "not found"
		recordedAcceptedConditionMsg, recordedProgrammedConditionMsg := "not found", "not found"

	return nil, fmt.Errorf("gateway %v does not have all expected conditions, last recorded status messages: %q, %q", nsName, recordedAcceptedConditionMsg, recordedProgrammedConditionMsg)
	return nil, fmt.Errorf("gateway %v does not have all expected conditions, last recorded status messages for Accepted: %q, Programmed: %q", nsName, recordedAcceptedConditionMsg, recordedProgrammedConditionMsg)

OCPBUGS-54966: Improve detection of missing DNSRecord for Gateway #1212

OCPBUGS-54966: Improve detection of missing DNSRecord for Gateway #1212

Uh oh!

Conversation

alebedev87 commented Apr 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci bot commented Apr 3, 2025

Uh oh!

alebedev87 commented Apr 3, 2025

Uh oh!

alebedev87 commented Apr 3, 2025

Uh oh!

alebedev87 commented Apr 4, 2025

Uh oh!

alebedev87 commented Apr 4, 2025

Uh oh!

alebedev87 commented Apr 4, 2025

Uh oh!

alebedev87 commented Apr 6, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Miciah commented Apr 7, 2025

Uh oh!

candita commented Apr 9, 2025

Uh oh!

alebedev87 commented Aug 4, 2025

Uh oh!

alebedev87 commented Aug 4, 2025

Uh oh!

openshift-ci-robot commented Aug 4, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alebedev87 commented Aug 18, 2025

Uh oh!

openshift-ci bot commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

candita commented Aug 19, 2025

Uh oh!

candita commented Aug 19, 2025

Uh oh!

openshift-ci bot commented Aug 19, 2025

Uh oh!

alebedev87 commented Apr 3, 2025 •

edited

Loading

openshift-ci bot commented Aug 18, 2025 •

edited

Loading

candita commented Aug 19, 2025 •

edited

Loading