Skip to content

Conversation

@alebedev87
Copy link
Contributor

@alebedev87 alebedev87 commented Apr 3, 2025

This PR aims at helping with gateway API e2e test failures when DNSRecord for a gateway cannot be found in time( example):

  • Adds more logs to the logic of DNSRecord ensuring to see when DNSRecord is updated or left untouched.
  • Adds more logs to the logic of generation of desired DNSRecord to see why Gateway service is not taken.
  • Adds more logs to Gateway service enqueue logic to see when reconciling hits or does not hit.
  • Ensures the Programmed condition is checked on the Gateway resource in the e2e test.
  • Logs gateway service definition while polling for gateway acceptance in the e2e test.
  • Logs the reason why DNSRecord is not ready during the assertion.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 3, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 3, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@alebedev87
Copy link
Contributor Author

/test e2e-aws-gatewayapi
/test e2e-aws-operator-techpreview

@alebedev87 alebedev87 force-pushed the dns-record-check-lib branch from 063382e to 13f7808 Compare April 3, 2025 12:52
@alebedev87 alebedev87 changed the title gwapi e2e: Add more logs to detect DNSRecord absence problem e2e: Improve detection of missing DNSRecord for Gateway Apr 3, 2025
@alebedev87
Copy link
Contributor Author

/test e2e-aws-gatewayapi
/test e2e-aws-operator-techpreview

@alebedev87 alebedev87 force-pushed the dns-record-check-lib branch 2 times, most recently from ae92836 to 45a4c1c Compare April 3, 2025 14:39
@alebedev87
Copy link
Contributor Author

/test all

@alebedev87 alebedev87 force-pushed the dns-record-check-lib branch 2 times, most recently from b923592 to da2ffeb Compare April 4, 2025 15:03
@alebedev87 alebedev87 marked this pull request as ready for review April 4, 2025 15:08
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 4, 2025
@openshift-ci openshift-ci bot requested review from frobware and grzpiotrowski April 4, 2025 15:09
@alebedev87 alebedev87 force-pushed the dns-record-check-lib branch 2 times, most recently from 2a880e8 to 0424f4e Compare April 4, 2025 15:10
@alebedev87
Copy link
Contributor Author

The gateway service didn't get its load balancer target in time (last transition: 17:27:57):

          - type: Programmed
            status: "False"
            observedgeneration: 1
            lasttransitiontime: "2025-04-04T17:27:57Z"
            reason: AddressNotAssigned
            message: 'Assigned to service(s) test-gateway-openshift-default.openshift-ingress.svc.cluster.local:80,
              but failed to assign to all requested addresses: address pending for hostname
              "test-gateway-openshift-default.openshift-ingress.svc.cluster.local"'

Load balancer provisioning happened at 17:29:28:

- apiVersion: v1                                                                                                                                     
  count: 1
  eventTime: null
  firstTimestamp: "2025-04-04T17:29:28Z"
  involvedObject:
    apiVersion: v1
    kind: Service
    name: test-gateway-openshift-default
    namespace: openshift-ingress
    resourceVersion: "95293"
    uid: 33f249bf-a0c3-4350-b3cc-70e4467c44a6
  kind: Event
  lastTimestamp: "2025-04-04T17:29:28Z"
  message: Ensured load balancer

The istiod created the service at 17:29:27:

2025-04-04T17:29:27.435940790Z 2025-04-04T17:29:27.435926Z	info	model	Full push, new service openshift-ingress/test-gateway-openshift-default.openshift-ingress.svc.cluster.local

And itself, the istiod pod was created at 17:29:18:

  creationTimestamp: "2025-04-04T17:29:18Z"
  generateName: istiod-openshift-gateway-57c5c5ccb9-

Who accepted the gatewayclass then?! The check for its acceptance goes before the gateway check.

@alebedev87 alebedev87 force-pushed the dns-record-check-lib branch from 0424f4e to 005d884 Compare April 4, 2025 21:39
@alebedev87
Copy link
Contributor Author

Running testGatewayAPIIstioInstallation before testGatewayAPIObjects may help to avoid problems with LB provisioning.

@alebedev87
Copy link
Contributor Author

/test e2e-aws-gatewayapi
/test e2e-aws-operator-techpreview

// Quick sanity check since we don't know how to handle both being set (is
// that even a valid state?)
if len(ingress.Hostname) > 0 && len(ingress.IP) > 0 {
log.Info("no load balancer hostname or IP for dnsrecord", "dnsrecord", name, "service", service.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case it has both hostname and ip, which is unexpected.

Comment on lines 565 to 566
recordedConditionMsg := "not found"
recordedAcceptedConditionMsg, recordedProgrammedConditionMsg := "not found", "not found"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With your change below to use %q instead of %s, it would be clearer to initialize these variables to empty strings.

return gatewayListenersHostnamesChanged(old, new)
changed := gatewayListenersHostnamesChanged(old, new)
if changed {
log.Info("listener hostname changed", "gateway", e.ObjectNew.(*gatewayapiv1.Gateway).Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and in the other log messages that you add, please capitalize the first letter. See https://github.com/kubernetes/community/blob/master/contributors/devel/sig-instrumentation/logging.md#message-style-guidelines.

@Miciah
Copy link
Contributor

Miciah commented Apr 7, 2025

In some test failures that I have been investigating, there are no events to indicate that the service controller tried to provision an LB, which leads me to wonder whether Istio sometimes doesn't create the service or doesn't configure it correctly. The istiod logs aren't very helpful for diagnosing these failures. For this reason, I believe it could be helpful for ensureGatewayObjectSuccess to get and log the service definition. Would you be interested in adding that to this PR?

@alebedev87 alebedev87 changed the title e2e: Improve detection of missing DNSRecord for Gateway [WIP] e2e: Improve detection of missing DNSRecord for Gateway Apr 9, 2025
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 9, 2025
@candita
Copy link
Contributor

candita commented Apr 9, 2025

/assign @rfredette
/assign

@alebedev87
Copy link
Contributor Author

/jire refresh

@alebedev87
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Aug 4, 2025
@openshift-ci-robot
Copy link
Contributor

@alebedev87: This pull request references Jira Issue OCPBUGS-54966, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.20.0) matches configured target version for branch (4.20.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @melvinjoseph86

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot removed the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Aug 4, 2025
// Quick sanity check since we don't know how to handle both being set (is
// that even a valid state?)
if len(ingress.Hostname) > 0 && len(ingress.IP) > 0 {
log.Info("Both load balancer hostname and IP are set", "dnsrecord", name, "service", service.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: how about logging an error here, in case it does happen more than we expect?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, changing log.Info(...) to log.Err(nil, ...) would make it more obvious that this log message represents an error condition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to log.Error().


err := wait.PollUntilContextTimeout(context.Background(), 1*time.Second, 1*time.Minute, false, func(context context.Context) (bool, error) {
// Wait for the gateway to be accepted and programmed.
// Load balancer provisioning can take several minutes on some platforms.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it would be great in the future to set timeouts that vary based on platforms.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an advantage to using a lower timeout on platforms that don't need the higher timeout?

I would prefer to take the longest amount of time we expect provisioning to take, and double it. We don't pay the cost unless provisioning actually takes that amount of time, but we do pay a cost if the test times out when waiting longer would have allowed it to pass.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We pay the cost of higher timeouts when everything is failing. I'm suggesting that we be more choosy with expected timeouts to take advantage of tests failing faster on platforms that don't take as long to provision. It's just a nit though, so no big deal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm suggesting that we be more choosy with expected timeouts to take advantage of tests failing faster on platforms that don't take as long to provision. It's just a nit though, so no big deal.

My concern with "tests failing faster on platforms that don't take as long to provision" is that it could lead to premature failures. For instance, if LB provisioning on GCP typically takes 1 minute, but occasionally takes 2 minutes, should we fail the test after 1 minute of polling (and then follow up with the CCM team)? Or should we allow it to poll for an additional minute to potentially avoid a retest of the entire suite?

I'd say that the cost of retesting flaky tests outweighs the cost of a slightly longer poll in a failed scenario, especially since the LB provisioning is a process outside of our direct control.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We pay the cost of higher timeouts when everything is failing.

Can we test for "everything is failing" in the polling loop, and exit early if the condition is detected? That would fail faster, flake less, and provide useful diagnostic information right in the test output.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but let's prioritize the merge so we can get better CI signal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but let's prioritize the merge so we can get better CI signal.

Agreed. The PR has already merged, but in general when we have tests that are flaky because polling loops time out, I would argue for (1) increasing timeouts as a first pass in order to reduce flakiness and (2) refining the failure conditions as a second pass in order to fail faster and improve diagnostics. Optimizing to reduce flakiness virtually always takes precedence over optimizing to fail fast.

// The creation of the gateway service may be delayed.
// Check the current status of the service to see where we are.
svc := &corev1.Service{}
svcNsName := types.NamespacedName{Namespace: namespace, Name: gw.Name + "-" + string(gw.Spec.GatewayClassName)}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to avoid reusing the code which is being tested in tests. Composing a name like this adds an additional check for an implicit assumption about the naming of the publishing service.

t.Logf("Last observed gateway:\n%s", util.ToYaml(gw))
return nil, fmt.Errorf("gateway %v not %v, last recorded status message: %s", nsName, gatewayapiv1.GatewayConditionAccepted, recordedConditionMsg)
t.Logf("[%s] Last observed gateway:\n%s", time.Now().Format(time.DateTime), util.ToYaml(gw))
return nil, fmt.Errorf("gateway %v does not have all expected conditions, last recorded status messages: %q, %q", nsName, recordedAcceptedConditionMsg, recordedProgrammedConditionMsg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return nil, fmt.Errorf("gateway %v does not have all expected conditions, last recorded status messages: %q, %q", nsName, recordedAcceptedConditionMsg, recordedProgrammedConditionMsg)
return nil, fmt.Errorf("gateway %v does not have all expected conditions, last recorded status messages for Accepted: %q, Programmed: %q", nsName, recordedAcceptedConditionMsg, recordedProgrammedConditionMsg)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous log line, with util.ToYaml(gw), should already log the Accepted and Programmed status conditions; is it useful to include them in the error message as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This requested change is just to clarify which condition matches the listed message. The message could end up just showing "last recorded status messages: Pending, Invalid". I'm suggesting we change it to "last recorded status messages for Accepted: Pending, Programmed: Invalid".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this error message is redundant. But I'm fine with changing it to the suggested one so we can more easily see the extracted messages in the logs.

@openshift-ci openshift-ci bot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm Indicates that a PR is ready to be merged. labels Aug 14, 2025
- Add more logs to the logic of DNSRecord ensuring to see when DNSRecord is
  updated or left untouched.
- Add more logs to the logic of generation of desired DNSRecord to see
  why Gateway service is not taken.
- Add more logs to Gateway service enqueue logic to see when reconciling
  hits or does not hit.
- Ensure the `Programmed` condition is checked on the Gateway resource
  in the e2e test.
- Log gateway service definition while polling for gateway acceptance in the e2e test.
- Log the reason why DNSRecord is not ready during the assertion.
@alebedev87 alebedev87 force-pushed the dns-record-check-lib branch from 443fe50 to 312078a Compare August 18, 2025 09:51
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Aug 18, 2025
@alebedev87
Copy link
Contributor Author

/retest-required

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 18, 2025

@alebedev87: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-operator-techpreview 312078a link false /test e2e-aws-operator-techpreview
ci/prow/okd-scos-e2e-aws-ovn 312078a link false /test okd-scos-e2e-aws-ovn
ci/prow/e2e-azure-ovn 312078a link false /test e2e-azure-ovn
ci/prow/e2e-aws-ovn-techpreview 312078a link false /test e2e-aws-ovn-techpreview

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@candita
Copy link
Contributor

candita commented Aug 19, 2025

/lgtm
/approve

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 19, 2025
@candita
Copy link
Contributor

candita commented Aug 19, 2025

We know the root cause for failures in CI for e2e-gcp-operator and they are being handled by the cloud controller team.
/override e2e-gcp-operator

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 19, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: candita, rfredette

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 19, 2025

@candita: /override requires failed status contexts, check run or a prowjob name to operate on.
The following unknown contexts/checkruns were given:

  • e2e-gcp-operator

Only the following failed contexts/checkruns were expected:

  • ci/prow/e2e-aws-operator
  • ci/prow/e2e-aws-operator-techpreview
  • ci/prow/e2e-aws-ovn
  • ci/prow/e2e-aws-ovn-hypershift-conformance
  • ci/prow/e2e-aws-ovn-serial
  • ci/prow/e2e-aws-ovn-single-node
  • ci/prow/e2e-aws-ovn-techpreview
  • ci/prow/e2e-aws-ovn-upgrade
  • ci/prow/e2e-azure-operator
  • ci/prow/e2e-azure-ovn
  • ci/prow/e2e-gcp-operator
  • ci/prow/e2e-gcp-ovn
  • ci/prow/e2e-hypershift
  • ci/prow/hypershift-e2e-aks
  • ci/prow/images
  • ci/prow/okd-scos-e2e-aws-ovn
  • ci/prow/okd-scos-images
  • ci/prow/unit
  • ci/prow/verify
  • ci/prow/verify-deps
  • pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-operator
  • pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-operator-techpreview
  • pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-ovn
  • pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-ovn-hypershift-conformance
  • pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-ovn-serial
  • pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-ovn-single-node
  • pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-ovn-techpreview
  • pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-ovn-upgrade
  • pull-ci-openshift-cluster-ingress-operator-master-e2e-azure-operator
  • pull-ci-openshift-cluster-ingress-operator-master-e2e-azure-ovn
  • pull-ci-openshift-cluster-ingress-operator-master-e2e-gcp-operator
  • pull-ci-openshift-cluster-ingress-operator-master-e2e-gcp-ovn
  • pull-ci-openshift-cluster-ingress-operator-master-e2e-hypershift
  • pull-ci-openshift-cluster-ingress-operator-master-hypershift-e2e-aks
  • pull-ci-openshift-cluster-ingress-operator-master-images
  • pull-ci-openshift-cluster-ingress-operator-master-okd-scos-e2e-aws-ovn
  • pull-ci-openshift-cluster-ingress-operator-master-okd-scos-images
  • pull-ci-openshift-cluster-ingress-operator-master-unit
  • pull-ci-openshift-cluster-ingress-operator-master-verify
  • pull-ci-openshift-cluster-ingress-operator-master-verify-deps
  • tide

If you are trying to override a checkrun that has a space in it, you must put a double quote on the context.

In response to this:

We know the root cause for failures in CI for e2e-gcp-operator and they are being handled by the cloud controller team.
/override e2e-gcp-operator

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@candita
Copy link
Contributor

candita commented Aug 19, 2025

We know the root cause for failures in CI for e2e-gcp-operator and they are being handled by the cloud controller team.
/override ci/prow/e2e-gcp-operator

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 19, 2025

@candita: Overrode contexts on behalf of candita: ci/prow/e2e-gcp-operator

In response to this:

/override ci/prow/e2e-gcp-operator

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@candita
Copy link
Contributor

candita commented Aug 19, 2025

/cancel hold

@candita
Copy link
Contributor

candita commented Aug 19, 2025

/tide refresh

@candita
Copy link
Contributor

candita commented Aug 19, 2025

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 19, 2025
@openshift-merge-bot openshift-merge-bot bot merged commit 52a0048 into openshift:master Aug 19, 2025
17 of 21 checks passed
@openshift-ci-robot
Copy link
Contributor

@alebedev87: Jira Issue OCPBUGS-54966: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-54966 has been moved to the MODIFIED state.

In response to this:

This PR aims at helping with gateway API e2e test failures when DNSRecord for a gateway cannot be found in time( example):

  • Adds more logs to the logic of DNSRecord ensuring to see when DNSRecord is updated or left untouched.
  • Adds more logs to the logic of generation of desired DNSRecord to see why Gateway service is not taken.
  • Adds more logs to Gateway service enqueue logic to see when reconciling hits or does not hit.
  • Ensures the Programmed condition is checked on the Gateway resource in the e2e test.
  • Logs gateway service definition while polling for gateway acceptance in the e2e test.
  • Logs the reason why DNSRecord is not ready during the assertion.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: ose-cluster-ingress-operator
This PR has been included in build ose-cluster-ingress-operator-container-v4.20.0-202508192316.p0.g52a0048.assembly.stream.el9.
All builds following this will include this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants