-
Notifications
You must be signed in to change notification settings - Fork 218
OCPBUGS-54966: Improve detection of missing DNSRecord for Gateway #1212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-54966: Improve detection of missing DNSRecord for Gateway #1212
Conversation
|
Skipping CI for Draft Pull Request. |
|
/test e2e-aws-gatewayapi |
063382e to
13f7808
Compare
|
/test e2e-aws-gatewayapi |
ae92836 to
45a4c1c
Compare
|
/test all |
b923592 to
da2ffeb
Compare
2a880e8 to
0424f4e
Compare
|
The gateway service didn't get its load balancer target in time (last transition: Load balancer provisioning happened at The istiod created the service at And itself, the istiod pod was created at creationTimestamp: "2025-04-04T17:29:18Z"
generateName: istiod-openshift-gateway-57c5c5ccb9-Who accepted the gatewayclass then?! The check for its acceptance goes before the gateway check. |
0424f4e to
005d884
Compare
|
Running |
|
/test e2e-aws-gatewayapi |
pkg/resources/dnsrecord/dns.go
Outdated
| // Quick sanity check since we don't know how to handle both being set (is | ||
| // that even a valid state?) | ||
| if len(ingress.Hostname) > 0 && len(ingress.IP) > 0 { | ||
| log.Info("no load balancer hostname or IP for dnsrecord", "dnsrecord", name, "service", service.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case it has both hostname and ip, which is unexpected.
test/e2e/util_gatewayapi_test.go
Outdated
| recordedConditionMsg := "not found" | ||
| recordedAcceptedConditionMsg, recordedProgrammedConditionMsg := "not found", "not found" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With your change below to use %q instead of %s, it would be clearer to initialize these variables to empty strings.
| return gatewayListenersHostnamesChanged(old, new) | ||
| changed := gatewayListenersHostnamesChanged(old, new) | ||
| if changed { | ||
| log.Info("listener hostname changed", "gateway", e.ObjectNew.(*gatewayapiv1.Gateway).Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here and in the other log messages that you add, please capitalize the first letter. See https://github.com/kubernetes/community/blob/master/contributors/devel/sig-instrumentation/logging.md#message-style-guidelines.
|
In some test failures that I have been investigating, there are no events to indicate that the service controller tried to provision an LB, which leads me to wonder whether Istio sometimes doesn't create the service or doesn't configure it correctly. The istiod logs aren't very helpful for diagnosing these failures. For this reason, I believe it could be helpful for |
|
/assign @rfredette |
|
/jire refresh |
|
/jira refresh |
|
@alebedev87: This pull request references Jira Issue OCPBUGS-54966, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
pkg/resources/dnsrecord/dns.go
Outdated
| // Quick sanity check since we don't know how to handle both being set (is | ||
| // that even a valid state?) | ||
| if len(ingress.Hostname) > 0 && len(ingress.IP) > 0 { | ||
| log.Info("Both load balancer hostname and IP are set", "dnsrecord", name, "service", service.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: how about logging an error here, in case it does happen more than we expect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call, changing log.Info(...) to log.Err(nil, ...) would make it more obvious that this log message represents an error condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to log.Error().
|
|
||
| err := wait.PollUntilContextTimeout(context.Background(), 1*time.Second, 1*time.Minute, false, func(context context.Context) (bool, error) { | ||
| // Wait for the gateway to be accepted and programmed. | ||
| // Load balancer provisioning can take several minutes on some platforms. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: it would be great in the future to set timeouts that vary based on platforms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an advantage to using a lower timeout on platforms that don't need the higher timeout?
I would prefer to take the longest amount of time we expect provisioning to take, and double it. We don't pay the cost unless provisioning actually takes that amount of time, but we do pay a cost if the test times out when waiting longer would have allowed it to pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We pay the cost of higher timeouts when everything is failing. I'm suggesting that we be more choosy with expected timeouts to take advantage of tests failing faster on platforms that don't take as long to provision. It's just a nit though, so no big deal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm suggesting that we be more choosy with expected timeouts to take advantage of tests failing faster on platforms that don't take as long to provision. It's just a nit though, so no big deal.
My concern with "tests failing faster on platforms that don't take as long to provision" is that it could lead to premature failures. For instance, if LB provisioning on GCP typically takes 1 minute, but occasionally takes 2 minutes, should we fail the test after 1 minute of polling (and then follow up with the CCM team)? Or should we allow it to poll for an additional minute to potentially avoid a retest of the entire suite?
I'd say that the cost of retesting flaky tests outweighs the cost of a slightly longer poll in a failed scenario, especially since the LB provisioning is a process outside of our direct control.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We pay the cost of higher timeouts when everything is failing.
Can we test for "everything is failing" in the polling loop, and exit early if the condition is detected? That would fail faster, flake less, and provide useful diagnostic information right in the test output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but let's prioritize the merge so we can get better CI signal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but let's prioritize the merge so we can get better CI signal.
Agreed. The PR has already merged, but in general when we have tests that are flaky because polling loops time out, I would argue for (1) increasing timeouts as a first pass in order to reduce flakiness and (2) refining the failure conditions as a second pass in order to fail faster and improve diagnostics. Optimizing to reduce flakiness virtually always takes precedence over optimizing to fail fast.
| // The creation of the gateway service may be delayed. | ||
| // Check the current status of the service to see where we are. | ||
| svc := &corev1.Service{} | ||
| svcNsName := types.NamespacedName{Namespace: namespace, Name: gw.Name + "-" + string(gw.Spec.GatewayClassName)} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: the generated name could be stored for other uses, like https://github.com/openshift/cluster-ingress-operator/blob/master/pkg/operator/controller/names.go#L319
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to avoid reusing the code which is being tested in tests. Composing a name like this adds an additional check for an implicit assumption about the naming of the publishing service.
test/e2e/util_gatewayapi_test.go
Outdated
| t.Logf("Last observed gateway:\n%s", util.ToYaml(gw)) | ||
| return nil, fmt.Errorf("gateway %v not %v, last recorded status message: %s", nsName, gatewayapiv1.GatewayConditionAccepted, recordedConditionMsg) | ||
| t.Logf("[%s] Last observed gateway:\n%s", time.Now().Format(time.DateTime), util.ToYaml(gw)) | ||
| return nil, fmt.Errorf("gateway %v does not have all expected conditions, last recorded status messages: %q, %q", nsName, recordedAcceptedConditionMsg, recordedProgrammedConditionMsg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| return nil, fmt.Errorf("gateway %v does not have all expected conditions, last recorded status messages: %q, %q", nsName, recordedAcceptedConditionMsg, recordedProgrammedConditionMsg) | |
| return nil, fmt.Errorf("gateway %v does not have all expected conditions, last recorded status messages for Accepted: %q, Programmed: %q", nsName, recordedAcceptedConditionMsg, recordedProgrammedConditionMsg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous log line, with util.ToYaml(gw), should already log the Accepted and Programmed status conditions; is it useful to include them in the error message as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This requested change is just to clarify which condition matches the listed message. The message could end up just showing "last recorded status messages: Pending, Invalid". I'm suggesting we change it to "last recorded status messages for Accepted: Pending, Programmed: Invalid".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this error message is redundant. But I'm fine with changing it to the suggested one so we can more easily see the extracted messages in the logs.
- Add more logs to the logic of DNSRecord ensuring to see when DNSRecord is updated or left untouched. - Add more logs to the logic of generation of desired DNSRecord to see why Gateway service is not taken. - Add more logs to Gateway service enqueue logic to see when reconciling hits or does not hit. - Ensure the `Programmed` condition is checked on the Gateway resource in the e2e test. - Log gateway service definition while polling for gateway acceptance in the e2e test. - Log the reason why DNSRecord is not ready during the assertion.
443fe50 to
312078a
Compare
|
/retest-required |
|
@alebedev87: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/lgtm |
|
We know the root cause for failures in CI for e2e-gcp-operator and they are being handled by the cloud controller team. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: candita, rfredette The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@candita: /override requires failed status contexts, check run or a prowjob name to operate on.
Only the following failed contexts/checkruns were expected:
If you are trying to override a checkrun that has a space in it, you must put a double quote on the context. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
We know the root cause for failures in CI for e2e-gcp-operator and they are being handled by the cloud controller team. |
|
@candita: Overrode contexts on behalf of candita: ci/prow/e2e-gcp-operator In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/cancel hold |
|
/tide refresh |
|
/unhold |
52a0048
into
openshift:master
|
@alebedev87: Jira Issue OCPBUGS-54966: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-54966 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[ART PR BUILD NOTIFIER] Distgit: ose-cluster-ingress-operator |
This PR aims at helping with gateway API e2e test failures when
DNSRecordfor a gateway cannot be found in time( example):Programmedcondition is checked on the Gateway resource in the e2e test.