-
Notifications
You must be signed in to change notification settings - Fork 204
OTA-1531: Rework error handling in FeatureGate processing #1206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
OTA-1531: Rework error handling in FeatureGate processing #1206
Conversation
@petr-muller: This pull request references OTA-1531 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.20.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
@petr-muller: This pull request references OTA-1531 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.20.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
e6ff14c
to
8357a33
Compare
@petr-muller: This pull request references OTA-1531 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.20.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
featureGates := configInformerFactory.Config().V1().FeatureGates().Lister() | ||
configInformerFactory.Start(ctx.Done()) | ||
configInformerFactory.WaitForCacheSync(ctx.Done()) | ||
|
||
ctx, cancel := context.WithTimeout(ctx, 30*time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just want to understand your reasoning on the number:
30s
here are WaitForCacheSync
(maybe 5s
?) + poll timeout (25s
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is the same 30s timeout that we had before, just applied elsewhere.
Previously, before we started using the informer-based (=cached) client, the GET to obtain the FeatureGet resource was the operation that could fail or timeout. We had a 30s timeout there in client-go internally for some networking errors, and on top of that we had a 25s timeout (i hope the 542520d commit description helps a bit with understanding why the 25s layer was added). The goal of the timeout is to force a restart when CVO fails its initial setup rather than for it to get stuck on startup. If we persistently fail over restarts, we'd go into crashloop error which is a good thing.
After we changed the client to be informer-based, it is cached and the Get
on L266 should no longer fail because it is a local operation over a cache. But the same failure case like previously can still happen, just earlier, when we sync the cache. Before this PR the wait was unbounded; the CVO could hypothetically wait forever. So this PR imposes the same 30s timeout on this like it did on the GET previously.
/test e2e-agnostic-ovn-upgrade-into-change-techpreview |
- The `FeatureGate` GET is now made over an informer client which is backed by a cache, so polling to smooth over non-isNotFound errors no longer makes sense. The GET runs after a cache sync succeeded earlier so no errors are expected (and if an error happens, it is unlikely to be resolved by a retry) - Waiting for cache sync was not bounded; impose a 30s timeout on startup (this is similar to the deadline imposed by the poll before we started to use the informer client)
8357a33
to
93f3861
Compare
/label no-qe I do not think this needs to be explicitly tested, I'm not even sure how we would approach it; we'd need a broken apiserver on CVO startup. |
/test e2e-aws-ovn-techpreview |
@petr-muller: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hongkailiu, petr-muller The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
FeatureGate
GET is now made over an informer client which is backed by a cache, so polling to smooth over non-isNotFound errors no longer makes sense. The GET runs after a cache sync succeeded earlier so no errors are expected (and if an error happens, it is unlikely to be resolved by a retry)