OTA-1531: Rework error handling in FeatureGate processing #1206

petr-muller · 2025-06-20T15:31:46Z

The FeatureGate GET is now made over an informer client which is backed by a cache, so polling to smooth over non-isNotFound errors no longer makes sense. The GET runs after a cache sync succeeded earlier so no errors are expected (and if an error happens, it is unlikely to be resolved by a retry)
Waiting for cache sync was not bounded; impose a 30s timeout on startup (this is similar to the deadline imposed by the poll before we started to use the informer client)

openshift-ci-robot · 2025-06-20T15:31:50Z

@petr-muller: This pull request references OTA-1531 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.20.0" version, but no target version was set.

In response to this:

The FeatureGate GET is now made over an informer client which is
backed by a cache, so polling to smooth over non-isNotFound errors no
longer makes sense. The GET runs after a cache sync succeeded earlier
so no errors are expected (and if an error happens, it is unlikely to
be resolved by a retry)

Waiting for cache sync was not bounded; impose a 30s timeout on startup
(this is similar to the deadline imposed by the poll before we started
to use the informer client)

Return pointers from processInitialFeatureGate to make return errors
easier

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2025-06-20T15:32:19Z

@petr-muller: This pull request references OTA-1531 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.20.0" version, but no target version was set.

In response to this:

The FeatureGate GET is now made over an informer client which is backed by a cache, so polling to smooth over non-isNotFound errors no longer makes sense. The GET runs after a cache sync succeeded earlier so no errors are expected (and if an error happens, it is unlikely to be resolved by a retry)

Waiting for cache sync was not bounded; impose a 30s timeout on startup (this is similar to the deadline imposed by the poll before we started to use the informer client)

Return pointers from processInitialFeatureGate to make return errors easier

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2025-07-03T17:53:48Z

@petr-muller: This pull request references OTA-1531 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.20.0" version, but no target version was set.

In response to this:

The FeatureGate GET is now made over an informer client which is backed by a cache, so polling to smooth over non-isNotFound errors no longer makes sense. The GET runs after a cache sync succeeded earlier so no errors are expected (and if an error happens, it is unlikely to be resolved by a retry)

Waiting for cache sync was not bounded; impose a 30s timeout on startup (this is similar to the deadline imposed by the poll before we started to use the informer client)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

pkg/start/start.go

hongkailiu · 2025-07-03T18:48:03Z

pkg/start/start.go

 	featureGates := configInformerFactory.Config().V1().FeatureGates().Lister()
 	configInformerFactory.Start(ctx.Done())
-	configInformerFactory.WaitForCacheSync(ctx.Done())
+
+	ctx, cancel := context.WithTimeout(ctx, 30*time.Second)


Just want to understand your reasoning on the number:
30s here are WaitForCacheSync (maybe 5s?) + poll timeout (25s)?

It is the same 30s timeout that we had before, just applied elsewhere.

Previously, before we started using the informer-based (=cached) client, the GET to obtain the FeatureGet resource was the operation that could fail or timeout. We had a 30s timeout there in client-go internally for some networking errors, and on top of that we had a 25s timeout (i hope the 542520d commit description helps a bit with understanding why the 25s layer was added). The goal of the timeout is to force a restart when CVO fails its initial setup rather than for it to get stuck on startup. If we persistently fail over restarts, we'd go into crashloop error which is a good thing.

After we changed the client to be informer-based, it is cached and the Get on L266 should no longer fail because it is a local operation over a cache. But the same failure case like previously can still happen, just earlier, when we sync the cache. Before this PR the wait was unbounded; the CVO could hypothetically wait forever. So this PR imposes the same 30s timeout on this like it did on the GET previously.

pkg/start/start.go

petr-muller · 2025-07-22T15:15:58Z

/test e2e-agnostic-ovn-upgrade-into-change-techpreview

- The `FeatureGate` GET is now made over an informer client which is backed by a cache, so polling to smooth over non-isNotFound errors no longer makes sense. The GET runs after a cache sync succeeded earlier so no errors are expected (and if an error happens, it is unlikely to be resolved by a retry) - Waiting for cache sync was not bounded; impose a 30s timeout on startup (this is similar to the deadline imposed by the poll before we started to use the informer client)

petr-muller · 2025-08-28T12:38:51Z

/label no-qe

I do not think this needs to be explicitly tested, I'm not even sure how we would approach it; we'd need a broken apiserver on CVO startup.

petr-muller · 2025-08-28T15:54:28Z

/test e2e-aws-ovn-techpreview

openshift-ci · 2025-08-28T18:31:40Z

@petr-muller: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/okd-scos-e2e-aws-ovn	`93f3861`	link	false	`/test okd-scos-e2e-aws-ovn`
ci/prow/e2e-aws-ovn-techpreview	`93f3861`	link	true	`/test e2e-aws-ovn-techpreview`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

hongkailiu · 2025-08-28T20:13:30Z

/lgtm

openshift-ci · 2025-08-28T20:15:20Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hongkailiu, petr-muller

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [hongkailiu,petr-muller]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 20, 2025

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 20, 2025

petr-muller force-pushed the ota-1531-08-simplify-featuregate-get branch from e6ff14c to 8357a33 Compare July 3, 2025 17:52

hongkailiu reviewed Jul 3, 2025

View reviewed changes

petr-muller mentioned this pull request Jul 23, 2025

OTA-1578: Disable ResourceReconciliationIssues condition #1216

Merged

petr-muller force-pushed the ota-1531-08-simplify-featuregate-get branch from 8357a33 to 93f3861 Compare August 28, 2025 12:26

openshift-ci bot added the no-qe Allows PRs to merge without qe-approved label label Aug 28, 2025

openshift-ci bot assigned hongkailiu Aug 28, 2025

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

OTA-1531: Rework error handling in FeatureGate processing #1206

OTA-1531: Rework error handling in FeatureGate processing #1206

Uh oh!

petr-muller commented Jun 20, 2025 •

edited

Loading

Uh oh!

openshift-ci-robot commented Jun 20, 2025 •

edited by openshift-ci bot

Loading

Uh oh!

openshift-ci-robot commented Jun 20, 2025 •

edited by openshift-ci bot

Loading

Uh oh!

openshift-ci-robot commented Jul 3, 2025 •

edited by openshift-ci bot

Loading

Uh oh!

Uh oh!

hongkailiu Jul 3, 2025

Uh oh!

petr-muller Aug 28, 2025 •

edited

Loading

Uh oh!

Uh oh!

petr-muller commented Jul 22, 2025

Uh oh!

petr-muller commented Aug 28, 2025

Uh oh!

petr-muller commented Aug 28, 2025

Uh oh!

openshift-ci bot commented Aug 28, 2025

Uh oh!

hongkailiu commented Aug 28, 2025

Uh oh!

openshift-ci bot commented Aug 28, 2025

Uh oh!

Uh oh!

OTA-1531: Rework error handling in FeatureGate processing #1206

Are you sure you want to change the base?

OTA-1531: Rework error handling in FeatureGate processing #1206

Uh oh!

Conversation

petr-muller commented Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Jun 20, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Jun 20, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Jul 3, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

hongkailiu Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

petr-muller Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

petr-muller commented Jul 22, 2025

Uh oh!

petr-muller commented Aug 28, 2025

Uh oh!

petr-muller commented Aug 28, 2025

Uh oh!

openshift-ci bot commented Aug 28, 2025

Uh oh!

hongkailiu commented Aug 28, 2025

Uh oh!

openshift-ci bot commented Aug 28, 2025

Uh oh!

Uh oh!

petr-muller commented Jun 20, 2025 •

edited

Loading

openshift-ci-robot commented Jun 20, 2025 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Jun 20, 2025 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Jul 3, 2025 •

edited by openshift-ci bot

Loading

petr-muller Aug 28, 2025 •

edited

Loading