Skip to content

Conversation

petr-muller
Copy link
Member

@petr-muller petr-muller commented Jun 20, 2025

  • The FeatureGate GET is now made over an informer client which is backed by a cache, so polling to smooth over non-isNotFound errors no longer makes sense. The GET runs after a cache sync succeeded earlier so no errors are expected (and if an error happens, it is unlikely to be resolved by a retry)
  • Waiting for cache sync was not bounded; impose a 30s timeout on startup (this is similar to the deadline imposed by the poll before we started to use the informer client)

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 20, 2025
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Jun 20, 2025

@petr-muller: This pull request references OTA-1531 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.20.0" version, but no target version was set.

In response to this:

  • The FeatureGate GET is now made over an informer client which is
    backed by a cache, so polling to smooth over non-isNotFound errors no
    longer makes sense. The GET runs after a cache sync succeeded earlier
    so no errors are expected (and if an error happens, it is unlikely to
    be resolved by a retry)
  • Waiting for cache sync was not bounded; impose a 30s timeout on startup
    (this is similar to the deadline imposed by the poll before we started
    to use the informer client)
  • Return pointers from processInitialFeatureGate to make return errors
    easier

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Jun 20, 2025

@petr-muller: This pull request references OTA-1531 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.20.0" version, but no target version was set.

In response to this:

  • The FeatureGate GET is now made over an informer client which is backed by a cache, so polling to smooth over non-isNotFound errors no longer makes sense. The GET runs after a cache sync succeeded earlier so no errors are expected (and if an error happens, it is unlikely to be resolved by a retry)
  • Waiting for cache sync was not bounded; impose a 30s timeout on startup (this is similar to the deadline imposed by the poll before we started to use the informer client)
  • Return pointers from processInitialFeatureGate to make return errors easier

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 20, 2025
@petr-muller petr-muller force-pushed the ota-1531-08-simplify-featuregate-get branch from e6ff14c to 8357a33 Compare July 3, 2025 17:52
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Jul 3, 2025

@petr-muller: This pull request references OTA-1531 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.20.0" version, but no target version was set.

In response to this:

  • The FeatureGate GET is now made over an informer client which is backed by a cache, so polling to smooth over non-isNotFound errors no longer makes sense. The GET runs after a cache sync succeeded earlier so no errors are expected (and if an error happens, it is unlikely to be resolved by a retry)
  • Waiting for cache sync was not bounded; impose a 30s timeout on startup (this is similar to the deadline imposed by the poll before we started to use the informer client)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

featureGates := configInformerFactory.Config().V1().FeatureGates().Lister()
configInformerFactory.Start(ctx.Done())
configInformerFactory.WaitForCacheSync(ctx.Done())

ctx, cancel := context.WithTimeout(ctx, 30*time.Second)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to understand your reasoning on the number:
30s here are WaitForCacheSync (maybe 5s?) + poll timeout (25s)?

Copy link
Member Author

@petr-muller petr-muller Aug 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is the same 30s timeout that we had before, just applied elsewhere.

Previously, before we started using the informer-based (=cached) client, the GET to obtain the FeatureGet resource was the operation that could fail or timeout. We had a 30s timeout there in client-go internally for some networking errors, and on top of that we had a 25s timeout (i hope the 542520d commit description helps a bit with understanding why the 25s layer was added). The goal of the timeout is to force a restart when CVO fails its initial setup rather than for it to get stuck on startup. If we persistently fail over restarts, we'd go into crashloop error which is a good thing.

After we changed the client to be informer-based, it is cached and the Get on L266 should no longer fail because it is a local operation over a cache. But the same failure case like previously can still happen, just earlier, when we sync the cache. Before this PR the wait was unbounded; the CVO could hypothetically wait forever. So this PR imposes the same 30s timeout on this like it did on the GET previously.

@petr-muller
Copy link
Member Author

/test e2e-agnostic-ovn-upgrade-into-change-techpreview

- The `FeatureGate` GET is now made over an informer client which is
  backed by a cache, so polling to smooth over non-isNotFound errors no
  longer makes sense. The GET runs after a cache sync succeeded earlier
  so no errors are expected (and if an error happens, it is unlikely to
  be resolved by a retry)
- Waiting for cache sync was not bounded; impose a 30s timeout on startup
  (this is similar to the deadline imposed by the poll before we started
  to use the informer client)
@petr-muller petr-muller force-pushed the ota-1531-08-simplify-featuregate-get branch from 8357a33 to 93f3861 Compare August 28, 2025 12:26
@petr-muller
Copy link
Member Author

/label no-qe

I do not think this needs to be explicitly tested, I'm not even sure how we would approach it; we'd need a broken apiserver on CVO startup.

@openshift-ci openshift-ci bot added the no-qe Allows PRs to merge without qe-approved label label Aug 28, 2025
@petr-muller
Copy link
Member Author

/test e2e-aws-ovn-techpreview

Copy link
Contributor

openshift-ci bot commented Aug 28, 2025

@petr-muller: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/okd-scos-e2e-aws-ovn 93f3861 link false /test okd-scos-e2e-aws-ovn
ci/prow/e2e-aws-ovn-techpreview 93f3861 link true /test e2e-aws-ovn-techpreview

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@hongkailiu
Copy link
Member

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 28, 2025
Copy link
Contributor

openshift-ci bot commented Aug 28, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hongkailiu, petr-muller

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [hongkailiu,petr-muller]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. no-qe Allows PRs to merge without qe-approved label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants