-
Notifications
You must be signed in to change notification settings - Fork 37
CM-735: Fix IstioCSR cache sync race condition by using unified manager cache #324
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CM-735: Fix IstioCSR cache sync race condition by using unified manager cache #324
Conversation
|
@lunarwhite: This pull request references CM-735 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.21.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
WalkthroughReplaces the prior custom client construction with a manager-backed client and introduces a cache builder. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro Cache: Disabled due to data retention organization setting Knowledge base: Disabled due to 📒 Files selected for processing (3)
🚧 Files skipped from review as they are similar to previous changes (2)
🔇 Additional comments (2)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.5.0)Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (5)
Makefile(1 hunks)pkg/controller/istiocsr/client.go(1 hunks)pkg/controller/istiocsr/controller.go(2 hunks)pkg/operator/setup_manager.go(1 hunks)test/e2e/istio_csr_test.go(1 hunks)
🔇 Additional comments (6)
Makefile (1)
107-107: LGTM! Timeout increase aligns with repeated test runs.The timeout increase from 1h to 2h is appropriate given the test changes that introduce
MustPassRepeatedly(5)in the IstioCSR E2E tests, which can extend execution time.test/e2e/istio_csr_test.go (1)
43-43: Consider whetherMustPassRepeatedly(5)should remain in production tests.While
MustPassRepeatedly(5)is valuable for validating the race condition fix during development, keeping it in production tests will increase E2E test execution time by approximately 5x. Consider whether this overhead is acceptable long-term, or if it should be removed after the fix is validated in CI.If the intent is to keep this decorator permanently for regression detection, document this decision in a comment explaining why this particular test requires repeated execution.
pkg/controller/istiocsr/client.go (1)
33-41: LGTM! Unified cache approach eliminates the race condition.The change to use
m.GetClient()directly is the core of the race condition fix. By using the manager's client, the reconciler now reads from the same cache that the controller's watches use, eliminating the dual-cache race where the custom cache could be unsynced during reconciliation.The added comments clearly document the rationale for this change.
pkg/controller/istiocsr/controller.go (2)
18-18: LGTM! Import addition supports new cache builder signature.The
k8s.io/client-go/restimport is required for theNewCacheBuilderfunction signature, which takes*rest.Configas its first parameter.
60-104: LGTM! Cache builder correctly configures label selectors and resource access.The
NewCacheBuilderfunction properly configures the cache with appropriate label selectors:
- IstioCSR: No filter (needs to see all instances for multi-instance validation)
- Managed resources (Certificate, Deployment, RBAC, Service, ServiceAccount): Filtered by
app=cert-manager-istio-csrlabel, matching the watch predicates inSetupWithManager- Watched/read resources (Secret, ConfigMap, Issuer, ClusterIssuer): No filter (allows reading any instance)
This configuration ensures the cache is properly scoped while maintaining access to all necessary resources.
pkg/operator/setup_manager.go (1)
55-57: LGTM! Manager configuration completes the unified cache setup.Setting
NewCache: istiocsr.NewCacheBuilderensures the manager uses the custom cache builder with proper label selectors. This is the final piece that connects the new cache builder to the manager, ensuring both the controller's watches and the reconciler's reads use the same cache instance.The added comment clearly documents the purpose of the custom cache builder.
00b55ae to
680596b
Compare
|
@lunarwhite: This pull request references CM-735 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.21.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/test e2e-operator-tech-preview |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, except for a suggestion.
| &corev1.Secret{}: {}, | ||
| &corev1.ConfigMap{}: {}, | ||
| &certmanagerv1.Issuer{}: {}, | ||
| &certmanagerv1.ClusterIssuer{}: {}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding these might cause cache to bloat. Should we instead retain informers like before to cache only the watched objects, which I think will be filtered based on predicate defined. This needs a little bit of testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bharath-b-rh Thanks for pointing this out! Just for cross-check, I consulted with Cursor and it points that the old approach "GetInformer() on types not in ByObject" would still have the same cache bloat issue.
The current solution advised by it is
- Keep only the types with label selectors in
ByObject - Remove the unlabeled types
- Let controller-runtime handle the rest:
.WatchesMetadata()creates a metadata-only informer for Secret automatically.Get()calls for uncached types go directly to the API server
Does that something make sense to you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yuedong Yeah, that's correct for cache there's no direct way of doing it, I tried working on it yesterday.
The required approach I think is, where ever these objects are referenced we need to use cache client first and on NotFound error use uncache client and update the required labels, so that next time it will be from cache.
For the objects in question here, we would need to filter using existing watch label and add the same where missing, I think Issuer/ClusterIssuer objects are not labelled with it.
And also watch on ConfigMap is not existing, I think Chirag PR adds it, but I think we would need to add for Issuer/ClusterIssuer, if not wrong.
WDYT?
|
/lgtm |
|
@bharath-b-rh: once the present PR merges, I will cherry-pick it on top of In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
680596b to
4f1fb15
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (3)
pkg/controller/istiocsr/client.go(1 hunks)pkg/controller/istiocsr/controller.go(2 hunks)pkg/operator/setup_manager.go(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- pkg/controller/istiocsr/client.go
- pkg/operator/setup_manager.go
🔇 Additional comments (1)
pkg/controller/istiocsr/controller.go (1)
18-18: Import addition looks good.Needed for the NewCacheBuilder signature used by controller-runtime.
4f1fb15 to
0ef8c87
Compare
|
/testwith openshift/cert-manager-operator/master/e2e-operator #312 |
|
/override ci/prow/e2e-operator-tech-preview |
|
@bharath-b-rh: Overrode contexts on behalf of bharath-b-rh: ci/prow/e2e-operator-tech-preview In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@lunarwhite: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bharath-b-rh, lunarwhite The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/label qe-approved |
|
@lunarwhite: This pull request references CM-735 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.21.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@bharath-b-rh: new pull request created: #330 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Background
Fixes: https://issues.redhat.com/browse/CM-735
The istio-csr controller exhibits an intermittent race condition where newly created IstioCSR custom resources are sometimes not reconciled, resulting in:
.statusfield in the IstioCSR objectcert-manager-istio-csrdeploymentRoot Cause
The custom cache used by the istio-csr controller is not properly waited for synchronization before the controller starts processing reconcile requests.
NewCacheBuilderChanges
BuildCustomClient()- No more separate custom cacheNewCacheBuilder()- Configures manager's cache with label selectorsNewClient()- Now uses manager'sm.GetClient()directlyValidation
IstioCSR e2e test suite passed repeatedly without suffering the described flakiness.
Proofs (exited after a 1h global timeout, till then all runs are passing):
[NB] Major dialogues with AI: https://gist.github.com/lunarwhite/8928d1dc8e35d0d23e6cc7a364985215