Skip to content

Conversation

@lunarwhite
Copy link
Member

@lunarwhite lunarwhite commented Oct 13, 2025

Background

Fixes: https://issues.redhat.com/browse/CM-735

The istio-csr controller exhibits an intermittent race condition where newly created IstioCSR custom resources are sometimes not reconciled, resulting in:

  • Empty .status field in the IstioCSR object
  • Missing cert-manager-istio-csr deployment
  • "object not found" errors in operator logs despite the object existing
  • Operator restart required to resolve

Root Cause

The custom cache used by the istio-csr controller is not properly waited for synchronization before the controller starts processing reconcile requests.

  • Before: Two separate caches → race condition
    • Manager's default cache (watches)
    • Custom cache (reads)
OLD (Race):
Manager cache syncs → triggers reconcile → reads from different custom cache → might not be synced
  • After: Single unified cache → no race possible
    • Manager's cache configured with NewCacheBuilder
    • Both watches AND reads use same cache instance
    • Controller-runtime guarantees cache sync before reconciliation
NEW (No Race):
Manager cache syncs → triggers reconcile → reads from SAME manager cache → guaranteed synced

Changes

  • Removed BuildCustomClient() - No more separate custom cache
  • Added NewCacheBuilder() - Configures manager's cache with label selectors
  • Updated NewClient() - Now uses manager's m.GetClient() directly
  • Updated setup_manager.go - Configure manager with custom cache builder

Validation

IstioCSR e2e test suite passed repeatedly without suffering the described flakiness.

Proofs (exited after a 1h global timeout, till then all runs are passing):

[NB] Major dialogues with AI: https://gist.github.com/lunarwhite/8928d1dc8e35d0d23e6cc7a364985215

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Oct 13, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Oct 13, 2025

@lunarwhite: This pull request references CM-735 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.21.0" version, but no target version was set.

In response to this:

Background

Fixes: https://issues.redhat.com/browse/CM-735

The istio-csr controller exhibits an intermittent race condition where newly created IstioCSR custom resources are sometimes not reconciled, resulting in:

  • Empty .status field in the IstioCSR object
  • Missing cert-manager-istio-csr deployment
  • "object not found" errors in operator logs despite the object existing
  • Operator restart required to resolve

Root Cause

The custom cache used by the istio-csr controller is not properly waited for synchronization before the controller starts processing reconcile requests.

  • Before: Two separate caches → race condition
  • Manager's default cache (watches)
  • Custom cache (reads)
OLD (Race):
Manager cache syncs → triggers reconcile → reads from different custom cache → might not be synced
  • After: Single unified cache → no race possible
  • Manager's cache configured with NewCacheBuilder
  • Both watches AND reads use same cache instance
  • Controller-runtime guarantees cache sync before reconciliation
NEW (No Race):
Manager cache syncs → triggers reconcile → reads from SAME manager cache → guaranteed synced

Changes

  • Removed BuildCustomClient() - No more separate custom cache
  • Added NewCacheBuilder() - Configures manager's cache with label selectors
  • Updated NewClient() - Now uses manager's m.GetClient() directly
  • Updated setup_manager.go - Configure manager with custom cache builder

Validation

IstioCSR e2e test suite passed repeatedly without suffering the described flakiness.

[NB] Major dialogues with AI: https://gist.github.com/lunarwhite/8928d1dc8e35d0d23e6cc7a364985215

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link

coderabbitai bot commented Oct 13, 2025

Walkthrough

Replaces the prior custom client construction with a manager-backed client and introduces a cache builder. NewClient now uses the manager's GetClient() directly; NewCacheBuilder configures a filtered cache for IstioCSR-related resources; manager setup switched to use the new cache builder.

Changes

Cohort / File(s) Summary
Client simplification
pkg/controller/istiocsr/client.go
NewClient now returns a ctrlClientImpl built from mgr.GetClient(); removed earlier BuildCustomClient usage and its error path; added comments about cache consistency.
Cache builder & controller changes
pkg/controller/istiocsr/controller.go
Added NewCacheBuilder(config *rest.Config, opts cache.Options) (cache.Cache, error) that sets up a cache with label selectors for IstioCSR and multiple Kubernetes resources. Removed BuildCustomClient. Added import k8s.io/client-go/rest.
Manager configuration
pkg/operator/setup_manager.go
Replaced prior custom client setup with ctrl.Options{ ..., NewCache: istiocsr.NewCacheBuilder }; removed use/imports of the previous rest-based client wrapper and adjusted comments/logging accordingly.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The pull request title "CM-735: Fix IstioCSR cache sync race condition by using unified manager cache" is directly related to the main change in the changeset. The title clearly and specifically identifies the problem being fixed (a cache sync race condition in IstioCSR) and the solution approach (using a unified manager cache). The changes throughout the pull request—removal of BuildCustomClient(), addition of NewCacheBuilder(), modification of NewClient(), and updates to setup_manager.go—all work together to implement this unified cache approach. The title is concise, avoids vague terminology, and a teammate scanning the repository history would immediately understand the primary change.
Description Check ✅ Passed The pull request description is thoroughly related to the changeset and provides substantive context for the changes. It includes a clear background explaining the race condition bug, a detailed root cause analysis comparing the dual-cache vs. unified-cache approaches, and a section documenting the specific code changes. The description also provides validation evidence via CI test results. All aspects of the description directly support understanding the purpose and implementation of the modifications to the three files in the changeset, making it informative and on-topic.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 4f1fb15 and 0ef8c87.

📒 Files selected for processing (3)
  • pkg/controller/istiocsr/client.go (1 hunks)
  • pkg/controller/istiocsr/controller.go (2 hunks)
  • pkg/operator/setup_manager.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • pkg/controller/istiocsr/client.go
  • pkg/operator/setup_manager.go
🔇 Additional comments (2)
pkg/controller/istiocsr/controller.go (2)

18-18: LGTM: Import addition is appropriate.

The rest package import is required for the NewCacheBuilder function signature.


60-102: Excellent implementation: unified cache with label selectors addresses the race condition and mitigates cache bloat.

The NewCacheBuilder function correctly configures a unified cache that eliminates the dual-cache race condition described in the PR objectives. Key design decisions:

  1. Label selectors mitigate cache bloat: The ByObject configuration with Label: managedResourceLabelReqSelector limits cached objects to those with app=cert-manager-istio-csr, addressing the cache bloat concern raised in previous comments.

  2. IstioCSR without label selector (Line 73): Correct—the controller must reconcile all IstioCSR instances, not just those with the managed-resource label.

  3. Secret exclusion from ByObject: Appropriate—Secrets are watched via WatchesMetadata (Line 193) with a different label filter (istiocsrResourceWatchLabelName) and are not managed resources created by this controller.

  4. Complete alignment: All resource types watched in SetupWithManager (Lines 185-193) are properly included in the cache configuration.

This implementation ensures that both watches and reads use the same synchronized cache, eliminating the race condition while maintaining efficient resource filtering.

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.5.0)

Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions
The command is terminated due to an error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between b099c2f and 00b55ae.

📒 Files selected for processing (5)
  • Makefile (1 hunks)
  • pkg/controller/istiocsr/client.go (1 hunks)
  • pkg/controller/istiocsr/controller.go (2 hunks)
  • pkg/operator/setup_manager.go (1 hunks)
  • test/e2e/istio_csr_test.go (1 hunks)
🔇 Additional comments (6)
Makefile (1)

107-107: LGTM! Timeout increase aligns with repeated test runs.

The timeout increase from 1h to 2h is appropriate given the test changes that introduce MustPassRepeatedly(5) in the IstioCSR E2E tests, which can extend execution time.

test/e2e/istio_csr_test.go (1)

43-43: Consider whether MustPassRepeatedly(5) should remain in production tests.

While MustPassRepeatedly(5) is valuable for validating the race condition fix during development, keeping it in production tests will increase E2E test execution time by approximately 5x. Consider whether this overhead is acceptable long-term, or if it should be removed after the fix is validated in CI.

If the intent is to keep this decorator permanently for regression detection, document this decision in a comment explaining why this particular test requires repeated execution.

pkg/controller/istiocsr/client.go (1)

33-41: LGTM! Unified cache approach eliminates the race condition.

The change to use m.GetClient() directly is the core of the race condition fix. By using the manager's client, the reconciler now reads from the same cache that the controller's watches use, eliminating the dual-cache race where the custom cache could be unsynced during reconciliation.

The added comments clearly document the rationale for this change.

pkg/controller/istiocsr/controller.go (2)

18-18: LGTM! Import addition supports new cache builder signature.

The k8s.io/client-go/rest import is required for the NewCacheBuilder function signature, which takes *rest.Config as its first parameter.


60-104: LGTM! Cache builder correctly configures label selectors and resource access.

The NewCacheBuilder function properly configures the cache with appropriate label selectors:

  • IstioCSR: No filter (needs to see all instances for multi-instance validation)
  • Managed resources (Certificate, Deployment, RBAC, Service, ServiceAccount): Filtered by app=cert-manager-istio-csr label, matching the watch predicates in SetupWithManager
  • Watched/read resources (Secret, ConfigMap, Issuer, ClusterIssuer): No filter (allows reading any instance)

This configuration ensures the cache is properly scoped while maintaining access to all necessary resources.

pkg/operator/setup_manager.go (1)

55-57: LGTM! Manager configuration completes the unified cache setup.

Setting NewCache: istiocsr.NewCacheBuilder ensures the manager uses the custom cache builder with proper label selectors. This is the final piece that connects the new cache builder to the manager, ensuring both the controller's watches and the reconciler's reads use the same cache instance.

The added comment clearly documents the purpose of the custom cache builder.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Oct 13, 2025

@lunarwhite: This pull request references CM-735 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.21.0" version, but no target version was set.

In response to this:

Background

Fixes: https://issues.redhat.com/browse/CM-735

The istio-csr controller exhibits an intermittent race condition where newly created IstioCSR custom resources are sometimes not reconciled, resulting in:

  • Empty .status field in the IstioCSR object
  • Missing cert-manager-istio-csr deployment
  • "object not found" errors in operator logs despite the object existing
  • Operator restart required to resolve

Root Cause

The custom cache used by the istio-csr controller is not properly waited for synchronization before the controller starts processing reconcile requests.

  • Before: Two separate caches → race condition
  • Manager's default cache (watches)
  • Custom cache (reads)
OLD (Race):
Manager cache syncs → triggers reconcile → reads from different custom cache → might not be synced
  • After: Single unified cache → no race possible
  • Manager's cache configured with NewCacheBuilder
  • Both watches AND reads use same cache instance
  • Controller-runtime guarantees cache sync before reconciliation
NEW (No Race):
Manager cache syncs → triggers reconcile → reads from SAME manager cache → guaranteed synced

Changes

  • Removed BuildCustomClient() - No more separate custom cache
  • Added NewCacheBuilder() - Configures manager's cache with label selectors
  • Updated NewClient() - Now uses manager's m.GetClient() directly
  • Updated setup_manager.go - Configure manager with custom cache builder

Validation

IstioCSR e2e test suite passed repeatedly without suffering the described flakiness.

Proofs (exited after a 1h global timeout, till then all runs are passing):

[NB] Major dialogues with AI: https://gist.github.com/lunarwhite/8928d1dc8e35d0d23e6cc7a364985215

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@lunarwhite
Copy link
Member Author

/test e2e-operator-tech-preview

Copy link
Contributor

@bharath-b-rh bharath-b-rh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, except for a suggestion.

Comment on lines 97 to 100
&corev1.Secret{}: {},
&corev1.ConfigMap{}: {},
&certmanagerv1.Issuer{}: {},
&certmanagerv1.ClusterIssuer{}: {},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding these might cause cache to bloat. Should we instead retain informers like before to cache only the watched objects, which I think will be filtered based on predicate defined. This needs a little bit of testing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bharath-b-rh Thanks for pointing this out! Just for cross-check, I consulted with Cursor and it points that the old approach "GetInformer() on types not in ByObject" would still have the same cache bloat issue.

The current solution advised by it is

  • Keep only the types with label selectors in ByObject
  • Remove the unlabeled types
  • Let controller-runtime handle the rest:
    • .WatchesMetadata() creates a metadata-only informer for Secret automatically
    • .Get() calls for uncached types go directly to the API server

Does that something make sense to you?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yuedong Yeah, that's correct for cache there's no direct way of doing it, I tried working on it yesterday.

The required approach I think is, where ever these objects are referenced we need to use cache client first and on NotFound error use uncache client and update the required labels, so that next time it will be from cache.

For the objects in question here, we would need to filter using existing watch label and add the same where missing, I think Issuer/ClusterIssuer objects are not labelled with it.

And also watch on ConfigMap is not existing, I think Chirag PR adds it, but I think we would need to add for Issuer/ClusterIssuer, if not wrong.

WDYT?

@bharath-b-rh
Copy link
Contributor

/lgtm
/approve
/label docs-approved
/label px-approved
/cherrypick cert-manager-1.18

@openshift-cherrypick-robot

@bharath-b-rh: once the present PR merges, I will cherry-pick it on top of cert-manager-1.18 in a new PR and assign it to you.

In response to this:

/lgtm
/approve
/label docs-approved
/label px-approved
/cherrypick cert-manager-1.18

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci openshift-ci bot added docs-approved Signifies that Docs has signed off on this PR px-approved Signifies that Product Support has signed off on this PR labels Oct 20, 2025
@openshift-ci openshift-ci bot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Oct 20, 2025
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Oct 21, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 00b55ae and 4f1fb15.

📒 Files selected for processing (3)
  • pkg/controller/istiocsr/client.go (1 hunks)
  • pkg/controller/istiocsr/controller.go (2 hunks)
  • pkg/operator/setup_manager.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • pkg/controller/istiocsr/client.go
  • pkg/operator/setup_manager.go
🔇 Additional comments (1)
pkg/controller/istiocsr/controller.go (1)

18-18: Import addition looks good.

Needed for the NewCacheBuilder signature used by controller-runtime.

@lunarwhite
Copy link
Member Author

/testwith openshift/cert-manager-operator/master/e2e-operator #312

@bharath-b-rh
Copy link
Contributor

/override ci/prow/e2e-operator-tech-preview

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 21, 2025

@bharath-b-rh: Overrode contexts on behalf of bharath-b-rh: ci/prow/e2e-operator-tech-preview

In response to this:

/override ci/prow/e2e-operator-tech-preview

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 21, 2025

@lunarwhite: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-operator-tech-preview 0ef8c87 link false /test e2e-operator-tech-preview

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@bharath-b-rh
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 21, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 21, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bharath-b-rh, lunarwhite

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@lunarwhite
Copy link
Member Author

/label qe-approved

@openshift-ci openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label Oct 21, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Oct 21, 2025

@lunarwhite: This pull request references CM-735 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.21.0" version, but no target version was set.

In response to this:

Background

Fixes: https://issues.redhat.com/browse/CM-735

The istio-csr controller exhibits an intermittent race condition where newly created IstioCSR custom resources are sometimes not reconciled, resulting in:

  • Empty .status field in the IstioCSR object
  • Missing cert-manager-istio-csr deployment
  • "object not found" errors in operator logs despite the object existing
  • Operator restart required to resolve

Root Cause

The custom cache used by the istio-csr controller is not properly waited for synchronization before the controller starts processing reconcile requests.

  • Before: Two separate caches → race condition
  • Manager's default cache (watches)
  • Custom cache (reads)
OLD (Race):
Manager cache syncs → triggers reconcile → reads from different custom cache → might not be synced
  • After: Single unified cache → no race possible
  • Manager's cache configured with NewCacheBuilder
  • Both watches AND reads use same cache instance
  • Controller-runtime guarantees cache sync before reconciliation
NEW (No Race):
Manager cache syncs → triggers reconcile → reads from SAME manager cache → guaranteed synced

Changes

  • Removed BuildCustomClient() - No more separate custom cache
  • Added NewCacheBuilder() - Configures manager's cache with label selectors
  • Updated NewClient() - Now uses manager's m.GetClient() directly
  • Updated setup_manager.go - Configure manager with custom cache builder

Validation

IstioCSR e2e test suite passed repeatedly without suffering the described flakiness.

Proofs (exited after a 1h global timeout, till then all runs are passing):

[NB] Major dialogues with AI: https://gist.github.com/lunarwhite/8928d1dc8e35d0d23e6cc7a364985215

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-merge-bot openshift-merge-bot bot merged commit d346a70 into openshift:master Oct 21, 2025
13 checks passed
@lunarwhite lunarwhite deleted the istiocsr-flake branch October 21, 2025 16:49
@openshift-cherrypick-robot

@bharath-b-rh: new pull request created: #330

In response to this:

/lgtm
/approve
/label docs-approved
/label px-approved
/cherrypick cert-manager-1.18

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. docs-approved Signifies that Docs has signed off on this PR jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. px-approved Signifies that Product Support has signed off on this PR qe-approved Signifies that QE has signed off on this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants