Skip to content

Conversation

@jianzhangbjz
Copy link
Member

@jianzhangbjz jianzhangbjz commented Oct 23, 2025

Description of the change:
Fixes a race condition that causes OLM to create duplicate InstallPlans for the same subscription when multiple worker threads reconcile a namespace concurrently.

Problem

When two worker threads process the namespace queue simultaneously:

  1. Worker 1 calls listInstallPlans() and finds no existing InstallPlan
  2. Worker 2 calls listInstallPlans() and finds no existing InstallPlan
  3. Worker 1 acquires the mutex lock and creates an InstallPlan
  4. Worker 1 releases the lock
  5. Worker 2 acquires the mutex lock, checks its stale list (from step 2), and creates a duplicate InstallPlan

This is a classic Time-of-Check-to-Time-of-Use (TOCTOU) vulnerability.

Motivation for the change:

Architectural changes:
Move the listInstallPlans() call inside the mutex-protected critical section to ensure atomic check-and-create behavior.

Testing remarks:
Added TestEnsureInstallPlanConcurrency which:

  • Launches 10 concurrent goroutines calling ensureInstallPlan()
  • Verifies only one InstallPlan is created
  • Verifies all goroutines receive the same InstallPlan reference

Reviewer Checklist

  • Implementation matches the proposed design, or proposal is updated to match implementation
  • Sufficient unit test coverage
  • Sufficient end-to-end test coverage
  • Bug fixes are accompanied by regression test(s)
  • e2e tests and flake fixes are accompanied evidence of flake testing, e.g. executing the test 100(0) times
  • tech debt/todo is accompanied by issue link(s) in comments in the surrounding code
  • Tests are comprehensible, e.g. Ginkgo DSL is being used appropriately
  • Docs updated or added to /doc
  • Commit messages sensible and descriptive
  • Tests marked as [FLAKE] are truly flaky and have an issue
  • Code is properly formatted

Assisted-by: Claude Code

@openshift-ci openshift-ci bot requested review from anik120 and joelanford October 23, 2025 08:43
@jianzhangbjz
Copy link
Member Author

Hi @joelanford , could you help take a look when you get a chance? Thanks!

@perdasilva
Copy link
Collaborator

/approve

@openshift-ci
Copy link

openshift-ci bot commented Oct 23, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: perdasilva

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 23, 2025
@camilamacedo86
Copy link
Contributor

@joelanford your PR #3684 does the same of this one. However, here we have tests as well. So, this one is in a better state.

Any reason for we not move with this one instead of : #3684 ?

@joelanford
Copy link
Member

Let's use this one. I'll close mine. Looks great!

@joelanford
Copy link
Member

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 23, 2025
@openshift-merge-bot openshift-merge-bot bot merged commit ce26d16 into operator-framework:master Oct 23, 2025
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants