Fix ServerClaim race condition and BMC-driven state oscillation#773
Fix ServerClaim race condition and BMC-driven state oscillation#773
Conversation
6ea823c to
f174603
Compare
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds APIReader wiring to server reconcilers and tests; reconcilers re-fetch live Server objects via APIReader.Get before ownership/state decisions and when claiming; server-claim logic now validates against freshly read API state to detect concurrent claims; BMC discover logging distinguishes create/update/unchanged. Changes
Sequence Diagram(s)mermaid Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related issues
Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
cmd/main.go (1)
393-395: PushAPIReaderdefaults into the reconcilers.This wiring is correct, but it makes
APIReadera caller-side requirement everywhere these reconcilers are instantiated. Default it in eachSetupWithManageror fail fast there so a missed constructor update becomes a clear setup error instead of a nil-interface panic.Also applies to: 430-432
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/main.go` around lines 393 - 395, The reconcilers (e.g., controller.ServerReconciler) currently require callers to set APIReader which can lead to nil-interface panics; update each reconciler's SetupWithManager method (for ServerReconciler and the other reconcilers referenced) to default r.APIReader = mgr.GetAPIReader() when r.APIReader is nil, or return a clear error immediately if mgr.GetAPIReader() is nil, so the missing constructor wiring becomes a deterministic setup-time failure instead of a runtime panic.internal/controller/suite_test.go (1)
185-213: Please add a regression that actually exercises the uncached-read path.This diff updates the harness, but I don't see a spec here that races two claims against the same available server and proves the loser no longer leaves the server unbound. That would make this fix much harder to regress.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/suite_test.go` around lines 185 - 213, Add a regression test that exercises the uncached-read path by racing two ServerClaim creations against the same available Server and asserting the loser does not leave the Server unbound: create a Server with status Available, concurrently create two ServerClaim objects targeting that Server using k8sManager.GetClient() (or create them in quick succession to simulate the race), then wait (Eventually) for one ServerClaim to reach Bound and the other to be Rejected/NotBound, and assert the Server object (via k8sManager.GetAPIReader() or fresh client read) remains bound to the winning claim; add the test near the ServerClaimReconciler/Suite setup so it uses the configured ServerClaimReconciler and exercises the uncached read path.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@cmd/main.go`:
- Around line 393-395: The reconcilers (e.g., controller.ServerReconciler)
currently require callers to set APIReader which can lead to nil-interface
panics; update each reconciler's SetupWithManager method (for ServerReconciler
and the other reconcilers referenced) to default r.APIReader =
mgr.GetAPIReader() when r.APIReader is nil, or return a clear error immediately
if mgr.GetAPIReader() is nil, so the missing constructor wiring becomes a
deterministic setup-time failure instead of a runtime panic.
In `@internal/controller/suite_test.go`:
- Around line 185-213: Add a regression test that exercises the uncached-read
path by racing two ServerClaim creations against the same available Server and
asserting the loser does not leave the Server unbound: create a Server with
status Available, concurrently create two ServerClaim objects targeting that
Server using k8sManager.GetClient() (or create them in quick succession to
simulate the race), then wait (Eventually) for one ServerClaim to reach Bound
and the other to be Rejected/NotBound, and assert the Server object (via
k8sManager.GetAPIReader() or fresh client read) remains bound to the winning
claim; add the test near the ServerClaimReconciler/Suite setup so it uses the
configured ServerClaimReconciler and exercises the uncached read path.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 98f97e40-148d-43e6-9556-ea843f161553
📒 Files selected for processing (6)
cmd/main.godocs/development/dev_setup.mdinternal/controller/bmc_controller.gointernal/controller/server_controller.gointernal/controller/serverclaim_controller.gointernal/controller/suite_test.go
f174603 to
19758cb
Compare
19758cb to
aa47ac1
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/controller/serverclaim_controller.go`:
- Around line 375-388: The fast-path that returns a cached selectedServer when
it appears already claimed must be revalidated against the API server before
returning; update the branch that currently short-circuits (the cached "already
claimed" path in reconcile) to re-fetch the Server via the controller's API
reader (call r.Get with the Server's namespaced name into selectedServer) and
then perform the same ownership checks used in ensureObjectRefForServer (verify
selectedServer.Spec.ServerClaimRef matches claim.Name/claim.Namespace); if the
live read shows the claim no longer owns the Server, treat it as not found and
return nil so reconcile requeues, otherwise continue as before.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: a858d6c3-13b6-4c49-a3c1-dbdad790309d
📒 Files selected for processing (5)
cmd/main.gointernal/controller/bmc_controller.gointernal/controller/server_controller.gointernal/controller/serverclaim_controller.gointernal/controller/suite_test.go
🚧 Files skipped from review as they are similar to previous changes (4)
- cmd/main.go
- internal/controller/suite_test.go
- internal/controller/server_controller.go
- internal/controller/bmc_controller.go
aa47ac1 to
78079ff
Compare
Re-fetch the Server directly from the API server in ensureObjectRefForServer before writing ServerClaimRef, so concurrent claim reconcilers working off a stale informer cache cannot overwrite each other's claim.
Replace the generic "Created or patched Server" log with distinct messages per outcome: "Created Server", "Updated Server", and "Server already up to date".
Re-fetch the Server from the API server before checking ServerClaimRef in handleAvailableState, so a BMC-triggered reconcile with a stale informer snapshot does not miss a concurrently-written claim ref and skip the Reserved transition.
78079ff to
73cfe30
Compare
Fix ServerClaim race condition and BMC-driven state oscillation
Summary
Under load, multiple
ServerClaimreconcilers could simultaneously find the same serverunclaimed due to a stale informer cache, overwrite each other's
ServerClaimRef, andleave no claim successfully bound. A periodic BMC reconcile compounded this by patching
the
Serverobject even when nothing changed, re-triggering the server state machine andrepeatedly resetting the server to
Available— reopening the race window on every cycle.See issue #772
Changes
Fix ServerClaim race: bypass cache before claiming a server
ensureObjectRefForServernow re-fetches theServerdirectly from the API server(via
APIReader) before writingServerClaimRef. Concurrent claim reconcilers workingoff a stale informer cache can no longer both see
ServerClaimRef == niland overwriteeach other's claim — the loser sees the already-set ref on the fresh read and bails out
cleanly.
Skip BMC server patch when discovered fields are unchanged
discoverServersnow only mutates theServerobject whenSystemUUID,SystemURI,BMCRef, or labels have actually changed. Previously,CreateOrPatchwas calledunconditionally on every BMC reconcile, issuing a write even when nothing differed
(
Operation: unchanged). This bumpedResourceVersionand re-triggered the servercontroller every ~7–10 seconds, keeping the oscillation loop alive.
Fix Available→Reserved transition on stale cache in server controller
handleAvailableStatenow re-fetches theServerfrom the API server before checkingServerClaimRef. Without this, a BMC-triggered reconcile arriving with a stale cachesnapshot would find
ServerClaimRef == nil, skip theReservedtransition, and run theAvailable boilerplate — resetting the server state even though a claim had already been
written.
Test plan
Reservedwithout oscillating back to
AvailableOperation: unchangedpatchesfollowed by server controller re-entries into the Available state machine
Summary by CodeRabbit
Bug Fixes
Improvements
Tests