Skip to content

Fix ServerClaim race condition and BMC-driven state oscillation#773

Open
xkonni wants to merge 3 commits intomainfrom
bug/serverclaim_race_condition
Open

Fix ServerClaim race condition and BMC-driven state oscillation#773
xkonni wants to merge 3 commits intomainfrom
bug/serverclaim_race_condition

Conversation

@xkonni
Copy link
Copy Markdown
Contributor

@xkonni xkonni commented Apr 1, 2026

Fix ServerClaim race condition and BMC-driven state oscillation

Summary

Under load, multiple ServerClaim reconcilers could simultaneously find the same server
unclaimed due to a stale informer cache, overwrite each other's ServerClaimRef, and
leave no claim successfully bound. A periodic BMC reconcile compounded this by patching
the Server object even when nothing changed, re-triggering the server state machine and
repeatedly resetting the server to Available — reopening the race window on every cycle.

See issue #772

Changes

Fix ServerClaim race: bypass cache before claiming a server

ensureObjectRefForServer now re-fetches the Server directly from the API server
(via APIReader) before writing ServerClaimRef. Concurrent claim reconcilers working
off a stale informer cache can no longer both see ServerClaimRef == nil and overwrite
each other's claim — the loser sees the already-set ref on the fresh read and bails out
cleanly.

Skip BMC server patch when discovered fields are unchanged

discoverServers now only mutates the Server object when SystemUUID, SystemURI,
BMCRef, or labels have actually changed. Previously, CreateOrPatch was called
unconditionally on every BMC reconcile, issuing a write even when nothing differed
(Operation: unchanged). This bumped ResourceVersion and re-triggered the server
controller every ~7–10 seconds, keeping the oscillation loop alive.

Fix Available→Reserved transition on stale cache in server controller

handleAvailableState now re-fetches the Server from the API server before checking
ServerClaimRef. Without this, a BMC-triggered reconcile arriving with a stale cache
snapshot would find ServerClaimRef == nil, skip the Reserved transition, and run the
Available boilerplate — resetting the server state even though a claim had already been
written.

Test plan

  • Verify under concurrent claim load that a server transitions cleanly to Reserved
    without oscillating back to Available
  • Confirm BMC reconcile no longer logs repeated Operation: unchanged patches
    followed by server controller re-entries into the Available state machine

Summary by CodeRabbit

  • Bug Fixes

    • Reconciliation now re-fetches fresh resource state to avoid acting on stale cached data and to detect concurrent claims, deferring when conflicts occur.
  • Improvements

    • Skip unnecessary updates when resources already match desired configuration.
    • More granular logging for created/updated/no-op outcomes.
  • Tests

    • Test setup aligned with runtime reconciler behavior.

@xkonni xkonni requested a review from a team as a code owner April 1, 2026 11:08
@github-actions github-actions bot added size/M bug Something isn't working documentation Improvements or additions to documentation labels Apr 1, 2026
@xkonni xkonni force-pushed the bug/serverclaim_race_condition branch from 6ea823c to f174603 Compare April 1, 2026 11:12
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 1, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds APIReader wiring to server reconcilers and tests; reconcilers re-fetch live Server objects via APIReader.Get before ownership/state decisions and when claiming; server-claim logic now validates against freshly read API state to detect concurrent claims; BMC discover logging distinguishes create/update/unchanged.

Changes

Cohort / File(s) Summary
Dependency injection & tests
cmd/main.go, internal/controller/suite_test.go
Wire mgr.GetAPIReader() into ServerReconciler and ServerClaimReconciler (production and test setup).
ServerReconciler
internal/controller/server_controller.go
Add APIReader client.Reader field; handleAvailableState re-fetches the Server via APIReader.Get into a fresh object before evaluating Spec.ServerClaimRef.
ServerClaimReconciler
internal/controller/serverclaim_controller.go
Add APIReader client.Reader; when selecting/claiming servers, re-fetch candidate Server(s) via APIReader.Get to validate claimability and detect concurrent claims; patch then verify ServerClaimRef post-patch.
BMC controller & logging
internal/controller/bmc_controller.go
In discoverServers, mutate callback early-returns when no changes; post-CreateOrPatch logging now switches on opResult to log Created/Updated/Unchanged messages.

Sequence Diagram(s)

mermaid
sequenceDiagram
participant Recon as Reconciler
participant Cache as Informer/Cache
participant API as API Server
participant Obj as Server (API object)

Cache->>Recon: deliver cached Server
Recon->>API: APIReader.Get(Server)
API-->>Recon: fresh Server object
Recon->>Recon: evaluate fresh.Spec.ServerClaimRef & status
alt Already claimed or not claimable
    Recon-->>Cache: update local pointer with fresh; return
else Select & attempt claim
    Recon->>API: Patch Server.spec.serverClaimRef (MergeFrom/Optimistic)
    API-->>Recon: patched Server
    Recon->>API: Get patched Server (verify ownership)
    API-->>Recon: verified Server
    Recon-->>Cache: update local pointer with patched Server
end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related issues

Possibly related PRs

Suggested labels

api-change, size/L, area/metal-automation

Suggested reviewers

  • Nuckal777
  • nagadeesh-nagaraja
🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Fix ServerClaim race condition and BMC-driven state oscillation' clearly and specifically summarizes the two main problems addressed in the changeset.
Description check ✅ Passed The description provides a comprehensive summary section, detailed changes explaining the fixes, and a test plan, though it deviates from the minimal template structure.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch bug/serverclaim_race_condition

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
cmd/main.go (1)

393-395: Push APIReader defaults into the reconcilers.

This wiring is correct, but it makes APIReader a caller-side requirement everywhere these reconcilers are instantiated. Default it in each SetupWithManager or fail fast there so a missed constructor update becomes a clear setup error instead of a nil-interface panic.

Also applies to: 430-432

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/main.go` around lines 393 - 395, The reconcilers (e.g.,
controller.ServerReconciler) currently require callers to set APIReader which
can lead to nil-interface panics; update each reconciler's SetupWithManager
method (for ServerReconciler and the other reconcilers referenced) to default
r.APIReader = mgr.GetAPIReader() when r.APIReader is nil, or return a clear
error immediately if mgr.GetAPIReader() is nil, so the missing constructor
wiring becomes a deterministic setup-time failure instead of a runtime panic.
internal/controller/suite_test.go (1)

185-213: Please add a regression that actually exercises the uncached-read path.

This diff updates the harness, but I don't see a spec here that races two claims against the same available server and proves the loser no longer leaves the server unbound. That would make this fix much harder to regress.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/suite_test.go` around lines 185 - 213, Add a regression
test that exercises the uncached-read path by racing two ServerClaim creations
against the same available Server and asserting the loser does not leave the
Server unbound: create a Server with status Available, concurrently create two
ServerClaim objects targeting that Server using k8sManager.GetClient() (or
create them in quick succession to simulate the race), then wait (Eventually)
for one ServerClaim to reach Bound and the other to be Rejected/NotBound, and
assert the Server object (via k8sManager.GetAPIReader() or fresh client read)
remains bound to the winning claim; add the test near the
ServerClaimReconciler/Suite setup so it uses the configured
ServerClaimReconciler and exercises the uncached read path.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@cmd/main.go`:
- Around line 393-395: The reconcilers (e.g., controller.ServerReconciler)
currently require callers to set APIReader which can lead to nil-interface
panics; update each reconciler's SetupWithManager method (for ServerReconciler
and the other reconcilers referenced) to default r.APIReader =
mgr.GetAPIReader() when r.APIReader is nil, or return a clear error immediately
if mgr.GetAPIReader() is nil, so the missing constructor wiring becomes a
deterministic setup-time failure instead of a runtime panic.

In `@internal/controller/suite_test.go`:
- Around line 185-213: Add a regression test that exercises the uncached-read
path by racing two ServerClaim creations against the same available Server and
asserting the loser does not leave the Server unbound: create a Server with
status Available, concurrently create two ServerClaim objects targeting that
Server using k8sManager.GetClient() (or create them in quick succession to
simulate the race), then wait (Eventually) for one ServerClaim to reach Bound
and the other to be Rejected/NotBound, and assert the Server object (via
k8sManager.GetAPIReader() or fresh client read) remains bound to the winning
claim; add the test near the ServerClaimReconciler/Suite setup so it uses the
configured ServerClaimReconciler and exercises the uncached read path.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 98f97e40-148d-43e6-9556-ea843f161553

📥 Commits

Reviewing files that changed from the base of the PR and between f5ed6d5 and 6ea823c.

📒 Files selected for processing (6)
  • cmd/main.go
  • docs/development/dev_setup.md
  • internal/controller/bmc_controller.go
  • internal/controller/server_controller.go
  • internal/controller/serverclaim_controller.go
  • internal/controller/suite_test.go

@xkonni xkonni force-pushed the bug/serverclaim_race_condition branch from f174603 to 19758cb Compare April 1, 2026 13:05
@github-actions github-actions bot added size/L and removed size/M labels Apr 1, 2026
@xkonni xkonni force-pushed the bug/serverclaim_race_condition branch from 19758cb to aa47ac1 Compare April 1, 2026 14:23
@github-actions github-actions bot added size/M and removed size/L labels Apr 1, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/controller/serverclaim_controller.go`:
- Around line 375-388: The fast-path that returns a cached selectedServer when
it appears already claimed must be revalidated against the API server before
returning; update the branch that currently short-circuits (the cached "already
claimed" path in reconcile) to re-fetch the Server via the controller's API
reader (call r.Get with the Server's namespaced name into selectedServer) and
then perform the same ownership checks used in ensureObjectRefForServer (verify
selectedServer.Spec.ServerClaimRef matches claim.Name/claim.Namespace); if the
live read shows the claim no longer owns the Server, treat it as not found and
return nil so reconcile requeues, otherwise continue as before.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a858d6c3-13b6-4c49-a3c1-dbdad790309d

📥 Commits

Reviewing files that changed from the base of the PR and between 19758cb and aa47ac1.

📒 Files selected for processing (5)
  • cmd/main.go
  • internal/controller/bmc_controller.go
  • internal/controller/server_controller.go
  • internal/controller/serverclaim_controller.go
  • internal/controller/suite_test.go
🚧 Files skipped from review as they are similar to previous changes (4)
  • cmd/main.go
  • internal/controller/suite_test.go
  • internal/controller/server_controller.go
  • internal/controller/bmc_controller.go

@xkonni xkonni force-pushed the bug/serverclaim_race_condition branch from aa47ac1 to 78079ff Compare April 1, 2026 14:34
xkonni added 3 commits April 1, 2026 16:35
Re-fetch the Server directly from the API server in
ensureObjectRefForServer before writing ServerClaimRef, so concurrent
claim reconcilers working off a stale informer cache cannot overwrite
each other's claim.
Replace the generic "Created or patched Server" log with distinct
messages per outcome: "Created Server", "Updated Server", and
"Server already up to date".
Re-fetch the Server from the API server before checking ServerClaimRef
in handleAvailableState, so a BMC-triggered reconcile with a stale
informer snapshot does not miss a concurrently-written claim ref and
skip the Reserved transition.
@xkonni xkonni force-pushed the bug/serverclaim_race_condition branch from 78079ff to 73cfe30 Compare April 1, 2026 14:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/metal-automation bug Something isn't working documentation Improvements or additions to documentation size/M

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

4 participants