Skip to content

Add ability to auto retry failed resources#695

Open
nagadeesh-nagaraja wants to merge 14 commits intomainfrom
autoRetry
Open

Add ability to auto retry failed resources#695
nagadeesh-nagaraja wants to merge 14 commits intomainfrom
autoRetry

Conversation

@nagadeesh-nagaraja
Copy link
Copy Markdown
Contributor

@nagadeesh-nagaraja nagadeesh-nagaraja commented Feb 20, 2026

Proposed Changes

provides option to auto retry the failed resources X number of times.
Also fixes the infinite auto retry issue through SET CRD

Fixes #693
fix: #623
fix: #580

Summary by CodeRabbit

  • New Features

    • Automatic retry for failed BIOS/BMC upgrades and settings via spec.failedAutoRetryCount, status.autoRetryCountRemaining and observedGeneration; global default via new CLI flag.
    • Retry annotation propagation and retry-condition tracking for coordinated retries.
    • Deterministic simulated upgrade task sequences (success/failure) for testing.
  • Bug Fixes

    • Structured error type for invalid BMC attributes and corrected attribute-check call.
  • Tests

    • Expanded end-to-end tests for retry flows, status transitions, annotation propagation, and cleanup.
  • Documentation

    • API docs and CRD schemas updated for new retry/status fields.

@nagadeesh-nagaraja nagadeesh-nagaraja self-assigned this Feb 20, 2026
@nagadeesh-nagaraja nagadeesh-nagaraja requested a review from a team as a code owner February 20, 2026 11:59
@github-actions github-actions bot added size/XL api-change documentation Improvements or additions to documentation labels Feb 20, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Feb 20, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds annotation-driven and automatic retry support across BIOS/BMC controllers: new Spec.FailedAutoRetryCount and Status.AutoRetryCountRemaining/ObservedGeneration fields, controller retry/propagation logic and conditions, deepcopy/CRD/schema updates, mock failed tasks, CLI flag, and tests.

Changes

Cohort / File(s) Summary
API Types
api/v1alpha1/biossettings_types.go, api/v1alpha1/biosversion_types.go, api/v1alpha1/bmcsettings_types.go, api/v1alpha1/bmcversion_types.go
Added FailedAutoRetryCount *int32 to templates/specs and AutoRetryCountRemaining *int32 + ObservedGeneration int64 to status structs (kubebuilder Minimum=0 where applicable).
Generated Code
api/v1alpha1/zz_generated.deepcopy.go
Added deepcopy handling for new pointer fields.
CRD Manifests
config/crd/bases/..._biossettings.yaml, ..._biosversions.yaml, ..._bmcsettings.yaml, ..._bmcversions.yaml, ..._biossettingssets.yaml, ..._biosversionsets.yaml, ..._bmcsettingssets.yaml, ..._bmcversionsets.yaml
Expanded OpenAPI schemas: spec.failedAutoRetryCount, status.autoRetryCountRemaining, and status.observedGeneration added across CRDs and set CRDs.
Controllers — single resources
internal/controller/biossettings_controller.go, internal/controller/biosversion_controller.go, internal/controller/bmcsettings_controller.go, internal/controller/bmcversion_controller.go
Added DefaultFailedAutoRetryCount fields; centralized/ refactored failed-state handling (process/handleFailedState), annotation short-circuit paths, auto-retry init/decrement of AutoRetryCountRemaining, retry condition updates and status patches.
Controllers — set resources & propagation
internal/controller/biossettingsset_controller.go, internal/controller/biosversionset_controller.go, internal/controller/bmcsettingsset_controller.go, internal/controller/bmcversionset_controller.go
After templated create/patch, initialize Status.AutoRetryCountRemaining from spec; added handleRetryAnnotationPropagation and invoked propagation at delete/patch/handle points to forward retry annotation to children.
Helper & Constants
internal/controller/helper.go
Added RetryOfFailedResourceConditionIssued/RetryOfFailedResourceReasonIssued; fetch latest child before propagation and guard re-propagation by checking child status.conditions via reflection.
BMC attribute validation
bmc/common/helpers.go, bmc/oem/dell.go, bmc/redfish_kube.go
Renamed CheckAttribuesCheckAttributes; added InvalidBMCSettingsError type and return structured errors; updated call sites and error aggregation.
Redfish mocks & local behavior
bmc/mockup.go, bmc/redfish_local.go
Added failed task sequences (BIOSUpgradeTaskFailedStatus, BMCUpgradeTaskFailedStatus) and runtime selection of failed vs normal task sequence when image URI contains "fail".
CLI / wiring
cmd/main.go
Added flag --default-failed-auto-retry-count and wired DefaultFailedAutoRetryCount into reconcilers' SetupWithManager calls.
Tests
internal/controller/*_controller_test.go, internal/controller/*set_controller_test.go
Added/extended E2E tests for annotation-driven and automatic retry flows, assertions on AutoRetryCountRemaining/ObservedGeneration, condition behavior, propagation and cleanup (some test blocks duplicated).
Docs
docs/api-reference/api.md
Documented new retry fields in specs/templates and status entries.

Sequence Diagram(s)

sequenceDiagram
    participant Controller as Reconciler (Controller)
    participant K8s as Kubernetes API
    participant Child as Child CR
    participant Redfish as Redfish/BMC mock
    Note over Controller,K8s: Failure detected or retry annotation applied
    Controller->>K8s: Read Resource / Patch Status (init/decrement AutoRetryCountRemaining / set Retry condition / set Pending)
    K8s-->>Controller: ACK
    Controller->>Child: Propagate retry annotation or create/patch child
    Child->>K8s: Update annotations/status
    K8s-->>Child: ACK
    Controller->>Redfish: Perform upgrade/check (image URI may cause failure)
    Redfish-->>Controller: Task status (failed or success)
    alt failed and retries remaining
      Controller->>K8s: Patch Status (AutoRetryCountRemaining--, set RetryOfFailedResourceConditionIssued)
      K8s-->>Controller: ACK
      Controller->>Controller: Requeue (transition to Pending)
    else no retries left
      Controller->>K8s: Mark Failed (no requeue)
      K8s-->>Controller: ACK
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

enhancement, bug, size/XXL

Suggested reviewers

  • afritzler
  • davidgrun
  • stefanhipfel
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 28.57% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: adding auto-retry capability for failed resources.
Description check ✅ Passed The description provides proposed changes and references related issues, but lacks detailed explanation of implementation approach.
Linked Issues check ✅ Passed All code changes comprehensively address the three linked issues: auto-retry logic [#623], Set CRD retry propagation [#693], and infinite retry prevention [#580].
Out of Scope Changes check ✅ Passed All changes are directly scoped to implementing auto-retry functionality; no unrelated modifications detected beyond issue requirements.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch autoRetry

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 18

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
internal/controller/bmcsettings_controller_test.go (1)

466-574: ⚠️ Potential issue | 🔴 Critical

Remove duplicate test: two It blocks with identical name and body at lines 466–574 and 576–684.

Both define It("Should replace missing BMCSettings ref in server", ...) with identical implementations. Ginkgo requires unique descriptions for sibling It nodes — this duplicate will cause the test suite to panic. Remove one of these tests.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/bmcsettings_controller_test.go` around lines 466 - 574,
There are two identical Ginkgo tests both declared as It("Should replace missing
BMCSettings ref in server", ...) which causes a panic due to duplicate sibling
descriptions; remove one of the duplicate It blocks (the entire second copy of
the test) or rename one to a unique description if you intend to keep both,
ensuring you remove the duplicate test body that references
bmcSettings/bmcSettings2, serverMaintenanceList, and the Expect/Delete flows
(the duplicated It with the same name and body should be deleted or renamed to a
unique string).
🧹 Nitpick comments (11)
bmc/mockup.go (1)

146-175: BMCUpgradeTaskFailedStatus is byte-for-byte identical to BIOSUpgradeTaskFailedStatus — consider extracting a shared helper.

Both failure sequences are currently duplicated (7 identical entries across 28 lines each). This is harmless in test-only code but could silently diverge if one is updated without the other.

♻️ Optional: Extract a shared default failed-task sequence
+// defaultUpgradeFailedTaskStatus returns a canonical failed upgrade task
+// sequence ending in ExceptionTaskState at 99 % completion.
+func defaultUpgradeFailedTaskStatus() []schemas.Task {
+	return []schemas.Task{
+		{TaskState: schemas.NewTaskState, PercentComplete: gofish.ToRef(uint(0))},
+		{TaskState: schemas.PendingTaskState, PercentComplete: gofish.ToRef(uint(0))},
+		{TaskState: schemas.StartingTaskState, PercentComplete: gofish.ToRef(uint(0))},
+		{TaskState: schemas.RunningTaskState, PercentComplete: gofish.ToRef(uint(10))},
+		{TaskState: schemas.RunningTaskState, PercentComplete: gofish.ToRef(uint(20))},
+		{TaskState: schemas.RunningTaskState, PercentComplete: gofish.ToRef(uint(98))},
+		{TaskState: schemas.ExceptionTaskState, PercentComplete: gofish.ToRef(uint(99))},
+	}
+}

Then in InitializeDefaults:

-	r.BIOSUpgradeTaskFailedStatus = []schemas.Task{
-		{TaskState: schemas.NewTaskState, PercentComplete: gofish.ToRef(uint(0))},
-		// ... 6 more identical entries ...
-	}
+	r.BIOSUpgradeTaskFailedStatus = defaultUpgradeFailedTaskStatus()

-	r.BMCUpgradeTaskFailedStatus = []schemas.Task{
-		{TaskState: schemas.NewTaskState, PercentComplete: gofish.ToRef(uint(0))},
-		// ... 6 more identical entries ...
-	}
+	r.BMCUpgradeTaskFailedStatus = defaultUpgradeFailedTaskStatus()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bmc/mockup.go` around lines 146 - 175, Extract the duplicated failed-task
sequence into a shared helper (e.g., func defaultFailedTaskSequence()
[]schemas.Task) and replace both BMCUpgradeTaskFailedStatus and
BIOSUpgradeTaskFailedStatus initializations to call that helper; update
InitializeDefaults to use the new helper for both variables so the identical
7-entry slice lives in one place and cannot drift.
internal/controller/bmcsettings_controller_test.go (1)

720-728: 10µs polling interval may cause flakiness or excessive API calls.

WithPolling(10 * time.Microsecond) is extremely aggressive for a controller test that involves API server round-trips. Consider a more reasonable interval like 100*time.Millisecond to reduce test flakiness and unnecessary load on the test API server.

Proposed fix
 	Eventually(func(g Gomega) bool {
 		g.Expect(Get(bmcSettings)()).To(Succeed())
 		return bmcSettings.Status.State == metalv1alpha1.BMCSettingsStateFailed && bmcSettings.Status.AutoRetryCountRemaining == nil
-	}).WithPolling((10 * time.Microsecond)).Should(BeTrue())
+	}).WithPolling(100 * time.Millisecond).Should(BeTrue())

 	Eventually(func(g Gomega) bool {
 		g.Expect(Get(bmcSettings)()).To(Succeed())
 		return bmcSettings.Status.State == metalv1alpha1.BMCSettingsStateFailed && bmcSettings.Status.AutoRetryCountRemaining != nil && *bmcSettings.Status.AutoRetryCountRemaining == int32(1)
-	}).WithPolling((10 * time.Microsecond)).Should(BeTrue())
+	}).WithPolling(100 * time.Millisecond).Should(BeTrue())
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/bmcsettings_controller_test.go` around lines 720 - 728,
The two Eventually checks in bmcsettings_controller_test.go use an overly
aggressive WithPolling((10 * time.Microsecond)) which can cause flakiness and
excess API traffic; update both WithPolling calls to a more reasonable interval
such as 100*time.Millisecond (or a test-constant like test.PollInterval) so the
checks on bmcSettings.Status (metalv1alpha1.BMCSettingsStateFailed and
AutoRetryCountRemaining) poll at a stable, less-frequent cadence.
bmc/common/helpers.go (1)

33-38: Fix typo in function name CheckAttribuesCheckAttributes.

The function is named CheckAttribues (missing an "t") and is defined in bmc/common/helpers.go (line 23) with 3 call sites: bmc/redfish_kube.go (line 329), bmc/redfish_local.go (line 241), and bmc/oem/dell.go (line 467). This is a good time to fix it during refactoring. Remember to run make lint-fix and make test after making the changes.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bmc/common/helpers.go` around lines 33 - 38, Rename the misspelled function
CheckAttribues to CheckAttributes (update its declaration in
bmc/common/helpers.go) and update every call site that references CheckAttribues
to use CheckAttributes instead; ensure the signature and any associated types
like InvalidBMCSettingsError remain unchanged, run make lint-fix and make test
to verify formatting and correctness, and commit the rename as a single refactor
so references are consistent across the codebase.
api/v1alpha1/biossettings_types.go (1)

20-22: Add minimum validation to prevent negative retry counts across all types.

FailedAutoRetryCount is used in four type definitions (biossettings_types.go, biosversion_types.go, bmcsettings_types.go, and bmcversion_types.go) without validation markers. Adding +kubebuilder:validation:Minimum=0 to each would provide API-level validation to prevent nonsensical negative configurations, consistent with the project's existing pattern of using validation markers (e.g., Priority fields).

Proposed fix for biossettings_types.go
 	// FailedAutoRetryCount is the number of times the controller should automatically retry the BIOSSettings upgrade in case of failure before giving up.
 	// +optional
+	// +kubebuilder:validation:Minimum=0
 	FailedAutoRetryCount *int32 `json:"failedAutoRetryCount,omitempty"`

Apply the same fix to biosversion_types.go, bmcsettings_types.go, and bmcversion_types.go.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/v1alpha1/biossettings_types.go` around lines 20 - 22, Add an API
validation marker to prevent negative retry counts by annotating the
FailedAutoRetryCount field with +kubebuilder:validation:Minimum=0 in all four
type definitions; locate the FailedAutoRetryCount *int32
`json:"failedAutoRetryCount,omitempty"` field in the BIOSSettings, BIOSVersion,
BMCSettings, and BMCVersion type declarations and add the kubebuilder validation
comment immediately above each field so the generated CRD enforces a minimum of
0.
api/v1alpha1/biosversion_types.go (1)

47-50: Same optional: consider +kubebuilder:validation:Minimum=0 on FailedAutoRetryCount

Same suggestion as the other API types in this PR.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/v1alpha1/biosversion_types.go` around lines 47 - 50, Add a kubebuilder
validation tag to the FailedAutoRetryCount field on the BIOSVersion type so the
value cannot be negative: update the struct field annotation for
FailedAutoRetryCount (the pointer *int32 json:"failedAutoRetryCount,omitempty")
to include `+kubebuilder:validation:Minimum=0` alongside the existing
`+optional` tag to enforce non-negative values at CR validation time.
api/v1alpha1/bmcsettings_types.go (1)

22-24: Same optional: consider +kubebuilder:validation:Minimum=0 on FailedAutoRetryCount

Consistent with the note in bmcversion_types.go, BIOSVersionTemplate, and BIOSSettingsTemplate — negative values are semantically invalid.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/v1alpha1/bmcsettings_types.go` around lines 22 - 24, The
FailedAutoRetryCount field currently allows negative values; add a kubebuilder
validation tag to enforce non-negativity by annotating FailedAutoRetryCount
(*int32) with +kubebuilder:validation:Minimum=0 (keep the existing
json:"failedAutoRetryCount,omitempty" tag) so the BMCSettings CRD schema
prevents negative numbers at validation time.
api/v1alpha1/bmcversion_types.go (1)

37-39: Consider adding +kubebuilder:validation:Minimum=0 to FailedAutoRetryCount

A negative retry count is semantically invalid. The same applies to the analogous fields in BIOSVersionTemplate, BMCSettingsTemplate, and BIOSSettingsTemplate.

♻️ Proposed addition
+	// +kubebuilder:validation:Minimum=0
 	// FailedAutoRetryCount is the number of times the controller should automatically retry the BMCVersion upgrade in case of failure before giving up.
 	// +optional
 	FailedAutoRetryCount *int32 `json:"failedAutoRetryCount,omitempty"`
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/v1alpha1/bmcversion_types.go` around lines 37 - 39, Add a non-negative
validation to the retry count fields by annotating the pointer int32 fields with
+kubebuilder:validation:Minimum=0; specifically add this comment annotation
immediately above FailedAutoRetryCount in BMCVersion (the FailedAutoRetryCount
*int32 `json:"failedAutoRetryCount,omitempty"` field) and do the same for the
corresponding fields in BIOSVersionTemplate, BMCSettingsTemplate, and
BIOSSettingsTemplate so the kubebuilder-generated CRD will reject negative
values while keeping the existing pointer type and json tag.
internal/controller/bmcversion_controller_test.go (1)

385-439: Use millisecond polling to reduce test flakiness.
10µs polling can busy-spin and be scheduling-sensitive in envtest; a small ms interval is usually more stable.

Suggested change
-}).WithPolling((10 * time.Microsecond)).Should(BeTrue())
+}).WithPolling((5 * time.Millisecond)).Should(BeTrue())
...
-}).WithPolling((10 * time.Microsecond)).Should(BeTrue())
+}).WithPolling((5 * time.Millisecond)).Should(BeTrue())
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/bmcversion_controller_test.go` around lines 385 - 439,
The test uses extremely small microsecond polling intervals (WithPolling((10 *
time.Microsecond))) which can cause busy-spinning and flakiness; change those
WithPolling calls in this test (references: retryCount, bmcVersion, the
Eventually(...) blocks that call Get(bmcVersion)()) to use a millisecond poll
interval (e.g., 10*time.Millisecond or 100*time.Millisecond) instead of 10µs so
the Eventually assertions are more stable in envtest.
internal/controller/biosversion_controller_test.go (1)

385-441: Use millisecond polling to reduce test flakiness.
5µs polling is extremely tight; a small ms interval is typically more stable in envtest.

Suggested change
-}).WithPolling((5 * time.Microsecond)).Should(BeTrue())
+}).WithPolling((5 * time.Millisecond)).Should(BeTrue())
...
-}).WithPolling((5 * time.Microsecond)).Should(BeTrue())
+}).WithPolling((5 * time.Millisecond)).Should(BeTrue())
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/biosversion_controller_test.go` around lines 385 - 441,
The test uses extremely tight polling intervals (WithPolling((5 *
time.Microsecond))) in the "Should allow retry using annotation" Eventually
checks which causes flakiness; change both occurrences of WithPolling((5 *
time.Microsecond)) to a millisecond-scale interval (for example WithPolling(5 *
time.Millisecond)) so the Eventually(func(...) ...) waits are more stable — the
change should be applied to the Eventually(func(g Gomega) bool { ... }) calls
referenced in this spec.
internal/controller/bmcsettingsset_controller_test.go (1)

513-638: Use millisecond polling to reduce test flakiness.
1µs polling can be CPU-heavy and timing-sensitive in envtest.

Suggested change
-Eventually(Object(bmcSettings01)).WithPolling(1 * time.Microsecond).MustPassRepeatedly(1).Should(SatisfyAll(
+Eventually(Object(bmcSettings01)).WithPolling(5 * time.Millisecond).MustPassRepeatedly(1).Should(SatisfyAll(
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/bmcsettingsset_controller_test.go` around lines 513 -
638, The test uses an ultra-tight poll interval causing flakiness: replace the
WithPolling(1 * time.Microsecond) call in the "Should successfully retry failed
state child resources once" spec (the
Eventually(Object(bmcSettings01)).WithPolling(...) invocation) with a
millisecond-based interval (e.g., 1 * time.Millisecond or a slightly larger
value like 10 * time.Millisecond) to reduce CPU pressure and timing sensitivity
while keeping the rest of the assertion logic unchanged.
internal/controller/bmcsettings_controller.go (1)

601-617: Failing BMCSettings when BMC is powered off could consume retry attempts for transient conditions.

When handleBMCPowerState encounters a powered-off BMC, it immediately transitions to Failed. Combined with FailedAutoRetryCount, temporary power-offs during settings application will burn through retry attempts. Consider whether a RequeueAfter with a condition update (waiting for power-on) would be more resilient than immediately failing.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/bmcsettings_controller.go` around lines 601 - 617, The
handler currently marks the BMCSettings as Failed when the BMC is Powered Off
(in handleBMCPowerState), which consumes retry budget; change the logic so that
when BMC.Status.PowerState is not PoweredOn you update the
BMCPoweredOffCondition to a non-failed interim state (e.g., Pending/Unknown) via
r.Conditions.Update and call updateBMCSettingsStatus with a non-failed state
instead of metalv1alpha1.BMCSettingsStateFailed, and return a reconcile result
that requeues after a short delay (use controllerutil or return
ctrl.Result{RequeueAfter: <duration>}) so the controller retries later without
consuming a failure retry count. Ensure you still record the BMCPoweredOffReason
and an informative message when updating BMCPoweredOffCondition in
handleBMCPowerState.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@config/crd/bases/metal.ironcore.dev_biossettingssets.yaml`:
- Around line 73-78: This CRD change should not be made directly in the
generated YAML; instead update the Go API type/marker that defines
failedAutoRetryCount (e.g., the BIOSSettingsSet or related API struct and its
kubebuilder markers where FailedAutoRetryCount is declared), then re-run
controller-gen via the project make target (run `make manifests`) to regenerate
config/crd/bases/*.yaml; do not manually edit the generated
metal.ironcore.dev_biossettingssets.yaml file.

In `@config/crd/bases/metal.ironcore.dev_bmcsettings.yaml`:
- Around line 70-75: The CRD field `failedAutoRetryCount` was edited directly in
the generated YAML; instead update the Go API type and markers and regenerate
manifests: edit the BMCSettings API (type BMCSettings or its Spec struct) to add
or adjust the field (FailedAutoRetryCount) and its kubebuilder validation
markers (e.g., +kubebuilder:validation:Minimum/Maximum or
+kubebuilder:validation:Format/int32 as needed), then run `make manifests` to
regenerate the CRD; do not modify the generated YAML under config/crd/bases
directly.

In `@internal/controller/biossettings_controller_test.go`:
- Around line 864-872: The test currently asserts a transient nil
AutoRetryCountRemaining immediately when State == BIOSSettingsStateFailed which
can be flaky; instead split into two waits: first wait for
biosSettings.Status.State == metalv1alpha1.BIOSSettingsStateFailed using
Eventually(Get(biosSettings)()) and then in a second Eventually wait for
biosSettings.Status.AutoRetryCountRemaining != nil &&
*biosSettings.Status.AutoRetryCountRemaining == int32(1); update the two
existing Eventually blocks around Get, biosSettings, BIOSSettingsStateFailed and
AutoRetryCountRemaining to implement this ordering so the nil phase is not
asserted.

In `@internal/controller/biossettings_controller.go`:
- Around line 1059-1088: When scheduling an auto-retry for BIOSSettings (inside
the block guarded by settings.Spec.FailedAutoRetryCount != nil), add a guard to
skip retry when *settings.Spec.FailedAutoRetryCount == 0 and ensure you never
set AutoRetryCountRemaining to a negative value; when you do schedule a retry,
reset settings.Status.FlowState to the zero value (clear stale FlowState) before
patching so the flow restarts (e.g., set settings.Status.FlowState = "" or the
type's zero) and include that change in the MergeFrom patch (biosSettingsBase).
Apply the same pattern to the corresponding handlers in
biosversion_controller.go, bmcsettings_controller.go and
bmcversion_controller.go, and keep the existing handling of
RetryOfFailedResourceConditionIssued and AutoRetryCountRemaining update but
ensure remaining is computed safely (do not decrement below 0).

In `@internal/controller/biossettingsset_controller_test.go`:
- Around line 487-491: The Consistently assertion intended to verify BIOSSetting
03 is wrong because it still references biosSettings02; update the target object
to biosSettings03 in that Consistently(Object(...)) call (and adjust the
descriptive By(...) text if desired) so the assertion checks
HaveField("Status.State", metalv1alpha1.BIOSSettingsStateFailed) and
HaveField("Status.AutoRetryCountRemaining", Equal(GetPtr(int32(0)))) against
biosSettings03 instead of biosSettings02.

In `@internal/controller/biossettingsset_controller.go`:
- Around line 294-300: The CreateOrPatch call is attempting to modify the status
subresource (settings.Status.AutoRetryCountRemaining) which is ignored for
resources with +kubebuilder:subresource:status; replace this CreateOrPatch usage
with a status patch/update against the status subresource (use r.Status().Patch
or r.Status().Update like the existing patchStatus() helper) so you fetch the
current settings, set settings.Status.AutoRetryCountRemaining =
settings.Spec.FailedAutoRetryCount and then call r.Status().Patch(ctx,
&settings, client.MergeFrom(orig)) (or follow the exact patchStatus() pattern)
to persist the status change.

In `@internal/controller/biosversion_controller.go`:
- Around line 435-493: The auto-retry logic in processFailedState allows a retry
when Spec.FailedAutoRetryCount is set to 0 (remaining==nil path) and can
decrement into negative; fix by explicitly checking FailedAutoRetryCount before
permitting a retry: in processFailedState (and the equivalent functions in
bmcversion_controller.go, bmcsettings_controller.go, biossettings_controller.go)
add a guard that if biosVersion.Spec.FailedAutoRetryCount != nil &&
*biosVersion.Spec.FailedAutoRetryCount == 0 then do not enter the auto-retry
block (return false,nil or skip retry), and when initializing
AutoRetryCountRemaining compute the initial value only when
Spec.FailedAutoRetryCount > 0 (set to Spec-1), otherwise leave
AutoRetryCountRemaining at 0 or nil to prevent negative values. Ensure updates
use the same GetCondition/Status patch flow already present.

In `@internal/controller/biosversionset_controller_test.go`:
- Around line 524-528: The second Consistently assertion is inspecting the wrong
test object: it should target biosVersion03 instead of biosVersion02; update the
Consistently(Object(...)) call in the test to pass the biosVersion03 variable so
the block labeled "BIOSVersion 03 has not been changed" validates the third
BIOSVersion child (keep the same SatisfyAll checks on Status.State and
Status.AutoRetryCountRemaining).

In `@internal/controller/bmcsettings_controller.go`:
- Around line 450-467: The error handling when GetCondition(r.Conditions,
bmcSetting.Status.Conditions, BMCSettingsConditionWrongSettings) fails is
wrapping the wrong variable (it uses the outer err from CheckBMCAttributes);
change the fmt.Errorf call to wrap and return the actual GetCondition error
variable (errCond) so the true failure is reported. Locate the block around
GetCondition and replace the fmt.Errorf("failed to get Condition for skip reboot
post setting update %v", err) to reference errCond instead, ensuring subsequent
logic (r.Conditions.Update and r.updateBMCSettingsStatus) remains unchanged.
- Around line 575-619: The error messages in handleBMCPowerState incorrectly
reference "BMCVersion" and the local variable BMCPoweredOffCondition shadows the
package-level constant; fix by updating the two fmt.Errorf messages passed on
Conditions.Update failure to say "failed to update Pending BMCSettings update
condition" (the calls inside handleBMCPowerState that wrap
conditionutils.UpdateStatus/Reason/Message), and rename the local
BMCPoweredOffCondition variable to bmcPoweredOffCond (or similar) in
handleBMCPowerState to avoid shadowing the package-level BMCPoweredOffCondition
constant while keeping all uses (GetCondition, Conditions.Update, and
updateBMCSettingsStatus) consistent with the new local name.
- Around line 709-764: The code in handleFailedState has three issues: the
GetCondition error message mentions "BIOSSettings" instead of "BMCSettings", a
copy/paste var name uses biosSettingsBase instead of bmcSettingsBase, and the
FailedAutoRetryCount logic allows Spec==0 to still perform one retry. Fix by:
update the fmt.Errorf call in the GetCondition error path to mention BMCSettings
(reference: GetCondition(...) error handling), rename the local variable
biosSettingsBase to bmcSettingsBase where the DeepCopy is taken before
status.Patch, and change the retry-count logic around
bmcSetting.Spec.FailedAutoRetryCount / bmcSetting.Status.AutoRetryCountRemaining
so that when Spec is non-nil and *Spec == 0 you do not enqueue an automatic
retry (initialize remaining only if *Spec > 0 and decrement safely otherwise
skip).

In `@internal/controller/bmcsettingsset_controller.go`:
- Around line 114-120: Update the incorrect "BIOSSettings*" log messages to
reference the correct resource types "BMCSettingsSet" or "BMCSettings" and
follow the project's logging style (start with a capital letter, use active
voice, past tense for error messages, and include object type). Specifically,
replace the log message in the log.V(1).Info call that currently reads "Updated
BIOSSettingsSet state" to something like "Updated BMCSettingsSet state" and
adjust any similar messages around the other occurrence (lines noted 361-375) to
reference BMCSettings/BMCSettingsSet; ensure related error logs follow the same
tense/format and locate these by the log.V(1).Info calls and the
handleRetryAnnotationPropagation context to make consistent updates.
- Around line 348-355: The current code uses controllerutil.CreateOrPatch to
modify bmcSettings.Status.AutoRetryCountRemaining (via
controllerutil.CreateOrPatch and bmcSettings), but status is a subresource so
updates must go through the status client; replace the CreateOrPatch status
modification with a status patch via r.Client.Status().Patch (or
Status().Update) targeting bmcSettings.Status.AutoRetryCountRemaining.
Concretely: take a deep copy of the original bmcSettings (e.g., orig :=
bmcSettings.DeepCopy()), modify bmcSettings.Status.AutoRetryCountRemaining =
bmcSettings.Spec.FailedAutoRetryCount, then call r.Client.Status().Patch(ctx,
&bmcSettings, client.MergeFrom(orig)) and handle errors (append to errs) instead
of using controllerutil.CreateOrPatch for the status change.

In `@internal/controller/bmcversion_controller.go`:
- Around line 203-213: The variable name biosVersionBase in the retry block is a
copy-paste mistake and should be renamed to bmcVersionBase to match the
surrounding BMCVersion context; in the block guarded by
shouldRetryReconciliation(bmcVersion) replace the identifier biosVersionBase
with bmcVersionBase where you call bmcVersion.DeepCopy() and pass the MergeFrom
option to r.Patch so the snapshot variable name consistently reflects the
resource being patched (bmcVersion).
- Around line 438-467: The retry guard allows FailedAutoRetryCount==0 to enter
the retry block, causing one unintended retry; update the logic that checks
FailedAutoRetryCount/AutoRetryCountRemaining (the guard using "remaining == nil
|| *remaining > 0" and the subsequent computation of AutoRetryCountRemaining) so
zero is treated as "no retries" — e.g., require Spec.FailedAutoRetryCount != nil
&& *Spec.FailedAutoRetryCount > 0 before proceeding or explicitly skip when
*Spec.FailedAutoRetryCount == 0; apply the same change to the identical blocks
handling FailedAutoRetryCount and AutoRetryCountRemaining in the BMCVersion,
BIOSVersion, BMCSettings and BIOSSettings controllers (symbols:
FailedAutoRetryCount, AutoRetryCountRemaining,
GetCondition/RetryOfFailedResourceConditionIssued, Status().Patch).

In `@internal/controller/bmcversionset_controller_test.go`:
- Around line 513-516: The test name says "once" but the variable retryCount :=
2 and the code path setting FailedAutoRetryCount uses that value, so rename the
local variable retryCount to a clearer name like autoRetryCount (or
failedAutoRetryCount) and update the It(...) description to accurately reflect
behavior (for example "Should successfully retry failed state child resources
once via annotation after auto-retries are exhausted" or adjust to match 2
automatic retries). Update any references to retryCount in this test (e.g. where
FailedAutoRetryCount is set and where OperationAnnotation is asserted) so names
and the test description consistently reflect whether the attempt is an
auto-retry count or a single annotation-driven retry.
- Around line 609-613: The assertion is checking the wrong variable and
comparing against the wrong named type: switch the observed object from
bmcVersion02 to bmcVersion03 (so the test actually verifies BMCVersion03), and
compare Status.State to the correct BMCVersionState constant (e.g.,
metalv1alpha1.BMCVersionStateFailed or an equivalent BMCVersionState("Failed"))
instead of metalv1alpha1.BIOSSettingsStateFailed so the equality check uses the
same named type; keep the AutoRetryCountRemaining assertion as-is but ensure the
object reference is bmcVersion03 and the state comparison uses BMCVersionState.

In `@internal/controller/bmcversionset_controller.go`:
- Around line 289-295: The current CreateOrPatch call on BMCVersion silently
drops Status.AutoRetryCountRemaining because status is a subresource; replace
this CreateOrPatch usage with the same pattern as updateStatus(): make a deep
copy of the original BMCVersion (orig := bmcVersion.DeepCopy()), mutate
bmcVersion.Status.AutoRetryCountRemaining =
bmcVersion.Spec.FailedAutoRetryCount, then call r.Status().Patch(ctx,
&bmcVersion, client.MergeFrom(orig)) and append any returned error to errs;
remove the CreateOrPatch invocation to ensure the status subresource is
correctly persisted.

---

Outside diff comments:
In `@internal/controller/bmcsettings_controller_test.go`:
- Around line 466-574: There are two identical Ginkgo tests both declared as
It("Should replace missing BMCSettings ref in server", ...) which causes a panic
due to duplicate sibling descriptions; remove one of the duplicate It blocks
(the entire second copy of the test) or rename one to a unique description if
you intend to keep both, ensuring you remove the duplicate test body that
references bmcSettings/bmcSettings2, serverMaintenanceList, and the
Expect/Delete flows (the duplicated It with the same name and body should be
deleted or renamed to a unique string).

---

Nitpick comments:
In `@api/v1alpha1/biossettings_types.go`:
- Around line 20-22: Add an API validation marker to prevent negative retry
counts by annotating the FailedAutoRetryCount field with
+kubebuilder:validation:Minimum=0 in all four type definitions; locate the
FailedAutoRetryCount *int32 `json:"failedAutoRetryCount,omitempty"` field in the
BIOSSettings, BIOSVersion, BMCSettings, and BMCVersion type declarations and add
the kubebuilder validation comment immediately above each field so the generated
CRD enforces a minimum of 0.

In `@api/v1alpha1/biosversion_types.go`:
- Around line 47-50: Add a kubebuilder validation tag to the
FailedAutoRetryCount field on the BIOSVersion type so the value cannot be
negative: update the struct field annotation for FailedAutoRetryCount (the
pointer *int32 json:"failedAutoRetryCount,omitempty") to include
`+kubebuilder:validation:Minimum=0` alongside the existing `+optional` tag to
enforce non-negative values at CR validation time.

In `@api/v1alpha1/bmcsettings_types.go`:
- Around line 22-24: The FailedAutoRetryCount field currently allows negative
values; add a kubebuilder validation tag to enforce non-negativity by annotating
FailedAutoRetryCount (*int32) with +kubebuilder:validation:Minimum=0 (keep the
existing json:"failedAutoRetryCount,omitempty" tag) so the BMCSettings CRD
schema prevents negative numbers at validation time.

In `@api/v1alpha1/bmcversion_types.go`:
- Around line 37-39: Add a non-negative validation to the retry count fields by
annotating the pointer int32 fields with +kubebuilder:validation:Minimum=0;
specifically add this comment annotation immediately above FailedAutoRetryCount
in BMCVersion (the FailedAutoRetryCount *int32
`json:"failedAutoRetryCount,omitempty"` field) and do the same for the
corresponding fields in BIOSVersionTemplate, BMCSettingsTemplate, and
BIOSSettingsTemplate so the kubebuilder-generated CRD will reject negative
values while keeping the existing pointer type and json tag.

In `@bmc/common/helpers.go`:
- Around line 33-38: Rename the misspelled function CheckAttribues to
CheckAttributes (update its declaration in bmc/common/helpers.go) and update
every call site that references CheckAttribues to use CheckAttributes instead;
ensure the signature and any associated types like InvalidBMCSettingsError
remain unchanged, run make lint-fix and make test to verify formatting and
correctness, and commit the rename as a single refactor so references are
consistent across the codebase.

In `@bmc/mockup.go`:
- Around line 146-175: Extract the duplicated failed-task sequence into a shared
helper (e.g., func defaultFailedTaskSequence() []schemas.Task) and replace both
BMCUpgradeTaskFailedStatus and BIOSUpgradeTaskFailedStatus initializations to
call that helper; update InitializeDefaults to use the new helper for both
variables so the identical 7-entry slice lives in one place and cannot drift.

In `@internal/controller/biosversion_controller_test.go`:
- Around line 385-441: The test uses extremely tight polling intervals
(WithPolling((5 * time.Microsecond))) in the "Should allow retry using
annotation" Eventually checks which causes flakiness; change both occurrences of
WithPolling((5 * time.Microsecond)) to a millisecond-scale interval (for example
WithPolling(5 * time.Millisecond)) so the Eventually(func(...) ...) waits are
more stable — the change should be applied to the Eventually(func(g Gomega) bool
{ ... }) calls referenced in this spec.

In `@internal/controller/bmcsettings_controller_test.go`:
- Around line 720-728: The two Eventually checks in
bmcsettings_controller_test.go use an overly aggressive WithPolling((10 *
time.Microsecond)) which can cause flakiness and excess API traffic; update both
WithPolling calls to a more reasonable interval such as 100*time.Millisecond (or
a test-constant like test.PollInterval) so the checks on bmcSettings.Status
(metalv1alpha1.BMCSettingsStateFailed and AutoRetryCountRemaining) poll at a
stable, less-frequent cadence.

In `@internal/controller/bmcsettings_controller.go`:
- Around line 601-617: The handler currently marks the BMCSettings as Failed
when the BMC is Powered Off (in handleBMCPowerState), which consumes retry
budget; change the logic so that when BMC.Status.PowerState is not PoweredOn you
update the BMCPoweredOffCondition to a non-failed interim state (e.g.,
Pending/Unknown) via r.Conditions.Update and call updateBMCSettingsStatus with a
non-failed state instead of metalv1alpha1.BMCSettingsStateFailed, and return a
reconcile result that requeues after a short delay (use controllerutil or return
ctrl.Result{RequeueAfter: <duration>}) so the controller retries later without
consuming a failure retry count. Ensure you still record the BMCPoweredOffReason
and an informative message when updating BMCPoweredOffCondition in
handleBMCPowerState.

In `@internal/controller/bmcsettingsset_controller_test.go`:
- Around line 513-638: The test uses an ultra-tight poll interval causing
flakiness: replace the WithPolling(1 * time.Microsecond) call in the "Should
successfully retry failed state child resources once" spec (the
Eventually(Object(bmcSettings01)).WithPolling(...) invocation) with a
millisecond-based interval (e.g., 1 * time.Millisecond or a slightly larger
value like 10 * time.Millisecond) to reduce CPU pressure and timing sensitivity
while keeping the rest of the assertion logic unchanged.

In `@internal/controller/bmcversion_controller_test.go`:
- Around line 385-439: The test uses extremely small microsecond polling
intervals (WithPolling((10 * time.Microsecond))) which can cause busy-spinning
and flakiness; change those WithPolling calls in this test (references:
retryCount, bmcVersion, the Eventually(...) blocks that call Get(bmcVersion)())
to use a millisecond poll interval (e.g., 10*time.Millisecond or
100*time.Millisecond) instead of 10µs so the Eventually assertions are more
stable in envtest.

@afritzler
Copy link
Copy Markdown
Member

@nagadeesh-nagaraja do you think it would be better to have the retry count globally configurable via flags first instead of addding the tot he spec?

@coderabbitai what do you think about that?

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Feb 23, 2026

Tip

For best results, initiate chat on the files or code changes.

@afritzler: That's an interesting design question! Both approaches have merit, and I'd like to share some thoughts on each:

Current approach (spec-level):

  • Declarative & self-contained: Follows Kubernetes patterns where resources are self-describing
  • Flexibility: Different operations can have different retry policies (e.g., BMC firmware might need more retries than BIOS settings)
  • GitOps-friendly: Retry configuration is versioned with the resource
  • More configuration burden: Users must configure it for each resource
  • Larger API surface: Adds fields to multiple CRDs

Global flag approach:

  • Simplicity: One place to configure, easier initial setup
  • Consistency: All resources get the same behavior by default
  • Lower cognitive load: Less for users to understand initially
  • Less flexible: Can't handle cases where different operations need different retry counts
  • Requires controller restart: Changes need redeployment

My recommendation: Hybrid approach 🎯

Consider implementing a global default via controller flag + optional per-resource override in the spec:

  1. Add a controller flag like --default-failed-auto-retry-count=3
  2. Keep the spec field as optional
  3. If spec field is nil, use the global default
  4. If spec field is set, it overrides the global default

This gives you:

  • Simple default behavior out of the box
  • Fine-grained control when needed
  • Easy operational management (change flag for cluster-wide policy)
  • Resource-specific overrides for special cases

This pattern is common in Kubernetes (e.g., how Pod SecurityContext works with PodSecurityPolicies/Standards) and provides the best of both worlds. Given this is a new feature, it would also allow you to start with a simple global default and let real-world usage inform whether per-resource configuration is frequently needed.

What do you think? Would this hybrid approach work for your use case?

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

♻️ Duplicate comments (5)
config/crd/bases/metal.ironcore.dev_biossettingssets.yaml (1)

73-78: Same missing minimum: 0 constraint as other CRDs — will auto-fix after the types.go marker correction.

This is the downstream effect of the missing + prefix in the kubebuilder marker in the corresponding *_types.go file. No action needed on this generated file directly.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@config/crd/bases/metal.ironcore.dev_biossettingssets.yaml` around lines 73 -
78, The CRD is missing the `minimum: 0` constraint because the kubebuilder
marker in the corresponding types file lacks the required '+' prefix; locate the
field FailedAutoRetryCount in the relevant *_types.go (e.g., the
BIOSSettingsSetSpec struct) and change the comment marker to use the plus form
(// +kubebuilder:validation:Minimum=0) for that field, then re-run the CRD
generation (controller-gen / your project generate task) so the generated
metal.ironcore.dev_biossettingssets.yaml includes minimum: 0.
config/crd/bases/metal.ironcore.dev_bmcsettings.yaml (1)

70-75: Same missing minimum: 0 constraint — downstream of the types.go marker bug.

Will auto-fix after correcting the + prefix in api/v1alpha1/bmcsettings_types.go and re-running make manifests.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@config/crd/bases/metal.ironcore.dev_bmcsettings.yaml` around lines 70 - 75,
The CRD is missing the explicit `minimum: 0` for the failedAutoRetryCount
because the kubebuilder validation marker in the type definition is malformed;
open api/v1alpha1/bmcsettings_types.go and fix the validation comment for the
FailedAutoRetryCount field to use the correct kubebuilder marker (e.g. ensure
the line reads `// +kubebuilder:validation:Minimum=0` with the leading plus),
then re-run `make manifests` to regenerate
config/crd/.../metal.ironcore.dev_bmcsettings.yaml so the failedAutoRetryCount
entry contains `minimum: 0`.
internal/controller/biossettings_controller.go (1)

1073-1101: ⚠️ Potential issue | 🟠 Major

Reset FlowState on auto-retry to avoid stale progress.

Auto-retry sets the state back to Pending but leaves previous FlowState entries, which can skip re-application steps or retain stale conditions. Clear FlowState when scheduling auto-retry (mirroring the manual retry path).

🧹 Proposed fix
 			biosSettingsBase := settings.DeepCopy()
 			settings.Status.State = metalv1alpha1.BIOSSettingsStatePending
+			settings.Status.FlowState = []metalv1alpha1.BIOSSettingsFlowStatus{}
 			retryCondition, err := GetCondition(r.Conditions, settings.Status.Conditions, RetryOfFailedResourceConditionIssued)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/biossettings_controller.go` around lines 1073 - 1101, The
auto-retry branch resets State to metalv1alpha1.BIOSSettingsStatePending but
doesn't clear prior FlowState entries, risking stale progress; when preparing
for auto-retry in the block that checks retryCount and manipulates settings
(e.g., where biosSettingsBase := settings.DeepCopy(), settings.Status.State =
metalv1alpha1.BIOSSettingsStatePending, and AutoRetryCountRemaining is
decremented), also clear the in-memory FlowState (e.g., set
settings.Status.FlowState = nil or empty slice) before calling r.Status().Patch
so the retry mirrors the manual retry path and removes any previous flow
progress/conditions.
internal/controller/bmcsettings_controller.go (2)

451-468: Misleading error message at Line 456.

The error message still reads "failed to get Condition for skip reboot post setting update" but this code path handles InvalidBMCSettingsError from CheckBMCAttributes. The message should reflect the actual context (invalid settings validation), not "skip reboot."

The original wrong-variable issue is partially addressed by using errors.Join(err, errCond), which now includes both errors.

Proposed fix
 				if errCond != nil {
-					return ctrl.Result{}, fmt.Errorf("failed to get Condition for skip reboot post setting update %v", errors.Join(err, errCond))
+					return ctrl.Result{}, fmt.Errorf("failed to get Condition for invalid BMC settings: %w", errors.Join(err, errCond))
 				}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/bmcsettings_controller.go` around lines 451 - 468, The
error message when errors.As detects an *common.InvalidBMCSettingsError is
misleading; update the fmt.Errorf call that wraps errCond in the GetCondition
failure branch so the message reflects invalid settings validation (e.g.,
"failed to get Condition for invalid BMC settings validation") instead of "skip
reboot post setting update"; locate the block handling InvalidBMCSettingsError
(uses variable invalidSettingsErr, calls GetCondition(r.Conditions,
bmcSetting.Status.Conditions, BMCSettingsConditionWrongSettings), then
r.Conditions.Update and r.updateBMCSettingsStatus) and change only the
descriptive string passed to fmt.Errorf while preserving the errors.Join(err,
errCond) behavior.

586-604: Local variable BMCPoweredOffCondition shadows the package-level constant.

At both Lines 586 and 604, the local variable BMCPoweredOffCondition has the same name as the package-level constant BMCPoweredOffCondition (Line 47). While Go resolves this correctly, it can confuse readers and tooling. Rename to e.g. bmcPoweredOffCond.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/bmcsettings_controller.go` around lines 586 - 604, The
local variable named BMCPoweredOffCondition shadows the package-level constant
BMCPoweredOffCondition; rename the local variable (e.g., to bmcPoweredOffCond)
at both places where you call GetCondition and receive the result, and update
all subsequent uses in that scope (the call to r.Conditions.Update and the call
to r.updateBMCSettingsStatus, plus the later GetCondition usage in the default
branch) to the new local name so the constant is no longer shadowed.
🧹 Nitpick comments (11)
bmc/mockup.go (1)

34-59: Optional: extract the shared task prefix to eliminate duplication.

Both helpers repeat the same 5-step preamble (New→Pending→Starting→Running(10)→Running(20)). Extracting it would make the terminal conditions of each sequence more obvious at a glance.

♻️ Suggested refactor
+// upgradeTaskPrefix returns the common leading task states shared by all upgrade sequences.
+func upgradeTaskPrefix() []schemas.Task {
+	return []schemas.Task{
+		{TaskState: schemas.NewTaskState, PercentComplete: gofish.ToRef(uint(0))},
+		{TaskState: schemas.PendingTaskState, PercentComplete: gofish.ToRef(uint(0))},
+		{TaskState: schemas.StartingTaskState, PercentComplete: gofish.ToRef(uint(0))},
+		{TaskState: schemas.RunningTaskState, PercentComplete: gofish.ToRef(uint(10))},
+		{TaskState: schemas.RunningTaskState, PercentComplete: gofish.ToRef(uint(20))},
+	}
+}
+
 // defaultUpgradeFailedTaskStatus returns a canonical failed upgrade task
 // sequence ending in ExceptionTaskState at 99 % completion.
 func defaultUpgradeFailedTaskStatus() []schemas.Task {
-	return []schemas.Task{
-		{TaskState: schemas.NewTaskState, PercentComplete: gofish.ToRef(uint(0))},
-		{TaskState: schemas.PendingTaskState, PercentComplete: gofish.ToRef(uint(0))},
-		{TaskState: schemas.StartingTaskState, PercentComplete: gofish.ToRef(uint(0))},
-		{TaskState: schemas.RunningTaskState, PercentComplete: gofish.ToRef(uint(10))},
-		{TaskState: schemas.RunningTaskState, PercentComplete: gofish.ToRef(uint(20))},
-		{TaskState: schemas.RunningTaskState, PercentComplete: gofish.ToRef(uint(98))},
-		{TaskState: schemas.ExceptionTaskState, PercentComplete: gofish.ToRef(uint(99))},
-	}
+	return append(upgradeTaskPrefix(),
+		schemas.Task{TaskState: schemas.RunningTaskState, PercentComplete: gofish.ToRef(uint(98))},
+		schemas.Task{TaskState: schemas.ExceptionTaskState, PercentComplete: gofish.ToRef(uint(99))},
+	)
 }

 // defaultUpgradeTaskStatus returns a canonical upgrade task sequence ending in CompletedTaskState at 100 % completion.
 func defaultUpgradeTaskStatus() []schemas.Task {
-	return []schemas.Task{
-		{TaskState: schemas.NewTaskState, PercentComplete: gofish.ToRef(uint(0))},
-		{TaskState: schemas.PendingTaskState, PercentComplete: gofish.ToRef(uint(0))},
-		{TaskState: schemas.StartingTaskState, PercentComplete: gofish.ToRef(uint(0))},
-		{TaskState: schemas.RunningTaskState, PercentComplete: gofish.ToRef(uint(10))},
-		{TaskState: schemas.RunningTaskState, PercentComplete: gofish.ToRef(uint(20))},
-		{TaskState: schemas.RunningTaskState, PercentComplete: gofish.ToRef(uint(100))},
-		{TaskState: schemas.CompletedTaskState, PercentComplete: gofish.ToRef(uint(100))},
-	}
+	return append(upgradeTaskPrefix(),
+		schemas.Task{TaskState: schemas.RunningTaskState, PercentComplete: gofish.ToRef(uint(100))},
+		schemas.Task{TaskState: schemas.CompletedTaskState, PercentComplete: gofish.ToRef(uint(100))},
+	)
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bmc/mockup.go` around lines 34 - 59, Extract the common 5-step preamble used
by defaultUpgradeFailedTaskStatus and defaultUpgradeTaskStatus into a helper
(e.g., upgradeTaskPreamble or makeUpgradePreamble) that returns []schemas.Task
for New→Pending→Starting→Running(10)→Running(20), then have
defaultUpgradeFailedTaskStatus and defaultUpgradeTaskStatus call that helper and
append their differing terminal steps (Exception/99 and Completed/100
respectively); update both functions to build their return slices by
concatenating the preamble slice with the terminal entries to remove duplication
and make the terminal conditions explicit.
bmc/common/helpers.go (1)

29-29: Stale TODO — EnumerationAttributeType is already handled.

The comment "add more types like maps and Enumerations" mentions Enumerations, but the EnumerationAttributeType case is fully implemented in this same function. Consider narrowing the TODO to only the missing type(s) (e.g., maps/objects).

🧹 Suggested update
-	// TODO: add more types like maps and Enumerations
+	// TODO: add support for Map/Object attribute types
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bmc/common/helpers.go` at line 29, The TODO comment in bmc/common/helpers.go
incorrectly mentions Enumerations even though EnumerationAttributeType is
already implemented in the function (look for the EnumerationAttributeType
case), so update the comment to reflect only the truly missing types (e.g., "add
support for maps/objects") or remove the TODO entirely; ensure you keep or
adjust any nearby comment to reference the actual unhandled types rather than
Enumerations.
internal/controller/bmcversionset_controller_test.go (1)

604-613: Assertions on retried BMCVersions use inconsistent matchers for AutoRetryCountRemaining.

Line 607 asserts AutoRetryCountRemaining is BeNil() for bmcVersion02, while line 612 asserts Not(Equal(GetPtr(int32(0)))) for bmcVersion03. These check different properties (nil vs non-zero) for what should be the same logical state (resource was retried). If the intent is to verify the resource left the exhausted-failed state, consider using the same matcher for both to avoid brittleness and improve readability.

Also, .MustPassRepeatedly(1) on line 605 is the default and has no effect — it can be removed for clarity.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/bmcversionset_controller_test.go` around lines 604 - 613,
The two assertions for AutoRetryCountRemaining are inconsistent: change the
bmcVersion02 check to match bmcVersion03 (verify AutoRetryCountRemaining is
non-nil) — e.g., use HaveField("Status.AutoRetryCountRemaining", Not(BeNil()))
for both bmcVersion02 and bmcVersion03 (or alternatively use
Not(Equal(GetPtr(int32(0)))) for both) so they assert the same retried state;
also remove the redundant .MustPassRepeatedly(1) on the
Eventually(Object(bmcVersion02)) call to simplify the test.
internal/controller/biossettingsset_controller_test.go (1)

500-510: Asymmetric intermediate-state assertions may be timing-sensitive.

Lines 501-504 assert AutoRetryCountRemaining is BeNil() for biosSettings02, while lines 507-510 assert Not(Equal(GetPtr(int32(0)))) for biosSettings03. These capture different intermediate reconciliation snapshots. If the controller reconciles both children in a single pass (or very quickly), biosSettings02 might never be observed with a nil counter, making this flaky.

Consider using a consistent assertion for both (e.g., just checking state is no longer Failed) if the nil intermediate state isn't guaranteed.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/biossettingsset_controller_test.go` around lines 500 -
510, The test contains asymmetric, timing-sensitive assertions for
biosSettings02 and biosSettings03: biosSettings02 checks
Status.AutoRetryCountRemaining is BeNil() while biosSettings03 asserts
Not(Equal(GetPtr(int32(0)))), which can be flaky if reconciliations are fast;
make the assertions consistent by either removing the nil check and asserting
only that Status.State is not metalv1alpha1.BIOSSettingsStateFailed for both
biosSettings02 and biosSettings03, or assert the same AutoRetryCountRemaining
expectation for both (e.g., Not(Equal(GetPtr(int32(0))))), referencing the
Status.State and Status.AutoRetryCountRemaining fields on biosSettings02 and
biosSettings03 in the Eventually blocks.
internal/controller/biosversionset_controller_test.go (1)

494-504: Same asymmetric intermediate-state assertion pattern as BIOSSettingsSet test.

Same concern as noted in biossettingsset_controller_test.go: biosVersion02 checks BeNil() while biosVersion03 checks Not(Equal(GetPtr(int32(0)))) for AutoRetryCountRemaining. This relies on observing different reconciliation timing for the two children.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/biosversionset_controller_test.go` around lines 494 -
504, The test uses asymmetric assertions for AutoRetryCountRemaining on
biosVersion02 (BeNil()) and biosVersion03 (Not(Equal(GetPtr(int32(0))))), which
relies on timing; make the checks consistent so they don't depend on
reconciliation ordering by changing the biosVersion02 assertion to mirror
biosVersion03: assert HaveField("Status.AutoRetryCountRemaining",
Not(Equal(GetPtr(int32(0))))) while keeping the State check
(Not(Equal(metalv1alpha1.BIOSVersionStateFailed))) for both, targeting the
biosVersion02 and biosVersion03 objects and the AutoRetryCountRemaining field.
internal/controller/bmcsettingsset_controller_test.go (1)

581-585: Avoid microsecond polling to reduce test flakiness.

1µs polling can busy-spin and be flaky under CI load; consider a millisecond-scale poll or default polling.

♻️ Suggested adjustment
- Eventually(Object(bmcSettings01)).WithPolling(1 * time.Microsecond).MustPassRepeatedly(1).Should(SatisfyAll(
+ Eventually(Object(bmcSettings01)).WithPolling(10 * time.Millisecond).MustPassRepeatedly(1).Should(SatisfyAll(
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/bmcsettingsset_controller_test.go` around lines 581 -
585, The test is using a 1µs polling interval which is too aggressive; update
the Eventually call that wraps Object(bmcSettings01) to use a millisecond-scale
poll or default polling instead of 1 * time.Microsecond — e.g., replace
WithPolling(1 * time.Microsecond) with WithPolling(1 * time.Millisecond) or
remove WithPolling(...) entirely so Eventually uses its default interval; keep
the rest of the assertion (HaveField checks on Status.State and
Status.AutoRetryCountRemaining) unchanged.
cmd/main.go (1)

56-61: Confirm lint/test run for Go changes.

Please confirm make lint-fix and make test were executed.

As per coding guidelines: **/*.go: Run make lint-fix and make test after editing Go source files.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/main.go` around lines 56 - 61, Ensure you ran the Go lint and tests after
editing cmd/main.go's main function: execute `make lint-fix` to apply linter
fixes and then `make test` to run the test suite, verify both complete without
errors, and commit any auto-fixed changes before pushing.
internal/controller/bmcsettings_controller.go (1)

704-777: Retry logic is duplicated nearly verbatim across four controllers.

The handleFailedState body here is essentially identical to handleFailedState in bmcversion_controller.go, processFailedState in biosversion_controller.go, and presumably biossettings_controller.go. The annotation-based retry block (~25 lines) and the auto-retry block (~30 lines) are copy-pasted with only type names changed.

Consider extracting a generic retry helper, e.g.:

func handleRetry[T client.Object](
    ctx context.Context,
    c client.Client,
    conditions *conditionutils.Accessor,
    obj T,
    defaultRetryCount int32,
    getSpec func(T) *int32,
    getStatus func(T) (*int32, []metav1.Condition),
    setStatus func(T, string, *int32, []metav1.Condition),
) (retried bool, err error)

This would reduce ~55 lines of duplicated code per controller to a single function call, making the retry policy easier to reason about and maintain.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/bmcsettings_controller.go` around lines 704 - 777, The
handleFailedState implementations duplicate annotation-based retry and
auto-retry logic across controllers; extract that logic into a single generic
helper (e.g., handleRetry[T client.Object]) that takes ctx, client.Client (or
the reconciler), *conditionutils.Accessor (r.Conditions), the resource object
(bmcSetting), the default retry count (r.DefaultFailedAutoRetryCount), and small
accessors for spec/status mutation (getSpec/getStatus/setStatus or closure
equivalents) so the helper can: check the OperationAnnotation retry path, update
the RetryOfFailedResourceConditionIssued condition, compute/adjust
AutoRetryCountRemaining, set State to BMCSettingsStatePending (or a passed-in
pending state string), and perform the Status().Patch via MergeFrom; then
replace the duplicated blocks in BMCSettingsReconciler.handleFailedState (and
the counterparts in bmcversion_controller.go, biosversion_controller.go,
biossettings_controller.go) with a single call to handleRetry and handle
returned (retried,bool) and error values.
internal/controller/biosversion_controller.go (2)

507-507: Verbose log value: logging the entire biosVersion object.

Line 507 logs "BIOSVersion", biosVersion which serializes the entire resource (spec, status, metadata) into the log line. Consider logging just the name:

-	log.V(1).Info("Failed to upgrade BIOSVersion", "BIOSVersion", biosVersion, "Server", server.Name)
+	log.V(1).Info("Failed to upgrade BIOSVersion", "BIOSVersion", biosVersion.Name, "Server", server.Name)

The same pattern exists in bmcsettings_controller.go Line 779 ("bmcSetting", bmcSetting). As per coding guidelines, follow Kubernetes logging conventions with key-value pairs that are concise.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/biosversion_controller.go` at line 507, The log is
serializing the entire resource by passing the full biosVersion object; change
the log call in biosversion_controller (the line using log.V(1).Info("Failed to
upgrade BIOSVersion", "BIOSVersion", biosVersion, "Server", server.Name)) to log
a concise identifier instead (e.g., use biosVersion.Name or
biosVersion.GetName()) so only the resource name is logged; apply the same
change pattern for the bmcSetting log in bmcsettings_controller (replace
"bmcSetting", bmcSetting with the bmcSetting name/identifier).

436-509: Same retry logic duplication as other controllers.

processFailedState is the same retry pattern with only type names changed. See the refactoring suggestion on bmcsettings_controller.go.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/biosversion_controller.go` around lines 436 - 509,
processFailedState duplicates the common "retry failed reconciliation" logic;
extract that flow into a shared helper and call it from
BIOSVersionReconciler.processFailedState. Create a reusable function (e.g.,
HandleFailedResourceRetry or handleFailedResourceRetry) that accepts context,
the reconciler's Conditions helper, the resource object (BIOSVersion), its
Status/Spec accessors or an interface to read/write State,
AutoRetryCountRemaining and FailedAutoRetryCount, the
DefaultFailedAutoRetryCount, and the RetryOfFailedResourceConditionIssued
constant and operation annotation key; move the annotation-based retry block,
the default/explicit retryCount logic, updating of Retry condition via
r.Conditions.Update, mutation of Status.Conditions, adjustment of
AutoRetryCountRemaining, and the Status().Patch behavior into that helper and
return (bool, error) so processFailedState can simply forward the call (and only
keep any BIOSVersion-specific logging). Update processFailedState to call this
helper and remove the duplicated logic (keep references to GetCondition,
RetryOfFailedResourceConditionIssued, BIOSVersionStatePending, and
r.Status().Patch as needed).
internal/controller/bmcversion_controller.go (1)

406-483: Same retry logic duplication as noted in bmcsettings_controller.go.

This handleFailedState is nearly line-for-line identical to the one in the BMCSettings controller (and the BIOSVersion/BIOSSettings equivalents). See the refactoring suggestion on bmcsettings_controller.go.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/bmcversion_controller.go` around lines 406 - 483, The
handleFailedState method in BMCVersionReconciler duplicates retry logic found in
other controllers; extract the shared retry flow into a reusable helper (e.g.,
RetryFailedResource or handleFailedStateShared) that accepts the reconciler's
Conditions, Status client, the resource object (with fields/methods used:
Status.AutoRetryCountRemaining, Spec.FailedAutoRetryCount, Name, GetAnnotations,
DeepCopy, Status.Conditions), DefaultFailedAutoRetryCount and the
RetryOfFailedResourceCondition constant, then have
BMCVersionReconciler.handleFailedState call that helper; the helper should
perform the GetCondition/Conditions.Update logic, adjust
AutoRetryCountRemaining, set State to Pending, and call r.Status().Patch (using
the passed status client/mergeFrom base) so you can remove the duplicated code
from BMCVersionReconciler and reuse it from BMCSettings/BIOSVersion/BIOSSettings
controllers.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@api/v1alpha1/biossettings_types.go`:
- Around line 20-23: The kubebuilder validation marker for the minimum
constraint is missing the required '+' prefix causing the Minimum=0 marker on
FailedAutoRetryCount in BIOSSettings (symbol: FailedAutoRetryCount) to be
ignored and likewise in biosversion_types.go; update the comments to add the '+'
prefix (e.g. change "// kubebuilder:validation:Minimum=0" to "//
+kubebuilder:validation:Minimum=0") for the affected fields
(FailedAutoRetryCount and the analogous field in biosversion_types.go), then run
"make manifests" and "make generate" to regenerate the CRD and code.

In `@api/v1alpha1/biosversion_types.go`:
- Around line 48-51: The kubebuilder marker for the FailedAutoRetryCount field
is missing the required '+' prefix so controller-gen won't pick up the
validation; update the comment above the FailedAutoRetryCount field (the struct
tag/field named FailedAutoRetryCount in biosversion_types.go) to use "//
+kubebuilder:validation:Minimum=0" instead of "//
kubebuilder:validation:Minimum=0" so the Minimum=0 validation is recognized.

In `@api/v1alpha1/bmcsettings_types.go`:
- Around line 22-25: The kubebuilder validation markers are missing the required
'+' prefix in several type files, so change the comment lines from "//
kubebuilder:validation:Minimum=0" to "// +kubebuilder:validation:Minimum=0" in
the four affected type definitions (update bmcsettings_types.go,
bmcversion_types.go, biosversion_types.go, biossettings_types.go) so the
Minimum=0 constraint is picked up for the corresponding retry/count fields
(e.g., FailedAutoRetryCount and the equivalent fields in the other types); after
editing those markers run make manifests and make generate to regenerate CRDs
and deepcopy files.

In `@bmc/redfish_local.go`:
- Around line 131-139: Rename the misspelled local variable taskSatus to
taskStatus in this function: change the declaration/assignment from "taskSatus
:= UnitTestMockUps.BIOSUpgradeTaskStatus", update the conditional that assigns
BIOSUpgradeTaskFailedStatus when BIOSUpgradingVersion contains "fail", and
update the subsequent uses in the bounds check (len(taskStatus)) and the task
pointer creation (&taskStatus[index]); ensure all occurrences in this snippet
match the correct spelling and compile.

In `@cmd/main.go`:
- Around line 95-96: Add validation after flag.Parse() to reject negative values
for defaultFailedAutoRetryCount: check the parsed defaultFailedAutoRetryCount
variable and if it's less than 0 return an error or log a fatal message and
exit, explaining that negative values are invalid; update any startup path that
uses defaultFailedAutoRetryCount to rely on this validation so controllers
BIOSSettings, BIOSVersion, BMCSettings and BMCVersion no longer silently disable
retries when given negative input.

In `@config/crd/bases/metal.ironcore.dev_bmcversions.yaml`:
- Around line 87-92: The kubebuilder markers for the FailedAutoRetryCount fields
are missing the required '+' prefix and are being rendered as plain text in
generated CRDs; update the comment on each FailedAutoRetryCount field in the
corresponding type definitions (BMCVersion, BMCSettings, BIOSSettings,
BIOSVersion) to change "// kubebuilder:validation:Minimum=0" to "//
+kubebuilder:validation:Minimum=0" so the marker is recognized and converted
into a proper schema constraint during CRD generation.

---

Duplicate comments:
In `@config/crd/bases/metal.ironcore.dev_biossettingssets.yaml`:
- Around line 73-78: The CRD is missing the `minimum: 0` constraint because the
kubebuilder marker in the corresponding types file lacks the required '+'
prefix; locate the field FailedAutoRetryCount in the relevant *_types.go (e.g.,
the BIOSSettingsSetSpec struct) and change the comment marker to use the plus
form (// +kubebuilder:validation:Minimum=0) for that field, then re-run the CRD
generation (controller-gen / your project generate task) so the generated
metal.ironcore.dev_biossettingssets.yaml includes minimum: 0.

In `@config/crd/bases/metal.ironcore.dev_bmcsettings.yaml`:
- Around line 70-75: The CRD is missing the explicit `minimum: 0` for the
failedAutoRetryCount because the kubebuilder validation marker in the type
definition is malformed; open api/v1alpha1/bmcsettings_types.go and fix the
validation comment for the FailedAutoRetryCount field to use the correct
kubebuilder marker (e.g. ensure the line reads `//
+kubebuilder:validation:Minimum=0` with the leading plus), then re-run `make
manifests` to regenerate config/crd/.../metal.ironcore.dev_bmcsettings.yaml so
the failedAutoRetryCount entry contains `minimum: 0`.

In `@internal/controller/biossettings_controller.go`:
- Around line 1073-1101: The auto-retry branch resets State to
metalv1alpha1.BIOSSettingsStatePending but doesn't clear prior FlowState
entries, risking stale progress; when preparing for auto-retry in the block that
checks retryCount and manipulates settings (e.g., where biosSettingsBase :=
settings.DeepCopy(), settings.Status.State =
metalv1alpha1.BIOSSettingsStatePending, and AutoRetryCountRemaining is
decremented), also clear the in-memory FlowState (e.g., set
settings.Status.FlowState = nil or empty slice) before calling r.Status().Patch
so the retry mirrors the manual retry path and removes any previous flow
progress/conditions.

In `@internal/controller/bmcsettings_controller.go`:
- Around line 451-468: The error message when errors.As detects an
*common.InvalidBMCSettingsError is misleading; update the fmt.Errorf call that
wraps errCond in the GetCondition failure branch so the message reflects invalid
settings validation (e.g., "failed to get Condition for invalid BMC settings
validation") instead of "skip reboot post setting update"; locate the block
handling InvalidBMCSettingsError (uses variable invalidSettingsErr, calls
GetCondition(r.Conditions, bmcSetting.Status.Conditions,
BMCSettingsConditionWrongSettings), then r.Conditions.Update and
r.updateBMCSettingsStatus) and change only the descriptive string passed to
fmt.Errorf while preserving the errors.Join(err, errCond) behavior.
- Around line 586-604: The local variable named BMCPoweredOffCondition shadows
the package-level constant BMCPoweredOffCondition; rename the local variable
(e.g., to bmcPoweredOffCond) at both places where you call GetCondition and
receive the result, and update all subsequent uses in that scope (the call to
r.Conditions.Update and the call to r.updateBMCSettingsStatus, plus the later
GetCondition usage in the default branch) to the new local name so the constant
is no longer shadowed.

---

Nitpick comments:
In `@bmc/common/helpers.go`:
- Line 29: The TODO comment in bmc/common/helpers.go incorrectly mentions
Enumerations even though EnumerationAttributeType is already implemented in the
function (look for the EnumerationAttributeType case), so update the comment to
reflect only the truly missing types (e.g., "add support for maps/objects") or
remove the TODO entirely; ensure you keep or adjust any nearby comment to
reference the actual unhandled types rather than Enumerations.

In `@bmc/mockup.go`:
- Around line 34-59: Extract the common 5-step preamble used by
defaultUpgradeFailedTaskStatus and defaultUpgradeTaskStatus into a helper (e.g.,
upgradeTaskPreamble or makeUpgradePreamble) that returns []schemas.Task for
New→Pending→Starting→Running(10)→Running(20), then have
defaultUpgradeFailedTaskStatus and defaultUpgradeTaskStatus call that helper and
append their differing terminal steps (Exception/99 and Completed/100
respectively); update both functions to build their return slices by
concatenating the preamble slice with the terminal entries to remove duplication
and make the terminal conditions explicit.

In `@cmd/main.go`:
- Around line 56-61: Ensure you ran the Go lint and tests after editing
cmd/main.go's main function: execute `make lint-fix` to apply linter fixes and
then `make test` to run the test suite, verify both complete without errors, and
commit any auto-fixed changes before pushing.

In `@internal/controller/biossettingsset_controller_test.go`:
- Around line 500-510: The test contains asymmetric, timing-sensitive assertions
for biosSettings02 and biosSettings03: biosSettings02 checks
Status.AutoRetryCountRemaining is BeNil() while biosSettings03 asserts
Not(Equal(GetPtr(int32(0)))), which can be flaky if reconciliations are fast;
make the assertions consistent by either removing the nil check and asserting
only that Status.State is not metalv1alpha1.BIOSSettingsStateFailed for both
biosSettings02 and biosSettings03, or assert the same AutoRetryCountRemaining
expectation for both (e.g., Not(Equal(GetPtr(int32(0))))), referencing the
Status.State and Status.AutoRetryCountRemaining fields on biosSettings02 and
biosSettings03 in the Eventually blocks.

In `@internal/controller/biosversion_controller.go`:
- Line 507: The log is serializing the entire resource by passing the full
biosVersion object; change the log call in biosversion_controller (the line
using log.V(1).Info("Failed to upgrade BIOSVersion", "BIOSVersion", biosVersion,
"Server", server.Name)) to log a concise identifier instead (e.g., use
biosVersion.Name or biosVersion.GetName()) so only the resource name is logged;
apply the same change pattern for the bmcSetting log in bmcsettings_controller
(replace "bmcSetting", bmcSetting with the bmcSetting name/identifier).
- Around line 436-509: processFailedState duplicates the common "retry failed
reconciliation" logic; extract that flow into a shared helper and call it from
BIOSVersionReconciler.processFailedState. Create a reusable function (e.g.,
HandleFailedResourceRetry or handleFailedResourceRetry) that accepts context,
the reconciler's Conditions helper, the resource object (BIOSVersion), its
Status/Spec accessors or an interface to read/write State,
AutoRetryCountRemaining and FailedAutoRetryCount, the
DefaultFailedAutoRetryCount, and the RetryOfFailedResourceConditionIssued
constant and operation annotation key; move the annotation-based retry block,
the default/explicit retryCount logic, updating of Retry condition via
r.Conditions.Update, mutation of Status.Conditions, adjustment of
AutoRetryCountRemaining, and the Status().Patch behavior into that helper and
return (bool, error) so processFailedState can simply forward the call (and only
keep any BIOSVersion-specific logging). Update processFailedState to call this
helper and remove the duplicated logic (keep references to GetCondition,
RetryOfFailedResourceConditionIssued, BIOSVersionStatePending, and
r.Status().Patch as needed).

In `@internal/controller/biosversionset_controller_test.go`:
- Around line 494-504: The test uses asymmetric assertions for
AutoRetryCountRemaining on biosVersion02 (BeNil()) and biosVersion03
(Not(Equal(GetPtr(int32(0))))), which relies on timing; make the checks
consistent so they don't depend on reconciliation ordering by changing the
biosVersion02 assertion to mirror biosVersion03: assert
HaveField("Status.AutoRetryCountRemaining", Not(Equal(GetPtr(int32(0))))) while
keeping the State check (Not(Equal(metalv1alpha1.BIOSVersionStateFailed))) for
both, targeting the biosVersion02 and biosVersion03 objects and the
AutoRetryCountRemaining field.

In `@internal/controller/bmcsettings_controller.go`:
- Around line 704-777: The handleFailedState implementations duplicate
annotation-based retry and auto-retry logic across controllers; extract that
logic into a single generic helper (e.g., handleRetry[T client.Object]) that
takes ctx, client.Client (or the reconciler), *conditionutils.Accessor
(r.Conditions), the resource object (bmcSetting), the default retry count
(r.DefaultFailedAutoRetryCount), and small accessors for spec/status mutation
(getSpec/getStatus/setStatus or closure equivalents) so the helper can: check
the OperationAnnotation retry path, update the
RetryOfFailedResourceConditionIssued condition, compute/adjust
AutoRetryCountRemaining, set State to BMCSettingsStatePending (or a passed-in
pending state string), and perform the Status().Patch via MergeFrom; then
replace the duplicated blocks in BMCSettingsReconciler.handleFailedState (and
the counterparts in bmcversion_controller.go, biosversion_controller.go,
biossettings_controller.go) with a single call to handleRetry and handle
returned (retried,bool) and error values.

In `@internal/controller/bmcsettingsset_controller_test.go`:
- Around line 581-585: The test is using a 1µs polling interval which is too
aggressive; update the Eventually call that wraps Object(bmcSettings01) to use a
millisecond-scale poll or default polling instead of 1 * time.Microsecond —
e.g., replace WithPolling(1 * time.Microsecond) with WithPolling(1 *
time.Millisecond) or remove WithPolling(...) entirely so Eventually uses its
default interval; keep the rest of the assertion (HaveField checks on
Status.State and Status.AutoRetryCountRemaining) unchanged.

In `@internal/controller/bmcversion_controller.go`:
- Around line 406-483: The handleFailedState method in BMCVersionReconciler
duplicates retry logic found in other controllers; extract the shared retry flow
into a reusable helper (e.g., RetryFailedResource or handleFailedStateShared)
that accepts the reconciler's Conditions, Status client, the resource object
(with fields/methods used: Status.AutoRetryCountRemaining,
Spec.FailedAutoRetryCount, Name, GetAnnotations, DeepCopy, Status.Conditions),
DefaultFailedAutoRetryCount and the RetryOfFailedResourceCondition constant,
then have BMCVersionReconciler.handleFailedState call that helper; the helper
should perform the GetCondition/Conditions.Update logic, adjust
AutoRetryCountRemaining, set State to Pending, and call r.Status().Patch (using
the passed status client/mergeFrom base) so you can remove the duplicated code
from BMCVersionReconciler and reuse it from BMCSettings/BIOSVersion/BIOSSettings
controllers.

In `@internal/controller/bmcversionset_controller_test.go`:
- Around line 604-613: The two assertions for AutoRetryCountRemaining are
inconsistent: change the bmcVersion02 check to match bmcVersion03 (verify
AutoRetryCountRemaining is non-nil) — e.g., use
HaveField("Status.AutoRetryCountRemaining", Not(BeNil())) for both bmcVersion02
and bmcVersion03 (or alternatively use Not(Equal(GetPtr(int32(0)))) for both) so
they assert the same retried state; also remove the redundant
.MustPassRepeatedly(1) on the Eventually(Object(bmcVersion02)) call to simplify
the test.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b2b3044 and ac2cd8b.

⛔ Files ignored due to path filters (8)
  • dist/chart/templates/crd/metal.ironcore.dev_biossettings.yaml is excluded by !**/dist/**
  • dist/chart/templates/crd/metal.ironcore.dev_biossettingssets.yaml is excluded by !**/dist/**
  • dist/chart/templates/crd/metal.ironcore.dev_biosversions.yaml is excluded by !**/dist/**
  • dist/chart/templates/crd/metal.ironcore.dev_biosversionsets.yaml is excluded by !**/dist/**
  • dist/chart/templates/crd/metal.ironcore.dev_bmcsettings.yaml is excluded by !**/dist/**
  • dist/chart/templates/crd/metal.ironcore.dev_bmcsettingssets.yaml is excluded by !**/dist/**
  • dist/chart/templates/crd/metal.ironcore.dev_bmcversions.yaml is excluded by !**/dist/**
  • dist/chart/templates/crd/metal.ironcore.dev_bmcversionsets.yaml is excluded by !**/dist/**
📒 Files selected for processing (35)
  • api/v1alpha1/biossettings_types.go
  • api/v1alpha1/biosversion_types.go
  • api/v1alpha1/bmcsettings_types.go
  • api/v1alpha1/bmcversion_types.go
  • bmc/common/helpers.go
  • bmc/mockup.go
  • bmc/oem/dell.go
  • bmc/redfish_kube.go
  • bmc/redfish_local.go
  • cmd/main.go
  • config/crd/bases/metal.ironcore.dev_biossettings.yaml
  • config/crd/bases/metal.ironcore.dev_biossettingssets.yaml
  • config/crd/bases/metal.ironcore.dev_biosversions.yaml
  • config/crd/bases/metal.ironcore.dev_biosversionsets.yaml
  • config/crd/bases/metal.ironcore.dev_bmcsettings.yaml
  • config/crd/bases/metal.ironcore.dev_bmcsettingssets.yaml
  • config/crd/bases/metal.ironcore.dev_bmcversions.yaml
  • config/crd/bases/metal.ironcore.dev_bmcversionsets.yaml
  • docs/api-reference/api.md
  • internal/controller/biossettings_controller.go
  • internal/controller/biossettings_controller_test.go
  • internal/controller/biossettingsset_controller.go
  • internal/controller/biossettingsset_controller_test.go
  • internal/controller/biosversion_controller.go
  • internal/controller/biosversion_controller_test.go
  • internal/controller/biosversionset_controller.go
  • internal/controller/biosversionset_controller_test.go
  • internal/controller/bmcsettings_controller.go
  • internal/controller/bmcsettings_controller_test.go
  • internal/controller/bmcsettingsset_controller.go
  • internal/controller/bmcsettingsset_controller_test.go
  • internal/controller/bmcversion_controller.go
  • internal/controller/bmcversion_controller_test.go
  • internal/controller/bmcversionset_controller.go
  • internal/controller/bmcversionset_controller_test.go
🚧 Files skipped from review as they are similar to previous changes (5)
  • api/v1alpha1/bmcversion_types.go
  • config/crd/bases/metal.ironcore.dev_bmcversionsets.yaml
  • config/crd/bases/metal.ironcore.dev_bmcsettingssets.yaml
  • config/crd/bases/metal.ironcore.dev_biosversions.yaml
  • internal/controller/biosversionset_controller.go

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
internal/controller/bmcsettings_controller.go (1)

704-777: ⚠️ Potential issue | 🟠 Major

Reset AutoRetryCountRemaining on spec changes; ConditionTrue issue is reversed in suggested fix.

Two concerns raised, with one requiring correction:

  1. AutoRetryCountRemaining reset (valid concern): When Spec.FailedAutoRetryCount changes, Status.AutoRetryCountRemaining is never reset. The *set_controller patterns correctly handle this (e.g., bmcsettingsset_controller.go:350): bmcSettings.Status.AutoRetryCountRemaining = bmcSettings.Spec.FailedAutoRetryCount. Add spec-change detection to reset the retry counter in bmcsettings_controller.go to comply with idempotent reconciliation.

  2. ConditionTrue type (review comment's suggestion is incorrect): Line 722 uses metav1.ConditionTrue while the rest of the controller uses corev1.ConditionTrue. However, the fix direction in the review is backwards. Since bmcSetting.Status.Conditions is of type []metav1.Condition, which requires metav1.ConditionStatus, line 722 is correct and the rest of the codebase should be updated to use metav1.ConditionTrue, not the other way around. The 18+ occurrences of corev1.ConditionTrue in condition updates are the inconsistency that should be corrected.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/bmcsettings_controller.go` around lines 704 - 777, handle
the missing reset of AutoRetryCountRemaining and normalize condition status
types: detect spec changes to FailedAutoRetryCount in handleFailedState (or
earlier in reconciliation) and when bmcSetting.Spec.FailedAutoRetryCount differs
from the current Status.AutoRetryCountRemaining reset
Status.AutoRetryCountRemaining to the spec value (follow the pattern used in
bmcsettingsset_controller.go where bmcSettings.Status.AutoRetryCountRemaining =
bmcSettings.Spec.FailedAutoRetryCount); also keep metav1.ConditionStatus types
consistent by using metav1.ConditionTrue (not corev1.ConditionTrue) when
updating or comparing the RetryOfFailedResourceConditionIssued via GetCondition
and r.Conditions.Update so all condition updates (e.g., the block around
GetCondition/conditionutils.UpdateStatus) use metav1.ConditionTrue across the
controller.
♻️ Duplicate comments (3)
internal/controller/bmcversionset_controller_test.go (1)

609-613: Previously flagged bugs (wrong object + wrong type) are now resolved.

bmcVersion03 is now the observed object and metalv1alpha1.BMCVersionStateFailed is the correct named type — both past issues are fixed.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/bmcversionset_controller_test.go` around lines 609 - 613,
The test already uses the correct observed object (bmcVersion03) and the proper
enum type (metalv1alpha1.BMCVersionStateFailed), so no code changes are
required; ensure the assertions in
Eventually(Object(bmcVersion03)).WithPolling(...) still reference
HaveField("Status.State", Not(Equal(metalv1alpha1.BMCVersionStateFailed))) and
HaveField("Status.AutoRetryCountRemaining", Not(Equal(GetPtr(int32(0))))) and
then mark the change as approved/mergeable.
bmc/redfish_local.go (1)

131-139: Previously flagged taskSatus typo is now corrected to taskStatus.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bmc/redfish_local.go` around lines 131 - 139, The typo has been corrected:
ensure the code consistently uses the variable and struct fields
UnitTestMockUps.BIOSUpgradeTaskStatus, UnitTestMockUps.BIOSUpgradingVersion and
UnitTestMockUps.BIOSUpgradeTaskFailedStatus and that the logic computing index
and taking task := &taskStatus[index] remains correct (guarding index >=
len(taskStatus) as shown); no further changes are needed beyond keeping these
exact identifiers consistent.
internal/controller/bmcsettings_controller.go (1)

576-619: Avoid shadowing the BMCPoweredOffCondition constant with a local variable.

The BMCPoweredOffCondition, err := GetCondition(..., BMCPoweredOffCondition) pattern is easy to misread and has already been a source of confusion.

Possible cleanup
-		BMCPoweredOffCondition, err := GetCondition(r.Conditions, bmcSetting.Status.Conditions, BMCPoweredOffCondition)
+		bmcPoweredOffCond, err := GetCondition(r.Conditions, bmcSetting.Status.Conditions, BMCPoweredOffCondition)
 		if err != nil {
 			return fmt.Errorf("failed to get Condition for powered off BMC state %v", err)
 		}
-		if BMCPoweredOffCondition.Status == metav1.ConditionTrue {
+		if bmcPoweredOffCond.Status == metav1.ConditionTrue {
 			if err := r.Conditions.Update(
-				BMCPoweredOffCondition,
+				bmcPoweredOffCond,
 				conditionutils.UpdateStatus(corev1.ConditionFalse),
 				conditionutils.UpdateReason("BMCPoweredBackOn"),
 				conditionutils.UpdateMessage(fmt.Sprintf("BMC in Powered On, Power State: %v", BMC.Status.PowerState)),
 			); err != nil {
 				return fmt.Errorf("failed to update Pending BMCSetting update condition: %w", err)
 			}
-			err = r.updateBMCSettingsStatus(ctx, bmcSetting, bmcSetting.Status.State, BMCPoweredOffCondition)
+			err = r.updateBMCSettingsStatus(ctx, bmcSetting, bmcSetting.Status.State, bmcPoweredOffCond)
 			return err
 		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/bmcsettings_controller.go` around lines 576 - 619, The
function handleBMCPowerState shadows the BMCPoweredOffCondition constant by
declaring a local variable with the same name; change the local variable to a
non-conflicting identifier (e.g., poweredOffCond or bmcpoweredOffCond) in both
places where you call GetCondition and use it in subsequent r.Conditions.Update
and r.updateBMCSettingsStatus calls so the constant remains unshadowed and all
references in handleBMCPowerState use the new local variable name.
🧹 Nitpick comments (2)
internal/controller/bmcsettings_controller_test.go (1)

614-617: Use a less aggressive polling interval in this assertion.

Line 617 polls every 1ms, which is unnecessarily tight for envtest and can cause avoidable flakiness. A coarser interval (e.g., 25–50ms) is usually more stable.

Proposed tweak
-		}).WithPolling((1 * time.Millisecond)).Should(BeTrue())
+		}).WithPolling(50 * time.Millisecond).Should(BeTrue())
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/bmcsettings_controller_test.go` around lines 614 - 617,
The test's Eventually call uses an overly aggressive polling interval; update
the WithPolling argument in the Eventually containing Get(bmcSettings)() and the
check on bmcSettings.Status.AutoRetryCountRemaining so it uses a coarser
interval (for example 25*time.Millisecond or 50*time.Millisecond) instead of
1*time.Millisecond to reduce flakiness in envtest while keeping the same
condition and timeout.
internal/controller/bmcsettings_controller.go (1)

57-58: Nice: InvalidBMCSettingsError becomes an actionable condition; consider skipping auto-retry for this failure type.

Surfacing “wrong settings” as BMCSettingsConditionWrongSettings is great for users. One follow-up: if auto-retry is enabled, deterministic spec errors will still burn through retries (and potentially flap) even though the only fix is changing the spec.

Also applies to: 451-468

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/bmcsettings_controller.go` around lines 57 - 58, The
controller currently surfaces deterministic spec errors via
BMCSettingsConditionWrongSettings; update the reconcile error handling so that
when InvalidBMCSettingsError is returned (or detected) you set the
BMCSettingsConditionWrongSettings / BMCSettingsReasonWrongSettings condition and
mark the result as non-requeue/non-retry (i.e., do not trigger exponential
requeue) instead of treating it as a transient error; specifically modify the
reconcile/error-handling path that processes errors from functions returning
InvalidBMCSettingsError (and any helpers that call setCondition) to return a
non-retryable result (requeue=false) while still recording the condition so the
user can act.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cmd/main.go`:
- Around line 211-214: Validate that the flag variable
defaultFailedAutoRetryCount is <= math.MaxInt32 before any int32 cast: add an
upper-bound check (<= math.MaxInt32) alongside the existing non-negative check
and return a clear Kubernetes-style error log message starting with a capital
letter (e.g., "Default failed auto retry count cannot exceed MaxInt32") using
setupLog.Error, then os.Exit(1); replace unchecked casts
int32(defaultFailedAutoRetryCount) wherever they occur (references: the int32
casts of defaultFailedAutoRetryCount used later in main.go) with safe conversion
after the new validation so overflow cannot occur.

In `@internal/controller/biossettings_controller.go`:
- Around line 1061-1104: The code fails to reinitialize AutoRetryCountRemaining
after it reaches 0 on later spec updates; modify the reconcile logic in the
BIOSSettings controller (around the block that reads FailedAutoRetryCount,
DefaultFailedAutoRetryCount and manipulates
settings.Status.AutoRetryCountRemaining) to detect a spec update and reset the
retry budget: specifically, when retryCount > 0, treat the case where
settings.Status.AutoRetryCountRemaining != nil &&
*settings.Status.AutoRetryCountRemaining == 0 as eligible for reinitialization
if the resource spec has changed (e.g., compare settings.Generation to
settings.Status.ObservedGeneration or compare spec hash stored on status); if a
spec change is detected set settings.Status.AutoRetryCountRemaining =
retryCount-1 (or retryCount if you prefer consistent decrement logic), update
ObservedGeneration/spec-hash on status, then Patch status via r.Status().Patch
(same MergeFrom flow used currently) so the resource can be retried after a spec
update.

In `@internal/controller/biosversion_controller_test.go`:
- Around line 430-437: The test currently deletes biosVersion before attempting
to list/delete ServerMaintenance, which races with the controller's cascade
deletion; modify the cleanup in biosversion_controller_test.go so you first list
ServerMaintenance (use the serverMaintenanceList variable and
ObjectList(&serverMaintenanceList)), delete each maintenance via
k8sClient.Delete(&maintenance), wait until the list is empty with
Eventually(ObjectList(&serverMaintenanceList)).Should(HaveField("Items",
BeEmpty())), and only then delete biosVersion with Expect(k8sClient.Delete(ctx,
biosVersion)).To(Succeed()); ensure you reuse the existing serverMaintenanceList
and k8sClient symbols and preserve the Eventually checks for emptiness.

In `@internal/controller/bmcsettings_controller_test.go`:
- Around line 636-641: The cleanup currently forces a non-empty
serverMaintenanceList via
Eventually(ObjectList(&serverMaintenanceList)).Should(HaveField("Items",
Not(BeEmpty()))) which makes the test fail when there are no maintenance
objects; change this to wait for the ObjectList to succeed (e.g.,
Eventually(ObjectList(&serverMaintenanceList)).Should(Succeed())) and then
conditionally delete any items only if serverMaintenanceList.Items is non-empty
(use if len(serverMaintenanceList.Items) > 0 { for _, maintenance := range
serverMaintenanceList.Items { Expect(k8sClient.Delete(ctx,
&maintenance)).To(Succeed()) } }). This keeps the cleanup robust when there are
zero maintenance objects.

In `@internal/controller/bmcsettings_controller.go`:
- Around line 438-442: The power-off handling in handleBMCPowerState currently
sets BMCSettingsStateFailed in the default branch but returns nil, so callers
(e.g., the reconcile block that calls r.handleBMCPowerState and then
updateSettingsAndVerify) continue processing and may apply settings despite a
failure; change handleBMCPowerState to return an explicit proceed signal (e.g.,
bool or a custom result type) or an error to indicate whether reconciliation
should continue, update the caller sites (the reconcile path around
handleBMCPowerState and the code that calls updateSettingsAndVerify) to check
that signal and short-circuit (requeue or return) when proceed is false, and
ensure BMCSettingsStateFailed is only set when you stop further processing to
keep reconcile idempotent and avoid applying settings after a failed power-state
transition.

In `@internal/controller/bmcversion_controller_test.go`:
- Around line 430-437: The test currently deletes bmcVersion before asserting
and deleting ServerMaintenance objects which can race with the controller's
cascade deletion; update the test to fetch and delete ServerMaintenance items
prior to deleting bmcVersion (or perform a conditional fetch-and-delete that
tolerates an empty list) by using the existing serverMaintenanceList variable
and ObjectList(&serverMaintenanceList) to collect items and
Expect(k8sClient.Delete(...)) for each maintenance before calling the code that
deletes bmcVersion, or wrap the deletion loop with a conditional that proceeds
only if serverMaintenanceList.Items is non-empty; ensure UpdateStatus(server,
...) and other status resets remain after maintenance cleanup.

In `@internal/controller/bmcversionset_controller_test.go`:
- Around line 679-684: The cleanup assertions are verifying the wrong servers:
the test selector matches bmc02/bmc03 (server02/server03) but the code checks
server01 and server02, leaving server03 unverified; update the assertions to
check server02 and server03 instead (replace the reference to server01 with
server03) so that both servers matched by the label
("metal.ironcore.dev/Manufacturer": "bar") are asserted to have exited
Maintenance (use the existing variables server02 and server03 in the
Eventually(Object(...)).Should(...) checks).

---

Outside diff comments:
In `@internal/controller/bmcsettings_controller.go`:
- Around line 704-777: handle the missing reset of AutoRetryCountRemaining and
normalize condition status types: detect spec changes to FailedAutoRetryCount in
handleFailedState (or earlier in reconciliation) and when
bmcSetting.Spec.FailedAutoRetryCount differs from the current
Status.AutoRetryCountRemaining reset Status.AutoRetryCountRemaining to the spec
value (follow the pattern used in bmcsettingsset_controller.go where
bmcSettings.Status.AutoRetryCountRemaining =
bmcSettings.Spec.FailedAutoRetryCount); also keep metav1.ConditionStatus types
consistent by using metav1.ConditionTrue (not corev1.ConditionTrue) when
updating or comparing the RetryOfFailedResourceConditionIssued via GetCondition
and r.Conditions.Update so all condition updates (e.g., the block around
GetCondition/conditionutils.UpdateStatus) use metav1.ConditionTrue across the
controller.

---

Duplicate comments:
In `@bmc/redfish_local.go`:
- Around line 131-139: The typo has been corrected: ensure the code consistently
uses the variable and struct fields UnitTestMockUps.BIOSUpgradeTaskStatus,
UnitTestMockUps.BIOSUpgradingVersion and
UnitTestMockUps.BIOSUpgradeTaskFailedStatus and that the logic computing index
and taking task := &taskStatus[index] remains correct (guarding index >=
len(taskStatus) as shown); no further changes are needed beyond keeping these
exact identifiers consistent.

In `@internal/controller/bmcsettings_controller.go`:
- Around line 576-619: The function handleBMCPowerState shadows the
BMCPoweredOffCondition constant by declaring a local variable with the same
name; change the local variable to a non-conflicting identifier (e.g.,
poweredOffCond or bmcpoweredOffCond) in both places where you call GetCondition
and use it in subsequent r.Conditions.Update and r.updateBMCSettingsStatus calls
so the constant remains unshadowed and all references in handleBMCPowerState use
the new local variable name.

In `@internal/controller/bmcversionset_controller_test.go`:
- Around line 609-613: The test already uses the correct observed object
(bmcVersion03) and the proper enum type (metalv1alpha1.BMCVersionStateFailed),
so no code changes are required; ensure the assertions in
Eventually(Object(bmcVersion03)).WithPolling(...) still reference
HaveField("Status.State", Not(Equal(metalv1alpha1.BMCVersionStateFailed))) and
HaveField("Status.AutoRetryCountRemaining", Not(Equal(GetPtr(int32(0))))) and
then mark the change as approved/mergeable.

---

Nitpick comments:
In `@internal/controller/bmcsettings_controller_test.go`:
- Around line 614-617: The test's Eventually call uses an overly aggressive
polling interval; update the WithPolling argument in the Eventually containing
Get(bmcSettings)() and the check on bmcSettings.Status.AutoRetryCountRemaining
so it uses a coarser interval (for example 25*time.Millisecond or
50*time.Millisecond) instead of 1*time.Millisecond to reduce flakiness in
envtest while keeping the same condition and timeout.

In `@internal/controller/bmcsettings_controller.go`:
- Around line 57-58: The controller currently surfaces deterministic spec errors
via BMCSettingsConditionWrongSettings; update the reconcile error handling so
that when InvalidBMCSettingsError is returned (or detected) you set the
BMCSettingsConditionWrongSettings / BMCSettingsReasonWrongSettings condition and
mark the result as non-requeue/non-retry (i.e., do not trigger exponential
requeue) instead of treating it as a transient error; specifically modify the
reconcile/error-handling path that processes errors from functions returning
InvalidBMCSettingsError (and any helpers that call setCondition) to return a
non-retryable result (requeue=false) while still recording the condition so the
user can act.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ac2cd8b and 89d4080.

⛔ Files ignored due to path filters (8)
  • dist/chart/templates/crd/metal.ironcore.dev_biossettings.yaml is excluded by !**/dist/**
  • dist/chart/templates/crd/metal.ironcore.dev_biossettingssets.yaml is excluded by !**/dist/**
  • dist/chart/templates/crd/metal.ironcore.dev_biosversions.yaml is excluded by !**/dist/**
  • dist/chart/templates/crd/metal.ironcore.dev_biosversionsets.yaml is excluded by !**/dist/**
  • dist/chart/templates/crd/metal.ironcore.dev_bmcsettings.yaml is excluded by !**/dist/**
  • dist/chart/templates/crd/metal.ironcore.dev_bmcsettingssets.yaml is excluded by !**/dist/**
  • dist/chart/templates/crd/metal.ironcore.dev_bmcversions.yaml is excluded by !**/dist/**
  • dist/chart/templates/crd/metal.ironcore.dev_bmcversionsets.yaml is excluded by !**/dist/**
📒 Files selected for processing (25)
  • api/v1alpha1/biossettings_types.go
  • api/v1alpha1/biosversion_types.go
  • api/v1alpha1/bmcsettings_types.go
  • api/v1alpha1/bmcversion_types.go
  • bmc/common/helpers.go
  • bmc/redfish.go
  • bmc/redfish_local.go
  • cmd/main.go
  • config/crd/bases/metal.ironcore.dev_biossettings.yaml
  • config/crd/bases/metal.ironcore.dev_biossettingssets.yaml
  • config/crd/bases/metal.ironcore.dev_biosversions.yaml
  • config/crd/bases/metal.ironcore.dev_biosversionsets.yaml
  • config/crd/bases/metal.ironcore.dev_bmcsettings.yaml
  • config/crd/bases/metal.ironcore.dev_bmcsettingssets.yaml
  • config/crd/bases/metal.ironcore.dev_bmcversions.yaml
  • config/crd/bases/metal.ironcore.dev_bmcversionsets.yaml
  • docs/api-reference/api.md
  • internal/controller/biossettings_controller.go
  • internal/controller/biosversion_controller.go
  • internal/controller/biosversion_controller_test.go
  • internal/controller/bmcsettings_controller.go
  • internal/controller/bmcsettings_controller_test.go
  • internal/controller/bmcsettingsset_controller_test.go
  • internal/controller/bmcversion_controller_test.go
  • internal/controller/bmcversionset_controller_test.go
✅ Files skipped from review due to trivial changes (1)
  • bmc/redfish.go
🚧 Files skipped from review as they are similar to previous changes (6)
  • internal/controller/bmcsettingsset_controller_test.go
  • config/crd/bases/metal.ironcore.dev_bmcversions.yaml
  • api/v1alpha1/biossettings_types.go
  • config/crd/bases/metal.ironcore.dev_bmcsettings.yaml
  • config/crd/bases/metal.ironcore.dev_biossettingssets.yaml
  • api/v1alpha1/bmcversion_types.go

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

♻️ Duplicate comments (1)
internal/controller/bmcversionset_controller_test.go (1)

515-515: Nit: rename retryCountfailedAutoRetryCount for clarity

retryCount is ambiguous — it could mean the annotation-driven retry count or the auto-retry budget. failedAutoRetryCount aligns with the field name FailedAutoRetryCount and makes the test self-documenting.

♻️ Proposed rename
-		retryCount := 2
+		failedAutoRetryCount := 2

Update all subsequent references accordingly (e.g., GetPtr(int32(retryCount))GetPtr(int32(failedAutoRetryCount))).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/bmcversionset_controller_test.go` at line 515, Rename the
local test variable retryCount to failedAutoRetryCount and update all its uses
accordingly (e.g., change GetPtr(int32(retryCount)) →
GetPtr(int32(failedAutoRetryCount))) so the name aligns with the struct field
FailedAutoRetryCount and clarifies this is the auto-retry budget in the test;
ensure all occurrences in internal/controller/bmcversionset_controller_test.go
that reference retryCount are updated to the new identifier.
🧹 Nitpick comments (1)
internal/controller/bmcversion_controller_test.go (1)

408-411: Relax the polling/assertion windows to reduce flakiness.

Line 411 uses 1 * time.Millisecond polling and Line 423 checks consistency for only 25ms; these are too tight for reconciliation timing and can create flaky/weak assertions.

Suggested adjustment
-		}).WithPolling((1 * time.Millisecond)).Should(BeTrue())
+		}).WithPolling(50 * time.Millisecond).Should(BeTrue())

-		Consistently(Object(bmcVersion), "25ms").Should(SatisfyAll(
+		Consistently(Object(bmcVersion), "1s", "50ms").Should(SatisfyAll(
 			HaveField("Status.State", metalv1alpha1.BMCVersionStateFailed),
 			HaveField("Status.AutoRetryCountRemaining", Equal(GetPtr(int32(0)))),
 		))

Also applies to: 423-423

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/bmcversion_controller_test.go` around lines 408 - 411,
The test uses too-tight timing in the Eventually polling for
bmcVersion.Status.AutoRetryCountRemaining; update the polling and consistency
windows to be more forgiving: in the Eventually call that wraps
Get(bmcVersion)() change WithPolling(1 * time.Millisecond) to something like
WithPolling(50 * time.Millisecond) (or 100ms) and where a 25ms consistency check
is used change it to ~200-500ms; keep referencing the same
Eventually(Get(bmcVersion)()) check and the
bmcVersion.Status.AutoRetryCountRemaining condition so the test waits longer and
reduces flakiness.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/api-reference/api.md`:
- Line 163: Update the wording for failedAutoRetryCount and similar entries to
use "settings update/apply" instead of "upgrade": locate the
`failedAutoRetryCount` table row and any other BIOSSettings/BMCSettings
descriptions (references noted around the same file, e.g., the entries that
currently say "upgrade") and replace "upgrade" with "settings update/apply" (or
"settings update" where appropriate) so the description reads that the
controller will retry the settings update/apply operation before giving up;
ensure this change is applied consistently for BIOSSettings and BMCSettings
entries.

In `@internal/controller/biossettings_controller_test.go`:
- Around line 888-893: The test currently deletes all ServerMaintenance items;
narrow deletion to only those created by this BIOSSettings by listing
ServerMaintenance into serverMaintenanceList with a selector/filter tied to the
BIOSSettings under test (e.g., use k8sClient.List(ctx, &serverMaintenanceList,
client.MatchingLabels{"biossettings": biosSettings.Name}) or filter the returned
serverMaintenanceList.Items for OwnerReferences matching the BIOSSettings
UID/Name) and then delete only those filtered items; keep the existing variables
k8sClient, serverMaintenanceList and ObjectList usage but replace the
unconditional List/Delete loop with a targeted list or post-list filter so
neighboring specs are not affected.

In `@internal/controller/biossettings_controller.go`:
- Around line 1039-1054: The retry path leaves old failure conditions when
RetryOfFailedResourceConditionIssued is already True; update the logic in the
manual retry handling (use GetCondition, RetryOfFailedResourceConditionIssued,
r.Conditions.Update and settings.Status.Conditions) to always normalize/reset
conditions for idempotency: when manual retry is triggered, regardless of
retryCondition.Status, replace settings.Status.Conditions with a fresh slice
containing only the normalized retryCondition (with ConditionTrue,
RetryOfFailedResourceReasonIssued and the operation message) and persist the
Update call to ensure prior failure conditions are cleared.

In `@internal/controller/biosversion_controller.go`:
- Around line 445-460: GetCondition is used to fetch the retry condition but
current logic only collapses conditions when Status != True, leaving old failure
conditions when it's already True; always collapse biosVersion.Status.Conditions
to only the retry condition to avoid carrying stale state. After calling
GetCondition(...) and after the r.Conditions.Update(...) branch (whether you
update or skip updating because Status is already True), set
biosVersion.Status.Conditions = []metav1.Condition{*retryCondition}
unconditionally so the retry state replaces any previous conditions; reference
GetCondition, RetryOfFailedResourceConditionIssued, r.Conditions.Update and
biosVersion.Status.Conditions when making this change.

In `@internal/controller/bmcsettings_controller_test.go`:
- Around line 637-641: The test currently deletes every ServerMaintenance
returned by k8sClient.List; change the cleanup to only delete ServerMaintenance
items owned by this test resource by filtering the ServerMaintenanceList.Items
(or using a List with a matching label selector) for OwnerReferences containing
the test resource's UID (or the test-specific label) before calling
k8sClient.Delete(ctx, &maintenance); update the code around
ServerMaintenanceList, maintenance, k8sClient.List and k8sClient.Delete to
perform that ownership check so only objects owned by this test are removed.

In `@internal/controller/bmcversion_controller_test.go`:
- Around line 430-435: The cleanup currently lists all ServerMaintenance
resources with k8sClient.List(ctx, &serverMaintenanceList) and may delete
resources from other tests; restrict the List to the current test namespace by
passing the namespace option (use client.InNamespace(ns.Name)) when calling
k8sClient.List, e.g., call k8sClient.List(ctx, &serverMaintenanceList,
client.InNamespace(ns.Name)) so the subsequent loop that deletes items only
affects ServerMaintenance objects in this test's namespace.

In `@internal/controller/bmcversion_controller.go`:
- Around line 420-434: The retry flow must deterministically reset the
BMCVersion condition set so stale failure conditions cannot persist; in
GetCondition/Retry handling (symbols: GetCondition,
RetryOfFailedResourceConditionIssued, r.Conditions.Update,
conditionutils.UpdateStatus/UpdateReason/UpdateMessage and
bmcVersion.Status.Conditions) ensure that whenever a manual retry is detected or
the retry condition ends up true you overwrite bmcVersion.Status.Conditions with
only the retry condition (not just when it was previously false) and still call
r.Conditions.Update to refresh reason/message as needed; in short, always
replace the Status.Conditions slice with []metav1.Condition{*retryCondition}
whenever retry is issued/true so reconciliation is idempotent.

In `@internal/controller/bmcversionset_controller_test.go`:
- Around line 605-613: The test uses WithPolling(10 * time.Microsecond) which is
too short; update the two Eventually assertions that reference bmcVersion02 and
bmcVersion03 to use WithPolling(10 * time.Millisecond) instead (or remove the
WithPolling call to rely on the suite default). Specifically change the
WithPolling argument on the Eventually(Object(bmcVersion02)) and
Eventually(Object(bmcVersion03)) assertions so they poll at a millisecond
granularity to avoid excessive API-server load and flakiness.
- Around line 604-613: The two assertions for AutoRetryCountRemaining are
asymmetric: bmcVersion02 uses BeNil() while bmcVersion03 uses
Not(Equal(GetPtr(int32(0)))) after applying
OperationAnnotationRetryChildAndSelf; make them consistent by updating the
matcher for either bmcVersion02 or bmcVersion03 so both use the same expectation
(e.g., both BeNil() or both Not(Equal(GetPtr(int32(0))))) or, if the difference
is intentional, add a clarifying By(...) message explaining why bmcVersion02 and
bmcVersion03 are expected to differ; locate the checks referencing bmcVersion02
and bmcVersion03 in the test and change the matcher or add the descriptive By
text accordingly.

---

Duplicate comments:
In `@internal/controller/bmcversionset_controller_test.go`:
- Line 515: Rename the local test variable retryCount to failedAutoRetryCount
and update all its uses accordingly (e.g., change GetPtr(int32(retryCount)) →
GetPtr(int32(failedAutoRetryCount))) so the name aligns with the struct field
FailedAutoRetryCount and clarifies this is the auto-retry budget in the test;
ensure all occurrences in internal/controller/bmcversionset_controller_test.go
that reference retryCount are updated to the new identifier.

---

Nitpick comments:
In `@internal/controller/bmcversion_controller_test.go`:
- Around line 408-411: The test uses too-tight timing in the Eventually polling
for bmcVersion.Status.AutoRetryCountRemaining; update the polling and
consistency windows to be more forgiving: in the Eventually call that wraps
Get(bmcVersion)() change WithPolling(1 * time.Millisecond) to something like
WithPolling(50 * time.Millisecond) (or 100ms) and where a 25ms consistency check
is used change it to ~200-500ms; keep referencing the same
Eventually(Get(bmcVersion)()) check and the
bmcVersion.Status.AutoRetryCountRemaining condition so the test waits longer and
reduces flakiness.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 89d4080 and c76e18e.

⛔ Files ignored due to path filters (4)
  • dist/chart/templates/crd/metal.ironcore.dev_biossettings.yaml is excluded by !**/dist/**
  • dist/chart/templates/crd/metal.ironcore.dev_biosversions.yaml is excluded by !**/dist/**
  • dist/chart/templates/crd/metal.ironcore.dev_bmcsettings.yaml is excluded by !**/dist/**
  • dist/chart/templates/crd/metal.ironcore.dev_bmcversions.yaml is excluded by !**/dist/**
📒 Files selected for processing (19)
  • api/v1alpha1/biossettings_types.go
  • api/v1alpha1/biosversion_types.go
  • api/v1alpha1/bmcsettings_types.go
  • api/v1alpha1/bmcversion_types.go
  • cmd/main.go
  • config/crd/bases/metal.ironcore.dev_biossettings.yaml
  • config/crd/bases/metal.ironcore.dev_biosversions.yaml
  • config/crd/bases/metal.ironcore.dev_bmcsettings.yaml
  • config/crd/bases/metal.ironcore.dev_bmcversions.yaml
  • docs/api-reference/api.md
  • internal/controller/biossettings_controller.go
  • internal/controller/biossettings_controller_test.go
  • internal/controller/biosversion_controller.go
  • internal/controller/biosversion_controller_test.go
  • internal/controller/bmcsettings_controller.go
  • internal/controller/bmcsettings_controller_test.go
  • internal/controller/bmcversion_controller.go
  • internal/controller/bmcversion_controller_test.go
  • internal/controller/bmcversionset_controller_test.go
🚧 Files skipped from review as they are similar to previous changes (6)
  • api/v1alpha1/biosversion_types.go
  • internal/controller/biosversion_controller_test.go
  • api/v1alpha1/bmcversion_types.go
  • config/crd/bases/metal.ironcore.dev_biossettings.yaml
  • config/crd/bases/metal.ironcore.dev_bmcversions.yaml
  • api/v1alpha1/bmcsettings_types.go

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

♻️ Duplicate comments (1)
internal/controller/biossettings_controller_test.go (1)

887-895: ⚠️ Potential issue | 🟠 Major

Cleanup check is still globally scoped and can be flaky across specs.

Line 895 asserts the full ServerMaintenance list is empty, even though this test only deletes objects controlled by biosSettings. Keep the list/query scoped (namespace + owner) and assert owned resources are gone.

🔧 Proposed scope-safe cleanup assertion
+	"sigs.k8s.io/controller-runtime/pkg/client"
@@
-		Expect(k8sClient.List(ctx, &serverMaintenanceList)).To(Succeed())
-		for _, maintenance := range serverMaintenanceList.Items {
-			if metav1.IsControlledBy(&maintenance, biosSettings) {
-				Expect(k8sClient.Delete(ctx, &maintenance)).To(Succeed())
-			}
-		}
-		Eventually(ObjectList(&serverMaintenanceList)).Should(HaveField("Items", BeEmpty()))
+		Expect(k8sClient.List(ctx, &serverMaintenanceList, client.InNamespace(ns.Name))).To(Succeed())
+		for i := range serverMaintenanceList.Items {
+			maintenance := &serverMaintenanceList.Items[i]
+			if metav1.IsControlledBy(maintenance, biosSettings) {
+				Expect(k8sClient.Delete(ctx, maintenance)).To(Succeed())
+			}
+		}
+		Eventually(func(g Gomega) int {
+			g.Expect(k8sClient.List(ctx, &serverMaintenanceList, client.InNamespace(ns.Name))).To(Succeed())
+			owned := 0
+			for i := range serverMaintenanceList.Items {
+				if metav1.IsControlledBy(&serverMaintenanceList.Items[i], biosSettings) {
+					owned++
+				}
+			}
+			return owned
+		}).Should(Equal(0))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/biossettings_controller_test.go` around lines 887 - 895,
The cleanup currently asserts the global ServerMaintenance list is empty which
is flaky; change the List and assertion to be scoped to the test's namespace and
only the resources owned by biosSettings: use k8sClient.List with a
namespace-scoped ListOptions (or client.InNamespace(biosSettings.Namespace)) to
fetch metalv1alpha1.ServerMaintenanceList, filter by
metav1.IsControlledBy(&maintenance, biosSettings) (or list by an owner/label
selector if available) and then assert that the filtered/owned slice is empty
(replace
Eventually(ObjectList(&serverMaintenanceList)).Should(HaveField("Items",
BeEmpty())) with an assertion that only the owned items are gone).
🧹 Nitpick comments (2)
internal/controller/biossettingsset_controller_test.go (1)

501-508: Avoid ultra-tight polling in Eventually to reduce flakiness.

At Line 501 and Line 507, 1ms/10ms polling is overly aggressive for envtest and can make CI timing-sensitive.

🔧 Suggested adjustment
-		Eventually(Object(biosSettings02)).WithPolling(1 * time.Millisecond).Should(SatisfyAll(
+		Eventually(Object(biosSettings02)).WithPolling(25 * time.Millisecond).Should(SatisfyAll(
 			HaveField("Status.State", Not(Equal(metalv1alpha1.BIOSSettingsStateFailed))),
 			HaveField("Status.AutoRetryCountRemaining", Not(Equal(GetPtr(int32(0))))),
 		))
@@
-		Eventually(Object(biosSettings03)).WithPolling(10 * time.Millisecond).Should(SatisfyAll(
+		Eventually(Object(biosSettings03)).WithPolling(25 * time.Millisecond).Should(SatisfyAll(
 			HaveField("Status.State", Not(Equal(metalv1alpha1.BIOSSettingsStateFailed))),
 			HaveField("Status.AutoRetryCountRemaining", Not(Equal(GetPtr(int32(0))))),
 		))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/biossettingsset_controller_test.go` around lines 501 -
508, The test uses ultra-tight polling intervals in the Eventually checks for
biosSettings02 and biosSettings03 (calls to
Eventually(Object(biosSettings02)).WithPolling(1 * time.Millisecond) and
Eventually(Object(biosSettings03)).WithPolling(10 * time.Millisecond)), which
can cause flakiness; change those WithPolling durations to a more realistic
interval (e.g., 50–200ms) or remove WithPolling to use the default, so update
the two Eventually(...) invocations referencing biosSettings02 and
biosSettings03 accordingly.
internal/controller/bmcsettings_controller_test.go (1)

614-617: Increase polling interval in retry-start assertion to reduce flakiness.

At Line 617, 1 * time.Millisecond is very tight for envtest timing and can make this test unstable.

🔧 Suggested adjustment
-		}).WithPolling((1 * time.Millisecond)).Should(BeTrue())
+		}).WithPolling(25 * time.Millisecond).Should(BeTrue())
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/bmcsettings_controller_test.go` around lines 614 - 617,
The polling interval in the Eventually assertion that checks Get(bmcSettings)()
and bmcSettings.Status.AutoRetryCountRemaining is currently set to 1 *
time.Millisecond which is too tight; update the WithPolling call (the
Eventually(...).WithPolling((1 * time.Millisecond)).Should(BeTrue()) expression)
to a larger interval such as 50 * time.Millisecond or 100 * time.Millisecond to
reduce flakiness in envtest, leaving the rest of the Eventually block (Get,
bmcSettings, and the boolean check) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/api-reference/api.md`:
- Line 322: The documentation uses "retry applying BIOSVersion/BMCVersion" which
is imprecise; update all descriptions referencing failedAutoRetryCount for
BIOSVersion and BMCVersion (and the same instances at the listed locations) to
use upgrade terminology such as "retry firmware upgrade for
BIOSVersion/BMCVersion" or "retry firmware upgrade operation" and ensure the
field name `failedAutoRetryCount` is described as "number of automatic firmware
upgrade retries before giving up" for both BIOSVersion and BMCVersion entries
(refer to `failedAutoRetryCount`, `BIOSVersion`, and `BMCVersion` to locate and
update the text).

In `@internal/controller/biossettings_controller.go`:
- Around line 1044-1054: The code currently skips calling r.Conditions.Update
when retryCondition.Status is already metav1.ConditionTrue, which leaves
retryCondition.Message stale and can cause repeated retries; modify the logic
around retryCondition and r.Conditions.Update in the BIOSSettings reconciliation
so the Update call is performed (or at least the metadata fields are refreshed)
regardless of the existing Status value — e.g., remove the conditional branch
that prevents calling r.Conditions.Update when retryCondition.Status ==
metav1.ConditionTrue and always invoke r.Conditions.Update(retryCondition,
conditionutils.UpdateStatus(...),
conditionutils.UpdateReason(RetryOfFailedResourceReasonIssued),
conditionutils.UpdateMessage(annotations[metalv1alpha1.OperationAnnotation]))
before assigning settings.Status.Conditions =
[]metav1.Condition{*retryCondition} so the Message/Reason are kept up-to-date
idempotently.

In `@internal/controller/biosversion_controller_test.go`:
- Around line 431-439: The test currently deletes only ServerMaintenance items
controlled by biosVersion but then globally asserts that
serverMaintenanceList.Items is empty, causing flakes if other ServerMaintenance
objects exist; change the final assertion to verify that no remaining items are
controlled by biosVersion instead of asserting the entire list is empty — locate
the block using serverMaintenanceList, metav1.IsControlledBy and biosVersion and
replace the
Eventually(ObjectList(&serverMaintenanceList)).Should(HaveField("Items",
BeEmpty())) check with an assertion that filters serverMaintenanceList.Items for
metav1.IsControlledBy(&item, biosVersion) and expects that filtered slice to be
empty (or Eventually the filtered result to be empty) using the existing
ObjectList/Eventually pattern.

In `@internal/controller/biosversion_controller.go`:
- Around line 466-477: The code determines effective retryCount from
biosVersion.Spec.FailedAutoRetryCount and r.DefaultFailedAutoRetryCount but
skips status mutation when retryCount == 0; add normalization so that when
effective retryCount resolves to 0 you explicitly set
biosVersion.Status.AutoRetryCountRemaining = 0 and persist that change (status
patch/update) so stale positive remaining counts are cleared; update the logic
near the retryCount handling in the reconciler (use the same method that patches
status elsewhere in this file) and apply the same fix at the second occurrence
referenced around the 513-515 area.

In `@internal/controller/biosversionset_controller_test.go`:
- Around line 563-568: After deleting the parent biosVersionSet, the subsequent
direct Delete calls for biosVersion02 and biosVersion03 can race with GC and
return NotFound; update the cleanup to tolerate missing children by checking for
apierrors.IsNotFound and treating that as success (or skip Delete if Get reports
NotFound) instead of failing the test — modify the code around
k8sClient.Delete(ctx, biosVersion02)/biosVersion03 to either call Delete and
ignore NotFound errors (using apierrors.IsNotFound) or perform a Get first and
only Delete if present, and keep the Existing
Eventually(Get(...)).Should(Satisfy(apierrors.IsNotFound)) assertions to
validate removal.

In `@internal/controller/bmcsettings_controller.go`:
- Around line 720-730: The code only updates retryCondition when Status !=
metav1.ConditionTrue, which leaves stale reason/message/annotation data; always
normalize the retry condition metadata by invoking r.Conditions.Update for
retryCondition when the existing Reason or Message differs (or unconditionally)
so the condition's Reason (RetryOfFailedResourceReasonIssued) and Message
(annotations[metalv1alpha1.OperationAnnotation]) are kept current; adjust the
logic around retryCondition, r.Conditions.Update, and
conditionutils.UpdateMessage/UpdateReason so the update occurs even if Status is
already True, and then set bmcSetting.Status.Conditions =
[]metav1.Condition{*retryCondition} as before.

---

Duplicate comments:
In `@internal/controller/biossettings_controller_test.go`:
- Around line 887-895: The cleanup currently asserts the global
ServerMaintenance list is empty which is flaky; change the List and assertion to
be scoped to the test's namespace and only the resources owned by biosSettings:
use k8sClient.List with a namespace-scoped ListOptions (or
client.InNamespace(biosSettings.Namespace)) to fetch
metalv1alpha1.ServerMaintenanceList, filter by
metav1.IsControlledBy(&maintenance, biosSettings) (or list by an owner/label
selector if available) and then assert that the filtered/owned slice is empty
(replace
Eventually(ObjectList(&serverMaintenanceList)).Should(HaveField("Items",
BeEmpty())) with an assertion that only the owned items are gone).

---

Nitpick comments:
In `@internal/controller/biossettingsset_controller_test.go`:
- Around line 501-508: The test uses ultra-tight polling intervals in the
Eventually checks for biosSettings02 and biosSettings03 (calls to
Eventually(Object(biosSettings02)).WithPolling(1 * time.Millisecond) and
Eventually(Object(biosSettings03)).WithPolling(10 * time.Millisecond)), which
can cause flakiness; change those WithPolling durations to a more realistic
interval (e.g., 50–200ms) or remove WithPolling to use the default, so update
the two Eventually(...) invocations referencing biosSettings02 and
biosSettings03 accordingly.

In `@internal/controller/bmcsettings_controller_test.go`:
- Around line 614-617: The polling interval in the Eventually assertion that
checks Get(bmcSettings)() and bmcSettings.Status.AutoRetryCountRemaining is
currently set to 1 * time.Millisecond which is too tight; update the WithPolling
call (the Eventually(...).WithPolling((1 * time.Millisecond)).Should(BeTrue())
expression) to a larger interval such as 50 * time.Millisecond or 100 *
time.Millisecond to reduce flakiness in envtest, leaving the rest of the
Eventually block (Get, bmcSettings, and the boolean check) unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 6bfdc7b5-96da-4812-b3af-3f29d2ddc9f9

📥 Commits

Reviewing files that changed from the base of the PR and between c76e18e and 80c3fe3.

⛔ Files ignored due to path filters (8)
  • dist/chart/templates/crd/metal.ironcore.dev_biossettings.yaml is excluded by !**/dist/**
  • dist/chart/templates/crd/metal.ironcore.dev_biossettingssets.yaml is excluded by !**/dist/**
  • dist/chart/templates/crd/metal.ironcore.dev_biosversions.yaml is excluded by !**/dist/**
  • dist/chart/templates/crd/metal.ironcore.dev_biosversionsets.yaml is excluded by !**/dist/**
  • dist/chart/templates/crd/metal.ironcore.dev_bmcsettings.yaml is excluded by !**/dist/**
  • dist/chart/templates/crd/metal.ironcore.dev_bmcsettingssets.yaml is excluded by !**/dist/**
  • dist/chart/templates/crd/metal.ironcore.dev_bmcversions.yaml is excluded by !**/dist/**
  • dist/chart/templates/crd/metal.ironcore.dev_bmcversionsets.yaml is excluded by !**/dist/**
📒 Files selected for processing (24)
  • api/v1alpha1/biossettings_types.go
  • api/v1alpha1/biosversion_types.go
  • api/v1alpha1/bmcsettings_types.go
  • api/v1alpha1/bmcversion_types.go
  • config/crd/bases/metal.ironcore.dev_biossettings.yaml
  • config/crd/bases/metal.ironcore.dev_biossettingssets.yaml
  • config/crd/bases/metal.ironcore.dev_biosversions.yaml
  • config/crd/bases/metal.ironcore.dev_biosversionsets.yaml
  • config/crd/bases/metal.ironcore.dev_bmcsettings.yaml
  • config/crd/bases/metal.ironcore.dev_bmcsettingssets.yaml
  • config/crd/bases/metal.ironcore.dev_bmcversions.yaml
  • config/crd/bases/metal.ironcore.dev_bmcversionsets.yaml
  • docs/api-reference/api.md
  • internal/controller/biossettings_controller.go
  • internal/controller/biossettings_controller_test.go
  • internal/controller/biossettingsset_controller_test.go
  • internal/controller/biosversion_controller.go
  • internal/controller/biosversion_controller_test.go
  • internal/controller/biosversionset_controller_test.go
  • internal/controller/bmcsettings_controller.go
  • internal/controller/bmcsettings_controller_test.go
  • internal/controller/bmcsettingsset_controller_test.go
  • internal/controller/bmcversion_controller_test.go
  • internal/controller/bmcversionset_controller_test.go
🚧 Files skipped from review as they are similar to previous changes (5)
  • internal/controller/bmcversionset_controller_test.go
  • api/v1alpha1/biosversion_types.go
  • config/crd/bases/metal.ironcore.dev_biosversionsets.yaml
  • internal/controller/bmcsettingsset_controller_test.go
  • api/v1alpha1/bmcversion_types.go

| `updatePolicy` _[UpdatePolicy](#updatepolicy)_ | UpdatePolicy indicates whether the server's upgrade service should bypass vendor update policies. | | |
| `image` _[ImageSpec](#imagespec)_ | Image specifies the image to use to upgrade to the given BIOS version. | | |
| `serverMaintenancePolicy` _[ServerMaintenancePolicy](#servermaintenancepolicy)_ | ServerMaintenancePolicy is a maintenance policy to be enforced on the server. | | |
| `failedAutoRetryCount` _integer_ | FailedAutoRetryCount is the number of times the controller should automatically retry applying BIOSVersion in case of failure before giving up. | | Minimum: 0 <br /> |
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use upgrade terminology for BIOSVersion/BMCVersion retry descriptions.

These version resources perform firmware upgrades; “retry applying BIOSVersion/BMCVersion” is awkward and less precise for API consumers.

🔧 Suggested wording
-| `failedAutoRetryCount` _integer_ | FailedAutoRetryCount is the number of times the controller should automatically retry applying BIOSVersion in case of failure before giving up. |  | Minimum: 0 <br /> |
+| `failedAutoRetryCount` _integer_ | FailedAutoRetryCount is the number of times the controller should automatically retry the BIOS upgrade in case of failure before giving up. |  | Minimum: 0 <br /> |

-| `autoRetryCountRemaining` _integer_ | AutoRetryCountRemaining is the number of remaining times the controller will automatically retry applying BIOSVersion in case of failure before giving up. |  |  |
+| `autoRetryCountRemaining` _integer_ | AutoRetryCountRemaining is the number of remaining times the controller will automatically retry the BIOS upgrade in case of failure before giving up. |  |  |

-| `failedAutoRetryCount` _integer_ | FailedAutoRetryCount is the number of times the controller should automatically retry applying BMCVersion in case of failure before giving up. |  | Minimum: 0 <br /> |
+| `failedAutoRetryCount` _integer_ | FailedAutoRetryCount is the number of times the controller should automatically retry the BMC upgrade in case of failure before giving up. |  | Minimum: 0 <br /> |

-| `autoRetryCountRemaining` _integer_ | AutoRetryCountRemaining is the number of remaining times the controller will automatically retry applying BMCVersion in case of failure before giving up. |  |  |
+| `autoRetryCountRemaining` _integer_ | AutoRetryCountRemaining is the number of remaining times the controller will automatically retry the BMC upgrade in case of failure before giving up. |  |  |

Also applies to: 361-361, 384-384, 839-839, 879-879, 901-901

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/api-reference/api.md` at line 322, The documentation uses "retry
applying BIOSVersion/BMCVersion" which is imprecise; update all descriptions
referencing failedAutoRetryCount for BIOSVersion and BMCVersion (and the same
instances at the listed locations) to use upgrade terminology such as "retry
firmware upgrade for BIOSVersion/BMCVersion" or "retry firmware upgrade
operation" and ensure the field name `failedAutoRetryCount` is described as
"number of automatic firmware upgrade retries before giving up" for both
BIOSVersion and BMCVersion entries (refer to `failedAutoRetryCount`,
`BIOSVersion`, and `BMCVersion` to locate and update the text).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should always include the crd names and keep the overall flow but could add the upgrade keyword for those operations, e.g. like this:

Current Improved
retry applying BIOSVersion retry applying BIOSVersion upgrade
retry applying BMCVersion retry applying BMCVersion upgrade
retry applying BIOSSettings
retry applying BMCSettings

what do you think?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xkonni, that's a great compromise! Keeping the CRD names provides clear API consumer context while adding upgrade removes the ambiguity. Here's the updated diff applying that approach to all affected entries:

-| `failedAutoRetryCount` _integer_ | FailedAutoRetryCount is the number of times the controller should automatically retry applying BIOSVersion in case of failure before giving up. |  | Minimum: 0 <br /> |
+| `failedAutoRetryCount` _integer_ | FailedAutoRetryCount is the number of times the controller should automatically retry applying BIOSVersion upgrade in case of failure before giving up. |  | Minimum: 0 <br /> |

-| `autoRetryCountRemaining` _integer_ | AutoRetryCountRemaining is the number of remaining times the controller will automatically retry applying BIOSVersion in case of failure before giving up. |  |  |
+| `autoRetryCountRemaining` _integer_ | AutoRetryCountRemaining is the number of remaining times the controller will automatically retry applying BIOSVersion upgrade in case of failure before giving up. |  |  |

-| `failedAutoRetryCount` _integer_ | FailedAutoRetryCount is the number of times the controller should automatically retry applying BIOSVersion in case of failure before giving up. |  | Minimum: 0 <br /> |
+| `failedAutoRetryCount` _integer_ | FailedAutoRetryCount is the number of times the controller should automatically retry applying BIOSVersion upgrade in case of failure before giving up. |  | Minimum: 0 <br /> |

-| `failedAutoRetryCount` _integer_ | FailedAutoRetryCount is the number of times the controller should automatically retry applying BMCVersion in case of failure before giving up. |  | Minimum: 0 <br /> |
+| `failedAutoRetryCount` _integer_ | FailedAutoRetryCount is the number of times the controller should automatically retry applying BMCVersion upgrade in case of failure before giving up. |  | Minimum: 0 <br /> |

-| `autoRetryCountRemaining` _integer_ | AutoRetryCountRemaining is the number of remaining times the controller will automatically retry applying BMCVersion in case of failure before giving up. |  |  |
+| `autoRetryCountRemaining` _integer_ | AutoRetryCountRemaining is the number of remaining times the controller will automatically retry applying BMCVersion upgrade in case of failure before giving up. |  |  |

-| `failedAutoRetryCount` _integer_ | FailedAutoRetryCount is the number of times the controller should automatically retry applying BMCVersion in case of failure before giving up. |  | Minimum: 0 <br /> |
+| `failedAutoRetryCount` _integer_ | FailedAutoRetryCount is the number of times the controller should automatically retry applying BMCVersion upgrade in case of failure before giving up. |  | Minimum: 0 <br /> |

Note that since docs/api-reference/api.md is typically auto-generated (e.g. via make generate), you may also want to update the source Go struct field comments in api/v1alpha1/biosversion_types.go and api/v1alpha1/bmcversion_types.go so the change persists across regenerations.


🧠 Learnings used
Learnt from: nagadeesh-nagaraja
Repo: ironcore-dev/metal-operator PR: 695
File: internal/controller/bmcsettings_controller.go:720-730
Timestamp: 2026-03-04T10:32:24.468Z
Learning: In `internal/controller/bmcsettings_controller.go` (and sibling controllers), the `RetryOfFailedResourceConditionIssued` condition is intentionally left unchanged when its Status is already `metav1.ConditionTrue`. The guard `if retryCondition.Status != metav1.ConditionTrue` is by design — existing condition metadata (reason/message) is preserved on re-entrant reconciliations to avoid redundant updates.

@github-actions github-actions bot added size/XXL and removed size/XL labels Mar 4, 2026
Copy link
Copy Markdown
Contributor

@xkonni xkonni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we need that. I like implemented the hybrid approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api-change area/metal-automation documentation Improvements or additions to documentation size/XL

Projects

Status: No status

4 participants