Enhanced node reconciliation by automating maintenance CR lifecycle by anton-paulovich · Pull Request #157 · ironcore-dev/cloud-provider-metal

anton-paulovich · 2026-02-24T08:51:10Z

Summary by CodeRabbit

New Features
- Improved node reconciliation: automated maintenance resource lifecycle, approval handshake between node and server claim, and more robust providerID handling.
Tests
- Added comprehensive tests for maintenance lifecycle, approval sync, providerID parsing, PodCIDR masking, and node deletion/cleanup.
Chores
- Dependency bumps; added local dev tooling and manifests (Kind configs, Tilt workflow), new Makefile targets, kubeconfig export, and .gitignore updates.

Fixes #155

coderabbitai · 2026-02-24T09:00:11Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

ℹ️ Recent review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1378247 and 8f81417.

⛔ Files ignored due to path filters (1)

go.sum is excluded by !**/*.sum

📒 Files selected for processing (15)

.gitignore
Makefile
Tiltfile
Tiltfile.license
config/kind/cloud-config.yaml
config/kind/crs/node.yaml
config/kind/crs/server.yaml
config/kind/crs/serverclaim.yaml
config/kind/kustomization.yaml
config/kind/manager-patch.yaml
go.mod
pkg/cloudprovider/metal/instances_v2.go
pkg/cloudprovider/metal/instances_v2_test.go
pkg/cloudprovider/metal/node_controller.go
pkg/cloudprovider/metal/node_controller_test.go

Walkthrough

Implements ServerMaintenance lifecycle and approval sync in the Node reconciler; refactors providerID building/parsing to URL form; extends PodCIDR reconciliation; updates tests for maintenance and providerID behavior; upgrades module dependencies; and adds local dev infra (Kind manifests, Tilt workflow, Makefile targets, and .gitignore entries).

Changes

Cohort / File(s)	Summary
Dependency Upgrades `go.mod`	Bumps Kubernetes and various modules (k8s.io/* v0.34.1 -> v0.35.0, otel, gofish, etc.). No exported API signature changes detected.
Node controller & maintenance `pkg/cloudprovider/metal/node_controller.go`	Adds maintenance/delete reconciliation paths: `reconcileMaintenance`, `reconcileDelete`, ServerMaintenance create/patch/delete helpers, approval sync with ServerClaim, finalizer handling, and routing from Reconcile/Start.
Controller tests `pkg/cloudprovider/metal/node_controller_test.go`	Reworks tests to cover ServerMaintenance lifecycle (create/idempotency/delete), approval handshake with ServerClaim, node deletion cleanup, providerID/zeroHostBits tests, and safer test teardown ignoring NotFound.
ProviderID parsing & helpers `pkg/cloudprovider/metal/instances_v2.go`, `pkg/cloudprovider/metal/instances_v2_test.go`	Replace string-split providerID with URL-based `buildProviderID(namespace,name)` and `getObjectKeyFromProviderID(providerID)` parsing; add validation and tests for parsing error cases.
Local dev tooling & tilt `Makefile`, `Tiltfile`, `Tiltfile.license`, `.gitignore`	Adds `kind-setup`, `export-kubeconfig`, `tilt-up` targets; decouples `docker-build` from tests; adds Tilt workflow and license file; ignores Tilt and kind kubeconfig artifacts.
Kind test manifests `config/kind/`, `config/kind/crs/`, `config/kind/manager-patch.yaml`	Adds cloud-config, Node/Server/ServerClaim example CRs, kustomization and manager deployment patch for local testing and injecting kubeconfigs.
Misc `.gitignore`	Added ignore entries for `.tiltbuild` and kind kubeconfig artifacts.

Sequence Diagram

sequenceDiagram
    participant Reconciler as Node Reconciler
    participant Node as Node Resource
    participant SC as ServerClaim
    participant SM as ServerMaintenance
    rect rgba(100,150,200,0.5)
    Note over Reconciler: Reconciliation triggered
    Reconciler->>Node: Fetch Node, check DeletionTimestamp
    alt Node Deleted
        Reconciler->>SM: ensureServerMaintenanceNotExists (delete CR)
        Reconciler->>Node: remove finalizer
    else Node Active
        Node->>Reconciler: providerID, maintenance label
        alt Maintenance Requested
            Reconciler->>SC: lookup ServerClaim by providerID
            Reconciler->>SM: ensureServerMaintenanceExists (CreateOrPatch)
            Reconciler->>Node: ensure finalizer present
            Reconciler->>SC: syncServerClaimApproval (label/approval)
        else No Maintenance
            Reconciler->>SM: ensureServerMaintenanceNotExists
            Reconciler->>Node: remove finalizer if present
        end
        Reconciler->>Node: reconcilePodCIDR (zeroHostBits -> patch PodCIDR/PodCIDRs)
    end
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Add node controller #144: Changes to Node reconciler and providerID parsing that overlap ServerMaintenance and providerID refactor code paths.

Suggested labels

enhancement

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Node maintenance controller' is concise and accurately reflects the main change: implementing node maintenance workflow with ServerMaintenance CR creation/deletion and finalizer management.
Description check	✅ Passed	The description includes a task reference, comprehensive proposed changes covering maintenance request handling, controller consolidation, and finalizer management, but does not explicitly reference the issue number in a 'Fixes #' line as shown in the template.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat/ServerMaintenenace

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@go.mod`:
- Around line 12-17: The go.mod currently mixes Kubernetes modules at v0.35.0
and v0.34.1 which can cause API/ABI skew; update the mismatched entries
(k8s.io/cloud-provider and k8s.io/controller-manager) to the same minor version
as the rest (e.g., v0.35.0) so all k8s modules (k8s.io/api, k8s.io/apimachinery,
k8s.io/client-go, k8s.io/cloud-provider, k8s.io/component-base,
k8s.io/controller-manager) share v0.35.x, then refresh dependencies (via go
get/go mod tidy) to ensure module graph consistency.

In `@pkg/cloudprovider/metal/node_maintenance_controller.go`:
- Around line 181-196: The parseProviderID function currently allows empty
namespace or name when providerID like "metal://ns/" or "metal:///name" is
passed; update parseProviderID to validate that after splitting rest into parts
the two segments parts[0] (namespace) and parts[1] (name) are non-empty and
return a clear error (e.g. "empty namespace or name") if either is empty; keep
returning types.NamespacedName{Namespace: parts[0], Name: parts[1]} only when
both segments are non-empty so callers of parseProviderID get a deterministic,
well-validated NamespacedName.
- Around line 116-123: Reconcile currently returns on Node NotFound before
cleaning up a ServerMaintenance CR, which can orphan the CR; fix by ensuring
cleanup runs on Node deletion — either add a Kubernetes finalizer on Node in
NodeMaintenanceReconciler.Reconcile (add finalizer when maintenance is
requested, detect deletionTimestamp, perform ServerMaintenance deletion/cleanup
using the providerID and targetClient, then remove finalizer) or implement a
DeleteFunc event handler that reads the tombstone to extract providerID and
proactively delete the corresponding ServerMaintenance CR via r.targetClient;
update NodeMaintenanceReconciler methods that reference providerID,
targetClient, and DeleteFunc to perform the delete logic and remove the
finalizer only after successful cleanup.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6e610d5 and e96e793.

⛔ Files ignored due to path filters (1)

go.sum is excluded by !**/*.sum

📒 Files selected for processing (6)

go.mod
pkg/cloudprovider/metal/cloud.go
pkg/cloudprovider/metal/node_controller.go
pkg/cloudprovider/metal/node_controller_test.go
pkg/cloudprovider/metal/node_maintenance_controller.go
pkg/cloudprovider/metal/node_maintenance_controller_test.go

💤 Files with no reviewable changes (2)

pkg/cloudprovider/metal/node_controller_test.go
pkg/cloudprovider/metal/node_controller.go

go.mod

pkg/cloudprovider/metal/node_maintenance_controller.go

defo89 · 2026-02-27T10:40:06Z

pkg/cloudprovider/metal/node_maintenance_controller.go

+			r.queue.Add(key)
+		},
+	})
+	defer r.queue.ShutDown()


Would it make sense to move the defer after the error check?
And I see that other controllers follow the same pattern - is that an issue?

Actually we are creating queue in the reconciler constructor NewNodeMaintenanceReconciler, which already allocates some resources
https://github.com/kubernetes/client-go/blob/master/util/workqueue/delaying_queue.go#L157
so we want to shutdown it anyways. It will be even better to move this defer on the top of this Start func

pkg/cloudprovider/metal/node_maintenance_controller.go

defo89 · 2026-02-27T13:54:00Z

In general I would try to avoid adding finalizers on Node objects as it might block machinery on top that manages nodes (Gardener, Cluster-API).

We need to think what happens if a rolling update is triggered and how we should react upon it:

With finalizer we will block the update machinery until this maintenance is completed
Without finalizer we will allow to delete the Node -> new machine will wait and go into Pending until the Server is repaired. But the maintenance will need to be removed manually.

Upd after offline discussion: finalizers on Node objects should not be an issue here.

pkg/cloudprovider/metal/cloud.go

afritzler · 2026-02-27T14:06:19Z

@coderabbitai do you think it is a good idea to split the reconciliation logic into two NodeReconciler? Wouldn't it be better to have one with the proper watch setup covering all use cases?

coderabbitai · 2026-02-27T14:07:28Z

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

@afritzler You raise an excellent architectural question. Let me examine the current implementation to provide informed guidance.

🏁 Script executed:

#!/bin/bash
# Examine the NodeReconciler implementation
echo "=== NodeReconciler (node_controller.go) ==="
cat pkg/cloudprovider/metal/node_controller.go

echo -e "\n=== NodeMaintenanceReconciler (node_maintenance_controller.go) ==="
cat pkg/cloudprovider/metal/node_maintenance_controller.go

echo -e "\n=== How both are initialized in cloud.go ==="
cat pkg/cloudprovider/metal/cloud.go | grep -A 20 "NodeReconciler\|NodeMaintenanceReconciler"

Length of output: 14776

@afritzler Good question! After examining the code, I'd recommend merging into a single NodeReconciler for the following reasons:

Current Architecture Issues

Duplicate event processing: Every Node change triggers reconciliation in both controllers, causing unnecessary overhead.
Non-idiomatic pattern: The standard controller-runtime pattern is one reconciler per resource type. Having two reconcilers watching the same resource is unusual and makes the overall Node lifecycle harder to reason about.
Resource overhead: Two workqueues, two goroutines, and duplicate informer event handling for the same resource.

Current Architecture Strengths

The concerns are orthogonal (PodCIDR vs. maintenance), which is good
No field conflicts between the two reconcilers
Clear code separation and easier unit testing

Recommended Approach

Merge into a single NodeReconciler with well-structured helper functions:

func (r *NodeReconciler) Reconcile(ctx context.Context, req ctrl.Request) error {
    node := &corev1.Node{}
    if err := r.targetClient.Get(ctx, req.NamespacedName, node); err != nil {
        return client.IgnoreNotFound(err)
    }

    // Handle deletion
    if !node.DeletionTimestamp.IsZero() {
        return r.reconcileDelete(ctx, node)
    }

    // Reconcile PodCIDR
    if err := r.reconcilePodCIDR(ctx, node); err != nil {
        return fmt.Errorf("failed to reconcile PodCIDR: %w", err)
    }

    // Reconcile maintenance
    if err := r.reconcileMaintenance(ctx, node); err != nil {
        return fmt.Errorf("failed to reconcile maintenance: %w", err)
    }

    return nil
}

This provides:

✅ Single reconciliation per Node change
✅ Clear separation via helper functions
✅ Standard controller-runtime pattern
✅ Still easy to test (test helper functions independently)
✅ Better resource efficiency

Note: The finalizer concern raised by @defo89 is separate from this architectural question and needs to be addressed regardless of whether you split or merge the reconcilers.

pkg/cloudprovider/metal/node_maintenance_controller.go

coderabbitai

Actionable comments posted: 7

🧹 Nitpick comments (2)

config/kind/manager-patch.yaml (1)

10-29: Harden manager container security context.

Line [10] onward defines the container without explicit hardening controls (allowPrivilegeEscalation, dropped capabilities, seccomp). This is a security posture gap even for dev clusters.

Proposed patch

 spec:
   strategy:
     type: Recreate
   template:
     spec:
+      securityContext:
+        seccompProfile:
+          type: RuntimeDefault
       containers:
         - name: manager
           image: controller
           imagePullPolicy: IfNotPresent
+          securityContext:
+            allowPrivilegeEscalation: false
+            runAsNonRoot: true
+            readOnlyRootFilesystem: true
+            capabilities:
+              drop:
+                - ALL
           args:
             - --cloud-provider=metal
             - --cloud-config=/etc/kubernetes/cloud-config/cloud-config

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@config/kind/manager-patch.yaml` around lines 10 - 29, Add a Pod/Container
securityContext to harden the "manager" container: in the container spec for
name "manager" add securityContext with allowPrivilegeEscalation: false,
runAsNonRoot: true (and set runAsUser to a non-root uid like 1000), and
capabilities.drop: ["ALL"]; also add a pod or container seccompProfile with
type: RuntimeDefault; ensure these fields are applied alongside the existing
volumeMount for "cloud-config" and keep secretName "metal-cloud-config"
unchanged.

Makefile (1)

70-73: Make kind-setup idempotent for repeatable local workflows.

Lines [72]-[73] fail on reruns if clusters already exist. Guarding creation improves day-to-day developer UX.

Proposed patch

 .PHONY: kind-setup
 kind-setup:
-	kind create cluster --name mgmt
-	kind create cluster --name worker
+	kind get clusters | grep -qx mgmt || kind create cluster --name mgmt
+	kind get clusters | grep -qx worker || kind create cluster --name worker

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@Makefile` around lines 70 - 73, The kind-setup target fails on reruns because
it unconditionally runs "kind create cluster --name mgmt" and "--name worker";
make it idempotent by checking for each cluster's existence before creating it
(e.g., use "kind get clusters" and only call kind create when the named cluster
is absent). Update the kind-setup recipe to guard the two creations (for "mgmt"
and "worker") so repeated runs skip creation if the cluster already exists.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@Makefile`:
- Around line 81-84: The tilt-up recipe currently sets KUBECONFIG on one shell
line then runs `tilt up` in a new shell and also references a non-existent
`mgmt-kubeconfig`; fix by setting the env var inline with the command and point
it at the files produced by the export-kubeconfig target (e.g.
`mgmt-kubeconfig-external` and the worker kubeconfig). Update the `tilt-up`
recipe so the single command uses:
KUBECONFIG=./config/kind/mgmt-kubeconfig-external:./config/kind/worker-kubeconfig
tilt up (ensure `export-kubeconfig` still runs as a prerequisite).

In `@pkg/cloudprovider/metal/node_controller_test.go`:
- Around line 251-253: The test currently uses
Consistently(Object(node)).Should(HaveField("Finalizers",
ContainElement(nodeMaintenanceFinalizer))) which races with asynchronous
reconciliation; change the assertion to first wait for the finalizer to appear
by calling Eventually(Object(node)).Should(HaveField("Finalizers",
ContainElement(nodeMaintenanceFinalizer))) and only then use
Consistently(Object(node)).Should(...) to assert it remains present. Locate the
occurrence referencing node and nodeMaintenanceFinalizer and wrap the initial
existence check with Eventually before the Consistently stability check.
- Around line 223-235: The test fixture creates a ServerMaintenance object
(existingCR) with Finalizers: []string{nodeMaintenanceFinalizer} which is
incorrect because nodeMaintenanceFinalizer is Node-specific; remove the
Finalizers field (or ensure Finalizers is empty) from the ServerMaintenance
object creation in the idempotency test so existingCR mirrors production
behavior (ServerMaintenance objects created without finalizers) and the test
remains focused on verifying the controller leaves existing CR fields untouched.

In `@pkg/cloudprovider/metal/node_controller.go`:
- Around line 202-214: The deletion in ensureServerMaintenanceNotExists
currently removes ServerMaintenance by name/namespace unconditionally; change it
to first GET the ServerMaintenance (using r.metalClient.Get into the same
maintenance object), check its labels for "app.kubernetes.io/managed-by" and
only call r.metalClient.Delete if that label matches the controller's manager
identifier (the value you set when creating maintenance objects in the
reconciler), otherwise return nil (or client.IgnoreNotFound(nil)). Keep existing
error handling using client.IgnoreNotFound(err) for both Get and Delete to avoid
spurious failures; reference function ensureServerMaintenanceNotExists and the
"app.kubernetes.io/managed-by" label in your changes.
- Around line 133-147: Move the deletion-check above the strict providerID
validation so deleting Nodes never skip delete reconciliation: check
node.DeletionTimestamp.IsZero() before inspecting node.Spec.ProviderID or
calling parseProviderID, and if the node is deleting call and return
r.reconcileDelete(ctx, node, /*serverClaimKey may be nil/zero*/ ) (or construct
a minimal serverClaimKey if reconcileDelete requires it) so finalizer cleanup
runs even when providerID is empty or malformed; then proceed with the existing
providerID empty check, parseProviderID call, and subsequent logging
(references: node.Spec.ProviderID, node.DeletionTimestamp, parseProviderID,
r.reconcileDelete, l.Info).

In `@Tiltfile`:
- Around line 15-17: The Tiltfile currently calls mgmt_kubectl with CRD URLs
pointing at the moving "main" branch; update each mgmt_kubectl invocation (the
three calls shown) to use the immutable release tag v0.3.0 in the URLs (replace
"main" with "v0.3.0") so the CRDs are pinned to a stable, reproducible version.
- Around line 15-20: The sequence applies CRD manifests then creates CRs too
quickly; update the Tiltfile to wait for each CRD to reach the Established
condition before applying the Server and ServerClaim CRs by invoking
mgmt_kubectl wait for each CRD (e.g., metal.ironcore.dev_serverclaims,
metal.ironcore.dev_servermaintenances, metal.ironcore.dev_servers) with
--for=condition=Established --timeout=<reasonable-duration> after each
mgmt_kubectl('apply -f ...yaml') call, then proceed to apply
config/kind/crs/server.yaml and config/kind/crs/serverclaim.yaml only after
those waits complete.

---

Nitpick comments:
In `@config/kind/manager-patch.yaml`:
- Around line 10-29: Add a Pod/Container securityContext to harden the "manager"
container: in the container spec for name "manager" add securityContext with
allowPrivilegeEscalation: false, runAsNonRoot: true (and set runAsUser to a
non-root uid like 1000), and capabilities.drop: ["ALL"]; also add a pod or
container seccompProfile with type: RuntimeDefault; ensure these fields are
applied alongside the existing volumeMount for "cloud-config" and keep
secretName "metal-cloud-config" unchanged.

In `@Makefile`:
- Around line 70-73: The kind-setup target fails on reruns because it
unconditionally runs "kind create cluster --name mgmt" and "--name worker"; make
it idempotent by checking for each cluster's existence before creating it (e.g.,
use "kind get clusters" and only call kind create when the named cluster is
absent). Update the kind-setup recipe to guard the two creations (for "mgmt" and
"worker") so repeated runs skip creation if the cluster already exists.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1fdada3 and ba12c39.

📒 Files selected for processing (13)

.gitignore
Makefile
Tiltfile
Tiltfile.license
config/kind/cloud-config.yaml
config/kind/crs/node.yaml
config/kind/crs/server.yaml
config/kind/crs/serverclaim.yaml
config/kind/kustomization.yaml
config/kind/manager-patch.yaml
go.mod
pkg/cloudprovider/metal/node_controller.go
pkg/cloudprovider/metal/node_controller_test.go

✅ Files skipped from review due to trivial changes (4)

config/kind/kustomization.yaml
Tiltfile.license
config/kind/crs/node.yaml
config/kind/crs/serverclaim.yaml

Makefile

pkg/cloudprovider/metal/node_controller_test.go

pkg/cloudprovider/metal/node_controller.go

Tiltfile

pkg/cloudprovider/metal/node_controller.go

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (2)

pkg/cloudprovider/metal/node_controller_test.go (1)
352-387: Consider moving this test inside the "ServerMaintenance CR Lifecycle" context.

This test is logically related to the ServerMaintenance CR lifecycle tests but is currently placed at the top level of the Describe block. Moving it inside the Context("ServerMaintenance CR Lifecycle", ...) block would improve test organization and make the relationship clearer.

Suggested reorganization

Move lines 352-387 inside the Context("ServerMaintenance CR Lifecycle", func() { ... }) block that ends at line 350.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/cloudprovider/metal/node_controller_test.go` around lines 352 - 387, Move
the It test titled "should NOT delete the ServerMaintenance CR if it is not
managed by the controller" into the existing Context("ServerMaintenance CR
Lifecycle", func() { ... }) block so it is scoped with the related
ServerMaintenance lifecycle tests; locate the It block (search for the exact
spec string) and cut/paste its entire body into the ServerMaintenance CR
Lifecycle context (near other It cases), ensuring imports/variables used by the
test (node, serverClaim, k8sClient, labelKeyManagedBy,
serverMaintenancePriority) remain in scope and update any DeferCleanup/patch
context if necessary.
Tiltfile (1)
28-32: Consider making architecture configurable for cross-platform development.

The hardcoded GOARCH=amd64 will prevent developers on ARM machines (Apple Silicon Macs) from using Tilt for local development without manual modification.
♻️ Suggested improvement
+ARCH = str(local('go env GOARCH', quiet=True)).strip()
+
 local_resource(
     "manager-binary",
-    cmd = 'mkdir -p .tiltbuild; CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build -o .tiltbuild/manager ./cmd/metal-cloud-controller-manager/main.go',
+    cmd = 'mkdir -p .tiltbuild; CGO_ENABLED=0 GOOS=linux GOARCH=' + ARCH + ' go build -o .tiltbuild/manager ./cmd/metal-cloud-controller-manager/main.go',
     deps = ["pkg", "cmd", "go.mod", "go.sum"]
 )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Tiltfile` around lines 28 - 32, The build command in the local_resource
"manager-binary" hardcodes GOARCH=amd64; change it to read GOARCH from the
environment with a default (e.g., use getenv("GOARCH") or similar in the
Tiltfile) so Apple Silicon/ARM devs can override it; update the cmd string for
manager-binary to inject the resolved GOARCH value (keep CGO_ENABLED=0 and
GOOS=linux) and document that developers can set GOARCH in their shell or Tilt
environment to arm64 when needed.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@go.mod`:
- Line 7: The Tiltfile and go.mod are out of sync: go.mod depends on
github.com/ironcore-dev/metal-operator at pseudo-version
v0.3.1-0.20260220142141-5dabaca4f7d7 while the Tiltfile uses METAL_OPERATOR_REF
set to v0.3.0; pick one source of truth and align them—either update
METAL_OPERATOR_REF in the Tiltfile to the exact pseudo-version/commit string
used in go.mod (the commit suffix 5dabaca4f7d7) or change the go.mod dependency
to the stable tag v0.3.0 so both references match.

In `@pkg/cloudprovider/metal/node_controller.go`:
- Around line 270-273: The reconcile should not return an error when
node.Spec.ProviderID is empty or unparseable because that triggers a requeue
storm; change the code around the parseProviderID call so you first check if
node.Spec.ProviderID is empty and, if so, log/debug and return nil (no error),
and if parseProviderID(node.Spec.ProviderID) returns an error, treat it as
non-fatal: log the parse failure with context (including node.Name and
ProviderID) and return nil instead of fmt.Errorf, ensuring any later use of
serverClaimKey only happens when parse succeeds.
- Around line 334-357: The create/patch path in ensureServerMaintenanceExists
uses controllerutil.CreateOrPatch which will mutate any existing
ServerMaintenance with the same key; update the mutate closure (or pre-check) to
detect an existing object that is not managed by us and bail instead of taking
it over: inside the CreateOrPatch mutate func in ensureServerMaintenanceExists
(or by doing a Get before calling CreateOrPatch) examine
maintenance.Labels[labelKeyManagedBy] and if it exists and !=
cloudProviderMetalName return a non-nil error to stop patching so we don’t adopt
unmanaged ServerMaintenance objects. Ensure you still set labels/Spec when the
object is absent or already managed by cloudProviderMetalName.

---

Nitpick comments:
In `@pkg/cloudprovider/metal/node_controller_test.go`:
- Around line 352-387: Move the It test titled "should NOT delete the
ServerMaintenance CR if it is not managed by the controller" into the existing
Context("ServerMaintenance CR Lifecycle", func() { ... }) block so it is scoped
with the related ServerMaintenance lifecycle tests; locate the It block (search
for the exact spec string) and cut/paste its entire body into the
ServerMaintenance CR Lifecycle context (near other It cases), ensuring
imports/variables used by the test (node, serverClaim, k8sClient,
labelKeyManagedBy, serverMaintenancePriority) remain in scope and update any
DeferCleanup/patch context if necessary.

In `@Tiltfile`:
- Around line 28-32: The build command in the local_resource "manager-binary"
hardcodes GOARCH=amd64; change it to read GOARCH from the environment with a
default (e.g., use getenv("GOARCH") or similar in the Tiltfile) so Apple
Silicon/ARM devs can override it; update the cmd string for manager-binary to
inject the resolved GOARCH value (keep CGO_ENABLED=0 and GOOS=linux) and
document that developers can set GOARCH in their shell or Tilt environment to
arm64 when needed.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ba12c39 and 6df53da.

📒 Files selected for processing (5)

Makefile
Tiltfile
go.mod
pkg/cloudprovider/metal/node_controller.go
pkg/cloudprovider/metal/node_controller_test.go

🚧 Files skipped from review as they are similar to previous changes (1)

Makefile

go.mod

pkg/cloudprovider/metal/node_controller.go

anton-paulovich · 2026-03-02T11:15:55Z

Moved everything to one NodeController
Updated finalizer name "metal.ironcore.dev/cloud-provider-metal"
Added label "app.kubernetes.io/managed-by": "cloud-provider-metal" to ServerMaintenance which is created by CCM
When metal.ironcore.dev/maintenance-requested label on the node is removed - CCM will remove only owned ServerMaintenances
Switched logger to klog
When reconciling node deletion we can possibly face invalid providerID error. ProviderID is needed for ServerMaintenance cleanup. In this case we remove finalizer on the node anyways to unblock node deletion, but owned by CCM ServerMaintenance will remain
Fixed bug when CCM could possibly rewrite existing ServerMaintenance which is not owned by him

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (1)

Tiltfile (1)

15-18: ⚠️ Potential issue | 🟠 Major

Pin the CRD source ref to an immutable release/commit.

Line 15 still points to main, so local bootstrap can change unexpectedly when upstream moves. Please pin to a tested tag (or commit SHA).

Suggested change

-METAL_OPERATOR_REF = "main" `#TODO`: update to the tag after new metal-operator release
+METAL_OPERATOR_REF = "v<tested-release-tag>"

#!/bin/bash
set -euo pipefail
curl -fsSL "https://api.github.com/repos/ironcore-dev/metal-operator/releases?per_page=10" | python - <<'PY'
import json, sys
for r in json.load(sys.stdin):
    if not r.get("draft") and not r.get("prerelease"):
        print(f"{r['tag_name']}  {r.get('published_at','')}")
PY

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@Tiltfile` around lines 15 - 18, The METAL_OPERATOR_REF is pinned to "main"
causing non-deterministic bootstrap; update the METAL_OPERATOR_REF constant to a
specific immutable release tag or commit SHA (e.g., "vX.Y.Z" or a commit hash)
and use that value in the three mgmt_kubectl(...) lines so they reference a
known-good release; adjust the comment TODO to indicate the chosen tag/sha and,
if desired, make METAL_OPERATOR_REF overridable via an environment variable so
developers can opt into different versions without changing the file.

🧹 Nitpick comments (2)

pkg/cloudprovider/metal/node_controller_test.go (2)

505-513: Add unsupported-scheme coverage in parseProviderID table tests.

Please add a case like aws://default/node-1 (or similar) to lock in expected rejection of non-metal schemes.

Suggested test entry

 		Entry("valid provider ID", "metal://default/node-1", types.NamespacedName{Namespace: "default", Name: "node-1"}, false),
+		Entry("unsupported scheme", "aws://default/node-1", types.NamespacedName{}, true),
 		Entry("empty string", "", types.NamespacedName{}, true),

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@pkg/cloudprovider/metal/node_controller_test.go` around lines 505 - 513, Add
a table test entry to assert parseProviderID rejects non-metal schemes: in the
table inside node_controller_test.go where parseProviderID cases are declared
(the Entry(...) lines), add an Entry for an unsupported scheme such as
"aws://default/node-1" with expected result types.NamespacedName{} and
expectError true; this ensures parseProviderID (the function under test)
explicitly fails on schemes other than "metal".

352-390: Add regression test for cleanup when ServerClaim disappears first.

A targeted case here would be valuable: create maintenance + finalizer, delete ServerClaim, remove maintenance-requested label, then assert finalizer is removed and managed ServerMaintenance is cleaned up.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@pkg/cloudprovider/metal/node_controller_test.go` around lines 352 - 390, Add
a new test in node_controller_test.go that covers the regression where the
ServerClaim is deleted before the Node: reuse the existing setup that creates a
ServerMaintenance CR and adds the nodeMaintenanceFinalizer by setting
node.Labels[metalv1alpha1.ServerMaintenanceRequestedLabelKey] = TrueStr and
patching via k8sClient; then delete the serverClaim object (serverClaim) from
the API, remove the maintenance-requested label from node and patch, and assert
that the managed ServerMaintenance (metalv1alpha1.ServerMaintenance) is deleted
and the Node finalizer (nodeMaintenanceFinalizer) is removed (use k8sClient.Get
with client.ObjectKey for maintenance and client.ObjectKeyFromObject(node) for
node and apierrors.IsNotFound checks inside Eventually blocks).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/cloudprovider/metal/node_controller.go`:
- Around line 275-280: The early return on apierrors.IsNotFound when fetching
ServerClaim (serverClaim := &metalv1alpha1.ServerClaim{}; err =
r.metalClient.Get(ctx, serverClaimKey, serverClaim)) skips the cleanup path;
change the logic so that when IsNotFound you still execute the cleanup branch
that removes the Node maintenance label, deletes any managed ServerMaintenance,
and removes the finalizer from the Node (the same cleanup you run when
ServerClaim exists but is not in maintenance), instead of returning immediately.
Update the handling in the ServerClaim fetch block and repeat the same fix for
the analogous code at the 306–317 region so both code paths (serverClaim missing
or present) will call the cleanup routines that operate on ServerMaintenance,
the Node object, and finalizer removal.
- Around line 148-151: parseProviderID currently accepts any scheme (provider)
and only errors on missing scheme; change it to explicitly validate allowed
schemes (e.g., "metal") by checking the extracted provider variable against a
whitelist inside parseProviderID and return a clear error like "unsupported
provider scheme" when it doesn't match; keep existing behavior of splitting
providerID with strings.Cut and parsing rest (namespace/name) but add the
whitelist check (reference symbols: parseProviderID, provider, rest, providerID)
so non-metal schemes are rejected before returning the NamespacedName.

In `@Tiltfile`:
- Around line 28-45: The docker image build ("docker_build" with ref
"controller") can race the binary creation from the local_resource named
"manager-binary"; add an explicit dependency by updating the k8s_resource for
the controller to include resource_deps=['manager-binary'] so Tilt waits for
local_resource("manager-binary") to finish before invoking docker_build,
ensuring .tiltbuild/manager exists before the image build runs.

---

Duplicate comments:
In `@Tiltfile`:
- Around line 15-18: The METAL_OPERATOR_REF is pinned to "main" causing
non-deterministic bootstrap; update the METAL_OPERATOR_REF constant to a
specific immutable release tag or commit SHA (e.g., "vX.Y.Z" or a commit hash)
and use that value in the three mgmt_kubectl(...) lines so they reference a
known-good release; adjust the comment TODO to indicate the chosen tag/sha and,
if desired, make METAL_OPERATOR_REF overridable via an environment variable so
developers can opt into different versions without changing the file.

---

Nitpick comments:
In `@pkg/cloudprovider/metal/node_controller_test.go`:
- Around line 505-513: Add a table test entry to assert parseProviderID rejects
non-metal schemes: in the table inside node_controller_test.go where
parseProviderID cases are declared (the Entry(...) lines), add an Entry for an
unsupported scheme such as "aws://default/node-1" with expected result
types.NamespacedName{} and expectError true; this ensures parseProviderID (the
function under test) explicitly fails on schemes other than "metal".
- Around line 352-390: Add a new test in node_controller_test.go that covers the
regression where the ServerClaim is deleted before the Node: reuse the existing
setup that creates a ServerMaintenance CR and adds the nodeMaintenanceFinalizer
by setting node.Labels[metalv1alpha1.ServerMaintenanceRequestedLabelKey] =
TrueStr and patching via k8sClient; then delete the serverClaim object
(serverClaim) from the API, remove the maintenance-requested label from node and
patch, and assert that the managed ServerMaintenance
(metalv1alpha1.ServerMaintenance) is deleted and the Node finalizer
(nodeMaintenanceFinalizer) is removed (use k8sClient.Get with client.ObjectKey
for maintenance and client.ObjectKeyFromObject(node) for node and
apierrors.IsNotFound checks inside Eventually blocks).

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6df53da and dd9769e.

📒 Files selected for processing (3)

Tiltfile
pkg/cloudprovider/metal/node_controller.go
pkg/cloudprovider/metal/node_controller_test.go

pkg/cloudprovider/metal/node_controller.go

Tiltfile

config/kind/crs/server.yaml

Makefile

pkg/cloudprovider/metal/node_controller_test.go

pkg/cloudprovider/metal/node_controller.go

afritzler · 2026-03-03T13:21:41Z

@coderabbitai summary

coderabbitai · 2026-03-03T13:21:52Z

✅ Actions performed

Summary regeneration triggered.

afritzler · 2026-03-03T14:22:55Z

@coderabbitai summary

coderabbitai · 2026-03-03T14:23:10Z

✅ Actions performed

Summary regeneration triggered.

Node maintenance controller

e96e793

github-actions bot added the size/XL label Feb 24, 2026

coderabbitai bot reviewed Feb 24, 2026

View reviewed changes

go.mod Outdated Show resolved Hide resolved

pkg/cloudprovider/metal/node_maintenance_controller.go Outdated Show resolved Hide resolved

pkg/cloudprovider/metal/node_maintenance_controller.go Outdated Show resolved Hide resolved

Node maintenance finalizer

1fdada3

anton-paulovich requested review from Nuckal777, afritzler, defo89 and maxmoehl February 24, 2026 11:01

hardikdr added the area/metal-automation Automation processes within the Metal project. label Feb 25, 2026

hardikdr added this to Roadmap Feb 25, 2026

anton-paulovich added the do-not-merge Do Not Merge label Feb 25, 2026

github-actions bot added size/XXL and removed size/XL labels Feb 25, 2026

Local environment

ceac43a

anton-paulovich force-pushed the feat/ServerMaintenenace branch from dc30c2c to ceac43a Compare February 25, 2026 15:03

go mod tidy

9c68690

anton-paulovich removed the do-not-merge Do Not Merge label Feb 25, 2026

defo89 reviewed Feb 27, 2026

View reviewed changes

afritzler requested changes Feb 27, 2026

View reviewed changes

pkg/cloudprovider/metal/cloud.go Outdated Show resolved Hide resolved

afritzler requested changes Feb 27, 2026

View reviewed changes

pkg/cloudprovider/metal/node_maintenance_controller.go Outdated Show resolved Hide resolved

pkg/cloudprovider/metal/node_maintenance_controller.go Outdated Show resolved Hide resolved

defo89 reviewed Feb 27, 2026

View reviewed changes

pkg/cloudprovider/metal/node_maintenance_controller.go Outdated Show resolved Hide resolved

Changes motivated by code review

ba12c39

github-actions bot added size/XL and removed size/XXL labels Feb 27, 2026

coderabbitai bot reviewed Feb 27, 2026

View reviewed changes

afritzler requested changes Feb 27, 2026

View reviewed changes

pkg/cloudprovider/metal/node_controller.go Outdated Show resolved Hide resolved

ServerMaintenance managed-by label

6df53da

anton-paulovich force-pushed the feat/ServerMaintenenace branch from f422a1a to 6df53da Compare March 2, 2026 11:02

coderabbitai bot reviewed Mar 2, 2026

View reviewed changes

go.mod Outdated Show resolved Hide resolved

pkg/cloudprovider/metal/node_controller.go Outdated Show resolved Hide resolved

pkg/cloudprovider/metal/node_controller.go Show resolved Hide resolved

Check for ServerMaintenance owner before CreateOrPatch()

dd9769e

coderabbitai bot reviewed Mar 2, 2026

View reviewed changes

pkg/cloudprovider/metal/node_controller.go Outdated Show resolved Hide resolved

pkg/cloudprovider/metal/node_controller.go Show resolved Hide resolved

Tiltfile Show resolved Hide resolved

Delete ServerMaintenance even if ServerClaim not found

916320f

afritzler requested changes Mar 2, 2026

View reviewed changes

config/kind/crs/server.yaml Outdated Show resolved Hide resolved

Makefile Show resolved Hide resolved

pkg/cloudprovider/metal/node_controller_test.go Outdated Show resolved Hide resolved

pkg/cloudprovider/metal/node_controller.go Outdated Show resolved Hide resolved

Refactor providerID logic + switch to komega in tests

6446036

github-actions bot added size/XXL and removed size/XL labels Mar 2, 2026

anton-paulovich requested review from afritzler and defo89 March 2, 2026 15:07

metal operator v0.4.0

14ae853

Merge branch 'main' into feat/ServerMaintenenace

8f81417

github-actions bot added size/XL and removed size/XXL labels Mar 3, 2026

afritzler changed the title ~~Node maintenance controller~~ Enhanced node reconciliation by automating maintenance CR lifecycle Mar 3, 2026

afritzler approved these changes Mar 3, 2026

View reviewed changes

afritzler added the enhancement New feature or request label Mar 3, 2026

afritzler merged commit ed20527 into main Mar 3, 2026
17 checks passed

afritzler deleted the feat/ServerMaintenenace branch March 3, 2026 14:25

github-project-automation bot moved this to Done in Roadmap Mar 3, 2026

afritzler added the minor label Mar 4, 2026

Conversation

anton-paulovich commented Feb 24, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested labels

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

defo89 Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

anton-paulovich Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

defo89 commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

afritzler commented Feb 27, 2026

Uh oh!

coderabbitai bot commented Feb 27, 2026

Current Architecture Issues

Current Architecture Strengths

Recommended Approach

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

anton-paulovich commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

afritzler commented Mar 3, 2026

Uh oh!

coderabbitai bot commented Mar 3, 2026

Uh oh!

afritzler commented Mar 3, 2026

Uh oh!

anton-paulovich commented Feb 24, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 24, 2026 •

edited

Loading

defo89 commented Feb 27, 2026 •

edited

Loading

anton-paulovich commented Mar 2, 2026 •

edited

Loading