Enhanced node reconciliation by automating maintenance CR lifecycle#157
Enhanced node reconciliation by automating maintenance CR lifecycle#157
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
ℹ️ Recent review infoConfiguration used: Organization UI Review profile: CHILL Plan: Pro ⛔ Files ignored due to path filters (1)
📒 Files selected for processing (15)
WalkthroughImplements ServerMaintenance lifecycle and approval sync in the Node reconciler; refactors providerID building/parsing to URL form; extends PodCIDR reconciliation; updates tests for maintenance and providerID behavior; upgrades module dependencies; and adds local dev infra (Kind manifests, Tilt workflow, Makefile targets, and .gitignore entries). Changes
Sequence DiagramsequenceDiagram
participant Reconciler as Node Reconciler
participant Node as Node Resource
participant SC as ServerClaim
participant SM as ServerMaintenance
rect rgba(100,150,200,0.5)
Note over Reconciler: Reconciliation triggered
Reconciler->>Node: Fetch Node, check DeletionTimestamp
alt Node Deleted
Reconciler->>SM: ensureServerMaintenanceNotExists (delete CR)
Reconciler->>Node: remove finalizer
else Node Active
Node->>Reconciler: providerID, maintenance label
alt Maintenance Requested
Reconciler->>SC: lookup ServerClaim by providerID
Reconciler->>SM: ensureServerMaintenanceExists (CreateOrPatch)
Reconciler->>Node: ensure finalizer present
Reconciler->>SC: syncServerClaimApproval (label/approval)
else No Maintenance
Reconciler->>SM: ensureServerMaintenanceNotExists
Reconciler->>Node: remove finalizer if present
end
Reconciler->>Node: reconcilePodCIDR (zeroHostBits -> patch PodCIDR/PodCIDRs)
end
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Suggested labels
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@go.mod`:
- Around line 12-17: The go.mod currently mixes Kubernetes modules at v0.35.0
and v0.34.1 which can cause API/ABI skew; update the mismatched entries
(k8s.io/cloud-provider and k8s.io/controller-manager) to the same minor version
as the rest (e.g., v0.35.0) so all k8s modules (k8s.io/api, k8s.io/apimachinery,
k8s.io/client-go, k8s.io/cloud-provider, k8s.io/component-base,
k8s.io/controller-manager) share v0.35.x, then refresh dependencies (via go
get/go mod tidy) to ensure module graph consistency.
In `@pkg/cloudprovider/metal/node_maintenance_controller.go`:
- Around line 181-196: The parseProviderID function currently allows empty
namespace or name when providerID like "metal://ns/" or "metal:///name" is
passed; update parseProviderID to validate that after splitting rest into parts
the two segments parts[0] (namespace) and parts[1] (name) are non-empty and
return a clear error (e.g. "empty namespace or name") if either is empty; keep
returning types.NamespacedName{Namespace: parts[0], Name: parts[1]} only when
both segments are non-empty so callers of parseProviderID get a deterministic,
well-validated NamespacedName.
- Around line 116-123: Reconcile currently returns on Node NotFound before
cleaning up a ServerMaintenance CR, which can orphan the CR; fix by ensuring
cleanup runs on Node deletion — either add a Kubernetes finalizer on Node in
NodeMaintenanceReconciler.Reconcile (add finalizer when maintenance is
requested, detect deletionTimestamp, perform ServerMaintenance deletion/cleanup
using the providerID and targetClient, then remove finalizer) or implement a
DeleteFunc event handler that reads the tombstone to extract providerID and
proactively delete the corresponding ServerMaintenance CR via r.targetClient;
update NodeMaintenanceReconciler methods that reference providerID,
targetClient, and DeleteFunc to perform the delete logic and remove the
finalizer only after successful cleanup.
ℹ️ Review info
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
go.sumis excluded by!**/*.sum
📒 Files selected for processing (6)
go.modpkg/cloudprovider/metal/cloud.gopkg/cloudprovider/metal/node_controller.gopkg/cloudprovider/metal/node_controller_test.gopkg/cloudprovider/metal/node_maintenance_controller.gopkg/cloudprovider/metal/node_maintenance_controller_test.go
💤 Files with no reviewable changes (2)
- pkg/cloudprovider/metal/node_controller_test.go
- pkg/cloudprovider/metal/node_controller.go
dc30c2c to
ceac43a
Compare
| r.queue.Add(key) | ||
| }, | ||
| }) | ||
| defer r.queue.ShutDown() |
There was a problem hiding this comment.
Would it make sense to move the defer after the error check?
And I see that other controllers follow the same pattern - is that an issue?
There was a problem hiding this comment.
Actually we are creating queue in the reconciler constructor NewNodeMaintenanceReconciler, which already allocates some resources
https://github.com/kubernetes/client-go/blob/master/util/workqueue/delaying_queue.go#L157
so we want to shutdown it anyways. It will be even better to move this defer on the top of this Start func
|
In general I would try to avoid adding finalizers on Node objects as it might block machinery on top that manages nodes (Gardener, Cluster-API). We need to think what happens if a rolling update is triggered and how we should react upon it:
Upd after offline discussion: finalizers on Node objects should not be an issue here. |
|
@coderabbitai do you think it is a good idea to split the reconciliation logic into two |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain
🏁 Script executed: Length of output: 14776
Current Architecture Issues
Current Architecture Strengths
Recommended ApproachMerge into a single func (r *NodeReconciler) Reconcile(ctx context.Context, req ctrl.Request) error {
node := &corev1.Node{}
if err := r.targetClient.Get(ctx, req.NamespacedName, node); err != nil {
return client.IgnoreNotFound(err)
}
// Handle deletion
if !node.DeletionTimestamp.IsZero() {
return r.reconcileDelete(ctx, node)
}
// Reconcile PodCIDR
if err := r.reconcilePodCIDR(ctx, node); err != nil {
return fmt.Errorf("failed to reconcile PodCIDR: %w", err)
}
// Reconcile maintenance
if err := r.reconcileMaintenance(ctx, node); err != nil {
return fmt.Errorf("failed to reconcile maintenance: %w", err)
}
return nil
}This provides:
Note: The finalizer concern raised by |
There was a problem hiding this comment.
Actionable comments posted: 7
🧹 Nitpick comments (2)
config/kind/manager-patch.yaml (1)
10-29: Harden manager container security context.Line [10] onward defines the container without explicit hardening controls (
allowPrivilegeEscalation, dropped capabilities, seccomp). This is a security posture gap even for dev clusters.Proposed patch
spec: strategy: type: Recreate template: spec: + securityContext: + seccompProfile: + type: RuntimeDefault containers: - name: manager image: controller imagePullPolicy: IfNotPresent + securityContext: + allowPrivilegeEscalation: false + runAsNonRoot: true + readOnlyRootFilesystem: true + capabilities: + drop: + - ALL args: - --cloud-provider=metal - --cloud-config=/etc/kubernetes/cloud-config/cloud-config🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@config/kind/manager-patch.yaml` around lines 10 - 29, Add a Pod/Container securityContext to harden the "manager" container: in the container spec for name "manager" add securityContext with allowPrivilegeEscalation: false, runAsNonRoot: true (and set runAsUser to a non-root uid like 1000), and capabilities.drop: ["ALL"]; also add a pod or container seccompProfile with type: RuntimeDefault; ensure these fields are applied alongside the existing volumeMount for "cloud-config" and keep secretName "metal-cloud-config" unchanged.Makefile (1)
70-73: Makekind-setupidempotent for repeatable local workflows.Lines [72]-[73] fail on reruns if clusters already exist. Guarding creation improves day-to-day developer UX.
Proposed patch
.PHONY: kind-setup kind-setup: - kind create cluster --name mgmt - kind create cluster --name worker + kind get clusters | grep -qx mgmt || kind create cluster --name mgmt + kind get clusters | grep -qx worker || kind create cluster --name worker🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Makefile` around lines 70 - 73, The kind-setup target fails on reruns because it unconditionally runs "kind create cluster --name mgmt" and "--name worker"; make it idempotent by checking for each cluster's existence before creating it (e.g., use "kind get clusters" and only call kind create when the named cluster is absent). Update the kind-setup recipe to guard the two creations (for "mgmt" and "worker") so repeated runs skip creation if the cluster already exists.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@Makefile`:
- Around line 81-84: The tilt-up recipe currently sets KUBECONFIG on one shell
line then runs `tilt up` in a new shell and also references a non-existent
`mgmt-kubeconfig`; fix by setting the env var inline with the command and point
it at the files produced by the export-kubeconfig target (e.g.
`mgmt-kubeconfig-external` and the worker kubeconfig). Update the `tilt-up`
recipe so the single command uses:
KUBECONFIG=./config/kind/mgmt-kubeconfig-external:./config/kind/worker-kubeconfig
tilt up (ensure `export-kubeconfig` still runs as a prerequisite).
In `@pkg/cloudprovider/metal/node_controller_test.go`:
- Around line 251-253: The test currently uses
Consistently(Object(node)).Should(HaveField("Finalizers",
ContainElement(nodeMaintenanceFinalizer))) which races with asynchronous
reconciliation; change the assertion to first wait for the finalizer to appear
by calling Eventually(Object(node)).Should(HaveField("Finalizers",
ContainElement(nodeMaintenanceFinalizer))) and only then use
Consistently(Object(node)).Should(...) to assert it remains present. Locate the
occurrence referencing node and nodeMaintenanceFinalizer and wrap the initial
existence check with Eventually before the Consistently stability check.
- Around line 223-235: The test fixture creates a ServerMaintenance object
(existingCR) with Finalizers: []string{nodeMaintenanceFinalizer} which is
incorrect because nodeMaintenanceFinalizer is Node-specific; remove the
Finalizers field (or ensure Finalizers is empty) from the ServerMaintenance
object creation in the idempotency test so existingCR mirrors production
behavior (ServerMaintenance objects created without finalizers) and the test
remains focused on verifying the controller leaves existing CR fields untouched.
In `@pkg/cloudprovider/metal/node_controller.go`:
- Around line 202-214: The deletion in ensureServerMaintenanceNotExists
currently removes ServerMaintenance by name/namespace unconditionally; change it
to first GET the ServerMaintenance (using r.metalClient.Get into the same
maintenance object), check its labels for "app.kubernetes.io/managed-by" and
only call r.metalClient.Delete if that label matches the controller's manager
identifier (the value you set when creating maintenance objects in the
reconciler), otherwise return nil (or client.IgnoreNotFound(nil)). Keep existing
error handling using client.IgnoreNotFound(err) for both Get and Delete to avoid
spurious failures; reference function ensureServerMaintenanceNotExists and the
"app.kubernetes.io/managed-by" label in your changes.
- Around line 133-147: Move the deletion-check above the strict providerID
validation so deleting Nodes never skip delete reconciliation: check
node.DeletionTimestamp.IsZero() before inspecting node.Spec.ProviderID or
calling parseProviderID, and if the node is deleting call and return
r.reconcileDelete(ctx, node, /*serverClaimKey may be nil/zero*/ ) (or construct
a minimal serverClaimKey if reconcileDelete requires it) so finalizer cleanup
runs even when providerID is empty or malformed; then proceed with the existing
providerID empty check, parseProviderID call, and subsequent logging
(references: node.Spec.ProviderID, node.DeletionTimestamp, parseProviderID,
r.reconcileDelete, l.Info).
In `@Tiltfile`:
- Around line 15-17: The Tiltfile currently calls mgmt_kubectl with CRD URLs
pointing at the moving "main" branch; update each mgmt_kubectl invocation (the
three calls shown) to use the immutable release tag v0.3.0 in the URLs (replace
"main" with "v0.3.0") so the CRDs are pinned to a stable, reproducible version.
- Around line 15-20: The sequence applies CRD manifests then creates CRs too
quickly; update the Tiltfile to wait for each CRD to reach the Established
condition before applying the Server and ServerClaim CRs by invoking
mgmt_kubectl wait for each CRD (e.g., metal.ironcore.dev_serverclaims,
metal.ironcore.dev_servermaintenances, metal.ironcore.dev_servers) with
--for=condition=Established --timeout=<reasonable-duration> after each
mgmt_kubectl('apply -f ...yaml') call, then proceed to apply
config/kind/crs/server.yaml and config/kind/crs/serverclaim.yaml only after
those waits complete.
---
Nitpick comments:
In `@config/kind/manager-patch.yaml`:
- Around line 10-29: Add a Pod/Container securityContext to harden the "manager"
container: in the container spec for name "manager" add securityContext with
allowPrivilegeEscalation: false, runAsNonRoot: true (and set runAsUser to a
non-root uid like 1000), and capabilities.drop: ["ALL"]; also add a pod or
container seccompProfile with type: RuntimeDefault; ensure these fields are
applied alongside the existing volumeMount for "cloud-config" and keep
secretName "metal-cloud-config" unchanged.
In `@Makefile`:
- Around line 70-73: The kind-setup target fails on reruns because it
unconditionally runs "kind create cluster --name mgmt" and "--name worker"; make
it idempotent by checking for each cluster's existence before creating it (e.g.,
use "kind get clusters" and only call kind create when the named cluster is
absent). Update the kind-setup recipe to guard the two creations (for "mgmt" and
"worker") so repeated runs skip creation if the cluster already exists.
ℹ️ Review info
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (13)
.gitignoreMakefileTiltfileTiltfile.licenseconfig/kind/cloud-config.yamlconfig/kind/crs/node.yamlconfig/kind/crs/server.yamlconfig/kind/crs/serverclaim.yamlconfig/kind/kustomization.yamlconfig/kind/manager-patch.yamlgo.modpkg/cloudprovider/metal/node_controller.gopkg/cloudprovider/metal/node_controller_test.go
✅ Files skipped from review due to trivial changes (4)
- config/kind/kustomization.yaml
- Tiltfile.license
- config/kind/crs/node.yaml
- config/kind/crs/serverclaim.yaml
f422a1a to
6df53da
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
pkg/cloudprovider/metal/node_controller_test.go (1)
352-387: Consider moving this test inside the "ServerMaintenance CR Lifecycle" context.This test is logically related to the ServerMaintenance CR lifecycle tests but is currently placed at the top level of the
Describeblock. Moving it inside theContext("ServerMaintenance CR Lifecycle", ...)block would improve test organization and make the relationship clearer.Suggested reorganization
Move lines 352-387 inside the
Context("ServerMaintenance CR Lifecycle", func() { ... })block that ends at line 350.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/cloudprovider/metal/node_controller_test.go` around lines 352 - 387, Move the It test titled "should NOT delete the ServerMaintenance CR if it is not managed by the controller" into the existing Context("ServerMaintenance CR Lifecycle", func() { ... }) block so it is scoped with the related ServerMaintenance lifecycle tests; locate the It block (search for the exact spec string) and cut/paste its entire body into the ServerMaintenance CR Lifecycle context (near other It cases), ensuring imports/variables used by the test (node, serverClaim, k8sClient, labelKeyManagedBy, serverMaintenancePriority) remain in scope and update any DeferCleanup/patch context if necessary.Tiltfile (1)
28-32: Consider making architecture configurable for cross-platform development.The hardcoded
GOARCH=amd64will prevent developers on ARM machines (Apple Silicon Macs) from using Tilt for local development without manual modification.♻️ Suggested improvement
+ARCH = str(local('go env GOARCH', quiet=True)).strip() + local_resource( "manager-binary", - cmd = 'mkdir -p .tiltbuild; CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build -o .tiltbuild/manager ./cmd/metal-cloud-controller-manager/main.go', + cmd = 'mkdir -p .tiltbuild; CGO_ENABLED=0 GOOS=linux GOARCH=' + ARCH + ' go build -o .tiltbuild/manager ./cmd/metal-cloud-controller-manager/main.go', deps = ["pkg", "cmd", "go.mod", "go.sum"] )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Tiltfile` around lines 28 - 32, The build command in the local_resource "manager-binary" hardcodes GOARCH=amd64; change it to read GOARCH from the environment with a default (e.g., use getenv("GOARCH") or similar in the Tiltfile) so Apple Silicon/ARM devs can override it; update the cmd string for manager-binary to inject the resolved GOARCH value (keep CGO_ENABLED=0 and GOOS=linux) and document that developers can set GOARCH in their shell or Tilt environment to arm64 when needed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@go.mod`:
- Line 7: The Tiltfile and go.mod are out of sync: go.mod depends on
github.com/ironcore-dev/metal-operator at pseudo-version
v0.3.1-0.20260220142141-5dabaca4f7d7 while the Tiltfile uses METAL_OPERATOR_REF
set to v0.3.0; pick one source of truth and align them—either update
METAL_OPERATOR_REF in the Tiltfile to the exact pseudo-version/commit string
used in go.mod (the commit suffix 5dabaca4f7d7) or change the go.mod dependency
to the stable tag v0.3.0 so both references match.
In `@pkg/cloudprovider/metal/node_controller.go`:
- Around line 270-273: The reconcile should not return an error when
node.Spec.ProviderID is empty or unparseable because that triggers a requeue
storm; change the code around the parseProviderID call so you first check if
node.Spec.ProviderID is empty and, if so, log/debug and return nil (no error),
and if parseProviderID(node.Spec.ProviderID) returns an error, treat it as
non-fatal: log the parse failure with context (including node.Name and
ProviderID) and return nil instead of fmt.Errorf, ensuring any later use of
serverClaimKey only happens when parse succeeds.
- Around line 334-357: The create/patch path in ensureServerMaintenanceExists
uses controllerutil.CreateOrPatch which will mutate any existing
ServerMaintenance with the same key; update the mutate closure (or pre-check) to
detect an existing object that is not managed by us and bail instead of taking
it over: inside the CreateOrPatch mutate func in ensureServerMaintenanceExists
(or by doing a Get before calling CreateOrPatch) examine
maintenance.Labels[labelKeyManagedBy] and if it exists and !=
cloudProviderMetalName return a non-nil error to stop patching so we don’t adopt
unmanaged ServerMaintenance objects. Ensure you still set labels/Spec when the
object is absent or already managed by cloudProviderMetalName.
---
Nitpick comments:
In `@pkg/cloudprovider/metal/node_controller_test.go`:
- Around line 352-387: Move the It test titled "should NOT delete the
ServerMaintenance CR if it is not managed by the controller" into the existing
Context("ServerMaintenance CR Lifecycle", func() { ... }) block so it is scoped
with the related ServerMaintenance lifecycle tests; locate the It block (search
for the exact spec string) and cut/paste its entire body into the
ServerMaintenance CR Lifecycle context (near other It cases), ensuring
imports/variables used by the test (node, serverClaim, k8sClient,
labelKeyManagedBy, serverMaintenancePriority) remain in scope and update any
DeferCleanup/patch context if necessary.
In `@Tiltfile`:
- Around line 28-32: The build command in the local_resource "manager-binary"
hardcodes GOARCH=amd64; change it to read GOARCH from the environment with a
default (e.g., use getenv("GOARCH") or similar in the Tiltfile) so Apple
Silicon/ARM devs can override it; update the cmd string for manager-binary to
inject the resolved GOARCH value (keep CGO_ENABLED=0 and GOOS=linux) and
document that developers can set GOARCH in their shell or Tilt environment to
arm64 when needed.
ℹ️ Review info
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
MakefileTiltfilego.modpkg/cloudprovider/metal/node_controller.gopkg/cloudprovider/metal/node_controller_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- Makefile
|
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (1)
Tiltfile (1)
15-18:⚠️ Potential issue | 🟠 MajorPin the CRD source ref to an immutable release/commit.
Line 15 still points to
main, so local bootstrap can change unexpectedly when upstream moves. Please pin to a tested tag (or commit SHA).Suggested change
-METAL_OPERATOR_REF = "main" `#TODO`: update to the tag after new metal-operator release +METAL_OPERATOR_REF = "v<tested-release-tag>"#!/bin/bash set -euo pipefail curl -fsSL "https://api.github.com/repos/ironcore-dev/metal-operator/releases?per_page=10" | python - <<'PY' import json, sys for r in json.load(sys.stdin): if not r.get("draft") and not r.get("prerelease"): print(f"{r['tag_name']} {r.get('published_at','')}") PY🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Tiltfile` around lines 15 - 18, The METAL_OPERATOR_REF is pinned to "main" causing non-deterministic bootstrap; update the METAL_OPERATOR_REF constant to a specific immutable release tag or commit SHA (e.g., "vX.Y.Z" or a commit hash) and use that value in the three mgmt_kubectl(...) lines so they reference a known-good release; adjust the comment TODO to indicate the chosen tag/sha and, if desired, make METAL_OPERATOR_REF overridable via an environment variable so developers can opt into different versions without changing the file.
🧹 Nitpick comments (2)
pkg/cloudprovider/metal/node_controller_test.go (2)
505-513: Add unsupported-scheme coverage inparseProviderIDtable tests.Please add a case like
aws://default/node-1(or similar) to lock in expected rejection of non-metal schemes.Suggested test entry
Entry("valid provider ID", "metal://default/node-1", types.NamespacedName{Namespace: "default", Name: "node-1"}, false), + Entry("unsupported scheme", "aws://default/node-1", types.NamespacedName{}, true), Entry("empty string", "", types.NamespacedName{}, true),🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/cloudprovider/metal/node_controller_test.go` around lines 505 - 513, Add a table test entry to assert parseProviderID rejects non-metal schemes: in the table inside node_controller_test.go where parseProviderID cases are declared (the Entry(...) lines), add an Entry for an unsupported scheme such as "aws://default/node-1" with expected result types.NamespacedName{} and expectError true; this ensures parseProviderID (the function under test) explicitly fails on schemes other than "metal".
352-390: Add regression test for cleanup whenServerClaimdisappears first.A targeted case here would be valuable: create maintenance + finalizer, delete
ServerClaim, remove maintenance-requested label, then assert finalizer is removed and managedServerMaintenanceis cleaned up.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/cloudprovider/metal/node_controller_test.go` around lines 352 - 390, Add a new test in node_controller_test.go that covers the regression where the ServerClaim is deleted before the Node: reuse the existing setup that creates a ServerMaintenance CR and adds the nodeMaintenanceFinalizer by setting node.Labels[metalv1alpha1.ServerMaintenanceRequestedLabelKey] = TrueStr and patching via k8sClient; then delete the serverClaim object (serverClaim) from the API, remove the maintenance-requested label from node and patch, and assert that the managed ServerMaintenance (metalv1alpha1.ServerMaintenance) is deleted and the Node finalizer (nodeMaintenanceFinalizer) is removed (use k8sClient.Get with client.ObjectKey for maintenance and client.ObjectKeyFromObject(node) for node and apierrors.IsNotFound checks inside Eventually blocks).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/cloudprovider/metal/node_controller.go`:
- Around line 275-280: The early return on apierrors.IsNotFound when fetching
ServerClaim (serverClaim := &metalv1alpha1.ServerClaim{}; err =
r.metalClient.Get(ctx, serverClaimKey, serverClaim)) skips the cleanup path;
change the logic so that when IsNotFound you still execute the cleanup branch
that removes the Node maintenance label, deletes any managed ServerMaintenance,
and removes the finalizer from the Node (the same cleanup you run when
ServerClaim exists but is not in maintenance), instead of returning immediately.
Update the handling in the ServerClaim fetch block and repeat the same fix for
the analogous code at the 306–317 region so both code paths (serverClaim missing
or present) will call the cleanup routines that operate on ServerMaintenance,
the Node object, and finalizer removal.
- Around line 148-151: parseProviderID currently accepts any scheme (provider)
and only errors on missing scheme; change it to explicitly validate allowed
schemes (e.g., "metal") by checking the extracted provider variable against a
whitelist inside parseProviderID and return a clear error like "unsupported
provider scheme" when it doesn't match; keep existing behavior of splitting
providerID with strings.Cut and parsing rest (namespace/name) but add the
whitelist check (reference symbols: parseProviderID, provider, rest, providerID)
so non-metal schemes are rejected before returning the NamespacedName.
In `@Tiltfile`:
- Around line 28-45: The docker image build ("docker_build" with ref
"controller") can race the binary creation from the local_resource named
"manager-binary"; add an explicit dependency by updating the k8s_resource for
the controller to include resource_deps=['manager-binary'] so Tilt waits for
local_resource("manager-binary") to finish before invoking docker_build,
ensuring .tiltbuild/manager exists before the image build runs.
---
Duplicate comments:
In `@Tiltfile`:
- Around line 15-18: The METAL_OPERATOR_REF is pinned to "main" causing
non-deterministic bootstrap; update the METAL_OPERATOR_REF constant to a
specific immutable release tag or commit SHA (e.g., "vX.Y.Z" or a commit hash)
and use that value in the three mgmt_kubectl(...) lines so they reference a
known-good release; adjust the comment TODO to indicate the chosen tag/sha and,
if desired, make METAL_OPERATOR_REF overridable via an environment variable so
developers can opt into different versions without changing the file.
---
Nitpick comments:
In `@pkg/cloudprovider/metal/node_controller_test.go`:
- Around line 505-513: Add a table test entry to assert parseProviderID rejects
non-metal schemes: in the table inside node_controller_test.go where
parseProviderID cases are declared (the Entry(...) lines), add an Entry for an
unsupported scheme such as "aws://default/node-1" with expected result
types.NamespacedName{} and expectError true; this ensures parseProviderID (the
function under test) explicitly fails on schemes other than "metal".
- Around line 352-390: Add a new test in node_controller_test.go that covers the
regression where the ServerClaim is deleted before the Node: reuse the existing
setup that creates a ServerMaintenance CR and adds the nodeMaintenanceFinalizer
by setting node.Labels[metalv1alpha1.ServerMaintenanceRequestedLabelKey] =
TrueStr and patching via k8sClient; then delete the serverClaim object
(serverClaim) from the API, remove the maintenance-requested label from node and
patch, and assert that the managed ServerMaintenance
(metalv1alpha1.ServerMaintenance) is deleted and the Node finalizer
(nodeMaintenanceFinalizer) is removed (use k8sClient.Get with client.ObjectKey
for maintenance and client.ObjectKeyFromObject(node) for node and
apierrors.IsNotFound checks inside Eventually blocks).
ℹ️ Review info
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
Tiltfilepkg/cloudprovider/metal/node_controller.gopkg/cloudprovider/metal/node_controller_test.go
|
@coderabbitai summary |
✅ Actions performedSummary regeneration triggered. |
|
@coderabbitai summary |
✅ Actions performedSummary regeneration triggered. |
Summary by CodeRabbit
New Features
Tests
Chores
Fixes #155