NMO Doesn't Increment ErrorOnLeaseCount to Stop Maintenance#158
NMO Doesn't Increment ErrorOnLeaseCount to Stop Maintenance#158razo7 wants to merge 5 commits intomedik8s:mainfrom
Conversation
…LeaseCount can be incremented ErrorOnLeaseCount of NM CR status was never incremented when obtainLease never returned a 'true' value
|
Skipping CI for Draft Pull Request. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: razo7 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughobtainLease signature changed to return only an error; lease.AlreadyHeldError is detected and treated as a contention (increments ErrorOnLeaseCount and can drive MaintenanceFailed), successful lease advances to MaintenanceRunning and resets counters, and InvalidateLease skips invalidation on AlreadyHeldError. Tests updated to cover contention and recovery. Changes
Sequence Diagram(s)sequenceDiagram
participant Reconciler
participant LeaseManager
participant NodeAPI
participant StatusEvent
Reconciler->>LeaseManager: RequestLease(node)
alt Lease obtained
LeaseManager-->>Reconciler: nil
Reconciler->>NodeAPI: Cordon/Taint node
Reconciler->>StatusEvent: Set MaintenanceRunning, reset ErrorOnLeaseCount
else Lease already held (AlreadyHeldError)
LeaseManager-->>Reconciler: AlreadyHeldError
Reconciler->>StatusEvent: Increment ErrorOnLeaseCount, possibly set MaintenanceFailed, emit event
else Other error
LeaseManager-->>Reconciler: error
Reconciler->>StatusEvent: Record error, emit event
end
Note over Reconciler,LeaseManager: On stop/cleanup
Reconciler->>LeaseManager: InvalidateLease(node)
alt Invalidate succeeded
LeaseManager-->>Reconciler: nil
else Invalidate AlreadyHeldError
LeaseManager-->>Reconciler: AlreadyHeldError (logged, skipped)
else Invalidate other error
LeaseManager-->>Reconciler: error (logged)
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
controllers/nodemaintenance_controller_test.go (2)
265-268: Unchecked type assertion onoriginalLeaseManager.Line 267 uses
originalLeaseManager.(*mockLeaseManager).Managerwithout a comma-ok guard. In test code this is generally acceptable since a panic would surface as a clear test failure, but it's worth noting that this assumes the suite always initializesr.LeaseManageras*mockLeaseManager.
263-292: Test scenario is well-structured and follows existing patterns.The mock setup with
DeferCleanupto restore the originalLeaseManageris clean. TheEventuallyblock properly waits for the async reconciliation loop to incrementErrorOnLeaseCountpast the threshold.The hardcoded substring
"failed to extend lease owned by us"on line 291 couples this test to the controller's error message (line 206 ofnodemaintenance_controller.go). Consider extracting this string as a shared constant to reduce fragility if the message needs to change in the future.
|
/test 4.22-openshift-e2e |
a19ca55 to
7747b91
Compare
|
/test 4.22-openshift-e2e |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
controllers/nodemaintenance_controller_test.go (1)
265-268: Unchecked type assertion could panic if test setup changes.Line 267 uses a bare type assertion
originalLeaseManager.(*mockLeaseManager)which will panic ifr.LeaseManagerisn't a*mockLeaseManager. Consider the comma-ok form for a clearer failure message, though this is minor given the controlled test context.♻️ Optional: safer type assertion
- r.LeaseManager = &mockLeaseManager{ - Manager: originalLeaseManager.(*mockLeaseManager).Manager, - requestLeaseErr: lease.AlreadyHeldError{}, - } + mock, ok := originalLeaseManager.(*mockLeaseManager) + Expect(ok).To(BeTrue(), "expected LeaseManager to be *mockLeaseManager") + r.LeaseManager = &mockLeaseManager{ + Manager: mock.Manager, + requestLeaseErr: lease.AlreadyHeldError{}, + }
|
Can you please add another test case which tests if NMO starts maintenance when a lease, which was taken by someone else, is released? |
…ctor obtainLease Fixing a false positive test result because InvalidateLease silently succeeded in the test environment. We fix that by avoid erroring for lease invalidation that NMO don't posses so we can stop the maintenance CR and record it as failed.
7747b91 to
35aafb2
Compare
Test the NMO-NHC lease coordination scenarios introduced by the recent obtainLease and stopNodeMaintenanceImp fixes. Why: The previous commits (d8b7eed, 3e7fc3f) fixed lease contention handling but had no test coverage. Without tests, regressions in ErrorOnLeaseCount incrementing, Phase=Failed transition, or lease-release recovery could go undetected. What: - Add "lease is already held" test: verifies that when RequestLease returns AlreadyHeldError indefinitely, ErrorOnLeaseCount exceeds the threshold and Phase transitions to MaintenanceFailed - Add "lease is held then released" test: verifies that after temporary lease contention (simulated via a counting mock that returns AlreadyHeldError N times then nil), NMO recovers — resets ErrorOnLeaseCount, cordons the node, drains pods, and reaches MaintenanceSucceeded - Extend mockLeaseManager with requestLeaseErr, invalidateLeaseErr, and maxRequestFailures fields to support both scenarios - Use named fields in suite_test.go for mockLeaseManager init
35aafb2 to
af10e22
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
controllers/nodemaintenance_controller.go (1)
190-221:⚠️ Potential issue | 🟡 MinorErrorOnLeaseCount reset on non-
AlreadyHeldErrorneeds clarificationThe counter resets to
0both when a non-AlreadyHeldErroroccurs (line 211) and when the lease is obtained successfully (line 216). While the success case includes a comment explaining the intent ("Another chance to evict pods"), the reset on non-AlreadyHeldErrorlacks explanation.This means if the controller is counting consecutive
AlreadyHeldErrorfailures and then encounters a different transient error (e.g., network timeout), the count resets and the threshold check restarts from zero. Based on the code structure and test expectations, this appears to be intentional —ErrorOnLeaseCountis designed to track only consecutiveAlreadyHeldErrorfailures, not all lease-related errors. However, adding a clarifying comment at line 211 would make the logic explicit and easier to maintain.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@controllers/nodemaintenance_controller.go` around lines 190 - 221, Add a clarifying comment explaining why nm.Status.ErrorOnLeaseCount is reset to 0 when err != nil (non-AlreadyHeldError) in the obtainLease handling: state that ErrorOnLeaseCount intentionally counts only consecutive lease.AlreadyHeldError occurrences (used by obtainLease and the maxAllowedErrorToUpdateOwnedLease logic) and should be cleared on any other transient or different error to restart the consecutive-held-error tally; place this comment immediately before the branch that logs "failed to request lease" (around the err != nil handling) and reference ErrorOnLeaseCount, obtainLease, and maxAllowedErrorToUpdateOwnedLease so future readers understand the intended behavior.
🧹 Nitpick comments (3)
controllers/nodemaintenance_controller_test.go (3)
295-337: Good coverage of the recovery path after lease release — addresses the reviewer's request.This directly covers the scenario slintes requested: NMO starts maintenance when a previously-contended lease is released.
One minor observation: the
Eventuallytimeout at line 326 is 5s. Given that the mock allowsmaxAllowedErrorToUpdateOwnedLease + 2failures before success, the controller's exponential backoff could occasionally push reconciliation beyond this window, leading to flakes in slower CI environments. Consider bumping the timeout to ~10s for resilience.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@controllers/nodemaintenance_controller_test.go` around lines 295 - 337, The test "lease is held then released by another entity" can flake due to a short Eventually timeout; increase the Eventually call timeout from "5s" to "10s" (or similar) where you assert maintenance.Status.Phase/DrainProgress/ErrorOnLeaseCount to accommodate the configured mockLeaseManager failures (maxAllowedErrorToUpdateOwnedLease + 2) and controller backoff behavior so the test is more resilient in slower CI.
489-508:requestFailCountis not thread-safe, but likely okay for single-CR reconciliation.The mock increments
requestFailCountwithout a mutex. This is fine as long as only one reconcile goroutine accesses it at a time (which is the case for a single CR with the default controller concurrency of 1). Just noting for awareness if tests ever run concurrent reconcilers.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@controllers/nodemaintenance_controller_test.go` around lines 489 - 508, The mockLeaseManager's RequestLease mutates requestFailCount without synchronization, which is unsafe if RequestLease can be called concurrently; make requestFailCount access thread-safe by adding a sync.Mutex or using atomic operations on requestFailCount inside the mockLeaseManager and guard increments/reads in RequestLease (and any other helpers) accordingly so concurrent reconciles won't race on requestFailCount.
266-270: Set holder identity in test error initialization if a public constructor becomes available.The
AlreadyHeldError.Error()method includes the holder identity in the error message:"can't update or invalidate the lease because it is held by different owner: %s". Currently, the test initializes this error with an emptyholderIdentity, resulting in less diagnostic output during test failures. SinceholderIdentityis an unexported field and the test runs in a different package, it cannot be set directly. This could be addressed if the lease package provides a public constructor (e.g.,NewAlreadyHeldError(holderIdentity string)) in the future.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@controllers/nodemaintenance_controller_test.go` around lines 266 - 270, Update the test to initialize AlreadyHeldError with a non-empty holder identity once the lease package exposes a public constructor (e.g., NewAlreadyHeldError); replace the current direct struct literals used for requestLeaseErr and invalidateLeaseErr on the mockLeaseManager (see mockLeaseManager, r.LeaseManager, requestLeaseErr, invalidateLeaseErr) by calling the new constructor so the error messages include the holder identity (e.g., NewAlreadyHeldError("test-holder")).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@controllers/nodemaintenance_controller.go`:
- Around line 388-397: stopNodeMaintenanceOnDeletion currently calls
r.LeaseManager.InvalidateLease for the node-not-found path but does not handle
lease.AlreadyHeldError, so a lease held by another entity can cause cleanup to
fail; update stopNodeMaintenanceOnDeletion to mirror stopNodeMaintenanceImp by
catching the error from r.LeaseManager.InvalidateLease, use errors.As to check
for lease.AlreadyHeldError, log an informational message (e.g., via
r.logger.Info) and continue cleanup when that error is encountered, otherwise
return the error as before.
---
Outside diff comments:
In `@controllers/nodemaintenance_controller.go`:
- Around line 190-221: Add a clarifying comment explaining why
nm.Status.ErrorOnLeaseCount is reset to 0 when err != nil (non-AlreadyHeldError)
in the obtainLease handling: state that ErrorOnLeaseCount intentionally counts
only consecutive lease.AlreadyHeldError occurrences (used by obtainLease and the
maxAllowedErrorToUpdateOwnedLease logic) and should be cleared on any other
transient or different error to restart the consecutive-held-error tally; place
this comment immediately before the branch that logs "failed to request lease"
(around the err != nil handling) and reference ErrorOnLeaseCount, obtainLease,
and maxAllowedErrorToUpdateOwnedLease so future readers understand the intended
behavior.
---
Nitpick comments:
In `@controllers/nodemaintenance_controller_test.go`:
- Around line 295-337: The test "lease is held then released by another entity"
can flake due to a short Eventually timeout; increase the Eventually call
timeout from "5s" to "10s" (or similar) where you assert
maintenance.Status.Phase/DrainProgress/ErrorOnLeaseCount to accommodate the
configured mockLeaseManager failures (maxAllowedErrorToUpdateOwnedLease + 2) and
controller backoff behavior so the test is more resilient in slower CI.
- Around line 489-508: The mockLeaseManager's RequestLease mutates
requestFailCount without synchronization, which is unsafe if RequestLease can be
called concurrently; make requestFailCount access thread-safe by adding a
sync.Mutex or using atomic operations on requestFailCount inside the
mockLeaseManager and guard increments/reads in RequestLease (and any other
helpers) accordingly so concurrent reconciles won't race on requestFailCount.
- Around line 266-270: Update the test to initialize AlreadyHeldError with a
non-empty holder identity once the lease package exposes a public constructor
(e.g., NewAlreadyHeldError); replace the current direct struct literals used for
requestLeaseErr and invalidateLeaseErr on the mockLeaseManager (see
mockLeaseManager, r.LeaseManager, requestLeaseErr, invalidateLeaseErr) by
calling the new constructor so the error messages include the holder identity
(e.g., NewAlreadyHeldError("test-holder")).
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
controllers/nodemaintenance_controller_test.go (1)
489-509: Optional: Use atomic operations to guardrequestFailCountagainst potential data races in concurrent testing.The test runs reconciliation in a background manager goroutine while the test goroutine may swap and restore
LeaseManagerviaDeferCleanup. Although the current test patterns useEventually()to avoid direct concurrent access torequestFailCount, the Go race detector tracks goroutine-level memory access. Usingatomic.Int32for bothrequestFailCountand comparing againstmaxRequestFailuresas an int would eliminate any theoretical race condition and keep tests clean under-raceflag if it's enabled in the future.♻️ Proposed fix using sync/atomic
+import "sync/atomic" + type mockLeaseManager struct { lease.Manager requestLeaseErr error // maxRequestFailures limits how many times RequestLease returns requestLeaseErr. // 0 means unlimited (fail forever). When requestFailCount reaches maxRequestFailures, // RequestLease returns nil — simulating the lease being released. maxRequestFailures int - requestFailCount int + requestFailCount atomic.Int32 invalidateLeaseErr error } func (mock *mockLeaseManager) RequestLease(_ context.Context, _ client.Object, _ time.Duration) error { if mock.requestLeaseErr != nil { - if mock.maxRequestFailures <= 0 || mock.requestFailCount < mock.maxRequestFailures { - mock.requestFailCount++ + if mock.maxRequestFailures <= 0 || int(mock.requestFailCount.Load()) < mock.maxRequestFailures { + mock.requestFailCount.Add(1) return mock.requestLeaseErr } } return nil }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@controllers/nodemaintenance_controller_test.go` around lines 489 - 509, The mockLeaseManager’s requestFailCount is incremented without synchronization which can cause race detector failures when RequestLease runs concurrently; replace the int requestFailCount with an atomic counter (e.g., atomic.Int32 or use atomic package functions) and update RequestLease to load/compare and increment the counter atomically when checking against maxRequestFailures, keeping maxRequestFailures as an int for comparison (convert types as needed) and ensuring all accesses to requestFailCount use the atomic API.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@controllers/nodemaintenance_controller_test.go`:
- Around line 316-336: The verifyEvent call that checks for the
FailedMaintenance event (using fakeRecorder.Events / isEventOccurred) is a
point-in-time assertion and can be flaky; modify the test so the verifyEvent
invocation is wrapped in an Eventually with a short timeout and poll interval
(e.g., Eventually(func() { verifyEvent(corev1.EventTypeWarning,
utils.EventReasonFailedMaintenance, utils.EventMessageFailedMaintenance) },
"5s", "200ms").Should(Succeed()) ) so the test waits for the reconciler to emit
the event before proceeding; update the corresponding test case around the
verifyEvent call in the It block that asserts lease contention recovery.
---
Nitpick comments:
In `@controllers/nodemaintenance_controller_test.go`:
- Around line 489-509: The mockLeaseManager’s requestFailCount is incremented
without synchronization which can cause race detector failures when RequestLease
runs concurrently; replace the int requestFailCount with an atomic counter
(e.g., atomic.Int32 or use atomic package functions) and update RequestLease to
load/compare and increment the counter atomically when checking against
maxRequestFailures, keeping maxRequestFailures as an int for comparison (convert
types as needed) and ensuring all accesses to requestFailCount use the atomic
API.
…tion stopNodeMaintenanceOnDeletion did not handle AlreadyHeldError from InvalidateLease, so when another entity (e.g. NHC) holds the lease during node deletion cleanup, the error caused the entire cleanup to fail. Additionally, the verifyEvent call in the lease-recovery test was a point-in-time check that could flake if the reconciler hadn't emitted the event yet. What: Mirror the AlreadyHeldError handling from stopNodeMaintenanceImp in stopNodeMaintenanceOnDeletion: log and continue cleanup instead of returning the error Wrap the FailedMaintenance verifyEvent in Eventually so the test retries until the reconciler emits the event
|
/test 4.22-openshift-e2e |
clobrano
left a comment
There was a problem hiding this comment.
I left a suggestion, not mandatory, but I think it should improve readability
|
/test 4.22-openshift-e2e |
Bootstrap failed, so we will retry |
| if err != nil { | ||
| return r.onReconcileError(ctx, nm, drainer, fmt.Errorf("failed to uncordon upon failure to obtain owned lease : %v ", err)) | ||
| var alreadyHeldErr lease.AlreadyHeldError | ||
| err = r.obtainLease(ctx, node) |
There was a problem hiding this comment.
At this point obtainLease is mostly an empty function. I suggest to remove it and call r.LeaseManager.RequestLease(ctx, node, LeaseDuration)directly. I won't stop of course if you prefer otherwise.
There was a problem hiding this comment.
+1, that method is useless now (and I would block on this ;) )
| err = r.stopNodeMaintenanceImp(ctx, drainer, node) | ||
| if err != nil { | ||
| return r.onReconcileError(ctx, nm, drainer, fmt.Errorf("failed to uncordon upon failure to obtain owned lease : %v ", err)) | ||
| var alreadyHeldErr lease.AlreadyHeldError |
There was a problem hiding this comment.
can be moved into the err != nil block
| r.logger.Info("can't extend owned lease. uncordon for now") | ||
|
|
||
| // Uncordon the node | ||
| err = r.stopNodeMaintenanceImp(ctx, drainer, node) |
There was a problem hiding this comment.
We need to take a step back.... this does not make sense here IMHO.
There are 2 ways the lease request can fail:
- AlreadyHeldError: this means that the initial request to get a lease failed because it's taken already. In the case there is no reason to stop maintenance, because it never started. We just need to retry after some time. I'm not even sure if we should increase the error count in this case...
- any other error: something else happened when getting or renewing the lease. Increase error count, retry with backoff. When error count exceeded the limit, stop maintenance if it was started and give up.
WDYT?
…nit-tests Increment lease errors when the lease was taken after it was obtained and maitenance gave up or due to any other error for obtaining the lease. Don't need to increment it when it was taken in the first time, and allowing NMO to continuesly try to begin maintenance. Include testing for when the lease is transiently lost during maintenance and when it permanently lost during maintenance. Moreover, it includes some small fixes in testing
c11eff5 to
9238928
Compare
|
/test 4.22-openshift-e2e |
Why we need this PR
ErrorOnLeaseCount of NM CR status was never incremented when obtainLease never returned a 'true' value. Resulting in NMO keeps trying to get a lease and not stopping until a change in the node happens
Changes made
AlreadyHeldErrorInvalidateLeasefunction on NM CR cleanupWhich issue(s) this PR fixes
RHWA-744
Test plan
Summary by CodeRabbit