Skip to content

NMO Doesn't Increment ErrorOnLeaseCount to Stop Maintenance#158

Draft
razo7 wants to merge 5 commits intomedik8s:mainfrom
razo7:fix-lease-obtaining-function
Draft

NMO Doesn't Increment ErrorOnLeaseCount to Stop Maintenance#158
razo7 wants to merge 5 commits intomedik8s:mainfrom
razo7:fix-lease-obtaining-function

Conversation

@razo7
Copy link
Member

@razo7 razo7 commented Feb 15, 2026

Why we need this PR

ErrorOnLeaseCount of NM CR status was never incremented when obtainLease never returned a 'true' value. Resulting in NMO keeps trying to get a lease and not stopping until a change in the node happens

Changes made

  • Simplify obtainLease function
  • Add logic to distinguish an AlreadyHeldError
    • so that ErrorOnLeaseCount can be incremented and NMO would uncordon the node and reach to Failed maintenance phase.
    • so we skip the lease invalidation logic after InvalidateLease function on NM CR cleanup

Which issue(s) this PR fixes

RHWA-744

Test plan

  • New unit test when the lease is already held by another entity
  • New unit test when the lease is already held by another entity, released, and then NMO takes the lease

Summary by CodeRabbit

  • Bug Fixes
    • Improved lease contention handling: already-held leases are treated as expected contention, error counters and cleanup now behave more reliably, preventing spurious failures and ensuring correct maintenance phase transitions.
  • Tests
    • Added scenarios for lease contention and subsequent recovery, validating error counting, maintenance failure/success transitions, emitted events, and node cordon/taint and drain progress.

…LeaseCount can be incremented

ErrorOnLeaseCount of NM CR status was never incremented when obtainLease never returned a 'true' value
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 15, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 15, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: razo7

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link

coderabbitai bot commented Feb 15, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

obtainLease signature changed to return only an error; lease.AlreadyHeldError is detected and treated as a contention (increments ErrorOnLeaseCount and can drive MaintenanceFailed), successful lease advances to MaintenanceRunning and resets counters, and InvalidateLease skips invalidation on AlreadyHeldError. Tests updated to cover contention and recovery.

Changes

Cohort / File(s) Summary
Controller: lease semantics
controllers/nodemaintenance_controller.go
Changed obtainLease signature to error only; detect lease.AlreadyHeldError via errors.As and treat as lease contention (increment ErrorOnLeaseCount, adjust phase transitions); on stop/cleanup skip invalidation when AlreadyHeldError occurs; added errors import.
Tests: suite init
controllers/controllers_suite_test.go
Use keyed composite literal for mockLeaseManager initialization (Manager: mockManager) to avoid positional dependency.
Tests: lease contention & recovery
controllers/nodemaintenance_controller_test.go
Add tests for "lease already held" and "transient contention then release"; extend mockLeaseManager with requestLeaseErr, maxRequestFailures, requestFailCount, invalidateLeaseErr to simulate failures and release; update assertions for ErrorOnLeaseCount, maintenance phase transitions, events, and node cordon/taint behavior.

Sequence Diagram(s)

sequenceDiagram
    participant Reconciler
    participant LeaseManager
    participant NodeAPI
    participant StatusEvent

    Reconciler->>LeaseManager: RequestLease(node)
    alt Lease obtained
        LeaseManager-->>Reconciler: nil
        Reconciler->>NodeAPI: Cordon/Taint node
        Reconciler->>StatusEvent: Set MaintenanceRunning, reset ErrorOnLeaseCount
    else Lease already held (AlreadyHeldError)
        LeaseManager-->>Reconciler: AlreadyHeldError
        Reconciler->>StatusEvent: Increment ErrorOnLeaseCount, possibly set MaintenanceFailed, emit event
    else Other error
        LeaseManager-->>Reconciler: error
        Reconciler->>StatusEvent: Record error, emit event
    end

    Note over Reconciler,LeaseManager: On stop/cleanup
    Reconciler->>LeaseManager: InvalidateLease(node)
    alt Invalidate succeeded
        LeaseManager-->>Reconciler: nil
    else Invalidate AlreadyHeldError
        LeaseManager-->>Reconciler: AlreadyHeldError (logged, skipped)
    else Invalidate other error
        LeaseManager-->>Reconciler: error (logged)
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

  • slintes
  • beekhof
🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title directly and specifically describes the main change: fixing the logic so that ErrorOnLeaseCount is properly incremented when lease contention occurs, which was the core problem addressed in this PR.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@razo7 razo7 changed the title Fix obtainLease Logic to Detect AlreadyHeldError NMO Doesn't Increment ErrorOnLeaseCount to Stop Maintenance Feb 16, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
controllers/nodemaintenance_controller_test.go (2)

265-268: Unchecked type assertion on originalLeaseManager.

Line 267 uses originalLeaseManager.(*mockLeaseManager).Manager without a comma-ok guard. In test code this is generally acceptable since a panic would surface as a clear test failure, but it's worth noting that this assumes the suite always initializes r.LeaseManager as *mockLeaseManager.


263-292: Test scenario is well-structured and follows existing patterns.

The mock setup with DeferCleanup to restore the original LeaseManager is clean. The Eventually block properly waits for the async reconciliation loop to increment ErrorOnLeaseCount past the threshold.

The hardcoded substring "failed to extend lease owned by us" on line 291 couples this test to the controller's error message (line 206 of nodemaintenance_controller.go). Consider extracting this string as a shared constant to reduce fragility if the message needs to change in the future.

@razo7
Copy link
Member Author

razo7 commented Feb 16, 2026

/test 4.22-openshift-e2e

@razo7 razo7 force-pushed the fix-lease-obtaining-function branch from a19ca55 to 7747b91 Compare February 16, 2026 13:49
@razo7
Copy link
Member Author

razo7 commented Feb 16, 2026

/test 4.22-openshift-e2e

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
controllers/nodemaintenance_controller_test.go (1)

265-268: Unchecked type assertion could panic if test setup changes.

Line 267 uses a bare type assertion originalLeaseManager.(*mockLeaseManager) which will panic if r.LeaseManager isn't a *mockLeaseManager. Consider the comma-ok form for a clearer failure message, though this is minor given the controlled test context.

♻️ Optional: safer type assertion
-				r.LeaseManager = &mockLeaseManager{
-					Manager:         originalLeaseManager.(*mockLeaseManager).Manager,
-					requestLeaseErr: lease.AlreadyHeldError{},
-				}
+				mock, ok := originalLeaseManager.(*mockLeaseManager)
+				Expect(ok).To(BeTrue(), "expected LeaseManager to be *mockLeaseManager")
+				r.LeaseManager = &mockLeaseManager{
+					Manager:         mock.Manager,
+					requestLeaseErr: lease.AlreadyHeldError{},
+				}

@slintes
Copy link
Member

slintes commented Feb 17, 2026

Can you please add another test case which tests if NMO starts maintenance when a lease, which was taken by someone else, is released?

…ctor obtainLease

Fixing a false positive test result because InvalidateLease silently succeeded in the test environment. We fix that by avoid erroring for lease invalidation that NMO don't posses so we can stop the maintenance CR and record it as failed.
@razo7 razo7 force-pushed the fix-lease-obtaining-function branch from 7747b91 to 35aafb2 Compare February 17, 2026 10:28
Test the NMO-NHC lease coordination scenarios introduced by the recent
obtainLease and stopNodeMaintenanceImp fixes.

Why: The previous commits (d8b7eed, 3e7fc3f) fixed lease contention
handling but had no test coverage. Without tests, regressions in
ErrorOnLeaseCount incrementing, Phase=Failed transition, or
lease-release recovery could go undetected.

What:
- Add "lease is already held" test: verifies that when RequestLease
  returns AlreadyHeldError indefinitely, ErrorOnLeaseCount exceeds
  the threshold and Phase transitions to MaintenanceFailed
- Add "lease is held then released" test: verifies that after
  temporary lease contention (simulated via a counting mock that
  returns AlreadyHeldError N times then nil), NMO recovers —
  resets ErrorOnLeaseCount, cordons the node, drains pods, and
  reaches MaintenanceSucceeded
- Extend mockLeaseManager with requestLeaseErr, invalidateLeaseErr,
  and maxRequestFailures fields to support both scenarios
- Use named fields in suite_test.go for mockLeaseManager init
@razo7 razo7 force-pushed the fix-lease-obtaining-function branch from 35aafb2 to af10e22 Compare February 17, 2026 10:32
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
controllers/nodemaintenance_controller.go (1)

190-221: ⚠️ Potential issue | 🟡 Minor

ErrorOnLeaseCount reset on non-AlreadyHeldError needs clarification

The counter resets to 0 both when a non-AlreadyHeldError occurs (line 211) and when the lease is obtained successfully (line 216). While the success case includes a comment explaining the intent ("Another chance to evict pods"), the reset on non-AlreadyHeldError lacks explanation.

This means if the controller is counting consecutive AlreadyHeldError failures and then encounters a different transient error (e.g., network timeout), the count resets and the threshold check restarts from zero. Based on the code structure and test expectations, this appears to be intentional — ErrorOnLeaseCount is designed to track only consecutive AlreadyHeldError failures, not all lease-related errors. However, adding a clarifying comment at line 211 would make the logic explicit and easier to maintain.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@controllers/nodemaintenance_controller.go` around lines 190 - 221, Add a
clarifying comment explaining why nm.Status.ErrorOnLeaseCount is reset to 0 when
err != nil (non-AlreadyHeldError) in the obtainLease handling: state that
ErrorOnLeaseCount intentionally counts only consecutive lease.AlreadyHeldError
occurrences (used by obtainLease and the maxAllowedErrorToUpdateOwnedLease
logic) and should be cleared on any other transient or different error to
restart the consecutive-held-error tally; place this comment immediately before
the branch that logs "failed to request lease" (around the err != nil handling)
and reference ErrorOnLeaseCount, obtainLease, and
maxAllowedErrorToUpdateOwnedLease so future readers understand the intended
behavior.
🧹 Nitpick comments (3)
controllers/nodemaintenance_controller_test.go (3)

295-337: Good coverage of the recovery path after lease release — addresses the reviewer's request.

This directly covers the scenario slintes requested: NMO starts maintenance when a previously-contended lease is released.

One minor observation: the Eventually timeout at line 326 is 5s. Given that the mock allows maxAllowedErrorToUpdateOwnedLease + 2 failures before success, the controller's exponential backoff could occasionally push reconciliation beyond this window, leading to flakes in slower CI environments. Consider bumping the timeout to ~10s for resilience.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@controllers/nodemaintenance_controller_test.go` around lines 295 - 337, The
test "lease is held then released by another entity" can flake due to a short
Eventually timeout; increase the Eventually call timeout from "5s" to "10s" (or
similar) where you assert
maintenance.Status.Phase/DrainProgress/ErrorOnLeaseCount to accommodate the
configured mockLeaseManager failures (maxAllowedErrorToUpdateOwnedLease + 2) and
controller backoff behavior so the test is more resilient in slower CI.

489-508: requestFailCount is not thread-safe, but likely okay for single-CR reconciliation.

The mock increments requestFailCount without a mutex. This is fine as long as only one reconcile goroutine accesses it at a time (which is the case for a single CR with the default controller concurrency of 1). Just noting for awareness if tests ever run concurrent reconcilers.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@controllers/nodemaintenance_controller_test.go` around lines 489 - 508, The
mockLeaseManager's RequestLease mutates requestFailCount without
synchronization, which is unsafe if RequestLease can be called concurrently;
make requestFailCount access thread-safe by adding a sync.Mutex or using atomic
operations on requestFailCount inside the mockLeaseManager and guard
increments/reads in RequestLease (and any other helpers) accordingly so
concurrent reconciles won't race on requestFailCount.

266-270: Set holder identity in test error initialization if a public constructor becomes available.

The AlreadyHeldError.Error() method includes the holder identity in the error message: "can't update or invalidate the lease because it is held by different owner: %s". Currently, the test initializes this error with an empty holderIdentity, resulting in less diagnostic output during test failures. Since holderIdentity is an unexported field and the test runs in a different package, it cannot be set directly. This could be addressed if the lease package provides a public constructor (e.g., NewAlreadyHeldError(holderIdentity string)) in the future.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@controllers/nodemaintenance_controller_test.go` around lines 266 - 270,
Update the test to initialize AlreadyHeldError with a non-empty holder identity
once the lease package exposes a public constructor (e.g., NewAlreadyHeldError);
replace the current direct struct literals used for requestLeaseErr and
invalidateLeaseErr on the mockLeaseManager (see mockLeaseManager,
r.LeaseManager, requestLeaseErr, invalidateLeaseErr) by calling the new
constructor so the error messages include the holder identity (e.g.,
NewAlreadyHeldError("test-holder")).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@controllers/nodemaintenance_controller.go`:
- Around line 388-397: stopNodeMaintenanceOnDeletion currently calls
r.LeaseManager.InvalidateLease for the node-not-found path but does not handle
lease.AlreadyHeldError, so a lease held by another entity can cause cleanup to
fail; update stopNodeMaintenanceOnDeletion to mirror stopNodeMaintenanceImp by
catching the error from r.LeaseManager.InvalidateLease, use errors.As to check
for lease.AlreadyHeldError, log an informational message (e.g., via
r.logger.Info) and continue cleanup when that error is encountered, otherwise
return the error as before.

---

Outside diff comments:
In `@controllers/nodemaintenance_controller.go`:
- Around line 190-221: Add a clarifying comment explaining why
nm.Status.ErrorOnLeaseCount is reset to 0 when err != nil (non-AlreadyHeldError)
in the obtainLease handling: state that ErrorOnLeaseCount intentionally counts
only consecutive lease.AlreadyHeldError occurrences (used by obtainLease and the
maxAllowedErrorToUpdateOwnedLease logic) and should be cleared on any other
transient or different error to restart the consecutive-held-error tally; place
this comment immediately before the branch that logs "failed to request lease"
(around the err != nil handling) and reference ErrorOnLeaseCount, obtainLease,
and maxAllowedErrorToUpdateOwnedLease so future readers understand the intended
behavior.

---

Nitpick comments:
In `@controllers/nodemaintenance_controller_test.go`:
- Around line 295-337: The test "lease is held then released by another entity"
can flake due to a short Eventually timeout; increase the Eventually call
timeout from "5s" to "10s" (or similar) where you assert
maintenance.Status.Phase/DrainProgress/ErrorOnLeaseCount to accommodate the
configured mockLeaseManager failures (maxAllowedErrorToUpdateOwnedLease + 2) and
controller backoff behavior so the test is more resilient in slower CI.
- Around line 489-508: The mockLeaseManager's RequestLease mutates
requestFailCount without synchronization, which is unsafe if RequestLease can be
called concurrently; make requestFailCount access thread-safe by adding a
sync.Mutex or using atomic operations on requestFailCount inside the
mockLeaseManager and guard increments/reads in RequestLease (and any other
helpers) accordingly so concurrent reconciles won't race on requestFailCount.
- Around line 266-270: Update the test to initialize AlreadyHeldError with a
non-empty holder identity once the lease package exposes a public constructor
(e.g., NewAlreadyHeldError); replace the current direct struct literals used for
requestLeaseErr and invalidateLeaseErr on the mockLeaseManager (see
mockLeaseManager, r.LeaseManager, requestLeaseErr, invalidateLeaseErr) by
calling the new constructor so the error messages include the holder identity
(e.g., NewAlreadyHeldError("test-holder")).

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
controllers/nodemaintenance_controller_test.go (1)

489-509: Optional: Use atomic operations to guard requestFailCount against potential data races in concurrent testing.

The test runs reconciliation in a background manager goroutine while the test goroutine may swap and restore LeaseManager via DeferCleanup. Although the current test patterns use Eventually() to avoid direct concurrent access to requestFailCount, the Go race detector tracks goroutine-level memory access. Using atomic.Int32 for both requestFailCount and comparing against maxRequestFailures as an int would eliminate any theoretical race condition and keep tests clean under -race flag if it's enabled in the future.

♻️ Proposed fix using sync/atomic
+import "sync/atomic"
+
 type mockLeaseManager struct {
 	lease.Manager
 	requestLeaseErr error
 	// maxRequestFailures limits how many times RequestLease returns requestLeaseErr.
 	// 0 means unlimited (fail forever). When requestFailCount reaches maxRequestFailures,
 	// RequestLease returns nil — simulating the lease being released.
 	maxRequestFailures int
-	requestFailCount   int
+	requestFailCount   atomic.Int32
 
 	invalidateLeaseErr error
 }
 
 func (mock *mockLeaseManager) RequestLease(_ context.Context, _ client.Object, _ time.Duration) error {
 	if mock.requestLeaseErr != nil {
-		if mock.maxRequestFailures <= 0 || mock.requestFailCount < mock.maxRequestFailures {
-			mock.requestFailCount++
+		if mock.maxRequestFailures <= 0 || int(mock.requestFailCount.Load()) < mock.maxRequestFailures {
+			mock.requestFailCount.Add(1)
 			return mock.requestLeaseErr
 		}
 	}
 	return nil
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@controllers/nodemaintenance_controller_test.go` around lines 489 - 509, The
mockLeaseManager’s requestFailCount is incremented without synchronization which
can cause race detector failures when RequestLease runs concurrently; replace
the int requestFailCount with an atomic counter (e.g., atomic.Int32 or use
atomic package functions) and update RequestLease to load/compare and increment
the counter atomically when checking against maxRequestFailures, keeping
maxRequestFailures as an int for comparison (convert types as needed) and
ensuring all accesses to requestFailCount use the atomic API.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@controllers/nodemaintenance_controller_test.go`:
- Around line 316-336: The verifyEvent call that checks for the
FailedMaintenance event (using fakeRecorder.Events / isEventOccurred) is a
point-in-time assertion and can be flaky; modify the test so the verifyEvent
invocation is wrapped in an Eventually with a short timeout and poll interval
(e.g., Eventually(func() { verifyEvent(corev1.EventTypeWarning,
utils.EventReasonFailedMaintenance, utils.EventMessageFailedMaintenance) },
"5s", "200ms").Should(Succeed()) ) so the test waits for the reconciler to emit
the event before proceeding; update the corresponding test case around the
verifyEvent call in the It block that asserts lease contention recovery.

---

Nitpick comments:
In `@controllers/nodemaintenance_controller_test.go`:
- Around line 489-509: The mockLeaseManager’s requestFailCount is incremented
without synchronization which can cause race detector failures when RequestLease
runs concurrently; replace the int requestFailCount with an atomic counter
(e.g., atomic.Int32 or use atomic package functions) and update RequestLease to
load/compare and increment the counter atomically when checking against
maxRequestFailures, keeping maxRequestFailures as an int for comparison (convert
types as needed) and ensuring all accesses to requestFailCount use the atomic
API.

…tion

stopNodeMaintenanceOnDeletion did not handle AlreadyHeldError from
InvalidateLease, so when another entity (e.g. NHC) holds the lease
during node deletion cleanup, the error caused the entire cleanup to
fail. Additionally, the verifyEvent call in the lease-recovery test was
a point-in-time check that could flake if the reconciler hadn't emitted
the event yet.
What:
Mirror the AlreadyHeldError handling from stopNodeMaintenanceImp in
stopNodeMaintenanceOnDeletion: log and continue cleanup instead of
returning the error
Wrap the FailedMaintenance verifyEvent in Eventually so the test
retries until the reconciler emits the event
@razo7
Copy link
Member Author

razo7 commented Feb 17, 2026

/test 4.22-openshift-e2e

Copy link
Contributor

@clobrano clobrano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a suggestion, not mandatory, but I think it should improve readability

@razo7
Copy link
Member Author

razo7 commented Feb 18, 2026

/test 4.22-openshift-e2e

@razo7
Copy link
Member Author

razo7 commented Feb 18, 2026

error destroying bootstrap resources failed during the destroy bootstrap hook: bootstrap ssh rule was not removed within 15m0s: failed to remove bootstrap SSH rule: failed to get AWSCluster: client rate limiter Wait returned an error: context deadline exceeded

Bootstrap failed, so we will retry
/test 4.22-openshift-e2e

if err != nil {
return r.onReconcileError(ctx, nm, drainer, fmt.Errorf("failed to uncordon upon failure to obtain owned lease : %v ", err))
var alreadyHeldErr lease.AlreadyHeldError
err = r.obtainLease(ctx, node)
Copy link
Contributor

@clobrano clobrano Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point obtainLease is mostly an empty function. I suggest to remove it and call r.LeaseManager.RequestLease(ctx, node, LeaseDuration)directly. I won't stop of course if you prefer otherwise.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, that method is useless now (and I would block on this ;) )

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

err = r.stopNodeMaintenanceImp(ctx, drainer, node)
if err != nil {
return r.onReconcileError(ctx, nm, drainer, fmt.Errorf("failed to uncordon upon failure to obtain owned lease : %v ", err))
var alreadyHeldErr lease.AlreadyHeldError
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be moved into the err != nil block

r.logger.Info("can't extend owned lease. uncordon for now")

// Uncordon the node
err = r.stopNodeMaintenanceImp(ctx, drainer, node)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to take a step back.... this does not make sense here IMHO.

There are 2 ways the lease request can fail:

  • AlreadyHeldError: this means that the initial request to get a lease failed because it's taken already. In the case there is no reason to stop maintenance, because it never started. We just need to retry after some time. I'm not even sure if we should increase the error count in this case...
  • any other error: something else happened when getting or renewing the lease. Increase error count, retry with backoff. When error count exceeded the limit, stop maintenance if it was started and give up.

WDYT?

Copy link
Member Author

@razo7 razo7 Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree

…nit-tests

Increment lease errors when the lease was taken after it was obtained and maitenance gave up or due to any other error for obtaining the lease. Don't need to increment it when it was taken in the first time, and allowing NMO to continuesly try to begin maintenance. Include testing for when the lease is transiently lost during maintenance and when it permanently lost during maintenance. Moreover, it includes some small fixes in testing
@razo7 razo7 force-pushed the fix-lease-obtaining-function branch from c11eff5 to 9238928 Compare February 18, 2026 15:51
@razo7
Copy link
Member Author

razo7 commented Feb 18, 2026

/test 4.22-openshift-e2e

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments