Adding CancelDrainTask to ASG termination to close orphaned generated heartbeat from nodes failing to cordon and drain #1173
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Issue
Fixes #1172
Problem Description
The Node Termination Handler has a critical bug in ASG termination event handling that creates orphaned heartbeat goroutines when node drain operations fail.
Current Behavior (Buggy)
When an ASG termination event fails to drain a node:
PreDrainTaskstarts a heartbeat goroutinecordonAndDrainNodefails to evict podsCancelInterruptionEventremoves the event but never stops the heartbeatImpact
Solution
Implemented a
CancelDrainTaskmechanism that mirrors the existingPreDrainTask/PostDrainTaskpattern to properly terminate heartbeats on drain failures.Key Changes
pkg/monitor/sqsevent/asg-lifecycle-event.gocancelHeartbeatChchannel for heartbeat cancellationCancelDrainTaskfunction to close the cancel channelSendHeartbeatsto listen for cancellation signalspkg/interruptionevent/draincordon/handler.goRunCancelDrainTaskwhen drain operations fail andCancelDrainTaskexistspkg/monitor/sqsevent/sqs-monitor_test.goCancelDrainTaskcreation and executionTesting
Automated Tests (All Passing)
make unit-test)make e2e-test)make compatibility-test)make license-test)make go-linter)make helm-lint)make spellcheck)Tested on: macOS (ARM64) (also ran
make unit-teston Linux x86_64)Kubernetes Version: 1.30
Manual Validation
Scenario: Deployed NTH in EKS cluster and blocked Kubernetes API calls to simulate drain failures
Before Fix:
After Fix:
Backward Compatibility
CancelDrainTaskis optional (nil-safe)Code Implementation
Possible Reproduction Steps (for verification):
deleteSqsMsgIfNodeNotFound=falseBy submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.