-
Notifications
You must be signed in to change notification settings - Fork 275
Description
Describe the bug
We use ASG instance launch lifecycle hook and noticed NTH logs node readiness checks only 3 times for the first ~40 seconds (since initial ASG_LAUNCH_LIFECYCLE message is received), before the node becomes ready, and then it no longer checks.
Even though node reaches Ready status (soon after last NTH check), ultimately the lifecycle hook heartbeat timeout happens and the ASG has to terminate the instance (hook default action is ABANDON).
This seems to be similar to the report in #1007.
Steps to reproduce
- Configure ASG with launch lifecycle hook (600s timeout,
ABANDONdefault action) - Launch instances that take >40 seconds to become Ready in Kubernetes
- Observe NTH logs that report node readiness fails 3 times but no additional retries
Note
The logs below were retrieved running a local build with this small patch applied, to have more insights:
patch git diff (click to expand)
--- a/pkg/interruptionevent/asg/launch/handler.go
+++ b/pkg/interruptionevent/asg/launch/handler.go
@@ -70,6 +70,10 @@ func (h *Handler) HandleEvent(drainEvent *monitor.InterruptionEvent) error {
}
if !isNodeReady {
h.commonHandler.InterruptionEventStore.CancelInterruptionEvent(drainEvent.EventID)
+ log.Info().
+ Str("instanceID", drainEvent.InstanceID).
+ Str("eventID", drainEvent.EventID).
+ Msg("Node not ready - canceling launch lifecycle event (ASG will retry or timeout)")
return nil
}
@@ -91,7 +95,9 @@ func (h *Handler) isNodeReady(instanceID string) (bool, error) {
}
if len(nodes) == 0 {
- log.Info().Str("instanceID", instanceID).Msg("EC2 instance not found")
+ log.Info().
+ Str("instanceID", instanceID).
+ Msg("EC2 instance not found in Kubernetes cluster - waiting for instance to join as a node")
return false, nil
}
@@ -99,12 +105,21 @@ func (h *Handler) isNodeReady(instanceID string) (bool, error) {
conditions := node.Status.Conditions
for _, condition := range conditions {
if condition.Type == "Ready" && condition.Status != "True" {
- log.Info().Str("instanceID", instanceID).Msg("EC2 instance found, but not ready")
+ log.Info().
+ Str("instanceID", instanceID).
+ Str("nodeName", node.Name).
+ Str("readyStatus", string(condition.Status)).
+ Str("readyReason", condition.Reason).
+ Str("readyMessage", condition.Message).
+ Msg("EC2 instance found in cluster but node not ready - waiting for ready status")
return false, nil
}
}
}
- log.Info().Str("instanceID", instanceID).Msg("EC2 instance is found and ready")
+ log.Info().
+ Str("instanceID", instanceID).
+ Str("nodeName", nodes[0].Name).
+ Msg("EC2 instance is found and ready - will complete ASG launch lifecycle hook")
return true, nil
}- Note that after ~3 retries, no more processing occurs even if node becomes Ready
- ASG eventually times out and terminates the instance
Expected outcome
NTH should retry checking node readiness without cancelling the event, allowing SQS retries to continue until the node becomes Ready, then send CONTINUE to complete the lifecycle hook successfully.
Application Logs
k logs aws-nth-6668f87c7b-brdvp | rg i-004eafb13e0bbb43 (click to expand)
{
"level": "info",
"event": {
"EventID": "asg-lifecycle-term-64373436363861642d663331612d663838382d396561622d633831623861656561373064",
"Kind": "ASG_LAUNCH_LIFECYCLE",
"Monitor": "SQS_MONITOR",
"Description": "ASG Lifecycle Launch event received. Instance was started at 2025-10-28 15:43:14.87 +0000 UTC \n",
"State": "",
"AutoScalingGroupName": "on-demand-huge-workers-01-2024112121084029580000002c",
"NodeName": "ip-10-31-16-42.eu-west-1.compute.internal",
"NodeLabels": null,
"Pods": null,
"InstanceID": "i-004eafb13e0bbb439",
"ProviderID": "aws:///eu-west-1a/i-004eafb13e0bbb439",
"InstanceType": "",
"IsManaged": true,
"StartTime": "2025-10-28T15:43:14.87Z",
"EndTime": "0001-01-01T00:00:00Z",
"NodeProcessed": false,
"InProgress": false
},
"time": "2025-10-28T15:43:15Z",
"message": "Adding new event to the event store"
}
{
"level": "info",
"event-id": "asg-lifecycle-term-64373436363861642d663331612d663838382d396561622d633831623861656561373064",
"kind": "ASG_LAUNCH_LIFECYCLE",
"node-name": "ip-10-31-16-42.eu-west-1.compute.internal",
"instance-id": "i-004eafb13e0bbb439",
"provider-id": "aws:///eu-west-1a/i-004eafb13e0bbb439",
"time": "2025-10-28T15:43:15Z",
"message": "Requesting instance drain"
}
{
"level": "info",
"instanceID": "i-004eafb13e0bbb439",
"time": "2025-10-28T15:43:15Z",
"message": "EC2 instance not found in Kubernetes cluster - waiting for instance to join as a node"
}
{
"level": "info",
"instanceID": "i-004eafb13e0bbb439",
"eventID": "asg-lifecycle-term-64373436363861642d663331612d663838382d396561622d633831623861656561373064",
"time": "2025-10-28T15:43:15Z",
"message": "Node not ready - canceling launch lifecycle event (ASG will retry or timeout)"
}
{
"level": "info",
"event": {
"EventID": "asg-lifecycle-term-64373436363861642d663331612d663838382d396561622d633831623861656561373064",
"Kind": "ASG_LAUNCH_LIFECYCLE",
"Monitor": "SQS_MONITOR",
"Description": "ASG Lifecycle Launch event received. Instance was started at 2025-10-28 15:43:14.87 +0000 UTC \n",
"State": "",
"AutoScalingGroupName": "on-demand-huge-workers-01-2024112121084029580000002c",
"NodeName": "ip-10-31-16-42.eu-west-1.compute.internal",
"NodeLabels": null,
"Pods": null,
"InstanceID": "i-004eafb13e0bbb439",
"ProviderID": "aws:///eu-west-1a/i-004eafb13e0bbb439",
"InstanceType": "",
"IsManaged": true,
"StartTime": "2025-10-28T15:43:14.87Z",
"EndTime": "0001-01-01T00:00:00Z",
"NodeProcessed": false,
"InProgress": false
},
"time": "2025-10-28T15:43:35Z",
"message": "Adding new event to the event store"
}
{
"level": "info",
"event-id": "asg-lifecycle-term-64373436363861642d663331612d663838382d396561622d633831623861656561373064",
"kind": "ASG_LAUNCH_LIFECYCLE",
"node-name": "ip-10-31-16-42.eu-west-1.compute.internal",
"instance-id": "i-004eafb13e0bbb439",
"provider-id": "aws:///eu-west-1a/i-004eafb13e0bbb439",
"time": "2025-10-28T15:43:35Z",
"message": "Requesting instance drain"
}
{
"level": "info",
"instanceID": "i-004eafb13e0bbb439",
"time": "2025-10-28T15:43:36Z",
"message": "EC2 instance not found in Kubernetes cluster - waiting for instance to join as a node"
}
{
"level": "info",
"instanceID": "i-004eafb13e0bbb439",
"eventID": "asg-lifecycle-term-64373436363861642d663331612d663838382d396561622d633831623861656561373064",
"time": "2025-10-28T15:43:36Z",
"message": "Node not ready - canceling launch lifecycle event (ASG will retry or timeout)"
}
{
"level": "info",
"event": {
"EventID": "asg-lifecycle-term-64373436363861642d663331612d663838382d396561622d633831623861656561373064",
"Kind": "ASG_LAUNCH_LIFECYCLE",
"Monitor": "SQS_MONITOR",
"Description": "ASG Lifecycle Launch event received. Instance was started at 2025-10-28 15:43:14.87 +0000 UTC \n",
"State": "",
"AutoScalingGroupName": "on-demand-huge-workers-01-2024112121084029580000002c",
"NodeName": "ip-10-31-16-42.eu-west-1.compute.internal",
"NodeLabels": null,
"Pods": null,
"InstanceID": "i-004eafb13e0bbb439",
"ProviderID": "aws:///eu-west-1a/i-004eafb13e0bbb439",
"InstanceType": "",
"IsManaged": true,
"StartTime": "2025-10-28T15:43:14.87Z",
"EndTime": "0001-01-01T00:00:00Z",
"NodeProcessed": false,
"InProgress": false
},
"time": "2025-10-28T15:43:55Z",
"message": "Adding new event to the event store"
}
{
"level": "info",
"event-id": "asg-lifecycle-term-64373436363861642d663331612d663838382d396561622d633831623861656561373064",
"kind": "ASG_LAUNCH_LIFECYCLE",
"node-name": "ip-10-31-16-42.eu-west-1.compute.internal",
"instance-id": "i-004eafb13e0bbb439",
"provider-id": "aws:///eu-west-1a/i-004eafb13e0bbb439",
"time": "2025-10-28T15:43:55Z",
"message": "Requesting instance drain"
}
{
"level": "info",
"instanceID": "i-004eafb13e0bbb439",
"nodeName": "ip-10-31-16-42.eu-west-1.compute.internal",
"readyStatus": "False",
"readyReason": "KubeletNotReady",
"readyMessage": "container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized",
"time": "2025-10-28T15:43:56Z",
"message": "EC2 instance found in cluster but node not ready - waiting for ready status"
}
{
"level": "info",
"instanceID": "i-004eafb13e0bbb439",
"eventID": "asg-lifecycle-term-64373436363861642d663331612d663838382d396561622d633831623861656561373064",
"time": "2025-10-28T15:43:56Z",
"message": "Node not ready - canceling launch lifecycle event (ASG will retry or timeout)"
}
{
"level": "info",
"event": {
"EventID": "asg-lifecycle-term-36306636363861652d313831622d656261632d653565342d303132353936316261396332",
"Kind": "ASG_LIFECYCLE",
"Monitor": "SQS_MONITOR",
"Description": "ASG Lifecycle Termination event received. Instance will be interrupted at 2025-10-28 15:53:15.796 +0000 UTC \n",
"State": "",
"AutoScalingGroupName": "on-demand-huge-workers-01-2024112121084029580000002c",
"NodeName": "ip-10-31-16-42.eu-west-1.compute.internal",
"NodeLabels": null,
"Pods": null,
"InstanceID": "i-004eafb13e0bbb439",
"ProviderID": "aws:///eu-west-1a/i-004eafb13e0bbb439",
"InstanceType": "r6a.xlarge",
"IsManaged": true,
"StartTime": "2025-10-28T15:53:15.796Z",
"EndTime": "0001-01-01T00:00:00Z",
"NodeProcessed": false,
"InProgress": false
},
"time": "2025-10-28T15:53:15Z",
"message": "Adding new event to the event store"
}
{
"level": "info",
"event-id": "asg-lifecycle-term-36306636363861652d313831622d656261632d653565342d303132353936316261396332",
"kind": "ASG_LIFECYCLE",
"node-name": "ip-10-31-16-42.eu-west-1.compute.internal",
"instance-id": "i-004eafb13e0bbb439",
"provider-id": "aws:///eu-west-1a/i-004eafb13e0bbb439",
"time": "2025-10-28T15:53:16Z",
"message": "Requesting instance drain"
}
{
"level": "info",
"lifecycleHookName": "nth-termination-request",
"instanceID": "i-004eafb13e0bbb439",
"time": "2025-10-28T15:53:20Z",
"message": "Completed ASG Lifecycle Hook"
}k8s node ip-10-31-16-42.eu-west-1.compute.internal events (click to expand)
$ kubectl get events --all-namespaces --sort-by='.lastTimestamp' \
-o custom-columns=\
NAMESPACE:.metadata.namespace,\
LAST_SEEN:.lastTimestamp,\
TYPE:.type,\
REASON:.reason,\
OBJECT:.involvedObject.kind,\
NAME:.involvedObject.name,\
MESSAGE:.message \
| grep ip-10-31-16-42
default <nil> Normal Starting Node ip-10-31-16-42.eu-west-1.compute.internal <none>
default 2025-10-28T15:43:38Z Normal NodeHasSufficientPID Node ip-10-31-16-4.eu-west-1.compute.internal Node ip-10-31-16-42.eu-west-1.compute.internal status is now: NodeHasSufficientPID
default 2025-10-28T15:43:38Z Normal Starting Node ip-10-31-16-4.eu-west-1.compute.internal Starting kubelet.
default 2025-10-28T15:43:38Z Warning InvalidDiskCapacity Node ip-10-31-16-4.eu-west-1.compute.internal invalid capacity 0 on image filesystem
default 2025-10-28T15:43:38Z Normal NodeHasSufficientMemory Node ip-10-31-16-4.eu-west-1.compute.internal Node ip-10-31-16-42.eu-west-1.compute.internal status is now: NodeHasSufficientMemory
default 2025-10-28T15:43:38Z Normal NodeHasNoDiskPressure Node ip-10-31-16-4.eu-west-1.compute.internal Node ip-10-31-16-42.eu-west-1.compute.internal status is now: NodeHasNoDiskPressure
default 2025-10-28T15:43:38Z Normal NodeAllocatableEnforced Node ip-10-31-16-4.eu-west-1.compute.internal Updated Node Allocatable limit across pods
kube-system 2025-10-28T15:43:44Z Normal Scheduled Pod ebs-csi-node-swwz9 Successfully assigned kube-system/ebs-csi-node-swwz9 to ip-10-31-16-42.eu-west-1.compute.internal
kube-system 2025-10-28T15:43:44Z Normal Scheduled Pod kube-proxy-6gg8t Successfully assigned kube-system/kube-proxy-6gg8t to ip-10-31-16-42.eu-west-1.compute.internal
fluent-bit 2025-10-28T15:43:44Z Normal Scheduled Pod fluent-bit-q4s8d Successfully assigned fluent-bit/fluent-bit-q4s8d to ip-10-31-16-42.eu-west-1.compute.internal
groundcover 2025-10-28T15:43:44Z Normal Scheduled Pod groundcover-sensor-w4dtg Successfully assigned groundcover/groundcover-sensor-w4dtg to ip-10-31-16-42.eu-west-1.compute.internal
default 2025-10-28T15:43:44Z Normal Synced Node ip-10-31-16-4.eu-west-1.compute.internal Node synced successfully
victoria-metrics-k8s-stack 2025-10-28T15:43:44Z Normal Scheduled Pod victoria-metrics-k8s-stack-prometheus-node-exporter-jh5jf Successfully assigned victoria-metrics-k8s-stack/victoria-metrics-k8s-stack-prometheus-node-exporter-jh5jf to ip-10-31-16-42.eu-west-1.compute.internal
kube-system 2025-10-28T15:43:44Z Normal Scheduled Pod aws-node-llxtd Successfully assigned kube-system/aws-node-llxtd to ip-10-31-16-42.eu-west-1.compute.internal
kube-system 2025-10-28T15:43:44Z Normal Scheduled Pod efs-csi-node-5z5rw Successfully assigned kube-system/efs-csi-node-5z5rw to ip-10-31-16-42.eu-west-1.compute.internal
default 2025-10-28T15:43:45Z Normal RegisteredNode Node ip-10-31-16-4.eu-west-1.compute.internal Node ip-10-31-16-42.eu-west-1.compute.internal event: Registered Node ip-10-31-16-42.eu-west-1.compute.internal in Controller
default 2025-10-28T15:43:55Z Normal UnknownInterruption Node ip-10-31-16-4.eu-west-1.compute.internal ASG Lifecycle Launch event received. Instance was started at 2025-10-28 15:43:14.87 +0000 UTC
default 2025-10-28T15:43:59Z Normal NodeReady Node ip-10-31-16-42.eu-west-1.compute.internal Node ip-10-31-16-42.eu-west-1.compute.internal status is now: NodeReady
actions-runners-system 2025-10-28T15:44:00Z Normal Scheduled Pod fm-huge-runners-gm2kf-l7tkd Successfully assigned actions-runners-system/fm-huge-runners-gm2kf-l7tkd to ip-10-31-16-42.eu-west-1.compute.internal
default 2025-10-28T15:53:16Z Normal ASGLifecycle Node ip-10-31-16-42.eu-west-1.compute.internal ASG Lifecycle Termination event received. Instance will be interrupted at 2025-10-28 15:53:15.796 +0000 UTC
default 2025-10-28T15:53:16Z Normal PreDrain Node ip-10-31-16-42.eu-west-1.compute.internal Pre-drain task successfully executed
actions-runners-system 2025-10-28T15:53:17Z Normal PodEviction Pod fm-huge-runners-gm2kf-l7tkd Pod evicted due to node drain (node ip-10-31-16-42.eu-west-1.compute.internal)
kube-system 2025-10-28T15:53:17Z Normal PodEviction Pod kube-proxy-6gg8t Pod evicted due to node drain (node ip-10-31-16-42.eu-west-1.compute.internal)
kube-system 2025-10-28T15:53:17Z Normal PodEviction Pod ebs-csi-node-swwz9 Pod evicted due to node drain (node ip-10-31-16-42.eu-west-1.compute.internal)
kube-system 2025-10-28T15:53:17Z Normal PodEviction Pod efs-csi-node-5z5rw Pod evicted due to node drain (node ip-10-31-16-42.eu-west-1.compute.internal)
fluent-bit 2025-10-28T15:53:17Z Normal PodEviction Pod fluent-bit-q4s8d Pod evicted due to node drain (node ip-10-31-16-42.eu-west-1.compute.internal)
groundcover 2025-10-28T15:53:17Z Normal PodEviction Pod groundcover-sensor-w4dtg Pod evicted due to node drain (node ip-10-31-16-42.eu-west-1.compute.internal)
victoria-metrics-k8s-stack 2025-10-28T15:53:17Z Normal PodEviction Pod victoria-metrics-k8s-stack-prometheus-node-exporter-jh5jf Pod evicted due to node drain (node ip-10-31-16-42.eu-west-1.compute.internal)
kube-system 2025-10-28T15:53:17Z Normal PodEviction Pod aws-node-llxtd Pod evicted due to node drain (node ip-10-31-16-42.eu-west-1.compute.internal)
default 2025-10-28T15:53:20Z Normal CordonAndDrain Node ip-10-31-16-42.eu-west-1.compute.internal Node successfully cordoned and drained
default 2025-10-28T15:53:21Z Normal PostDrain Node ip-10-31-16-42.eu-west-1.compute.internal Post-drain task successfully executed
default 2025-10-28T15:54:10Z Normal NodeNotReady Node ip-10-31-16-42.eu-west-1.compute.internal Node ip-10-31-16-42.eu-west-1.compute.internal status is now: NodeNotReady
default 2025-10-28T15:54:58Z Normal DeletingNode Node ip-10-31-16-42.eu-west-1.compute.internal Deleting node ip-10-31-16-42.eu-west-1.compute.internal because it does not exist in the cloud provider
default 2025-10-28T15:55:00Z Normal RemovingNode Node ip-10-31-16-42.eu-west-1.compute.internal Node ip-10-31-16-42.eu-west-1.compute.internal event: Removing Node ip-10-31-16-42.eu-west-1.compute.internal from Controller
victoria-metrics-k8s-stack 2025-10-28T15:55:39Z Warning FailedToUpdateEndpointSlices Service victoria-metrics-k8s-stack-prometheus-node-exporter Error updating Endpoint Slices for Service victoria-metrics-k8s-stack/victoria-metrics-k8s-stack-prometheus-node-exporter: skipping Pod victoria-metrics-k8s-stack-prometheus-node-exporter-jh5jf for Service victoria-metrics-k8s-stack/victoria-metrics-k8s-stack-prometheus-node-exporter: Node ip-10-31-16-42.eu-west-1.compute.internal Not FoundFrom K8s events timeline is:
15:43:38Z - Node starts joining and gets various status updates
15:43:44Z - Node synced successfully
15:43:59Z - Node became ready
15:44:00Z - Pods start getting scheduled to the node
But looking at the NTH logs:
15:43:15Z - First NTH check shows no node in K8s for the EC2 instance
15:43:36Z - Second NTH check shows same
15:43:56Z - Last NTH check shows EC2 as node "not ready" with CNI issues
15:53:15Z - Next log is the termination event (10 minutes later - lifecycle hook heartbeat timeout)
Somehow, the node was ready at 15:43:59Z, but NTH never checked again after 15:43:56Z, until the ASG lifecycle hook eventually times out.
Environment
- NTH App Version: v1.25.2
- NTH Mode (IMDS/Queue processor): Queue Processor
- OS/Arch: Amazon Linux 2023 (AMI: amazon-eks-node-al2023-x86_64-standard-1.31-v20251016)
- Kubernetes version: v1.31.13
- Installation method: Helm chart 0.27.2