ASG launch instance lifecycle hook does not redo node readiness checks

**Describe the bug**
We use [ASG instance launch lifecycle hook](https://github.com/aws/aws-node-termination-handler/blob/main/README.md#1-handle-asg-instance-launch-lifecycle-notifications-optional) and noticed NTH logs node readiness checks only 3 times for the first ~40 seconds (since initial `ASG_LAUNCH_LIFECYCLE` message is received), before the node becomes ready, and then it no longer checks.
Even though node reaches `Ready` status (soon after last NTH check), ultimately the lifecycle hook heartbeat timeout happens and the ASG has to terminate the instance (hook default action is `ABANDON`).
This seems to be similar to the report in https://github.com/aws/aws-node-termination-handler/issues/1007.

**Steps to reproduce**
1. Configure ASG with launch lifecycle hook (600s timeout, `ABANDON` default action)
2. Launch instances that take >40 seconds to become Ready in Kubernetes
3. Observe NTH logs that report node readiness fails 3 times but no additional retries

> [!NOTE]
> The logs below were retrieved running a local build with this small patch applied, to have more insights:

<details markdown=1><summary>
patch git diff (click to expand)
</summary>

```shell
--- a/pkg/interruptionevent/asg/launch/handler.go
+++ b/pkg/interruptionevent/asg/launch/handler.go
@@ -70,6 +70,10 @@ func (h *Handler) HandleEvent(drainEvent *monitor.InterruptionEvent) error {
        }
        if !isNodeReady {
                h.commonHandler.InterruptionEventStore.CancelInterruptionEvent(drainEvent.EventID)
+               log.Info().
+                       Str("instanceID", drainEvent.InstanceID).
+                       Str("eventID", drainEvent.EventID).
+                       Msg("Node not ready - canceling launch lifecycle event (ASG will retry or timeout)")
                return nil
        }

@@ -91,7 +95,9 @@ func (h *Handler) isNodeReady(instanceID string) (bool, error) {
        }

        if len(nodes) == 0 {
-               log.Info().Str("instanceID", instanceID).Msg("EC2 instance not found")
+               log.Info().
+                       Str("instanceID", instanceID).
+                       Msg("EC2 instance not found in Kubernetes cluster - waiting for instance to join as a node")
                return false, nil
        }

@@ -99,12 +105,21 @@ func (h *Handler) isNodeReady(instanceID string) (bool, error) {
                conditions := node.Status.Conditions
                for _, condition := range conditions {
                        if condition.Type == "Ready" && condition.Status != "True" {
-                               log.Info().Str("instanceID", instanceID).Msg("EC2 instance found, but not ready")
+                               log.Info().
+                                       Str("instanceID", instanceID).
+                                       Str("nodeName", node.Name).
+                                       Str("readyStatus", string(condition.Status)).
+                                       Str("readyReason", condition.Reason).
+                                       Str("readyMessage", condition.Message).
+                                       Msg("EC2 instance found in cluster but node not ready - waiting for ready status")
                                return false, nil
                        }
                }
        }
-       log.Info().Str("instanceID", instanceID).Msg("EC2 instance is found and ready")
+       log.Info().
+               Str("instanceID", instanceID).
+               Str("nodeName", nodes[0].Name).
+               Msg("EC2 instance is found and ready - will complete ASG launch lifecycle hook")
        return true, nil
 }
```

</details>

4. Note that after ~3 retries, no more processing occurs even if node becomes Ready
5. ASG eventually times out and terminates the instance

**Expected outcome**

NTH should retry checking node readiness without cancelling the event, allowing SQS retries to continue until the node becomes Ready, then send CONTINUE to complete the lifecycle hook successfully.

**Application Logs**

<details markdown=1><summary>
k logs aws-nth-6668f87c7b-brdvp | rg i-004eafb13e0bbb43 (click to expand)
</summary>

```json
{
    "level": "info",
    "event": {
        "EventID": "asg-lifecycle-term-64373436363861642d663331612d663838382d396561622d633831623861656561373064",
        "Kind": "ASG_LAUNCH_LIFECYCLE",
        "Monitor": "SQS_MONITOR",
        "Description": "ASG Lifecycle Launch event received. Instance was started at 2025-10-28 15:43:14.87 +0000 UTC \n",
        "State": "",
        "AutoScalingGroupName": "on-demand-huge-workers-01-2024112121084029580000002c",
        "NodeName": "ip-10-31-16-42.eu-west-1.compute.internal",
        "NodeLabels": null,
        "Pods": null,
        "InstanceID": "i-004eafb13e0bbb439",
        "ProviderID": "aws:///eu-west-1a/i-004eafb13e0bbb439",
        "InstanceType": "",
        "IsManaged": true,
        "StartTime": "2025-10-28T15:43:14.87Z",
        "EndTime": "0001-01-01T00:00:00Z",
        "NodeProcessed": false,
        "InProgress": false
    },
    "time": "2025-10-28T15:43:15Z",
    "message": "Adding new event to the event store"
}
{
    "level": "info",
    "event-id": "asg-lifecycle-term-64373436363861642d663331612d663838382d396561622d633831623861656561373064",
    "kind": "ASG_LAUNCH_LIFECYCLE",
    "node-name": "ip-10-31-16-42.eu-west-1.compute.internal",
    "instance-id": "i-004eafb13e0bbb439",
    "provider-id": "aws:///eu-west-1a/i-004eafb13e0bbb439",
    "time": "2025-10-28T15:43:15Z",
    "message": "Requesting instance drain"
}
{
    "level": "info",
    "instanceID": "i-004eafb13e0bbb439",
    "time": "2025-10-28T15:43:15Z",
    "message": "EC2 instance not found in Kubernetes cluster - waiting for instance to join as a node"
}
{
    "level": "info",
    "instanceID": "i-004eafb13e0bbb439",
    "eventID": "asg-lifecycle-term-64373436363861642d663331612d663838382d396561622d633831623861656561373064",
    "time": "2025-10-28T15:43:15Z",
    "message": "Node not ready - canceling launch lifecycle event (ASG will retry or timeout)"
}
{
    "level": "info",
    "event": {
        "EventID": "asg-lifecycle-term-64373436363861642d663331612d663838382d396561622d633831623861656561373064",
        "Kind": "ASG_LAUNCH_LIFECYCLE",
        "Monitor": "SQS_MONITOR",
        "Description": "ASG Lifecycle Launch event received. Instance was started at 2025-10-28 15:43:14.87 +0000 UTC \n",
        "State": "",
        "AutoScalingGroupName": "on-demand-huge-workers-01-2024112121084029580000002c",
        "NodeName": "ip-10-31-16-42.eu-west-1.compute.internal",
        "NodeLabels": null,
        "Pods": null,
        "InstanceID": "i-004eafb13e0bbb439",
        "ProviderID": "aws:///eu-west-1a/i-004eafb13e0bbb439",
        "InstanceType": "",
        "IsManaged": true,
        "StartTime": "2025-10-28T15:43:14.87Z",
        "EndTime": "0001-01-01T00:00:00Z",
        "NodeProcessed": false,
        "InProgress": false
    },
    "time": "2025-10-28T15:43:35Z",
    "message": "Adding new event to the event store"
}
{
    "level": "info",
    "event-id": "asg-lifecycle-term-64373436363861642d663331612d663838382d396561622d633831623861656561373064",
    "kind": "ASG_LAUNCH_LIFECYCLE",
    "node-name": "ip-10-31-16-42.eu-west-1.compute.internal",
    "instance-id": "i-004eafb13e0bbb439",
    "provider-id": "aws:///eu-west-1a/i-004eafb13e0bbb439",
    "time": "2025-10-28T15:43:35Z",
    "message": "Requesting instance drain"
}
{
    "level": "info",
    "instanceID": "i-004eafb13e0bbb439",
    "time": "2025-10-28T15:43:36Z",
    "message": "EC2 instance not found in Kubernetes cluster - waiting for instance to join as a node"
}
{
    "level": "info",
    "instanceID": "i-004eafb13e0bbb439",
    "eventID": "asg-lifecycle-term-64373436363861642d663331612d663838382d396561622d633831623861656561373064",
    "time": "2025-10-28T15:43:36Z",
    "message": "Node not ready - canceling launch lifecycle event (ASG will retry or timeout)"
}
{
    "level": "info",
    "event": {
        "EventID": "asg-lifecycle-term-64373436363861642d663331612d663838382d396561622d633831623861656561373064",
        "Kind": "ASG_LAUNCH_LIFECYCLE",
        "Monitor": "SQS_MONITOR",
        "Description": "ASG Lifecycle Launch event received. Instance was started at 2025-10-28 15:43:14.87 +0000 UTC \n",
        "State": "",
        "AutoScalingGroupName": "on-demand-huge-workers-01-2024112121084029580000002c",
        "NodeName": "ip-10-31-16-42.eu-west-1.compute.internal",
        "NodeLabels": null,
        "Pods": null,
        "InstanceID": "i-004eafb13e0bbb439",
        "ProviderID": "aws:///eu-west-1a/i-004eafb13e0bbb439",
        "InstanceType": "",
        "IsManaged": true,
        "StartTime": "2025-10-28T15:43:14.87Z",
        "EndTime": "0001-01-01T00:00:00Z",
        "NodeProcessed": false,
        "InProgress": false
    },
    "time": "2025-10-28T15:43:55Z",
    "message": "Adding new event to the event store"
}
{
    "level": "info",
    "event-id": "asg-lifecycle-term-64373436363861642d663331612d663838382d396561622d633831623861656561373064",
    "kind": "ASG_LAUNCH_LIFECYCLE",
    "node-name": "ip-10-31-16-42.eu-west-1.compute.internal",
    "instance-id": "i-004eafb13e0bbb439",
    "provider-id": "aws:///eu-west-1a/i-004eafb13e0bbb439",
    "time": "2025-10-28T15:43:55Z",
    "message": "Requesting instance drain"
}
{
    "level": "info",
    "instanceID": "i-004eafb13e0bbb439",
    "nodeName": "ip-10-31-16-42.eu-west-1.compute.internal",
    "readyStatus": "False",
    "readyReason": "KubeletNotReady",
    "readyMessage": "container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized",
    "time": "2025-10-28T15:43:56Z",
    "message": "EC2 instance found in cluster but node not ready - waiting for ready status"
}
{
    "level": "info",
    "instanceID": "i-004eafb13e0bbb439",
    "eventID": "asg-lifecycle-term-64373436363861642d663331612d663838382d396561622d633831623861656561373064",
    "time": "2025-10-28T15:43:56Z",
    "message": "Node not ready - canceling launch lifecycle event (ASG will retry or timeout)"
}
{
    "level": "info",
    "event": {
        "EventID": "asg-lifecycle-term-36306636363861652d313831622d656261632d653565342d303132353936316261396332",
        "Kind": "ASG_LIFECYCLE",
        "Monitor": "SQS_MONITOR",
        "Description": "ASG Lifecycle Termination event received. Instance will be interrupted at 2025-10-28 15:53:15.796 +0000 UTC \n",
        "State": "",
        "AutoScalingGroupName": "on-demand-huge-workers-01-2024112121084029580000002c",
        "NodeName": "ip-10-31-16-42.eu-west-1.compute.internal",
        "NodeLabels": null,
        "Pods": null,
        "InstanceID": "i-004eafb13e0bbb439",
        "ProviderID": "aws:///eu-west-1a/i-004eafb13e0bbb439",
        "InstanceType": "r6a.xlarge",
        "IsManaged": true,
        "StartTime": "2025-10-28T15:53:15.796Z",
        "EndTime": "0001-01-01T00:00:00Z",
        "NodeProcessed": false,
        "InProgress": false
    },
    "time": "2025-10-28T15:53:15Z",
    "message": "Adding new event to the event store"
}
{
    "level": "info",
    "event-id": "asg-lifecycle-term-36306636363861652d313831622d656261632d653565342d303132353936316261396332",
    "kind": "ASG_LIFECYCLE",
    "node-name": "ip-10-31-16-42.eu-west-1.compute.internal",
    "instance-id": "i-004eafb13e0bbb439",
    "provider-id": "aws:///eu-west-1a/i-004eafb13e0bbb439",
    "time": "2025-10-28T15:53:16Z",
    "message": "Requesting instance drain"
}
{
    "level": "info",
    "lifecycleHookName": "nth-termination-request",
    "instanceID": "i-004eafb13e0bbb439",
    "time": "2025-10-28T15:53:20Z",
    "message": "Completed ASG Lifecycle Hook"
}
```

</details>

<details markdown=1><summary>
k8s node ip-10-31-16-42.eu-west-1.compute.internal events (click to expand)
</summary>

```shell
$ kubectl get events --all-namespaces --sort-by='.lastTimestamp' \
  -o custom-columns=\
NAMESPACE:.metadata.namespace,\
LAST_SEEN:.lastTimestamp,\
TYPE:.type,\
REASON:.reason,\
OBJECT:.involvedObject.kind,\
NAME:.involvedObject.name,\
MESSAGE:.message \
  | grep ip-10-31-16-42
default                      <nil>                  Normal    Starting                   Node          ip-10-31-16-42.eu-west-1.compute.internal                   <none>
default                      2025-10-28T15:43:38Z   Normal    NodeHasSufficientPID           Node             ip-10-31-16-4.eu-west-1.compute.internal                                    Node ip-10-31-16-42.eu-west-1.compute.internal status is now: NodeHasSufficientPID
default                      2025-10-28T15:43:38Z   Normal    Starting                       Node             ip-10-31-16-4.eu-west-1.compute.internal                                    Starting kubelet.
default                      2025-10-28T15:43:38Z   Warning   InvalidDiskCapacity            Node             ip-10-31-16-4.eu-west-1.compute.internal                                    invalid capacity 0 on image filesystem
default                      2025-10-28T15:43:38Z   Normal    NodeHasSufficientMemory        Node             ip-10-31-16-4.eu-west-1.compute.internal                                    Node ip-10-31-16-42.eu-west-1.compute.internal status is now: NodeHasSufficientMemory
default                      2025-10-28T15:43:38Z   Normal    NodeHasNoDiskPressure          Node             ip-10-31-16-4.eu-west-1.compute.internal                                    Node ip-10-31-16-42.eu-west-1.compute.internal status is now: NodeHasNoDiskPressure
default                      2025-10-28T15:43:38Z   Normal    NodeAllocatableEnforced        Node             ip-10-31-16-4.eu-west-1.compute.internal                                    Updated Node Allocatable limit across pods
kube-system                  2025-10-28T15:43:44Z   Normal    Scheduled                      Pod              ebs-csi-node-swwz9                                                           Successfully assigned kube-system/ebs-csi-node-swwz9 to ip-10-31-16-42.eu-west-1.compute.internal
kube-system                  2025-10-28T15:43:44Z   Normal    Scheduled                      Pod              kube-proxy-6gg8t                                                             Successfully assigned kube-system/kube-proxy-6gg8t to ip-10-31-16-42.eu-west-1.compute.internal
fluent-bit                   2025-10-28T15:43:44Z   Normal    Scheduled                      Pod              fluent-bit-q4s8d                                                             Successfully assigned fluent-bit/fluent-bit-q4s8d to ip-10-31-16-42.eu-west-1.compute.internal
groundcover                  2025-10-28T15:43:44Z   Normal    Scheduled                      Pod              groundcover-sensor-w4dtg                                                     Successfully assigned groundcover/groundcover-sensor-w4dtg to ip-10-31-16-42.eu-west-1.compute.internal
default                      2025-10-28T15:43:44Z   Normal    Synced                         Node             ip-10-31-16-4.eu-west-1.compute.internal                                    Node synced successfully
victoria-metrics-k8s-stack   2025-10-28T15:43:44Z   Normal    Scheduled                      Pod              victoria-metrics-k8s-stack-prometheus-node-exporter-jh5jf                    Successfully assigned victoria-metrics-k8s-stack/victoria-metrics-k8s-stack-prometheus-node-exporter-jh5jf to ip-10-31-16-42.eu-west-1.compute.internal
kube-system                  2025-10-28T15:43:44Z   Normal    Scheduled                      Pod              aws-node-llxtd                                                               Successfully assigned kube-system/aws-node-llxtd to ip-10-31-16-42.eu-west-1.compute.internal
kube-system                  2025-10-28T15:43:44Z   Normal    Scheduled                      Pod              efs-csi-node-5z5rw                                                           Successfully assigned kube-system/efs-csi-node-5z5rw to ip-10-31-16-42.eu-west-1.compute.internal
default                      2025-10-28T15:43:45Z   Normal    RegisteredNode                 Node             ip-10-31-16-4.eu-west-1.compute.internal                                    Node ip-10-31-16-42.eu-west-1.compute.internal event: Registered Node ip-10-31-16-42.eu-west-1.compute.internal in Controller
default                      2025-10-28T15:43:55Z   Normal    UnknownInterruption            Node             ip-10-31-16-4.eu-west-1.compute.internal                                    ASG Lifecycle Launch event received. Instance was started at 2025-10-28 15:43:14.87 +0000 UTC
default                      2025-10-28T15:43:59Z   Normal    NodeReady                      Node               ip-10-31-16-42.eu-west-1.compute.internal                                    Node ip-10-31-16-42.eu-west-1.compute.internal status is now: NodeReady
actions-runners-system       2025-10-28T15:44:00Z   Normal    Scheduled                      Pod                fm-huge-runners-gm2kf-l7tkd                                                  Successfully assigned actions-runners-system/fm-huge-runners-gm2kf-l7tkd to ip-10-31-16-42.eu-west-1.compute.internal
default                      2025-10-28T15:53:16Z   Normal    ASGLifecycle                   Node               ip-10-31-16-42.eu-west-1.compute.internal                                    ASG Lifecycle Termination event received. Instance will be interrupted at 2025-10-28 15:53:15.796 +0000 UTC
default                      2025-10-28T15:53:16Z   Normal    PreDrain                       Node               ip-10-31-16-42.eu-west-1.compute.internal                                    Pre-drain task successfully executed
actions-runners-system       2025-10-28T15:53:17Z   Normal    PodEviction                    Pod                fm-huge-runners-gm2kf-l7tkd                                                  Pod evicted due to node drain (node ip-10-31-16-42.eu-west-1.compute.internal)
kube-system                  2025-10-28T15:53:17Z   Normal    PodEviction                    Pod                kube-proxy-6gg8t                                                             Pod evicted due to node drain (node ip-10-31-16-42.eu-west-1.compute.internal)
kube-system                  2025-10-28T15:53:17Z   Normal    PodEviction                    Pod                ebs-csi-node-swwz9                                                           Pod evicted due to node drain (node ip-10-31-16-42.eu-west-1.compute.internal)
kube-system                  2025-10-28T15:53:17Z   Normal    PodEviction                    Pod                efs-csi-node-5z5rw                                                           Pod evicted due to node drain (node ip-10-31-16-42.eu-west-1.compute.internal)
fluent-bit                   2025-10-28T15:53:17Z   Normal    PodEviction                    Pod                fluent-bit-q4s8d                                                             Pod evicted due to node drain (node ip-10-31-16-42.eu-west-1.compute.internal)
groundcover                  2025-10-28T15:53:17Z   Normal    PodEviction                    Pod                groundcover-sensor-w4dtg                                                     Pod evicted due to node drain (node ip-10-31-16-42.eu-west-1.compute.internal)
victoria-metrics-k8s-stack   2025-10-28T15:53:17Z   Normal    PodEviction                    Pod                victoria-metrics-k8s-stack-prometheus-node-exporter-jh5jf                    Pod evicted due to node drain (node ip-10-31-16-42.eu-west-1.compute.internal)
kube-system                  2025-10-28T15:53:17Z   Normal    PodEviction                    Pod                aws-node-llxtd                                                               Pod evicted due to node drain (node ip-10-31-16-42.eu-west-1.compute.internal)
default                      2025-10-28T15:53:20Z   Normal    CordonAndDrain                 Node               ip-10-31-16-42.eu-west-1.compute.internal                                    Node successfully cordoned and drained
default                      2025-10-28T15:53:21Z   Normal    PostDrain                      Node               ip-10-31-16-42.eu-west-1.compute.internal                                    Post-drain task successfully executed
default                      2025-10-28T15:54:10Z   Normal    NodeNotReady                    Node               ip-10-31-16-42.eu-west-1.compute.internal                                    Node ip-10-31-16-42.eu-west-1.compute.internal status is now: NodeNotReady
default                      2025-10-28T15:54:58Z   Normal    DeletingNode                    Node               ip-10-31-16-42.eu-west-1.compute.internal                                    Deleting node ip-10-31-16-42.eu-west-1.compute.internal because it does not exist in the cloud provider
default                      2025-10-28T15:55:00Z   Normal    RemovingNode                    Node               ip-10-31-16-42.eu-west-1.compute.internal                                    Node ip-10-31-16-42.eu-west-1.compute.internal event: Removing Node ip-10-31-16-42.eu-west-1.compute.internal from Controller
victoria-metrics-k8s-stack   2025-10-28T15:55:39Z   Warning   FailedToUpdateEndpointSlices    Service            victoria-metrics-k8s-stack-prometheus-node-exporter                          Error updating Endpoint Slices for Service victoria-metrics-k8s-stack/victoria-metrics-k8s-stack-prometheus-node-exporter: skipping Pod victoria-metrics-k8s-stack-prometheus-node-exporter-jh5jf for Service victoria-metrics-k8s-stack/victoria-metrics-k8s-stack-prometheus-node-exporter: Node ip-10-31-16-42.eu-west-1.compute.internal Not Found
```

</details>

From K8s events timeline is:
  15:43:38Z - Node starts joining and gets various status updates
  15:43:44Z - Node synced successfully
  15:43:59Z - Node became ready
  15:44:00Z - Pods start getting scheduled to the node

But looking at the NTH logs:
  15:43:15Z  - First NTH check shows no node in K8s for the EC2 instance
  15:43:36Z  - Second NTH check shows same
  15:43:56Z  - Last NTH check shows EC2 as node "not ready" with CNI issues
  15:53:15Z  - Next log is the termination event (10 minutes later - lifecycle hook heartbeat timeout)

Somehow, the node was ready at 15:43:59Z, but NTH never checked again after 15:43:56Z, until the ASG lifecycle hook eventually times out.

**Environment**

* NTH App Version: v1.25.2
* NTH Mode (IMDS/Queue processor): Queue Processor
* OS/Arch: Amazon Linux 2023 (AMI: amazon-eks-node-al2023-x86_64-standard-1.31-v20251016)
* Kubernetes version: v1.31.13
* Installation method: Helm chart 0.27.2


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ASG launch instance lifecycle hook does not redo node readiness checks #1205

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ASG launch instance lifecycle hook does not redo node readiness checks #1205

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions