Skip to content

ASG launch instance lifecycle hook does not redo node readiness checksΒ #1205

@baburciu

Description

@baburciu

Describe the bug
We use ASG instance launch lifecycle hook and noticed NTH logs node readiness checks only 3 times for the first ~40 seconds (since initial ASG_LAUNCH_LIFECYCLE message is received), before the node becomes ready, and then it no longer checks.
Even though node reaches Ready status (soon after last NTH check), ultimately the lifecycle hook heartbeat timeout happens and the ASG has to terminate the instance (hook default action is ABANDON).
This seems to be similar to the report in #1007.

Steps to reproduce

  1. Configure ASG with launch lifecycle hook (600s timeout, ABANDON default action)
  2. Launch instances that take >40 seconds to become Ready in Kubernetes
  3. Observe NTH logs that report node readiness fails 3 times but no additional retries

Note

The logs below were retrieved running a local build with this small patch applied, to have more insights:

patch git diff (click to expand)
--- a/pkg/interruptionevent/asg/launch/handler.go
+++ b/pkg/interruptionevent/asg/launch/handler.go
@@ -70,6 +70,10 @@ func (h *Handler) HandleEvent(drainEvent *monitor.InterruptionEvent) error {
        }
        if !isNodeReady {
                h.commonHandler.InterruptionEventStore.CancelInterruptionEvent(drainEvent.EventID)
+               log.Info().
+                       Str("instanceID", drainEvent.InstanceID).
+                       Str("eventID", drainEvent.EventID).
+                       Msg("Node not ready - canceling launch lifecycle event (ASG will retry or timeout)")
                return nil
        }

@@ -91,7 +95,9 @@ func (h *Handler) isNodeReady(instanceID string) (bool, error) {
        }

        if len(nodes) == 0 {
-               log.Info().Str("instanceID", instanceID).Msg("EC2 instance not found")
+               log.Info().
+                       Str("instanceID", instanceID).
+                       Msg("EC2 instance not found in Kubernetes cluster - waiting for instance to join as a node")
                return false, nil
        }

@@ -99,12 +105,21 @@ func (h *Handler) isNodeReady(instanceID string) (bool, error) {
                conditions := node.Status.Conditions
                for _, condition := range conditions {
                        if condition.Type == "Ready" && condition.Status != "True" {
-                               log.Info().Str("instanceID", instanceID).Msg("EC2 instance found, but not ready")
+                               log.Info().
+                                       Str("instanceID", instanceID).
+                                       Str("nodeName", node.Name).
+                                       Str("readyStatus", string(condition.Status)).
+                                       Str("readyReason", condition.Reason).
+                                       Str("readyMessage", condition.Message).
+                                       Msg("EC2 instance found in cluster but node not ready - waiting for ready status")
                                return false, nil
                        }
                }
        }
-       log.Info().Str("instanceID", instanceID).Msg("EC2 instance is found and ready")
+       log.Info().
+               Str("instanceID", instanceID).
+               Str("nodeName", nodes[0].Name).
+               Msg("EC2 instance is found and ready - will complete ASG launch lifecycle hook")
        return true, nil
 }
  1. Note that after ~3 retries, no more processing occurs even if node becomes Ready
  2. ASG eventually times out and terminates the instance

Expected outcome

NTH should retry checking node readiness without cancelling the event, allowing SQS retries to continue until the node becomes Ready, then send CONTINUE to complete the lifecycle hook successfully.

Application Logs

k logs aws-nth-6668f87c7b-brdvp | rg i-004eafb13e0bbb43 (click to expand)
{
    "level": "info",
    "event": {
        "EventID": "asg-lifecycle-term-64373436363861642d663331612d663838382d396561622d633831623861656561373064",
        "Kind": "ASG_LAUNCH_LIFECYCLE",
        "Monitor": "SQS_MONITOR",
        "Description": "ASG Lifecycle Launch event received. Instance was started at 2025-10-28 15:43:14.87 +0000 UTC \n",
        "State": "",
        "AutoScalingGroupName": "on-demand-huge-workers-01-2024112121084029580000002c",
        "NodeName": "ip-10-31-16-42.eu-west-1.compute.internal",
        "NodeLabels": null,
        "Pods": null,
        "InstanceID": "i-004eafb13e0bbb439",
        "ProviderID": "aws:///eu-west-1a/i-004eafb13e0bbb439",
        "InstanceType": "",
        "IsManaged": true,
        "StartTime": "2025-10-28T15:43:14.87Z",
        "EndTime": "0001-01-01T00:00:00Z",
        "NodeProcessed": false,
        "InProgress": false
    },
    "time": "2025-10-28T15:43:15Z",
    "message": "Adding new event to the event store"
}
{
    "level": "info",
    "event-id": "asg-lifecycle-term-64373436363861642d663331612d663838382d396561622d633831623861656561373064",
    "kind": "ASG_LAUNCH_LIFECYCLE",
    "node-name": "ip-10-31-16-42.eu-west-1.compute.internal",
    "instance-id": "i-004eafb13e0bbb439",
    "provider-id": "aws:///eu-west-1a/i-004eafb13e0bbb439",
    "time": "2025-10-28T15:43:15Z",
    "message": "Requesting instance drain"
}
{
    "level": "info",
    "instanceID": "i-004eafb13e0bbb439",
    "time": "2025-10-28T15:43:15Z",
    "message": "EC2 instance not found in Kubernetes cluster - waiting for instance to join as a node"
}
{
    "level": "info",
    "instanceID": "i-004eafb13e0bbb439",
    "eventID": "asg-lifecycle-term-64373436363861642d663331612d663838382d396561622d633831623861656561373064",
    "time": "2025-10-28T15:43:15Z",
    "message": "Node not ready - canceling launch lifecycle event (ASG will retry or timeout)"
}
{
    "level": "info",
    "event": {
        "EventID": "asg-lifecycle-term-64373436363861642d663331612d663838382d396561622d633831623861656561373064",
        "Kind": "ASG_LAUNCH_LIFECYCLE",
        "Monitor": "SQS_MONITOR",
        "Description": "ASG Lifecycle Launch event received. Instance was started at 2025-10-28 15:43:14.87 +0000 UTC \n",
        "State": "",
        "AutoScalingGroupName": "on-demand-huge-workers-01-2024112121084029580000002c",
        "NodeName": "ip-10-31-16-42.eu-west-1.compute.internal",
        "NodeLabels": null,
        "Pods": null,
        "InstanceID": "i-004eafb13e0bbb439",
        "ProviderID": "aws:///eu-west-1a/i-004eafb13e0bbb439",
        "InstanceType": "",
        "IsManaged": true,
        "StartTime": "2025-10-28T15:43:14.87Z",
        "EndTime": "0001-01-01T00:00:00Z",
        "NodeProcessed": false,
        "InProgress": false
    },
    "time": "2025-10-28T15:43:35Z",
    "message": "Adding new event to the event store"
}
{
    "level": "info",
    "event-id": "asg-lifecycle-term-64373436363861642d663331612d663838382d396561622d633831623861656561373064",
    "kind": "ASG_LAUNCH_LIFECYCLE",
    "node-name": "ip-10-31-16-42.eu-west-1.compute.internal",
    "instance-id": "i-004eafb13e0bbb439",
    "provider-id": "aws:///eu-west-1a/i-004eafb13e0bbb439",
    "time": "2025-10-28T15:43:35Z",
    "message": "Requesting instance drain"
}
{
    "level": "info",
    "instanceID": "i-004eafb13e0bbb439",
    "time": "2025-10-28T15:43:36Z",
    "message": "EC2 instance not found in Kubernetes cluster - waiting for instance to join as a node"
}
{
    "level": "info",
    "instanceID": "i-004eafb13e0bbb439",
    "eventID": "asg-lifecycle-term-64373436363861642d663331612d663838382d396561622d633831623861656561373064",
    "time": "2025-10-28T15:43:36Z",
    "message": "Node not ready - canceling launch lifecycle event (ASG will retry or timeout)"
}
{
    "level": "info",
    "event": {
        "EventID": "asg-lifecycle-term-64373436363861642d663331612d663838382d396561622d633831623861656561373064",
        "Kind": "ASG_LAUNCH_LIFECYCLE",
        "Monitor": "SQS_MONITOR",
        "Description": "ASG Lifecycle Launch event received. Instance was started at 2025-10-28 15:43:14.87 +0000 UTC \n",
        "State": "",
        "AutoScalingGroupName": "on-demand-huge-workers-01-2024112121084029580000002c",
        "NodeName": "ip-10-31-16-42.eu-west-1.compute.internal",
        "NodeLabels": null,
        "Pods": null,
        "InstanceID": "i-004eafb13e0bbb439",
        "ProviderID": "aws:///eu-west-1a/i-004eafb13e0bbb439",
        "InstanceType": "",
        "IsManaged": true,
        "StartTime": "2025-10-28T15:43:14.87Z",
        "EndTime": "0001-01-01T00:00:00Z",
        "NodeProcessed": false,
        "InProgress": false
    },
    "time": "2025-10-28T15:43:55Z",
    "message": "Adding new event to the event store"
}
{
    "level": "info",
    "event-id": "asg-lifecycle-term-64373436363861642d663331612d663838382d396561622d633831623861656561373064",
    "kind": "ASG_LAUNCH_LIFECYCLE",
    "node-name": "ip-10-31-16-42.eu-west-1.compute.internal",
    "instance-id": "i-004eafb13e0bbb439",
    "provider-id": "aws:///eu-west-1a/i-004eafb13e0bbb439",
    "time": "2025-10-28T15:43:55Z",
    "message": "Requesting instance drain"
}
{
    "level": "info",
    "instanceID": "i-004eafb13e0bbb439",
    "nodeName": "ip-10-31-16-42.eu-west-1.compute.internal",
    "readyStatus": "False",
    "readyReason": "KubeletNotReady",
    "readyMessage": "container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized",
    "time": "2025-10-28T15:43:56Z",
    "message": "EC2 instance found in cluster but node not ready - waiting for ready status"
}
{
    "level": "info",
    "instanceID": "i-004eafb13e0bbb439",
    "eventID": "asg-lifecycle-term-64373436363861642d663331612d663838382d396561622d633831623861656561373064",
    "time": "2025-10-28T15:43:56Z",
    "message": "Node not ready - canceling launch lifecycle event (ASG will retry or timeout)"
}
{
    "level": "info",
    "event": {
        "EventID": "asg-lifecycle-term-36306636363861652d313831622d656261632d653565342d303132353936316261396332",
        "Kind": "ASG_LIFECYCLE",
        "Monitor": "SQS_MONITOR",
        "Description": "ASG Lifecycle Termination event received. Instance will be interrupted at 2025-10-28 15:53:15.796 +0000 UTC \n",
        "State": "",
        "AutoScalingGroupName": "on-demand-huge-workers-01-2024112121084029580000002c",
        "NodeName": "ip-10-31-16-42.eu-west-1.compute.internal",
        "NodeLabels": null,
        "Pods": null,
        "InstanceID": "i-004eafb13e0bbb439",
        "ProviderID": "aws:///eu-west-1a/i-004eafb13e0bbb439",
        "InstanceType": "r6a.xlarge",
        "IsManaged": true,
        "StartTime": "2025-10-28T15:53:15.796Z",
        "EndTime": "0001-01-01T00:00:00Z",
        "NodeProcessed": false,
        "InProgress": false
    },
    "time": "2025-10-28T15:53:15Z",
    "message": "Adding new event to the event store"
}
{
    "level": "info",
    "event-id": "asg-lifecycle-term-36306636363861652d313831622d656261632d653565342d303132353936316261396332",
    "kind": "ASG_LIFECYCLE",
    "node-name": "ip-10-31-16-42.eu-west-1.compute.internal",
    "instance-id": "i-004eafb13e0bbb439",
    "provider-id": "aws:///eu-west-1a/i-004eafb13e0bbb439",
    "time": "2025-10-28T15:53:16Z",
    "message": "Requesting instance drain"
}
{
    "level": "info",
    "lifecycleHookName": "nth-termination-request",
    "instanceID": "i-004eafb13e0bbb439",
    "time": "2025-10-28T15:53:20Z",
    "message": "Completed ASG Lifecycle Hook"
}
k8s node ip-10-31-16-42.eu-west-1.compute.internal events (click to expand)
$ kubectl get events --all-namespaces --sort-by='.lastTimestamp' \
  -o custom-columns=\
NAMESPACE:.metadata.namespace,\
LAST_SEEN:.lastTimestamp,\
TYPE:.type,\
REASON:.reason,\
OBJECT:.involvedObject.kind,\
NAME:.involvedObject.name,\
MESSAGE:.message \
  | grep ip-10-31-16-42
default                      <nil>                  Normal    Starting                   Node          ip-10-31-16-42.eu-west-1.compute.internal                   <none>
default                      2025-10-28T15:43:38Z   Normal    NodeHasSufficientPID           Node             ip-10-31-16-4.eu-west-1.compute.internal                                    Node ip-10-31-16-42.eu-west-1.compute.internal status is now: NodeHasSufficientPID
default                      2025-10-28T15:43:38Z   Normal    Starting                       Node             ip-10-31-16-4.eu-west-1.compute.internal                                    Starting kubelet.
default                      2025-10-28T15:43:38Z   Warning   InvalidDiskCapacity            Node             ip-10-31-16-4.eu-west-1.compute.internal                                    invalid capacity 0 on image filesystem
default                      2025-10-28T15:43:38Z   Normal    NodeHasSufficientMemory        Node             ip-10-31-16-4.eu-west-1.compute.internal                                    Node ip-10-31-16-42.eu-west-1.compute.internal status is now: NodeHasSufficientMemory
default                      2025-10-28T15:43:38Z   Normal    NodeHasNoDiskPressure          Node             ip-10-31-16-4.eu-west-1.compute.internal                                    Node ip-10-31-16-42.eu-west-1.compute.internal status is now: NodeHasNoDiskPressure
default                      2025-10-28T15:43:38Z   Normal    NodeAllocatableEnforced        Node             ip-10-31-16-4.eu-west-1.compute.internal                                    Updated Node Allocatable limit across pods
kube-system                  2025-10-28T15:43:44Z   Normal    Scheduled                      Pod              ebs-csi-node-swwz9                                                           Successfully assigned kube-system/ebs-csi-node-swwz9 to ip-10-31-16-42.eu-west-1.compute.internal
kube-system                  2025-10-28T15:43:44Z   Normal    Scheduled                      Pod              kube-proxy-6gg8t                                                             Successfully assigned kube-system/kube-proxy-6gg8t to ip-10-31-16-42.eu-west-1.compute.internal
fluent-bit                   2025-10-28T15:43:44Z   Normal    Scheduled                      Pod              fluent-bit-q4s8d                                                             Successfully assigned fluent-bit/fluent-bit-q4s8d to ip-10-31-16-42.eu-west-1.compute.internal
groundcover                  2025-10-28T15:43:44Z   Normal    Scheduled                      Pod              groundcover-sensor-w4dtg                                                     Successfully assigned groundcover/groundcover-sensor-w4dtg to ip-10-31-16-42.eu-west-1.compute.internal
default                      2025-10-28T15:43:44Z   Normal    Synced                         Node             ip-10-31-16-4.eu-west-1.compute.internal                                    Node synced successfully
victoria-metrics-k8s-stack   2025-10-28T15:43:44Z   Normal    Scheduled                      Pod              victoria-metrics-k8s-stack-prometheus-node-exporter-jh5jf                    Successfully assigned victoria-metrics-k8s-stack/victoria-metrics-k8s-stack-prometheus-node-exporter-jh5jf to ip-10-31-16-42.eu-west-1.compute.internal
kube-system                  2025-10-28T15:43:44Z   Normal    Scheduled                      Pod              aws-node-llxtd                                                               Successfully assigned kube-system/aws-node-llxtd to ip-10-31-16-42.eu-west-1.compute.internal
kube-system                  2025-10-28T15:43:44Z   Normal    Scheduled                      Pod              efs-csi-node-5z5rw                                                           Successfully assigned kube-system/efs-csi-node-5z5rw to ip-10-31-16-42.eu-west-1.compute.internal
default                      2025-10-28T15:43:45Z   Normal    RegisteredNode                 Node             ip-10-31-16-4.eu-west-1.compute.internal                                    Node ip-10-31-16-42.eu-west-1.compute.internal event: Registered Node ip-10-31-16-42.eu-west-1.compute.internal in Controller
default                      2025-10-28T15:43:55Z   Normal    UnknownInterruption            Node             ip-10-31-16-4.eu-west-1.compute.internal                                    ASG Lifecycle Launch event received. Instance was started at 2025-10-28 15:43:14.87 +0000 UTC
default                      2025-10-28T15:43:59Z   Normal    NodeReady                      Node               ip-10-31-16-42.eu-west-1.compute.internal                                    Node ip-10-31-16-42.eu-west-1.compute.internal status is now: NodeReady
actions-runners-system       2025-10-28T15:44:00Z   Normal    Scheduled                      Pod                fm-huge-runners-gm2kf-l7tkd                                                  Successfully assigned actions-runners-system/fm-huge-runners-gm2kf-l7tkd to ip-10-31-16-42.eu-west-1.compute.internal
default                      2025-10-28T15:53:16Z   Normal    ASGLifecycle                   Node               ip-10-31-16-42.eu-west-1.compute.internal                                    ASG Lifecycle Termination event received. Instance will be interrupted at 2025-10-28 15:53:15.796 +0000 UTC
default                      2025-10-28T15:53:16Z   Normal    PreDrain                       Node               ip-10-31-16-42.eu-west-1.compute.internal                                    Pre-drain task successfully executed
actions-runners-system       2025-10-28T15:53:17Z   Normal    PodEviction                    Pod                fm-huge-runners-gm2kf-l7tkd                                                  Pod evicted due to node drain (node ip-10-31-16-42.eu-west-1.compute.internal)
kube-system                  2025-10-28T15:53:17Z   Normal    PodEviction                    Pod                kube-proxy-6gg8t                                                             Pod evicted due to node drain (node ip-10-31-16-42.eu-west-1.compute.internal)
kube-system                  2025-10-28T15:53:17Z   Normal    PodEviction                    Pod                ebs-csi-node-swwz9                                                           Pod evicted due to node drain (node ip-10-31-16-42.eu-west-1.compute.internal)
kube-system                  2025-10-28T15:53:17Z   Normal    PodEviction                    Pod                efs-csi-node-5z5rw                                                           Pod evicted due to node drain (node ip-10-31-16-42.eu-west-1.compute.internal)
fluent-bit                   2025-10-28T15:53:17Z   Normal    PodEviction                    Pod                fluent-bit-q4s8d                                                             Pod evicted due to node drain (node ip-10-31-16-42.eu-west-1.compute.internal)
groundcover                  2025-10-28T15:53:17Z   Normal    PodEviction                    Pod                groundcover-sensor-w4dtg                                                     Pod evicted due to node drain (node ip-10-31-16-42.eu-west-1.compute.internal)
victoria-metrics-k8s-stack   2025-10-28T15:53:17Z   Normal    PodEviction                    Pod                victoria-metrics-k8s-stack-prometheus-node-exporter-jh5jf                    Pod evicted due to node drain (node ip-10-31-16-42.eu-west-1.compute.internal)
kube-system                  2025-10-28T15:53:17Z   Normal    PodEviction                    Pod                aws-node-llxtd                                                               Pod evicted due to node drain (node ip-10-31-16-42.eu-west-1.compute.internal)
default                      2025-10-28T15:53:20Z   Normal    CordonAndDrain                 Node               ip-10-31-16-42.eu-west-1.compute.internal                                    Node successfully cordoned and drained
default                      2025-10-28T15:53:21Z   Normal    PostDrain                      Node               ip-10-31-16-42.eu-west-1.compute.internal                                    Post-drain task successfully executed
default                      2025-10-28T15:54:10Z   Normal    NodeNotReady                    Node               ip-10-31-16-42.eu-west-1.compute.internal                                    Node ip-10-31-16-42.eu-west-1.compute.internal status is now: NodeNotReady
default                      2025-10-28T15:54:58Z   Normal    DeletingNode                    Node               ip-10-31-16-42.eu-west-1.compute.internal                                    Deleting node ip-10-31-16-42.eu-west-1.compute.internal because it does not exist in the cloud provider
default                      2025-10-28T15:55:00Z   Normal    RemovingNode                    Node               ip-10-31-16-42.eu-west-1.compute.internal                                    Node ip-10-31-16-42.eu-west-1.compute.internal event: Removing Node ip-10-31-16-42.eu-west-1.compute.internal from Controller
victoria-metrics-k8s-stack   2025-10-28T15:55:39Z   Warning   FailedToUpdateEndpointSlices    Service            victoria-metrics-k8s-stack-prometheus-node-exporter                          Error updating Endpoint Slices for Service victoria-metrics-k8s-stack/victoria-metrics-k8s-stack-prometheus-node-exporter: skipping Pod victoria-metrics-k8s-stack-prometheus-node-exporter-jh5jf for Service victoria-metrics-k8s-stack/victoria-metrics-k8s-stack-prometheus-node-exporter: Node ip-10-31-16-42.eu-west-1.compute.internal Not Found

From K8s events timeline is:
15:43:38Z - Node starts joining and gets various status updates
15:43:44Z - Node synced successfully
15:43:59Z - Node became ready
15:44:00Z - Pods start getting scheduled to the node

But looking at the NTH logs:
15:43:15Z - First NTH check shows no node in K8s for the EC2 instance
15:43:36Z - Second NTH check shows same
15:43:56Z - Last NTH check shows EC2 as node "not ready" with CNI issues
15:53:15Z - Next log is the termination event (10 minutes later - lifecycle hook heartbeat timeout)

Somehow, the node was ready at 15:43:59Z, but NTH never checked again after 15:43:56Z, until the ASG lifecycle hook eventually times out.

Environment

  • NTH App Version: v1.25.2
  • NTH Mode (IMDS/Queue processor): Queue Processor
  • OS/Arch: Amazon Linux 2023 (AMI: amazon-eks-node-al2023-x86_64-standard-1.31-v20251016)
  • Kubernetes version: v1.31.13
  • Installation method: Helm chart 0.27.2

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions