Skip to content

[BUG] Unresponsive AWSManagedMachinePool with bad bootstrap #5672

@MinhNguyen-at

Description

@MinhNguyen-at

/kind bug
AWSManagedMachinePool becomes responsive with a bad bootstrap (network call to aws eks or invalid command leading to cloud-init failure in AL2).

How to reproduce:

  1. Create/Update MachinePool, AWSManagedMachinePool, and oneof EKSConfig with an invalid bootstrap/pre-bootstrap command e.g. ./bin/exec-something-does-not-exist.exeonalinuxsurelydoesnotwork
  2. Observe event on AWSManagedMachinePool:
EKSNodegroupReconciliationFailed: failed to wait for nodegroup to be active: failed to wait for EKS nodegroup "eks-pool": request cancelled while waiting, context canceled

We can see reconciliation is stuck waiting on EKS node group to not be in "updating" phase. It will never get there because the update will never complete as nodes will never successfully join and replace existing ones. This leads to manual commands needing to be run since we cannot fix the invalid bootstrap via CAPA resources anymore.

  • Cluster-api-provider-aws version: v2.9.1
  • Kubernetes version: (use kubectl version): 1.30+
  • OS (e.g. from /etc/os-release): AL2

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.needs-priorityneeds-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions