Skip to content

control plane RollingUpdate config silently breaks cluster creationΒ #9898

@swapzero

Description

@swapzero

What happened:

While following the documented process for Bare Metal RollingUpgrades, I prepared a new cluster.yaml that includes the upgradeRolloutStrategy set to RollingUpdate for both the control plane and worker node groups.

Initially, I encountered a validation error:

Error: the cluster config file provided is invalid: validating upgrade rollout strategy configuration: WorkerNodeGroupConfiguration: upgradeRolloutStrategy.rollingUpdate field is required for upgradeRolloutStrategy.type RollingUpdate

After addressing this by adding the required rollingUpdate fields for the worker group, the configuration passed validation. For the control plane, it appears maxSurge defaults to 1, and since the preflight checks succeed without explicitly setting it, I assumed that was acceptable.

Here is the diff between the original and updated cluster.yaml:

--- cluster.yaml.orig   2025-07-12 21:15:44.266427279 +0000
+++ cluster.yaml        2025-07-12 21:16:23.706199561 +0000
@@ -13,6 +13,8 @@
       cidrBlocks:
       - 10.96.0.0/12
   controlPlaneConfiguration:
+    upgradeRolloutStrategy:
+      type: RollingUpdate
     count: 3
     endpoint:
       host: "10.162.10.140"
@@ -31,6 +33,11 @@
       kind: TinkerbellMachineConfig
       name: eks-a
     name: worker-group
+    upgradeRolloutStrategy:
+      rollingUpdate:
+        maxSurge: 1
+        maxUnavailable: 0
+      type: RollingUpdate
 ---
 apiVersion: anywhere.eks.amazonaws.com/v1alpha1
 kind: TinkerbellDatacenterConfig

When attempting to create the cluster using:

eksctl anywhere create cluster \
  --hardware-csv hardware.csv \
  -f cluster.yaml \
  --no-timeouts

The process failed with an error that suggested a missing TinkerbellMachineConfig:

cluster has an error: Dependent cluster objects don't exist: TinkerbellMachineConfig.anywhere.eks.amazonaws.com "eks-a-cp" not found

However, this resource clearly exists in the bootstrap cluster:

$ kubectl get TinkerbellMachineConfig --all-namespaces
NAMESPACE   NAME       AGE
default     eks-a      54s
default     eks-a-cp   54s

Despite the resource being present, the cluster creation process keeps retrying endlessly. Even with --no-timeouts, which ironically guarantees being stuck in a loop with no helpful feedback.


What you expected to happen:

I expected the cluster creation to either work or fail with a clear and accurate error. If something is misconfigured, the message should explain what's actually wrong so it's possible to fix it without guesswork.


How to reproduce it (as minimally and precisely as possible):

  1. Create a cluster.yaml that includes upgradeRolloutStrategy.type: RollingUpdate for both control plane and worker groups. Ensure rollingUpdate values are included for the worker group but left out for the control plane (since it passes validation).
  2. Run eksctl anywhere create cluster using the configuration.
  3. Observe the error related to a missing TinkerbellMachineConfig, even though the resource is present, and note the endless retry loop behavior.

Anything else we need to know?:


Environment:

  • EKS Anywhere Release
$ eksctl anywhere version
Version: v0.22.6
Release Manifest URL: https://anywhere-assets.eks.amazonaws.com/releases/eks-a/manifest.yaml
Bundle Manifest URL: https://anywhere-assets.eks.amazonaws.com/releases/bundles/100/manifest.yaml

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions