Skip to content

kubeadm: single-control-plane stacked etcd upgrade can time out waiting for static pod hash change because verification requires apiserver (etcd unavailable/restarting) #3297

@adilGhaffarDev

Description

@adilGhaffarDev

What happened?

When upgrading etcd on a single control-plane cluster, the etcd static pod upgrade can fail even though etcd itself restarts successfully.

The failure is caused by kubeadm verifying the static pod restart via the Kubernetes API, but during a single-member etcd restart the API server is unhealthy/unreachable because its still connected to old etcd.

  • ETCD is upgraded/restarted (making the API server unhealthy) -> yes
  • ETCD starts up "successfully" -> It starts up, but the API server doesn't reconnect to it until kubelet/kubeadm could verify it.
  • kubeadm checks on ETCD through the API server <-- The new etcd might be healthy, but API server still tries to connect to the old pod.

What did you expect to happen?

  • kubeadm should be able to complete the etcd static pod upgrade without requiring a healthy API server during the etcd restart window, or
  • kubeadm should use a verification method that does not depend on the API server being available.

How can we reproduce it (as minimally and precisely as possible)?

The way to reproduce the issue would be to deploy 1-node cluster with k8s 1.34.0, then try to upgrade it to v1.35.0 using kubeadm v1.35.0.
This does not always happen, its flaky does not always fail.

Anything else we need to know?

Proposed fixes:

Option 1 (fallback when apiserver is unreachable):
If kubeadm cannot reach the API server while waiting for the static pod hash change, fall back to verifying etcd health directly using the etcd client endpoint (kubeadm already has etcd client/TLS handling elsewhere).
Once etcd endpoint is healthy again, retry the apiserver-based hash verification.

Option 2 (local verification):
Verify static pod restart locally via CRI (pod/container recreated) + manifest/hash on disk, bypassing apiserver.

Kubernetes version

1.35

Cloud provider

Baremetal

OS version

Install tools

Details

Container runtime (CRI) and version (if applicable)

1.35

Related plugins (CNI, CSI, ...) and versions (if applicable)

Details

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/kubeadmarea/kubeletkind/bugCategorizes issue or PR as related to a bug.priority/awaiting-more-evidenceLowest priority. Possibly useful, but not yet enough support to actually get it done.sig/cluster-lifecycleCategorizes an issue or PR as relevant to SIG Cluster Lifecycle.sig/nodeCategorizes an issue or PR as relevant to SIG Node.

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions