-
Notifications
You must be signed in to change notification settings - Fork 737
kubeadm: single-control-plane stacked etcd upgrade can time out waiting for static pod hash change because verification requires apiserver (etcd unavailable/restarting) #3297
Description
What happened?
When upgrading etcd on a single control-plane cluster, the etcd static pod upgrade can fail even though etcd itself restarts successfully.
The failure is caused by kubeadm verifying the static pod restart via the Kubernetes API, but during a single-member etcd restart the API server is unhealthy/unreachable because its still connected to old etcd.
- ETCD is upgraded/restarted (making the API server unhealthy) -> yes
- ETCD starts up "successfully" -> It starts up, but the API server doesn't reconnect to it until kubelet/kubeadm could verify it.
- kubeadm checks on ETCD through the API server <-- The new etcd might be healthy, but API server still tries to connect to the old pod.
What did you expect to happen?
- kubeadm should be able to complete the etcd static pod upgrade without requiring a healthy API server during the etcd restart window, or
- kubeadm should use a verification method that does not depend on the API server being available.
How can we reproduce it (as minimally and precisely as possible)?
The way to reproduce the issue would be to deploy 1-node cluster with k8s 1.34.0, then try to upgrade it to v1.35.0 using kubeadm v1.35.0.
This does not always happen, its flaky does not always fail.
Anything else we need to know?
Proposed fixes:
Option 1 (fallback when apiserver is unreachable):
If kubeadm cannot reach the API server while waiting for the static pod hash change, fall back to verifying etcd health directly using the etcd client endpoint (kubeadm already has etcd client/TLS handling elsewhere).
Once etcd endpoint is healthy again, retry the apiserver-based hash verification.
Option 2 (local verification):
Verify static pod restart locally via CRI (pod/container recreated) + manifest/hash on disk, bypassing apiserver.
Kubernetes version
1.35
Cloud provider
Baremetal
OS version
Install tools
Details
Container runtime (CRI) and version (if applicable)
1.35