Skip to content

Commit 14ac1b3

Browse files
authored
Merge pull request #44058 from chendave/troubleshooting
Update trouble shooting to include the issue of etcd upgrade
2 parents dbb6146 + 6c6ace0 commit 14ac1b3

File tree

1 file changed

+79
-0
lines changed

1 file changed

+79
-0
lines changed

content/en/docs/setup/production-environment/tools/kubeadm/troubleshooting-kubeadm.md

Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -431,3 +431,82 @@ See [Enabling signed kubelet serving certificates](/docs/tasks/administer-cluste
431431
to understand how to configure the kubelets in a kubeadm cluster to have properly signed serving certificates.
432432

433433
Also see [How to run the metrics-server securely](https://github.com/kubernetes-sigs/metrics-server/blob/master/FAQ.md#how-to-run-metrics-server-securely).
434+
435+
## Upgrade fails due to etcd hash not changing
436+
437+
Only applicable to upgrading a control plane node with a kubeadm binary v1.28.3 or later,
438+
where the node is currently managed by kubeadm versions v1.28.0, v1.28.1 or v1.28.2.
439+
440+
Here is the error message you may encounter:
441+
```
442+
[upgrade/etcd] Failed to upgrade etcd: couldn't upgrade control plane. kubeadm has tried to recover everything into the earlier state. Errors faced: static Pod hash for component etcd on Node kinder-upgrade-control-plane-1 did not change after 5m0s: timed out waiting for the condition
443+
[upgrade/etcd] Waiting for previous etcd to become available
444+
I0907 10:10:09.109104 3704 etcd.go:588] [etcd] attempting to see if all cluster endpoints ([https://172.17.0.6:2379/ https://172.17.0.4:2379/ https://172.17.0.3:2379/]) are available 1/10
445+
[upgrade/etcd] Etcd was rolled back and is now available
446+
static Pod hash for component etcd on Node kinder-upgrade-control-plane-1 did not change after 5m0s: timed out waiting for the condition
447+
couldn't upgrade control plane. kubeadm has tried to recover everything into the earlier state. Errors faced
448+
k8s.io/kubernetes/cmd/kubeadm/app/phases/upgrade.rollbackOldManifests
449+
cmd/kubeadm/app/phases/upgrade/staticpods.go:525
450+
k8s.io/kubernetes/cmd/kubeadm/app/phases/upgrade.upgradeComponent
451+
cmd/kubeadm/app/phases/upgrade/staticpods.go:254
452+
k8s.io/kubernetes/cmd/kubeadm/app/phases/upgrade.performEtcdStaticPodUpgrade
453+
cmd/kubeadm/app/phases/upgrade/staticpods.go:338
454+
...
455+
```
456+
457+
The reason for this failure is that the affected versions generate an etcd manifest file with unwanted defaults in the PodSpec.
458+
This will result in a diff from the manifest comparison, and kubeadm will expect a change in the Pod hash, but the kubelet will never update the hash.
459+
460+
There are two way to workaround this issue if you see it in your cluster:
461+
- The etcd upgrade can be skipped between the affected versions and v1.28.3 (or later) by using:
462+
```shell
463+
kubeadm upgrade {apply|node} [version] --etcd-upgrade=false
464+
```
465+
466+
This is not recommended in case a new etcd version was introduced by a later v1.28 patch version.
467+
468+
- Before upgrade, patch the manifest for the etcd static pod, to remove the problematic defaulted attributes:
469+
470+
```patch
471+
diff --git a/etc/kubernetes/manifests/etcd_defaults.yaml b/etc/kubernetes/manifests/etcd_origin.yaml
472+
index d807ccbe0aa..46b35f00e15 100644
473+
--- a/etc/kubernetes/manifests/etcd_defaults.yaml
474+
+++ b/etc/kubernetes/manifests/etcd_origin.yaml
475+
@@ -43,7 +43,6 @@ spec:
476+
scheme: HTTP
477+
initialDelaySeconds: 10
478+
periodSeconds: 10
479+
- successThreshold: 1
480+
timeoutSeconds: 15
481+
name: etcd
482+
resources:
483+
@@ -59,26 +58,18 @@ spec:
484+
scheme: HTTP
485+
initialDelaySeconds: 10
486+
periodSeconds: 10
487+
- successThreshold: 1
488+
timeoutSeconds: 15
489+
- terminationMessagePath: /dev/termination-log
490+
- terminationMessagePolicy: File
491+
volumeMounts:
492+
- mountPath: /var/lib/etcd
493+
name: etcd-data
494+
- mountPath: /etc/kubernetes/pki/etcd
495+
name: etcd-certs
496+
- dnsPolicy: ClusterFirst
497+
- enableServiceLinks: true
498+
hostNetwork: true
499+
priority: 2000001000
500+
priorityClassName: system-node-critical
501+
- restartPolicy: Always
502+
- schedulerName: default-scheduler
503+
securityContext:
504+
seccompProfile:
505+
type: RuntimeDefault
506+
- terminationGracePeriodSeconds: 30
507+
volumes:
508+
- hostPath:
509+
path: /etc/kubernetes/pki/etcd
510+
```
511+
512+
More information can be found in the [tracking issue](https://github.com/kubernetes/kubeadm/issues/2927) for this bug.

0 commit comments

Comments
 (0)