Merge pull request #44058 from chendave/troubleshooting

k8s-ci-robot · web-flow · commit 14ac1b301f42 · 2023-11-27T11:06:34.000+01:00
Update trouble shooting to include the issue of etcd upgrade
diff --git a/content/en/docs/setup/production-environment/tools/kubeadm/troubleshooting-kubeadm.md b/content/en/docs/setup/production-environment/tools/kubeadm/troubleshooting-kubeadm.md
@@ -431,3 +431,82 @@ See [Enabling signed kubelet serving certificates](/docs/tasks/administer-cluste
 to understand how to configure the kubelets in a kubeadm cluster to have properly signed serving certificates.
 
 Also see [How to run the metrics-server securely](https://github.com/kubernetes-sigs/metrics-server/blob/master/FAQ.md#how-to-run-metrics-server-securely).
+
+## Upgrade fails due to etcd hash not changing
+
+Only applicable to upgrading a control plane node with a kubeadm binary v1.28.3 or later,
+where the node is currently managed by kubeadm versions v1.28.0, v1.28.1 or v1.28.2.
+
+Here is the error message you may encounter:
+```
+[upgrade/etcd] Failed to upgrade etcd: couldn't upgrade control plane. kubeadm has tried to recover everything into the earlier state. Errors faced: static Pod hash for component etcd on Node kinder-upgrade-control-plane-1 did not change after 5m0s: timed out waiting for the condition
+[upgrade/etcd] Waiting for previous etcd to become available
+I0907 10:10:09.109104    3704 etcd.go:588] [etcd] attempting to see if all cluster endpoints ([https://172.17.0.6:2379/ https://172.17.0.4:2379/ https://172.17.0.3:2379/]) are available 1/10
+[upgrade/etcd] Etcd was rolled back and is now available
+static Pod hash for component etcd on Node kinder-upgrade-control-plane-1 did not change after 5m0s: timed out waiting for the condition
+couldn't upgrade control plane. kubeadm has tried to recover everything into the earlier state. Errors faced
+k8s.io/kubernetes/cmd/kubeadm/app/phases/upgrade.rollbackOldManifests
+	cmd/kubeadm/app/phases/upgrade/staticpods.go:525
+k8s.io/kubernetes/cmd/kubeadm/app/phases/upgrade.upgradeComponent
+	cmd/kubeadm/app/phases/upgrade/staticpods.go:254
+k8s.io/kubernetes/cmd/kubeadm/app/phases/upgrade.performEtcdStaticPodUpgrade
+	cmd/kubeadm/app/phases/upgrade/staticpods.go:338
+...
+```
+
+The reason for this failure is that the affected versions generate an etcd manifest file with unwanted defaults in the PodSpec.
+This will result in a diff from the manifest comparison, and kubeadm will expect a change in the Pod hash, but the kubelet will never update the hash.
+
+There are two way to workaround this issue if you see it in your cluster:
+- The etcd upgrade can be skipped between the affected versions and v1.28.3 (or later) by using:
+```shell
+kubeadm upgrade {apply|node} [version] --etcd-upgrade=false
+```
+
+This is not recommended in case a new etcd version was introduced by a later v1.28 patch version.
+
+- Before upgrade, patch the manifest for the etcd static pod, to remove the problematic defaulted attributes:
+
+  ```patch
+  diff --git a/etc/kubernetes/manifests/etcd_defaults.yaml b/etc/kubernetes/manifests/etcd_origin.yaml
+  index d807ccbe0aa..46b35f00e15 100644
+  --- a/etc/kubernetes/manifests/etcd_defaults.yaml
+  +++ b/etc/kubernetes/manifests/etcd_origin.yaml
+  @@ -43,7 +43,6 @@ spec:
+          scheme: HTTP
+        initialDelaySeconds: 10
+        periodSeconds: 10
+  -      successThreshold: 1
+        timeoutSeconds: 15
+      name: etcd
+      resources:
+  @@ -59,26 +58,18 @@ spec:
+          scheme: HTTP
+        initialDelaySeconds: 10
+        periodSeconds: 10
+  -      successThreshold: 1
+        timeoutSeconds: 15
+  -    terminationMessagePath: /dev/termination-log
+  -    terminationMessagePolicy: File
+      volumeMounts:
+      - mountPath: /var/lib/etcd
+        name: etcd-data
+      - mountPath: /etc/kubernetes/pki/etcd
+        name: etcd-certs
+  -  dnsPolicy: ClusterFirst
+  -  enableServiceLinks: true
+    hostNetwork: true
+    priority: 2000001000
+    priorityClassName: system-node-critical
+  -  restartPolicy: Always
+  -  schedulerName: default-scheduler
+    securityContext:
+      seccompProfile:
+        type: RuntimeDefault
+  -  terminationGracePeriodSeconds: 30
+    volumes:
+    - hostPath:
+        path: /etc/kubernetes/pki/etcd
+  ```
+
+More information can be found in the [tracking issue](https://github.com/kubernetes/kubeadm/issues/2927) for this bug.