Skip to content

ETCD disaster recovery failures on a fresh cluster #971

@Danil-Grigorev

Description

@Danil-Grigorev

I've tried to restore an etcd backup taken from a freshly deployed k0s cluster and perform a disaster recovery on a different cluster, deployed from the same template. It seems there are several issues, which I'd like to showcase here. Unfortunately, no manual attempts allowed me to get the cluster to a healthy state after that, beyond partially working single controller node instance.

Here is a list of issues I observed:

  1. kube-proxy errors - mitigated by restart
  2. Certificate issues with pod logs - mitigated by specifying --insecure-skip-tls-verify-backend=true for kubectl. Possibly similar to Restoring kubelet config from backup fails #367
  3. ETCD timeouts after first node rollout during second controller join

Steps performed:

k0sctl apply - prepare initial cluster - 3 CP + 3 workers
k0s etcd backup
k0sctl reset
k0sctl apply --restore-from backup.tar.gz - same configuration, different nodes, 3 CP + 3 workers

Restore step completed without issues:

➜  mke3 git:(main) ✗ kubectl get ns                       
NAME               STATUS   AGE
calico-apiserver   Active   11m
calico-system      Active   11m
default            Active   12m
k0rdent            Active   11m
k0s-autopilot      Active   11m
k0s-system         Active   11m
kube-node-lease    Active   12m
kube-public        Active   12m
kube-system        Active   12m
mgmt               Active   6m46s
mke                Active   6m32s
projectsveltos     Active   7m24s
tigera-operator    Active   11m
➜  mke3 git:(main) ✗ kubectl get nodes
NAME                                            STATUS   ROLES           AGE   VERSION
ip-172-31-0-125.eu-central-1.compute.internal   Ready    control-plane   11m   v1.32.6+k0s
ip-172-31-0-233.eu-central-1.compute.internal   Ready    <none>          11m   v1.32.6+k0s
ip-172-31-0-32.eu-central-1.compute.internal    Ready    control-plane   11m   v1.32.6+k0s
ip-172-31-0-60.eu-central-1.compute.internal    Ready    control-plane   12m   v1.32.6+k0s
ip-172-31-0-68.eu-central-1.compute.internal    Ready    <none>          11m   v1.32.6+k0s
ip-172-31-0-88.eu-central-1.compute.internal    Ready    <none>          11m   v1.32.6+k0s

but overall apply eventually failed, trying to join second controller instance.

0.5. Old nodes present

It seems that etcd restore preserves previous nodes even in k0sctl apply scenario. Shouldn't these always be removed, as apply will join them once more with up-to-date IP addresses and configuration?

➜  mke3 git:(main) ✗ kubectl get nodes
NAME                                            STATUS   ROLES           AGE   VERSION
ip-172-31-0-125.eu-central-1.compute.internal   Ready    control-plane   11m   v1.32.6+k0s
ip-172-31-0-233.eu-central-1.compute.internal   Ready    <none>          11m   v1.32.6+k0s
ip-172-31-0-32.eu-central-1.compute.internal    Ready    control-plane   11m   v1.32.6+k0s
ip-172-31-0-60.eu-central-1.compute.internal    Ready    control-plane   12m   v1.32.6+k0s
ip-172-31-0-68.eu-central-1.compute.internal    Ready    <none>          11m   v1.32.6+k0s
ip-172-31-0-88.eu-central-1.compute.internal    Ready    <none>          11m   v1.32.6+k0s

1. Kube-proxy errors

Here is a state I was able to observe, by manually cancelling the restore before the second node join:

➜  mke3 git:(main) ✗ kubectl get nodes        
NAME                                            STATUS     ROLES           AGE     VERSION
ip-172-31-0-183.eu-central-1.compute.internal   NotReady   <none>          4h8m    v1.32.6+k0s
ip-172-31-0-206.eu-central-1.compute.internal   NotReady   control-plane   5h31m   v1.32.6+k0s
ip-172-31-0-217.eu-central-1.compute.internal   NotReady   control-plane   4h10m   v1.32.6+k0s
ip-172-31-0-224.eu-central-1.compute.internal   NotReady   control-plane   4h9m    v1.32.6+k0s
ip-172-31-0-227.eu-central-1.compute.internal   NotReady   <none>          4h8m    v1.32.6+k0s
ip-172-31-0-60.eu-central-1.compute.internal    NotReady   control-plane   17m     v1.32.6+k0s
ip-172-31-0-69.eu-central-1.compute.internal    NotReady   <none>          4h8m    v1.32.6+k0s
➜  mke3 git:(main) ✗ kubectl get pods -n calico-system  
NAME                                       READY   STATUS                  RESTARTS       AGE
calico-kube-controllers-7669678ddb-t2hsw   1/1     Running                 1              5h33m
calico-node-6cgld                          1/1     Running                 0              4h10m
calico-node-99cdh                          1/1     Running                 1              5h33m
calico-node-bv2qr                          1/1     Running                 0              4h10m
calico-node-gzdbb                          1/1     Running                 0              4h10m
calico-node-pf9vc                          1/1     Running                 0              4h11m
calico-node-pwrvb                          0/1     Init:CrashLoopBackOff   8 (4m6s ago)   20m
calico-node-tp8fx                          1/1     Running                 1              5h33m
calico-typha-6b975b87d6-js5st              1/1     Running                 0              4h11m
calico-typha-6b975b87d6-m6ff9              1/1     Running                 1              5h33m
calico-typha-6b975b87d6-xhhvx              1/1     Running                 0              4h10m
csi-node-driver-2tlpr                      2/2     Running                 0              4h10m
csi-node-driver-b7vbf                      2/2     Running                 2              5h33m
csi-node-driver-nflkq                      2/2     Running                 2              5h33m
csi-node-driver-r2sdc                      2/2     Running                 0              4h10m
csi-node-driver-rvb95                      0/2     ContainerCreating       0              20m
csi-node-driver-tln5g                      2/2     Running                 0              4h11m
csi-node-driver-x9ksz                      2/2     Running                 0              4h10m
goldmane-77b796bd9-frrw5                   1/1     Running                 1              5h33m
whisker-5d756b79c5-dfhf8                   2/2     Running                 2              5h33m

The newly added node was not moving to a Ready state.

Cause - calico did not roll out properly due to kube-proxy. No amount of time spent waiting lead to the issue resolving itself.

Here are the logs from the kube-proxy pod:

➜  mke3 git:(main) ✗ kubectl logs kube-proxy-dcmfv -n kube-system --insecure-skip-tls-verify-backend=true 
...
2025-10-27T13:40:22.146156841Z stderr F E1027 13:40:22.146060       1 event_broadcaster.go:279] "Unable to write event (may retry after sleeping)" err="Post \"https://84kcma-mke4-lb-0c6d75d0c71c7034.elb.eu-central-1.amazonaws.com:6443/apis/events.k8s.io/v1/namespaces/default/events\": dial tcp: lookup 84kcma-mke4-lb-0c6d75d0c71c7034.elb.eu-central-1.amazonaws.com on 172.31.0.2:53: no such host"

Restarting the kube-proxy pod helps:

➜  mke3 git:(main) ✗ kubectl get nodes                
NAME                                           STATUS   ROLES           AGE   VERSION
ip-172-31-0-60.eu-central-1.compute.internal   Ready    control-plane   30m   v1.32.6+k0s
➜  mke3 git:(main) ✗ kubectl get pods -n calico-system              

NAME                                       READY   STATUS    RESTARTS   AGE
calico-kube-controllers-7669678ddb-z9z7t   1/1     Running   0          10m
calico-node-5gc9h                          1/1     Running   0          9m36s
calico-typha-6b975b87d6-2js7b              1/1     Running   0          10m
csi-node-driver-rvb95                      2/2     Running   0          32m
goldmane-77b796bd9-vtg2t                   1/1     Running   0          10m
whisker-5d756b79c5-bdw9x                   2/2     Running   0          10m

2. k0s kc logs - kubelet certificate issues

k0s kc logs reports certificate error, and requires to use --insecure-skip-tls-verify-backend=true flag.

➜  mke3 git:(main) ✗ kubectl logs kube-proxy-dcmfv -n kube-system                                         
Error from server: Get "https://172.31.0.51:10250/containerLogs/kube-system/kube-proxy-dcmfv/kube-proxy": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "kubernetes-ca")
➜  mke3 git:(main) ✗ kubectl logs kube-proxy-dcmfv -n kube-system --insecure-skip-tls-verify-backend=true 
...
2025-10-27T13:40:22.146156841Z stderr F E1027 13:40:22.146060       1 event_broadcaster.go:279] "Unable to write event (may retry after sleeping)" err="Post \"https://84kcma-mke4-lb-0c6d75d0c71c7034.elb.eu-central-1.amazonaws.com:6443/apis/events.k8s.io/v1/namespaces/default/events\": dial tcp: lookup 84kcma-mke4-lb-0c6d75d0c71c7034.elb.eu-central-1.amazonaws.com on 172.31.0.2:53: no such host"

3. ETCD Cluster id mismatch on second node join

After joining the second controller node, during the waiting phase on the node readiness, etcd on the initial node starts to timeout. It appears that during restore the new controller node assumes that it is an etcd leader and the only member:

root@ip-172-31-0-57:~# k0s etcd member-list
{"members":{"ip-172-31-0-57.eu-central-1.compute.internal":"https://172.31.0.57:2380"}}

Logs from k0scontroller on the second node report the following:

root@ip-172-31-0-57:~# journalctl -u k0scontroller.service | tail -1
Oct 29 16:21:43 ip-172-31-0-57.eu-central-1.compute.internal k0s[37685]: time="2025-10-29 16:21:43" level=info msg="{\"level\":\"warn\",\"ts\":\"2025-10-29T16:21:43.296444Z\",\"caller\":\"rafthttp/http.go:500\",\"msg\":\"request cluster ID mismatch\",\"local-member-id\":\"f1217a169d84161d\",\"local-member-cluster-id\":\"68234dd6ec1a9690\",\"local-member-server-version\":\"3.5.18\",\"local-member-server-minimum-cluster-version\":\"3.0.0\",\"remote-peer-server-name\":\"bb20fd5aa20bce5c\",\"remote-peer-server-version\":\"3.5.18\",\"remote-peer-server-minimum-cluster-version\":\"3.0.0\",\"remote-peer-cluster-id\":\"9258a30bab734a85\"}" component=etcd stream=stderr

Questions

  • Kube-proxy startup issues → Requires manual pod restart to recover.
    • Is it possible to automate this step in k0sctl?
  • Certificate verification errors → Logs only accessible with --insecure-skip-tls-verify-backend=true.
    • Could kubelet apiServer certificates be restored/regenerated to include newly aded nodes? No other certificates seemed to cause an issue at this stage.
  • ETCD cluster ID mismatch on controller join → Causes etcd timeouts and broken multi-controller state.
    • Should restored clusters assume current Cluster ID / use restored Cluster ID or manage etcd members in other way?

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentationenhancementNew feature or requestquestionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions