-
Notifications
You must be signed in to change notification settings - Fork 96
Description
I've tried to restore an etcd backup taken from a freshly deployed k0s cluster and perform a disaster recovery on a different cluster, deployed from the same template. It seems there are several issues, which I'd like to showcase here. Unfortunately, no manual attempts allowed me to get the cluster to a healthy state after that, beyond partially working single controller node instance.
Here is a list of issues I observed:
- kube-proxy errors - mitigated by restart
- Certificate issues with pod logs - mitigated by specifying --insecure-skip-tls-verify-backend=true for kubectl. Possibly similar to Restoring kubelet config from backup fails #367
- ETCD timeouts after first node rollout during second controller join
Steps performed:
k0sctl apply - prepare initial cluster - 3 CP + 3 workers
k0s etcd backup
k0sctl reset
k0sctl apply --restore-from backup.tar.gz - same configuration, different nodes, 3 CP + 3 workersRestore step completed without issues:
➜ mke3 git:(main) ✗ kubectl get ns
NAME STATUS AGE
calico-apiserver Active 11m
calico-system Active 11m
default Active 12m
k0rdent Active 11m
k0s-autopilot Active 11m
k0s-system Active 11m
kube-node-lease Active 12m
kube-public Active 12m
kube-system Active 12m
mgmt Active 6m46s
mke Active 6m32s
projectsveltos Active 7m24s
tigera-operator Active 11m
➜ mke3 git:(main) ✗ kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-172-31-0-125.eu-central-1.compute.internal Ready control-plane 11m v1.32.6+k0s
ip-172-31-0-233.eu-central-1.compute.internal Ready <none> 11m v1.32.6+k0s
ip-172-31-0-32.eu-central-1.compute.internal Ready control-plane 11m v1.32.6+k0s
ip-172-31-0-60.eu-central-1.compute.internal Ready control-plane 12m v1.32.6+k0s
ip-172-31-0-68.eu-central-1.compute.internal Ready <none> 11m v1.32.6+k0s
ip-172-31-0-88.eu-central-1.compute.internal Ready <none> 11m v1.32.6+k0sbut overall apply eventually failed, trying to join second controller instance.
0.5. Old nodes present
It seems that etcd restore preserves previous nodes even in k0sctl apply scenario. Shouldn't these always be removed, as apply will join them once more with up-to-date IP addresses and configuration?
➜ mke3 git:(main) ✗ kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-172-31-0-125.eu-central-1.compute.internal Ready control-plane 11m v1.32.6+k0s
ip-172-31-0-233.eu-central-1.compute.internal Ready <none> 11m v1.32.6+k0s
ip-172-31-0-32.eu-central-1.compute.internal Ready control-plane 11m v1.32.6+k0s
ip-172-31-0-60.eu-central-1.compute.internal Ready control-plane 12m v1.32.6+k0s
ip-172-31-0-68.eu-central-1.compute.internal Ready <none> 11m v1.32.6+k0s
ip-172-31-0-88.eu-central-1.compute.internal Ready <none> 11m v1.32.6+k0s1. Kube-proxy errors
Here is a state I was able to observe, by manually cancelling the restore before the second node join:
➜ mke3 git:(main) ✗ kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-172-31-0-183.eu-central-1.compute.internal NotReady <none> 4h8m v1.32.6+k0s
ip-172-31-0-206.eu-central-1.compute.internal NotReady control-plane 5h31m v1.32.6+k0s
ip-172-31-0-217.eu-central-1.compute.internal NotReady control-plane 4h10m v1.32.6+k0s
ip-172-31-0-224.eu-central-1.compute.internal NotReady control-plane 4h9m v1.32.6+k0s
ip-172-31-0-227.eu-central-1.compute.internal NotReady <none> 4h8m v1.32.6+k0s
ip-172-31-0-60.eu-central-1.compute.internal NotReady control-plane 17m v1.32.6+k0s
ip-172-31-0-69.eu-central-1.compute.internal NotReady <none> 4h8m v1.32.6+k0s
➜ mke3 git:(main) ✗ kubectl get pods -n calico-system
NAME READY STATUS RESTARTS AGE
calico-kube-controllers-7669678ddb-t2hsw 1/1 Running 1 5h33m
calico-node-6cgld 1/1 Running 0 4h10m
calico-node-99cdh 1/1 Running 1 5h33m
calico-node-bv2qr 1/1 Running 0 4h10m
calico-node-gzdbb 1/1 Running 0 4h10m
calico-node-pf9vc 1/1 Running 0 4h11m
calico-node-pwrvb 0/1 Init:CrashLoopBackOff 8 (4m6s ago) 20m
calico-node-tp8fx 1/1 Running 1 5h33m
calico-typha-6b975b87d6-js5st 1/1 Running 0 4h11m
calico-typha-6b975b87d6-m6ff9 1/1 Running 1 5h33m
calico-typha-6b975b87d6-xhhvx 1/1 Running 0 4h10m
csi-node-driver-2tlpr 2/2 Running 0 4h10m
csi-node-driver-b7vbf 2/2 Running 2 5h33m
csi-node-driver-nflkq 2/2 Running 2 5h33m
csi-node-driver-r2sdc 2/2 Running 0 4h10m
csi-node-driver-rvb95 0/2 ContainerCreating 0 20m
csi-node-driver-tln5g 2/2 Running 0 4h11m
csi-node-driver-x9ksz 2/2 Running 0 4h10m
goldmane-77b796bd9-frrw5 1/1 Running 1 5h33m
whisker-5d756b79c5-dfhf8 2/2 Running 2 5h33mThe newly added node was not moving to a Ready state.
Cause - calico did not roll out properly due to kube-proxy. No amount of time spent waiting lead to the issue resolving itself.
Here are the logs from the kube-proxy pod:
➜ mke3 git:(main) ✗ kubectl logs kube-proxy-dcmfv -n kube-system --insecure-skip-tls-verify-backend=true
...
2025-10-27T13:40:22.146156841Z stderr F E1027 13:40:22.146060 1 event_broadcaster.go:279] "Unable to write event (may retry after sleeping)" err="Post \"https://84kcma-mke4-lb-0c6d75d0c71c7034.elb.eu-central-1.amazonaws.com:6443/apis/events.k8s.io/v1/namespaces/default/events\": dial tcp: lookup 84kcma-mke4-lb-0c6d75d0c71c7034.elb.eu-central-1.amazonaws.com on 172.31.0.2:53: no such host"Restarting the kube-proxy pod helps:
➜ mke3 git:(main) ✗ kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-172-31-0-60.eu-central-1.compute.internal Ready control-plane 30m v1.32.6+k0s
➜ mke3 git:(main) ✗ kubectl get pods -n calico-system
NAME READY STATUS RESTARTS AGE
calico-kube-controllers-7669678ddb-z9z7t 1/1 Running 0 10m
calico-node-5gc9h 1/1 Running 0 9m36s
calico-typha-6b975b87d6-2js7b 1/1 Running 0 10m
csi-node-driver-rvb95 2/2 Running 0 32m
goldmane-77b796bd9-vtg2t 1/1 Running 0 10m
whisker-5d756b79c5-bdw9x 2/2 Running 0 10m2. k0s kc logs - kubelet certificate issues
k0s kc logs reports certificate error, and requires to use --insecure-skip-tls-verify-backend=true flag.
➜ mke3 git:(main) ✗ kubectl logs kube-proxy-dcmfv -n kube-system
Error from server: Get "https://172.31.0.51:10250/containerLogs/kube-system/kube-proxy-dcmfv/kube-proxy": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "kubernetes-ca")
➜ mke3 git:(main) ✗ kubectl logs kube-proxy-dcmfv -n kube-system --insecure-skip-tls-verify-backend=true
...
2025-10-27T13:40:22.146156841Z stderr F E1027 13:40:22.146060 1 event_broadcaster.go:279] "Unable to write event (may retry after sleeping)" err="Post \"https://84kcma-mke4-lb-0c6d75d0c71c7034.elb.eu-central-1.amazonaws.com:6443/apis/events.k8s.io/v1/namespaces/default/events\": dial tcp: lookup 84kcma-mke4-lb-0c6d75d0c71c7034.elb.eu-central-1.amazonaws.com on 172.31.0.2:53: no such host"3. ETCD Cluster id mismatch on second node join
After joining the second controller node, during the waiting phase on the node readiness, etcd on the initial node starts to timeout. It appears that during restore the new controller node assumes that it is an etcd leader and the only member:
root@ip-172-31-0-57:~# k0s etcd member-list
{"members":{"ip-172-31-0-57.eu-central-1.compute.internal":"https://172.31.0.57:2380"}}Logs from k0scontroller on the second node report the following:
root@ip-172-31-0-57:~# journalctl -u k0scontroller.service | tail -1
Oct 29 16:21:43 ip-172-31-0-57.eu-central-1.compute.internal k0s[37685]: time="2025-10-29 16:21:43" level=info msg="{\"level\":\"warn\",\"ts\":\"2025-10-29T16:21:43.296444Z\",\"caller\":\"rafthttp/http.go:500\",\"msg\":\"request cluster ID mismatch\",\"local-member-id\":\"f1217a169d84161d\",\"local-member-cluster-id\":\"68234dd6ec1a9690\",\"local-member-server-version\":\"3.5.18\",\"local-member-server-minimum-cluster-version\":\"3.0.0\",\"remote-peer-server-name\":\"bb20fd5aa20bce5c\",\"remote-peer-server-version\":\"3.5.18\",\"remote-peer-server-minimum-cluster-version\":\"3.0.0\",\"remote-peer-cluster-id\":\"9258a30bab734a85\"}" component=etcd stream=stderrQuestions
- Kube-proxy startup issues → Requires manual pod restart to recover.
- Is it possible to automate this step in
k0sctl?
- Is it possible to automate this step in
- Certificate verification errors → Logs only accessible with
--insecure-skip-tls-verify-backend=true.- Could
kubeletapiServer certificates be restored/regenerated to include newly aded nodes? No other certificates seemed to cause an issue at this stage.
- Could
- ETCD cluster ID mismatch on controller join → Causes etcd timeouts and broken multi-controller state.
- Should restored clusters assume current Cluster ID / use restored Cluster ID or manage etcd members in other way?