ETCD disaster recovery failures on a fresh cluster

I've tried to restore an etcd backup taken from a freshly deployed `k0s` cluster and perform a disaster recovery on a different cluster, deployed from the same template. It seems there are several issues, which I'd like to showcase here. Unfortunately, no manual attempts allowed me to get the cluster to a healthy state after that, beyond partially working single controller node instance.

Here is a list of issues I observed:
1. kube-proxy errors - mitigated by restart
2. Certificate issues with pod logs - mitigated by specifying --insecure-skip-tls-verify-backend=true for kubectl. Possibly similar to https://github.com/k0sproject/k0sctl/issues/367
3. ETCD timeouts after first node rollout during second controller join


## Steps performed:

```bash
k0sctl apply - prepare initial cluster - 3 CP + 3 workers
k0s etcd backup
k0sctl reset
k0sctl apply --restore-from backup.tar.gz - same configuration, different nodes, 3 CP + 3 workers
```

Restore step completed without issues:

```bash
➜  mke3 git:(main) ✗ kubectl get ns                       
NAME               STATUS   AGE
calico-apiserver   Active   11m
calico-system      Active   11m
default            Active   12m
k0rdent            Active   11m
k0s-autopilot      Active   11m
k0s-system         Active   11m
kube-node-lease    Active   12m
kube-public        Active   12m
kube-system        Active   12m
mgmt               Active   6m46s
mke                Active   6m32s
projectsveltos     Active   7m24s
tigera-operator    Active   11m
➜  mke3 git:(main) ✗ kubectl get nodes
NAME                                            STATUS   ROLES           AGE   VERSION
ip-172-31-0-125.eu-central-1.compute.internal   Ready    control-plane   11m   v1.32.6+k0s
ip-172-31-0-233.eu-central-1.compute.internal   Ready    <none>          11m   v1.32.6+k0s
ip-172-31-0-32.eu-central-1.compute.internal    Ready    control-plane   11m   v1.32.6+k0s
ip-172-31-0-60.eu-central-1.compute.internal    Ready    control-plane   12m   v1.32.6+k0s
ip-172-31-0-68.eu-central-1.compute.internal    Ready    <none>          11m   v1.32.6+k0s
ip-172-31-0-88.eu-central-1.compute.internal    Ready    <none>          11m   v1.32.6+k0s
```

but overall apply eventually failed, trying to join second controller instance.

### 0.5. Old nodes present

It seems that etcd restore preserves previous nodes even in `k0sctl apply` scenario. Shouldn't these always be removed, as `apply` will join them once more with up-to-date IP addresses and configuration?

```bash
➜  mke3 git:(main) ✗ kubectl get nodes
NAME                                            STATUS   ROLES           AGE   VERSION
ip-172-31-0-125.eu-central-1.compute.internal   Ready    control-plane   11m   v1.32.6+k0s
ip-172-31-0-233.eu-central-1.compute.internal   Ready    <none>          11m   v1.32.6+k0s
ip-172-31-0-32.eu-central-1.compute.internal    Ready    control-plane   11m   v1.32.6+k0s
ip-172-31-0-60.eu-central-1.compute.internal    Ready    control-plane   12m   v1.32.6+k0s
ip-172-31-0-68.eu-central-1.compute.internal    Ready    <none>          11m   v1.32.6+k0s
ip-172-31-0-88.eu-central-1.compute.internal    Ready    <none>          11m   v1.32.6+k0s
```

###  1. `Kube-proxy` errors
 
Here is a state I was able to observe, by manually cancelling the restore before the second node join:

```bash
➜  mke3 git:(main) ✗ kubectl get nodes        
NAME                                            STATUS     ROLES           AGE     VERSION
ip-172-31-0-183.eu-central-1.compute.internal   NotReady   <none>          4h8m    v1.32.6+k0s
ip-172-31-0-206.eu-central-1.compute.internal   NotReady   control-plane   5h31m   v1.32.6+k0s
ip-172-31-0-217.eu-central-1.compute.internal   NotReady   control-plane   4h10m   v1.32.6+k0s
ip-172-31-0-224.eu-central-1.compute.internal   NotReady   control-plane   4h9m    v1.32.6+k0s
ip-172-31-0-227.eu-central-1.compute.internal   NotReady   <none>          4h8m    v1.32.6+k0s
ip-172-31-0-60.eu-central-1.compute.internal    NotReady   control-plane   17m     v1.32.6+k0s
ip-172-31-0-69.eu-central-1.compute.internal    NotReady   <none>          4h8m    v1.32.6+k0s
➜  mke3 git:(main) ✗ kubectl get pods -n calico-system  
NAME                                       READY   STATUS                  RESTARTS       AGE
calico-kube-controllers-7669678ddb-t2hsw   1/1     Running                 1              5h33m
calico-node-6cgld                          1/1     Running                 0              4h10m
calico-node-99cdh                          1/1     Running                 1              5h33m
calico-node-bv2qr                          1/1     Running                 0              4h10m
calico-node-gzdbb                          1/1     Running                 0              4h10m
calico-node-pf9vc                          1/1     Running                 0              4h11m
calico-node-pwrvb                          0/1     Init:CrashLoopBackOff   8 (4m6s ago)   20m
calico-node-tp8fx                          1/1     Running                 1              5h33m
calico-typha-6b975b87d6-js5st              1/1     Running                 0              4h11m
calico-typha-6b975b87d6-m6ff9              1/1     Running                 1              5h33m
calico-typha-6b975b87d6-xhhvx              1/1     Running                 0              4h10m
csi-node-driver-2tlpr                      2/2     Running                 0              4h10m
csi-node-driver-b7vbf                      2/2     Running                 2              5h33m
csi-node-driver-nflkq                      2/2     Running                 2              5h33m
csi-node-driver-r2sdc                      2/2     Running                 0              4h10m
csi-node-driver-rvb95                      0/2     ContainerCreating       0              20m
csi-node-driver-tln5g                      2/2     Running                 0              4h11m
csi-node-driver-x9ksz                      2/2     Running                 0              4h10m
goldmane-77b796bd9-frrw5                   1/1     Running                 1              5h33m
whisker-5d756b79c5-dfhf8                   2/2     Running                 2              5h33m
```
The newly added node was not moving to a `Ready` state.

Cause - calico did not roll out properly due to kube-proxy. No amount of time spent waiting lead to the issue resolving itself.

Here are the logs from the kube-proxy pod:

```bash
➜  mke3 git:(main) ✗ kubectl logs kube-proxy-dcmfv -n kube-system --insecure-skip-tls-verify-backend=true 
...
2025-10-27T13:40:22.146156841Z stderr F E1027 13:40:22.146060       1 event_broadcaster.go:279] "Unable to write event (may retry after sleeping)" err="Post \"https://84kcma-mke4-lb-0c6d75d0c71c7034.elb.eu-central-1.amazonaws.com:6443/apis/events.k8s.io/v1/namespaces/default/events\": dial tcp: lookup 84kcma-mke4-lb-0c6d75d0c71c7034.elb.eu-central-1.amazonaws.com on 172.31.0.2:53: no such host"
```

Restarting the kube-proxy pod helps:

```bash
➜  mke3 git:(main) ✗ kubectl get nodes                
NAME                                           STATUS   ROLES           AGE   VERSION
ip-172-31-0-60.eu-central-1.compute.internal   Ready    control-plane   30m   v1.32.6+k0s
➜  mke3 git:(main) ✗ kubectl get pods -n calico-system              

NAME                                       READY   STATUS    RESTARTS   AGE
calico-kube-controllers-7669678ddb-z9z7t   1/1     Running   0          10m
calico-node-5gc9h                          1/1     Running   0          9m36s
calico-typha-6b975b87d6-2js7b              1/1     Running   0          10m
csi-node-driver-rvb95                      2/2     Running   0          32m
goldmane-77b796bd9-vtg2t                   1/1     Running   0          10m
whisker-5d756b79c5-bdw9x                   2/2     Running   0          10m
```

### 2. `k0s kc logs` - kubelet certificate issues

 `k0s kc logs` reports certificate error, and requires to use `--insecure-skip-tls-verify-backend=true` flag.
```bash
➜  mke3 git:(main) ✗ kubectl logs kube-proxy-dcmfv -n kube-system                                         
Error from server: Get "https://172.31.0.51:10250/containerLogs/kube-system/kube-proxy-dcmfv/kube-proxy": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "kubernetes-ca")
➜  mke3 git:(main) ✗ kubectl logs kube-proxy-dcmfv -n kube-system --insecure-skip-tls-verify-backend=true 
...
2025-10-27T13:40:22.146156841Z stderr F E1027 13:40:22.146060       1 event_broadcaster.go:279] "Unable to write event (may retry after sleeping)" err="Post \"https://84kcma-mke4-lb-0c6d75d0c71c7034.elb.eu-central-1.amazonaws.com:6443/apis/events.k8s.io/v1/namespaces/default/events\": dial tcp: lookup 84kcma-mke4-lb-0c6d75d0c71c7034.elb.eu-central-1.amazonaws.com on 172.31.0.2:53: no such host"
```

### 3. ETCD Cluster id mismatch on second node join

After joining the second controller node, during the waiting phase on the node readiness, etcd on the initial node starts to timeout. It appears that during restore the new controller node assumes that it is an etcd leader and the only member:

```bash
root@ip-172-31-0-57:~# k0s etcd member-list
{"members":{"ip-172-31-0-57.eu-central-1.compute.internal":"https://172.31.0.57:2380"}}
```

Logs from k0scontroller on the second node report the following:

```bash
root@ip-172-31-0-57:~# journalctl -u k0scontroller.service | tail -1
Oct 29 16:21:43 ip-172-31-0-57.eu-central-1.compute.internal k0s[37685]: time="2025-10-29 16:21:43" level=info msg="{\"level\":\"warn\",\"ts\":\"2025-10-29T16:21:43.296444Z\",\"caller\":\"rafthttp/http.go:500\",\"msg\":\"request cluster ID mismatch\",\"local-member-id\":\"f1217a169d84161d\",\"local-member-cluster-id\":\"68234dd6ec1a9690\",\"local-member-server-version\":\"3.5.18\",\"local-member-server-minimum-cluster-version\":\"3.0.0\",\"remote-peer-server-name\":\"bb20fd5aa20bce5c\",\"remote-peer-server-version\":\"3.5.18\",\"remote-peer-server-minimum-cluster-version\":\"3.0.0\",\"remote-peer-cluster-id\":\"9258a30bab734a85\"}" component=etcd stream=stderr
```

## Questions

- Kube-proxy startup issues → Requires manual pod restart to recover.
  - Is it possible to automate this step in `k0sctl`?
- Certificate verification errors → Logs only accessible with `--insecure-skip-tls-verify-backend=true`.
  - Could `kubelet` apiServer certificates be restored/regenerated to include newly aded nodes? No other certificates seemed to cause an issue at this stage.
- ETCD cluster ID mismatch on controller join → Causes etcd timeouts and broken multi-controller state.
  - Should restored clusters assume current Cluster ID / use restored Cluster ID or manage etcd members in other way?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ETCD disaster recovery failures on a fresh cluster #971

Steps performed:

0.5. Old nodes present

1. `Kube-proxy` errors

2. `k0s kc logs` - kubelet certificate issues

3. ETCD Cluster id mismatch on second node join

Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ETCD disaster recovery failures on a fresh cluster #971

Description

Steps performed:

0.5. Old nodes present

1. Kube-proxy errors

2. k0s kc logs - kubelet certificate issues

3. ETCD Cluster id mismatch on second node join

Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. `Kube-proxy` errors

2. `k0s kc logs` - kubelet certificate issues