Skip to content

Commit beb2c16

Browse files
authored
adding steps to recover cluster (#3438)
Document steps to recover cluster after accidental ELB deletion
1 parent 7e6731c commit beb2c16

File tree

1 file changed

+148
-0
lines changed

1 file changed

+148
-0
lines changed

docs/book/src/topics/troubleshooting.md

Lines changed: 148 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -59,3 +59,151 @@ $ aws iam get-instance-profile --instance-profile-name control-plane.cluster-api
5959

6060
```
6161
If instance profile does not look as expected, you may try recreating the CloudFormation stack using `clusterawsadm` as explained in the above sections.
62+
63+
64+
## Recover a management cluster after losing the api server load balancer
65+
66+
These steps outline the process for recovering a management cluster after losing the load balancer for the api server. These steps are needed because AWS load balancers have dynamically generated DNS names. This means that when a load balancer is deleted CAPA will recreate the load balancer but it will have a different DNS name that does not match the original, so we need to update some resources as well as the certs to match the new name to make the cluster healthy again. There are a few different scenarios which this could happen.
67+
68+
* The load balancer gets deleted by some external process or user.
69+
* If a cluster is created with the same name as the management cluster in a different namespace and then deleted it will delete the existing load balancer. This is due to ownership of AWS resources being managed by tags. See this [issue](https://github.com/kubernetes-sigs/cluster-api-provider-aws/issues/969#issuecomment-519121056) for reference.
70+
71+
### **Access the api server locally**
72+
73+
1. ssh to a control plane node and modify the `/etc/kubernetes/admin.conf`
74+
75+
* Replace the `server` with `server: https://localhost:6443`
76+
77+
* Add `insecure-skip-tls-verify: true`
78+
79+
* Comment out `certificate-authority-data:`
80+
81+
2. Export the kubeconfig and ensure you can connect
82+
83+
```bash
84+
export KUBECONFIG=/etc/kubernetes/admin.conf
85+
kubectl get nodes
86+
```
87+
88+
89+
### **Get rid of the lingering duplicate cluster**
90+
91+
**This step is only needed in the scenario that duplicate cluster was created and deleted which caused the API server load balancer to be deleted.**
92+
93+
1. since there is a duplicate cluster that is trying to be deleted and can't due to some resources being unable to cleanup since they are in use we need to stop the conflicting reconciliation process. Edit the duplicate aws cluster object and remove the `finalizers`
94+
95+
```bash
96+
kubectl edit awscluster <clustername>
97+
```
98+
2. next run `kubectl describe awscluster <clustername>` to validate that the finalizers have been removed
99+
100+
3. `kubectl get clusters` to verify the cluster is gone
101+
102+
103+
### **Make at least one node `Ready`**
104+
105+
1. Right now all endpoints are down due to nodes not being ready. this is problematic for coredns adn cni pods in particular. let's get one control plane node back healthy. on the control plane node we logged into edit the `/etc/kubernetes/kubelet.conf`
106+
107+
* Replace the `server` with `server: https://localhost:6443`
108+
109+
* Add `insecure-skip-tls-verify: true`
110+
111+
* Comment out `certificate-authority-data:`
112+
113+
* Restart the kubelet `systemctl restart kubelet`
114+
115+
2. `kubectl get nodes` and validate that the node is in a ready state.
116+
3. After a few minutes most things should start scheduling themselves on the new node. The pods that did not restart on their own that were causing issues were core-dns,kube-proxy, and cni pods.Those should be restart manually.
117+
4. (optional) tail the capa logs to see the load balancer start to reconcile
118+
119+
```bash
120+
kubectl logs -f -n capa-system deployments.apps/capa-controller-manager`
121+
```
122+
123+
### **Update the control plane nodes with new LB settings**
124+
125+
1. To be safe we will do this on all CP nodes rather than having them recreate to avoid potential data loss issues. Follow the following steps for **each** CP node.
126+
127+
2. Regenrate the certs for the api server using the new name. Make sure to update your service cidr and endpoint in the below command.
128+
129+
```bash
130+
rm /etc/kubernetes/pki/apiserver.crt
131+
rm /etc/kubernetes/pki/apiserver.key
132+
133+
kubeadm init phase certs apiserver --control-plane-endpoint="mynewendpoint.com" --service-cidr=100.64.0.0/13 -v10
134+
```
135+
136+
3. Update settings in `/etc/kubernetes/admin.conf`
137+
138+
* Replace the `server` with `server: https://<your-new-lb.com>:6443`
139+
140+
* Remove `insecure-skip-tls-verify: true`
141+
142+
* Uncomment `certificate-authority-data:`
143+
144+
* Export the kubeconfig and ensure you can connect
145+
146+
```bash
147+
export KUBECONFIG=/etc/kubernetes/admin.conf
148+
kubectl get nodes
149+
```
150+
151+
4. Update the settings in `/etc/kubernetes/kubelet.conf`
152+
153+
* Replace the `server` with `server: https://your-new-lb.com:6443`
154+
155+
* Remove `insecure-skip-tls-verify: true`
156+
157+
* Uncomment `certificate-authority-data:`
158+
159+
* restart the kubelet `systemctl restart kubelet`
160+
161+
5. Just as we did before we need new pods to pick up api server cache changes so you will want to force restart pods like cni pods, kube-proxy, core-dns , etc.
162+
163+
### Update capi settings for new LB DNS name
164+
165+
1. Update the control plane endpoint on the `awscluster` and `cluster` objects. To do this we need to disable the validatingwebhooks. We will back them up and then delete so we can apply later.
166+
167+
```bash
168+
kubectl get validatingwebhookconfigurations capa-validating-webhook-configuration -o yaml > capa-webhook && kubectl delete validatingwebhookconfigurations capa-validating-webhook-configuration
169+
170+
kubectl get validatingwebhookconfigurations capi-validating-webhook-configuration -o yaml > capi-webhook && kubectl delete validatingwebhookconfigurations capi-validating-webhook-configuration
171+
```
172+
173+
2. Edit the `spec.controlPlaneEndpoint.host` field on both `awscluster` and `cluster` to have the new endpoint
174+
175+
3. Re-apply your webhooks
176+
177+
```bash
178+
kubectl apply -f capi-webhook
179+
kubectl apply -f capa-webhook
180+
```
181+
182+
183+
4. Update the following config maps and replace the old control plane name with the new one.
184+
185+
```bash
186+
kubectl edit cm -n kube-system kubeadm-config
187+
kubectl edit cm -n kube-system kube-proxy
188+
kubectl edit cm -n kube-public cluster-info
189+
```
190+
191+
5. Edit the cluster kubeconfig secret that capi uses to talk to the management cluster. You will need to decode teh secret, replace the endpoint and re-encode and save.
192+
193+
```bash
194+
kubectl edit secret -n <namespace> <cluster-name>-kubeconfig`
195+
```
196+
6. At this point things should start to reconcile on their own, but we can use the commands in the next step to force it.
197+
198+
199+
### Roll all of the nodes to make sure everything is fresh
200+
201+
202+
1.
203+
```bash
204+
kubectl patch kcp <clusternamekcp> -n namespace --type merge -p "{\"spec\":{\"rolloutAfter\":\"`date +'%Y-%m-%dT%TZ'`\"}}"
205+
```
206+
207+
2. ```bash
208+
kubectl patch machinedeployment CLUSTER_NAME-md-0 -n namespace --type merge -p "{\"spec\":{\"template\":{\"metadata\":{\"annotations\":{\"date\":\"`date +'%s'`\"}}}}}"
209+
```

0 commit comments

Comments
 (0)