Skip to content

Commit c0b55a2

Browse files
Merge pull request #67790 from obrown1205/OSDOCS-6831
2 parents 8750440 + fcf99e4 commit c0b55a2

File tree

2 files changed

+278
-0
lines changed

2 files changed

+278
-0
lines changed

hosted_control_planes/hcp-backup-restore-dr.adoc

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@ In hosted clusters, etcd pods run as part of a stateful set. The stateful set re
1818

1919
include::modules/hosted-cluster-etcd-status.adoc[leveloffset=+2]
2020
include::modules/hosted-cluster-single-node-recovery.adoc[leveloffset=+2]
21+
include::modules/hosted-cluster-etcd-quorum-loss-recovery.adoc[leveloffset=+2]
2122

2223
[id="hcp-backup-restore"]
2324
== Backing up and restoring etcd on a hosted cluster in AWS
Lines changed: 277 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,277 @@
1+
// Module included in the following assembly:
2+
//
3+
// * hcp-backup-restore-dr.adoc
4+
5+
:_content-type: PROCEDURE
6+
[id="hosted-cluster-etcd-quorum-loss-recovery_{context}"]
7+
= Recovering an etcd cluster from a quorum loss
8+
9+
If multiple members of the etcd cluster have lost data or return a `CrashLoopBackOff` status, it can cause an etcd quorum loss. You must restore your etcd cluster from a snapshot.
10+
11+
[IMPORTANT]
12+
====
13+
This procedure requires API downtime.
14+
====
15+
16+
.Prerequisites
17+
* The `oc` and `jq` binaries have been installed.
18+
19+
.Procedure
20+
21+
. First, set up your environment variables and scale down the API servers:
22+
23+
.. Set up environment variables for your hosted cluster by entering the following commands, replacing values as necessary:
24+
+
25+
[source,terminal]
26+
----
27+
$ CLUSTER_NAME=my-cluster
28+
----
29+
+
30+
[source,terminal]
31+
----
32+
$ HOSTED_CLUSTER_NAMESPACE=clusters
33+
----
34+
+
35+
[source,terminal]
36+
----
37+
$ CONTROL_PLANE_NAMESPACE="${HOSTED_CLUSTER_NAMESPACE}-${CLUSTER_NAME}"
38+
----
39+
40+
.. Pause reconciliation of the hosted cluster by entering the following command, replacing values as necessary:
41+
+
42+
[source,terminal]
43+
----
44+
$ oc patch -n ${HOSTED_CLUSTER_NAMESPACE} hostedclusters/${CLUSTER_NAME} -p '{"spec":{"pausedUntil":"true"}}' --type=merge
45+
----
46+
47+
.. Scale down the API servers by entering the following commands:
48+
+
49+
... Scale down the `kube-apiserver`:
50+
+
51+
[source,terminal]
52+
----
53+
$ oc scale -n ${CONTROL_PLANE_NAMESPACE} deployment/kube-apiserver --replicas=0
54+
----
55+
56+
... Scale down the `openshift-apiserver`:
57+
+
58+
[source,terminal]
59+
----
60+
$ oc scale -n ${CONTROL_PLANE_NAMESPACE} deployment/openshift-apiserver --replicas=0
61+
----
62+
63+
... Scale down the `openshift-oauth-apiserver`:
64+
+
65+
[source,terminal]
66+
----
67+
$ oc scale -n ${CONTROL_PLANE_NAMESPACE} deployment/openshift-oauth-apiserver --replicas=0
68+
----
69+
70+
. Next, take a snapshot of etcd by using one of the following methods:
71+
72+
.. Use a previously backed-up snapshot of etcd.
73+
.. If you have an available etcd pod, take a snapshot from the active etcd pod by completing the following steps:
74+
75+
... List etcd pods by entering the following command:
76+
+
77+
[source,terminal]
78+
----
79+
$ oc get -n ${CONTROL_PLANE_NAMESPACE} pods -l app=etcd
80+
----
81+
82+
... Take a snapshot of the pod database and save it locally to your machine by entering the following commands:
83+
+
84+
[source,terminal]
85+
----
86+
$ ETCD_POD=etcd-0
87+
----
88+
+
89+
[source,terminal]
90+
----
91+
$ oc exec -n ${CONTROL_PLANE_NAMESPACE} -c etcd -t ${ETCD_POD} -- env ETCDCTL_API=3 /usr/bin/etcdctl \
92+
--cacert /etc/etcd/tls/etcd-ca/ca.crt \
93+
--cert /etc/etcd/tls/client/etcd-client.crt \
94+
--key /etc/etcd/tls/client/etcd-client.key \
95+
--endpoints=https://localhost:2379 \
96+
snapshot save /var/lib/snapshot.db
97+
----
98+
99+
... Verify that the snapshot is successful by entering the following command:
100+
+
101+
[source,terminal]
102+
----
103+
$ oc exec -n ${CONTROL_PLANE_NAMESPACE} -c etcd -t ${ETCD_POD} -- env ETCDCTL_API=3 /usr/bin/etcdctl -w table snapshot status /var/lib/snapshot.db
104+
----
105+
106+
.. Make a local copy of the snapshot by entering the following command:
107+
+
108+
[source,terminal]
109+
----
110+
$ oc cp -c etcd ${CONTROL_PLANE_NAMESPACE}/${ETCD_POD}:/var/lib/snapshot.db /tmp/etcd.snapshot.db
111+
----
112+
113+
... Make a copy of the snapshot database from etcd persistent storage:
114+
+
115+
.... List etcd pods by entering the following command:
116+
+
117+
[source,terminal]
118+
----
119+
$ oc get -n ${CONTROL_PLANE_NAMESPACE} pods -l app=etcd
120+
----
121+
122+
.... Find a pod that is running and set its name as the value of `ETCD_POD: ETCD_POD=etcd-0`, and then copy its snapshot database by entering the following command:
123+
+
124+
[source,terminal]
125+
----
126+
$ oc cp -c etcd ${CONTROL_PLANE_NAMESPACE}/${ETCD_POD}:/var/lib/data/member/snap/db /tmp/etcd.snapshot.db
127+
----
128+
129+
. Next, scale down the etcd statefulset by entering the following command:
130+
+
131+
[source,terminal]
132+
----
133+
$ oc scale -n ${CONTROL_PLANE_NAMESPACE} statefulset/etcd --replicas=0
134+
----
135+
136+
.. Delete volumes for second and third members by entering the following command:
137+
+
138+
[source,terminal]
139+
----
140+
$ oc delete -n ${CONTROL_PLANE_NAMESPACE} pvc/data-etcd-1 pvc/data-etcd-2
141+
----
142+
143+
.. Create a pod to access the first etcd member's data:
144+
145+
... Get the etcd image by entering the following command:
146+
+
147+
[source,terminal]
148+
----
149+
$ ETCD_IMAGE=$(oc get -n ${CONTROL_PLANE_NAMESPACE} statefulset/etcd -o jsonpath='{ .spec.template.spec.containers[0].image }')
150+
----
151+
+
152+
... Create a pod that allows access to etcd data:
153+
+
154+
[source,yaml]
155+
----
156+
$ cat << EOF | oc apply -n ${CONTROL_PLANE_NAMESPACE} -f -
157+
apiVersion: apps/v1
158+
kind: Deployment
159+
metadata:
160+
name: etcd-data
161+
spec:
162+
replicas: 1
163+
selector:
164+
matchLabels:
165+
app: etcd-data
166+
template:
167+
metadata:
168+
labels:
169+
app: etcd-data
170+
spec:
171+
containers:
172+
- name: access
173+
image: $ETCD_IMAGE
174+
volumeMounts:
175+
- name: data
176+
mountPath: /var/lib
177+
command:
178+
- /usr/bin/bash
179+
args:
180+
- -c
181+
- |-
182+
while true; do
183+
sleep 1000
184+
done
185+
volumes:
186+
- name: data
187+
persistentVolumeClaim:
188+
claimName: data-etcd-0
189+
EOF
190+
----
191+
192+
... Check the status of the `etcd-data` pod and wait for it to be running by entering the following command:
193+
+
194+
[source,terminal]
195+
----
196+
$ oc get -n ${CONTROL_PLANE_NAMESPACE} pods -l app=etcd-data
197+
----
198+
199+
... Get the name of the `etcd-data` pod by entering the following command:
200+
+
201+
[source,terminal]
202+
----
203+
$ DATA_POD=$(oc get -n ${CONTROL_PLANE_NAMESPACE} pods --no-headers -l app=etcd-data -o name | cut -d/ -f2)
204+
----
205+
206+
.. Copy an etcd snapshot into the pod by entering the following command:
207+
+
208+
[source,terminal]
209+
----
210+
$ oc cp /tmp/etcd.snapshot.db ${CONTROL_PLANE_NAMESPACE}/${DATA_POD}:/var/lib/restored.snap.db
211+
----
212+
213+
.. Remove old data from the `etcd-data` pod by entering the following commands:
214+
+
215+
[source,terminal]
216+
----
217+
$ oc exec -n ${CONTROL_PLANE_NAMESPACE} ${DATA_POD} -- rm -rf /var/lib/data
218+
----
219+
+
220+
[source,terminal]
221+
----
222+
$ oc exec -n ${CONTROL_PLANE_NAMESPACE} ${DATA_POD} -- mkdir -p /var/lib/data
223+
----
224+
225+
.. Restore the etcd snapshot by entering the following command:
226+
+
227+
[source,terminal]
228+
----
229+
$ oc exec -n ${CONTROL_PLANE_NAMESPACE} ${DATA_POD} -- etcdutl snapshot restore /var/lib/restored.snap.db \
230+
--data-dir=/var/lib/data --skip-hash-check \
231+
--name etcd-0 \
232+
--initial-cluster-token=etcd-cluster \
233+
--initial-cluster etcd-0=https://etcd-0.etcd-discovery.${CONTROL_PLANE_NAMESPACE}.svc:2380,etcd-1=https://etcd-1.etcd-discovery.${CONTROL_PLANE_NAMESPACE}.svc:2380,etcd-2=https://etcd-2.etcd-discovery.${CONTROL_PLANE_NAMESPACE}.svc:2380 \
234+
--initial-advertise-peer-urls https://etcd-0.etcd-discovery.${CONTROL_PLANE_NAMESPACE}.svc:2380
235+
----
236+
237+
.. Remove the temporary etcd snapshot from the pod by entering the following command:
238+
+
239+
[source,terminal]
240+
----
241+
$ oc exec -n ${CONTROL_PLANE_NAMESPACE} ${DATA_POD} -- rm /var/lib/restored.snap.db
242+
----
243+
244+
.. Delete data access deployment by entering the following command:
245+
+
246+
[source,terminal]
247+
----
248+
$ oc delete -n ${CONTROL_PLANE_NAMESPACE} deployment/etcd-data
249+
----
250+
251+
.. Scale up the etcd cluster by entering the following command:
252+
+
253+
[source,terminal]
254+
----
255+
$ oc scale -n ${CONTROL_PLANE_NAMESPACE} statefulset/etcd --replicas=3
256+
----
257+
258+
.. Wait for the etcd member pods to return and report as available by entering the following command:
259+
+
260+
[source,terminal]
261+
----
262+
$ oc get -n ${CONTROL_PLANE_NAMESPACE} pods -l app=etcd -w
263+
----
264+
265+
.. Scale up all etcd-writer deployments by entering the following command:
266+
+
267+
[source,terminal]
268+
----
269+
$ oc scale deployment -n ${CONTROL_PLANE_NAMESPACE} --replicas=3 kube-apiserver openshift-apiserver openshift-oauth-apiserver
270+
----
271+
272+
. Restore reconciliation of the hosted cluster by entering the following command:
273+
+
274+
[source,terminal]
275+
----
276+
$ oc patch -n ${CLUSTER_NAMESPACE} hostedclusters/${CLUSTER_NAME} -p '{"spec":{"pausedUntil":""}}' --type=merge
277+
----

0 commit comments

Comments
 (0)