[OSDOCS-6831]: Adding etcd recovery docs for hosted control planes

lahinson · lahinson · commit b73b9cc59d74 · 2023-08-29T13:04:45.000-04:00
diff --git a/hosted_control_planes/hcp-backup-restore-dr.adoc b/hosted_control_planes/hcp-backup-restore-dr.adoc
@@ -11,10 +11,20 @@ If you need to back up and restore etcd on a hosted cluster or provide disaster
 :FeatureName: Hosted control planes
 include::snippets/technology-preview.adoc[]
 
+[id="hosted-etcd-non-disruptive-recovery"]
+== Recovering etcd pods for hosted clusters
+
+In hosted clusters, etcd pods run as part of a stateful set. The stateful set relies on persistent storage to store etcd data for each member. In a highly available control plane, the size of the stateful set is three pods, and each member has its own persistent volume claim.
+
+include::modules/hosted-cluster-etcd-status.adoc[leveloffset=+2]
+include::modules/hosted-cluster-single-node-recovery.adoc[leveloffset=+2]
+
 [id="hcp-backup-restore"]
-== Backing up and restoring etcd on a hosted cluster
+== Backing up and restoring etcd on a hosted cluster in AWS
+
+If you use hosted control planes for {product-title}, the process to back up and restore etcd is different from xref:../backup_and_restore/control_plane_backup_and_restore/backing-up-etcd.adoc#backing-up-etcd-data_backup-etcd[the usual etcd backup process].
 
-If you use hosted control planes on {product-title}, the process to back up and restore etcd is different from xref:../backup_and_restore/control_plane_backup_and_restore/backing-up-etcd.adoc#backing-up-etcd-data_backup-etcd[the usual etcd backup process].
+The following procedures are specific to hosted control planes on AWS.
 
 // Backing up etcd on a hosted cluster
 include::modules/backup-etcd-hosted-cluster.adoc[leveloffset=+2]
diff --git a/modules/hosted-cluster-etcd-status.adoc b/modules/hosted-cluster-etcd-status.adoc
@@ -0,0 +1,52 @@
+// Module included in the following assembly:
+//
+// * hcp-backup-restore-dr.adoc
+
+:_content-type: PROCEDURE
+[id="hosted-cluster-etcd-status_{context}"]
+= Checking the status of a hosted cluster
+
+To check the status of your hosted cluster, complete the following steps.
+
+.Procedure
+
+. Enter the running etcd pod that you want to check by entering the following command:
++
+[source,terminal]
+----
+$ oc rsh etcd-0
+----
+
+. Set up the etcdctl environment by entering the following commands:
++
+[source,terminal]
+----
+sh-4.4$ export ETCDCTL_API=3
+----
++
+[source,terminal]
+----
+sh-4.4$ export ETCDCTL_CACERT=/etc/etcd/tls/etcd-ca/ca.crt
+----
++
+[source,terminal]
+----
+sh-4.4$ export ETCDCTL_CERT=/etc/etcd/tls/client/etcd-client.crt
+----
++
+[source,terminal]
+----
+sh-4.4$ export ETCDCTL_KEY=/etc/etcd/tls/client/etcd-client.key
+----
++
+[source,terminal]
+----
+sh-4.4$ export ETCDCTL_ENDPOINTS=https://etcd-client:2379
+----
+
+. Print the endpoint status for each cluster member by entering the following command:
++
+[source,terminal]
+----
+sh-4.4$ etcdctl endpoint health --cluster -w table
+----
diff --git a/modules/hosted-cluster-single-node-recovery.adoc b/modules/hosted-cluster-single-node-recovery.adoc
@@ -0,0 +1,52 @@
+// Module included in the following assembly:
+//
+// * hcp-backup-restore-dr.adoc
+
+:_content-type: PROCEDURE
+[id="hosted-cluster-single-node-recovery_{context}"]
+= Recovering an etcd member for a hosted cluster
+
+An etcd member of a 3-node cluster might fail because of corrupted or missing data. To recover the etcd member, complete the following steps.
+
+.Procedure
+
+. If you need to confirm that the etcd member is failing, enter the following command:
++
+[source,terminal]
+----
+$ oc get pods -l app=etcd -n <control_plane_namespace>
+----
++
+The output resembles this example if the etcd member is failing:
++
+.Example output
+[source,terminal]
+----
+NAME     READY   STATUS             RESTARTS     AGE
+etcd-0   2/2     Running            0            64m
+etcd-1   2/2     Running            0            45m
+etcd-2   1/2     CrashLoopBackOff   1 (5s ago)   64m
+----
+
+. Delete the persistent volume claim of the failing etcd member and the pod by entering the following command:
++
+[source,terminal]
+----
+$ oc delete pvc/data-etcd-2 pod/etcd-2 --wait=false
+----
+
+. When the pod restarts, verify that the etcd member is added back to the etcd cluster and is correctly functioning by entering the following command:
++
+[source,terminal]
+----
+$ oc get pods -l app=etcd -n $CONTROL_PLANE_NAMESPACE
+----
++
+.Example output
+[source,terminal]
+----
+NAME     READY   STATUS    RESTARTS   AGE
+etcd-0   2/2     Running   0          67m
+etcd-1   2/2     Running   0          48m
+etcd-2   2/2     Running   0          2m2s
+----