Merge pull request #20053 from bergerhoffer/OSDOCS-893-etcd-op

bergerhoffer · web-flow · commit 4f561ee270b4 · 2020-04-12T20:38:33.000-04:00
OSDOCS-893: Updating DR procedures for the etcd cluster operator
diff --git a/backup_and_restore/backing-up-etcd.adoc b/backup_and_restore/backing-up-etcd.adoc
@@ -15,8 +15,7 @@ installation, otherwise the backup will contain expired certificates. It is also
 recommended to take etcd backups during non-peak usage hours, as it is a
 blocking action.
 
-Once you have an etcd backup, you can xref:../backup_and_restore/disaster_recovery/scenario-1-infra-recovery.adoc#dr-infrastructure-recovery[recover from lost master hosts]
-and xref:../backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.adoc#dr-restoring-cluster-state[restore to a previous cluster state].
+Once you have an etcd backup, you can xref:../backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.adoc#dr-restoring-cluster-state[restore to a previous cluster state].
 
 You can perform the xref:../backup_and_restore/backing-up-etcd.adoc#backing-up-etcd-data_backup-etcd[etcd data backup process]
 on any master host that has connectivity to the etcd cluster, where the proper
diff --git a/backup_and_restore/disaster_recovery/about-disaster-recovery.adoc b/backup_and_restore/disaster_recovery/about-disaster-recovery.adoc
@@ -11,11 +11,10 @@ how to recover from several disaster situations that might occur with their
 more of the following procedures in order to return your cluster to a working
 state.
 
-xref:../../backup_and_restore/disaster_recovery/scenario-1-infra-recovery.adoc#dr-infrastructure-recovery[Recovering from lost master hosts]::
-This solution handles situations where you have lost the majority of your master
-hosts, leading to etcd quorum loss and the cluster going offline. As long as you
-have taken an etcd backup and have at least one remaining healthy master host,
-you can follow this procedure to recover your cluster.
+xref:../../backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.adoc#dr-restoring-cluster-state[Restoring to a previous cluster state]::
+This solution handles situations where you want to restore your cluster to
+a previous state, for example, if an administrator deletes something critical.
+This also includes situations where you have lost the majority of your master hosts, leading to etcd quorum loss and the cluster going offline. As long as you have taken an etcd backup, you can follow this procedure to restore your cluster to a previous state.
 +
 If applicable, you might also need to xref:../../backup_and_restore/disaster_recovery/scenario-3-expired-certs.adoc#dr-recovering-expired-certs[recover from expired control plane certificates].
 +
@@ -24,14 +23,6 @@ If applicable, you might also need to xref:../../backup_and_restore/disaster_rec
 If you have a majority of your masters still available and have an etcd quorum, then follow the procedure to xref:../../backup_and_restore/replacing-failed-master.adoc#replacing-failed-master-host[replace a single failed master host].
 ====
 
-xref:../../backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.adoc#dr-restoring-cluster-state[Restoring to a previous cluster state]::
-This solution handles situations where you want to restore your cluster to
-a previous state, for example, if an administrator deletes something critical.
-As long as you have taken an etcd backup, you can follow this procedure to
-restore your cluster to a previous state.
-+
-If applicable, you might also need to xref:../../backup_and_restore/disaster_recovery/scenario-3-expired-certs.adoc#dr-recovering-expired-certs[recover from expired control plane certificates].
-
 xref:../../backup_and_restore/disaster_recovery/scenario-3-expired-certs.adoc#dr-recovering-expired-certs[Recovering from expired control plane certificates]::
 This solution handles situations where your control plane certificates have
 expired. For example, if you shut down your cluster before the first certificate
diff --git a/backup_and_restore/disaster_recovery/scenario-1-infra-recovery.adoc b/backup_and_restore/disaster_recovery/scenario-1-infra-recovery.adoc
@@ -5,22 +5,9 @@ include::modules/common-attributes.adoc[]
 
 toc::[]
 
-This document describes the process to recover from a complete loss of a master host. This includes
-situations where a majority of master hosts have been lost, leading to etcd quorum loss and the cluster going offline. This procedure assumes that you have at least one healthy master host.
-
-At a high level, the procedure is to:
-
-. Restore etcd quorum on a remaining master host.
-. Create new master hosts.
-. Correct DNS and load balancer entries.
-. Grow etcd to full membership.
-
-If the majority of master hosts have been lost, you will need an xref:../../backup_and_restore/backing-up-etcd.adoc#backing-up-etcd-data_backup-etcd[etcd backup] to restore etcd quorum on the remaining master host.
+As of {product-title} 4.4, follow the procedure to xref:../../backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.adoc#dr-restoring-cluster-state[restore to a previous cluster state] in order to recover from lost master hosts.
 
 [NOTE]
 ====
 If you have a majority of your masters still available and have an etcd quorum, then follow the procedure to xref:../../backup_and_restore/replacing-failed-master.adoc#replacing-failed-master-host[replace a single failed master host].
 ====
-
-// Recovering from lost master hosts
-include::modules/dr-recover-lost-control-plane-hosts.adoc[leveloffset=+1]
diff --git a/modules/backup-etcd.adoc b/modules/backup-etcd.adoc
@@ -5,7 +5,7 @@
 [id="backing-up-etcd-data_{context}"]
 = Backing up etcd data
 
-Follow these steps to back up etcd data by creating an etcd snapshot and backing up static Kubernetes API server resources. This backup can be saved and used at a later time if you need to restore etcd.
+Follow these steps to back up etcd data by creating an etcd snapshot and backing up the resources for the static Pods. This backup can be saved and used at a later time if you need to restore etcd.
 
 You should only save a backup from a single master host. You do not need a backup from each master host in the cluster.
 
@@ -15,18 +15,27 @@ You should only save a backup from a single master host. You do not need a backu
 
 .Procedure
 
-. Access a master host as the root user.
+. Access a master host.
 
-. Run the `etcd-snapshot-backup.sh` script and pass in the location to save the backup to.
+. Run the `cluster-backup.sh` script and pass in the location to save the backup to.
 +
 ----
-$ sudo /usr/local/bin/etcd-snapshot-backup.sh ./assets/backup
+$ sudo /usr/local/bin/cluster-backup.sh ./assets/backup
+1bf371f1b5a483927cd01bb593b0e12cff406eb8d7d0acf4ab079c36a0abd3f7
+etcdctl version: 3.3.18
+API version: 3.3
+found latest kube-apiserver-pod: /etc/kubernetes/static-pod-resources/kube-apiserver-pod-7
+found latest kube-controller-manager-pod: /etc/kubernetes/static-pod-resources/kube-controller-manager-pod-8
+found latest kube-scheduler-pod: /etc/kubernetes/static-pod-resources/kube-scheduler-pod-6
+found latest etcd-pod: /etc/kubernetes/static-pod-resources/etcd-pod-2
+Snapshot saved at /var/home/core/assets/backup/snapshot_2020-03-18_220218.db
+snapshot db and kube resources are successfully saved to /var/home/core/assets/backup
 ----
 +
 In this example, two files are created in the `./assets/backup/` directory on the master host:
 
 * `snapshot_<datetimestamp>.db`: This file is the etcd snapshot.
-* `static_kuberesources_<datetimestamp>.tar.gz`: This file contains the static Kubernetes API server resources. If etcd encryption is enabled, it also contains the encryption keys for the etcd snapshot.
+* `static_kuberesources_<datetimestamp>.tar.gz`: This file contains the resources for the static Pods. If etcd encryption is enabled, it also contains the encryption keys for the etcd snapshot.
 +
 [NOTE]
 ====
diff --git a/modules/dr-recover-lost-control-plane-hosts.adoc b/modules/dr-recover-lost-control-plane-hosts.adoc
diff --git a/modules/dr-restoring-cluster-state.adoc b/modules/dr-restoring-cluster-state.adoc