Skip to content

Commit 4f561ee

Browse files
authored
Merge pull request #20053 from bergerhoffer/OSDOCS-893-etcd-op
OSDOCS-893: Updating DR procedures for the etcd cluster operator
2 parents 972136c + a5bb050 commit 4f561ee

File tree

6 files changed

+188
-396
lines changed

6 files changed

+188
-396
lines changed

backup_and_restore/backing-up-etcd.adoc

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,8 +15,7 @@ installation, otherwise the backup will contain expired certificates. It is also
1515
recommended to take etcd backups during non-peak usage hours, as it is a
1616
blocking action.
1717

18-
Once you have an etcd backup, you can xref:../backup_and_restore/disaster_recovery/scenario-1-infra-recovery.adoc#dr-infrastructure-recovery[recover from lost master hosts]
19-
and xref:../backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.adoc#dr-restoring-cluster-state[restore to a previous cluster state].
18+
Once you have an etcd backup, you can xref:../backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.adoc#dr-restoring-cluster-state[restore to a previous cluster state].
2019

2120
You can perform the xref:../backup_and_restore/backing-up-etcd.adoc#backing-up-etcd-data_backup-etcd[etcd data backup process]
2221
on any master host that has connectivity to the etcd cluster, where the proper

backup_and_restore/disaster_recovery/about-disaster-recovery.adoc

Lines changed: 4 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -11,11 +11,10 @@ how to recover from several disaster situations that might occur with their
1111
more of the following procedures in order to return your cluster to a working
1212
state.
1313

14-
xref:../../backup_and_restore/disaster_recovery/scenario-1-infra-recovery.adoc#dr-infrastructure-recovery[Recovering from lost master hosts]::
15-
This solution handles situations where you have lost the majority of your master
16-
hosts, leading to etcd quorum loss and the cluster going offline. As long as you
17-
have taken an etcd backup and have at least one remaining healthy master host,
18-
you can follow this procedure to recover your cluster.
14+
xref:../../backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.adoc#dr-restoring-cluster-state[Restoring to a previous cluster state]::
15+
This solution handles situations where you want to restore your cluster to
16+
a previous state, for example, if an administrator deletes something critical.
17+
This also includes situations where you have lost the majority of your master hosts, leading to etcd quorum loss and the cluster going offline. As long as you have taken an etcd backup, you can follow this procedure to restore your cluster to a previous state.
1918
+
2019
If applicable, you might also need to xref:../../backup_and_restore/disaster_recovery/scenario-3-expired-certs.adoc#dr-recovering-expired-certs[recover from expired control plane certificates].
2120
+
@@ -24,14 +23,6 @@ If applicable, you might also need to xref:../../backup_and_restore/disaster_rec
2423
If you have a majority of your masters still available and have an etcd quorum, then follow the procedure to xref:../../backup_and_restore/replacing-failed-master.adoc#replacing-failed-master-host[replace a single failed master host].
2524
====
2625

27-
xref:../../backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.adoc#dr-restoring-cluster-state[Restoring to a previous cluster state]::
28-
This solution handles situations where you want to restore your cluster to
29-
a previous state, for example, if an administrator deletes something critical.
30-
As long as you have taken an etcd backup, you can follow this procedure to
31-
restore your cluster to a previous state.
32-
+
33-
If applicable, you might also need to xref:../../backup_and_restore/disaster_recovery/scenario-3-expired-certs.adoc#dr-recovering-expired-certs[recover from expired control plane certificates].
34-
3526
xref:../../backup_and_restore/disaster_recovery/scenario-3-expired-certs.adoc#dr-recovering-expired-certs[Recovering from expired control plane certificates]::
3627
This solution handles situations where your control plane certificates have
3728
expired. For example, if you shut down your cluster before the first certificate

backup_and_restore/disaster_recovery/scenario-1-infra-recovery.adoc

Lines changed: 1 addition & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -5,22 +5,9 @@ include::modules/common-attributes.adoc[]
55

66
toc::[]
77

8-
This document describes the process to recover from a complete loss of a master host. This includes
9-
situations where a majority of master hosts have been lost, leading to etcd quorum loss and the cluster going offline. This procedure assumes that you have at least one healthy master host.
10-
11-
At a high level, the procedure is to:
12-
13-
. Restore etcd quorum on a remaining master host.
14-
. Create new master hosts.
15-
. Correct DNS and load balancer entries.
16-
. Grow etcd to full membership.
17-
18-
If the majority of master hosts have been lost, you will need an xref:../../backup_and_restore/backing-up-etcd.adoc#backing-up-etcd-data_backup-etcd[etcd backup] to restore etcd quorum on the remaining master host.
8+
As of {product-title} 4.4, follow the procedure to xref:../../backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.adoc#dr-restoring-cluster-state[restore to a previous cluster state] in order to recover from lost master hosts.
199

2010
[NOTE]
2111
====
2212
If you have a majority of your masters still available and have an etcd quorum, then follow the procedure to xref:../../backup_and_restore/replacing-failed-master.adoc#replacing-failed-master-host[replace a single failed master host].
2313
====
24-
25-
// Recovering from lost master hosts
26-
include::modules/dr-recover-lost-control-plane-hosts.adoc[leveloffset=+1]

modules/backup-etcd.adoc

Lines changed: 14 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
[id="backing-up-etcd-data_{context}"]
66
= Backing up etcd data
77

8-
Follow these steps to back up etcd data by creating an etcd snapshot and backing up static Kubernetes API server resources. This backup can be saved and used at a later time if you need to restore etcd.
8+
Follow these steps to back up etcd data by creating an etcd snapshot and backing up the resources for the static Pods. This backup can be saved and used at a later time if you need to restore etcd.
99

1010
You should only save a backup from a single master host. You do not need a backup from each master host in the cluster.
1111

@@ -15,18 +15,27 @@ You should only save a backup from a single master host. You do not need a backu
1515

1616
.Procedure
1717

18-
. Access a master host as the root user.
18+
. Access a master host.
1919

20-
. Run the `etcd-snapshot-backup.sh` script and pass in the location to save the backup to.
20+
. Run the `cluster-backup.sh` script and pass in the location to save the backup to.
2121
+
2222
----
23-
$ sudo /usr/local/bin/etcd-snapshot-backup.sh ./assets/backup
23+
$ sudo /usr/local/bin/cluster-backup.sh ./assets/backup
24+
1bf371f1b5a483927cd01bb593b0e12cff406eb8d7d0acf4ab079c36a0abd3f7
25+
etcdctl version: 3.3.18
26+
API version: 3.3
27+
found latest kube-apiserver-pod: /etc/kubernetes/static-pod-resources/kube-apiserver-pod-7
28+
found latest kube-controller-manager-pod: /etc/kubernetes/static-pod-resources/kube-controller-manager-pod-8
29+
found latest kube-scheduler-pod: /etc/kubernetes/static-pod-resources/kube-scheduler-pod-6
30+
found latest etcd-pod: /etc/kubernetes/static-pod-resources/etcd-pod-2
31+
Snapshot saved at /var/home/core/assets/backup/snapshot_2020-03-18_220218.db
32+
snapshot db and kube resources are successfully saved to /var/home/core/assets/backup
2433
----
2534
+
2635
In this example, two files are created in the `./assets/backup/` directory on the master host:
2736

2837
* `snapshot_<datetimestamp>.db`: This file is the etcd snapshot.
29-
* `static_kuberesources_<datetimestamp>.tar.gz`: This file contains the static Kubernetes API server resources. If etcd encryption is enabled, it also contains the encryption keys for the etcd snapshot.
38+
* `static_kuberesources_<datetimestamp>.tar.gz`: This file contains the resources for the static Pods. If etcd encryption is enabled, it also contains the encryption keys for the etcd snapshot.
3039
+
3140
[NOTE]
3241
====

0 commit comments

Comments
 (0)