Skip to content

Commit 7812f0e

Browse files
authored
Merge pull request #53792 from lahinson/hcp-etcd-backup
[OSDOCS-4273]: adding backup, restore, and dr tasks for hosted control planes
2 parents 036f7ff + ce63dcf commit 7812f0e

9 files changed

+859
-0
lines changed

_topic_maps/_topic_map.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2553,6 +2553,8 @@ Topics:
25532553
File: backing-up-etcd
25542554
- Name: Replacing an unhealthy etcd member
25552555
File: replacing-unhealthy-etcd-member
2556+
- Name: Backing up and restoring etcd on a hosted cluster
2557+
File: etcd-backup-restore-hosted-cluster
25562558
- Name: Disaster recovery
25572559
Dir: disaster_recovery
25582560
Topics:
@@ -2562,6 +2564,8 @@ Topics:
25622564
File: scenario-2-restoring-cluster-state
25632565
- Name: Recovering from expired control plane certificates
25642566
File: scenario-3-expired-certs
2567+
- Name: Disaster recovery for a hosted cluster within an AWS region
2568+
File: dr-hosted-cluster-within-aws-region
25652569
---
25662570
Name: Migrating from version 3 to 4
25672571
Dir: migrating_from_ocp_3_to_4
Lines changed: 131 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,131 @@
1+
:_content-type: ASSEMBLY
2+
[id="dr-hosted-cluster-within-aws-region"]
3+
= Disaster recovery for a hosted cluster within an AWS region
4+
include::_attributes/common-attributes.adoc[]
5+
:context: dr-hosted-cluster-within-aws-region
6+
7+
toc::[]
8+
9+
In a situation where you need disaster recovery (DR) for a hosted cluster, you can recover a hosted cluster to the same region within AWS. For example, you need DR when the upgrade of a management cluster fails and the hosted cluster is in a read-only state.
10+
11+
:FeatureName: Hosted control planes
12+
include::snippets/technology-preview.adoc[]
13+
14+
The DR process involves three main steps:
15+
16+
. Backing up the hosted cluster on the source management cluster
17+
. Restoring the hosted cluster on a destination management cluster
18+
. Deleting the hosted cluster from the source management cluster
19+
20+
Your workloads remain running during the process. The Cluster API might be unavailable for a period, but that will not affect the services that are running on the worker nodes.
21+
22+
[IMPORTANT]
23+
====
24+
Both the source management cluster and the destination management cluster must have the `--external-dns` flags to maintain the API server URL, as shown in this example:
25+
26+
.Example: External DNS flags
27+
[source,terminal]
28+
----
29+
--external-dns-provider=aws \
30+
--external-dns-credentials=<AWS Credentials location> \
31+
--external-dns-domain-filter=<DNS Base Domain>
32+
----
33+
34+
That way, the server URL ends with `https://api-sample-hosted.sample-hosted.aws.openshift.com`.
35+
36+
If you do not include the `--external-dns` flags to maintain the API server URL, the hosted cluster cannot be migrated.
37+
====
38+
39+
[id="dr-hosted-cluster-env-context"]
40+
== Example environment and context
41+
42+
Consider an scenario where you have three clusters to restore. Two are management clusters, and one is a hosted cluster. You can restore either the control plane only or the control plane and the nodes. Before you begin, you need the following information:
43+
44+
* Source MGMT Namespace: The source management namespace
45+
* Source MGMT ClusterName: The source management cluster name
46+
* Source MGMT Kubeconfig: The source management `kubeconfig` file
47+
* Destination MGMT Kubeconfig: The destination management `kubeconfig` file
48+
* HC Kubeconfig: The hosted cluster `kubeconfig` file
49+
* SSH key file: The SSH public key
50+
* Pull secret: The pull secret file to access the release images
51+
* AWS credentials
52+
* AWS region
53+
* Base domain: The DNS base domain to use as an external DNS
54+
* S3 bucket name: The bucket in the AWS region where you plan to upload the etcd backup
55+
56+
This information is shown in the following example environment variables.
57+
58+
.Example environment variables
59+
[source,terminal]
60+
----
61+
SSH_KEY_FILE=${HOME}/.ssh/id_rsa.pub
62+
BASE_PATH=${HOME}/hypershift
63+
BASE_DOMAIN="aws.sample.com"
64+
PULL_SECRET_FILE="${HOME}/pull_secret.json"
65+
AWS_CREDS="${HOME}/.aws/credentials"
66+
AWS_ZONE_ID="Z02718293M33QHDEQBROL"
67+
68+
CONTROL_PLANE_AVAILABILITY_POLICY=SingleReplica
69+
HYPERSHIFT_PATH=${BASE_PATH}/src/hypershift
70+
HYPERSHIFT_CLI=${HYPERSHIFT_PATH}/bin/hypershift
71+
HYPERSHIFT_IMAGE=${HYPERSHIFT_IMAGE:-"quay.io/${USER}/hypershift:latest"}
72+
NODE_POOL_REPLICAS=${NODE_POOL_REPLICAS:-2}
73+
74+
# MGMT Context
75+
MGMT_REGION=us-west-1
76+
MGMT_CLUSTER_NAME="${USER}-dev"
77+
MGMT_CLUSTER_NS=${USER}
78+
MGMT_CLUSTER_DIR="${BASE_PATH}/hosted_clusters/${MGMT_CLUSTER_NS}-${MGMT_CLUSTER_NAME}"
79+
MGMT_KUBECONFIG="${MGMT_CLUSTER_DIR}/kubeconfig"
80+
81+
# MGMT2 Context
82+
MGMT2_CLUSTER_NAME="${USER}-dest"
83+
MGMT2_CLUSTER_NS=${USER}
84+
MGMT2_CLUSTER_DIR="${BASE_PATH}/hosted_clusters/${MGMT2_CLUSTER_NS}-${MGMT2_CLUSTER_NAME}"
85+
MGMT2_KUBECONFIG="${MGMT2_CLUSTER_DIR}/kubeconfig"
86+
87+
# Hosted Cluster Context
88+
HC_CLUSTER_NS=clusters
89+
HC_REGION=us-west-1
90+
HC_CLUSTER_NAME="${USER}-hosted"
91+
HC_CLUSTER_DIR="${BASE_PATH}/hosted_clusters/${HC_CLUSTER_NS}-${HC_CLUSTER_NAME}"
92+
HC_KUBECONFIG="${HC_CLUSTER_DIR}/kubeconfig"
93+
BACKUP_DIR=${HC_CLUSTER_DIR}/backup
94+
95+
BUCKET_NAME="${USER}-hosted-${MGMT_REGION}"
96+
97+
# DNS
98+
AWS_ZONE_ID="Z07342811SH9AA102K1AC"
99+
EXTERNAL_DNS_DOMAIN="hc.jpdv.aws.kerbeross.com"
100+
----
101+
102+
[id="dr-hosted-cluster-process"]
103+
== Overview of the backup and restore process
104+
105+
The backup and restore process works as follows:
106+
107+
. On management cluster 1, which you can think of as the source management cluster, the control plane and workers interact by using the ExternalDNS API.
108+
109+
. You take a snapshot of the hosted cluster, which includes etcd, the control plane, and the worker nodes. The worker nodes are moved to the external DNS, the control plane is saved in a local manifest file, and etcd is backed up to an S3 bucket.
110+
111+
. On management cluster 2, which you can think of as the destination management cluster, you restore etcd from the S3 bucket and restore the control plane from the local manifest file.
112+
113+
. By using the External DNS API, the worker nodes are restored to management cluster 2.
114+
115+
. On management cluster 2, the control plane and worker nodes interact by using the ExternalDNS API.
116+
117+
// When the updated diagram is available, I will add it here and update the first sentence in this section to read, "As shown in the following diagram, the backup and restore process works as follows:"
118+
119+
You can manually back up and restore your hosted cluster, or you can run a script to complete the process. For more information about the script, see "Running a script to back up and restore a hosted cluster".
120+
121+
// Backing up the hosted cluster
122+
include::modules/dr-hosted-cluster-within-aws-region-backup.adoc[leveloffset=+1]
123+
124+
// Restoring the hosted cluster
125+
include::modules/dr-hosted-cluster-within-aws-region-restore.adoc[leveloffset=+1]
126+
127+
// Deleting the hosted cluster
128+
include::modules/dr-hosted-cluster-within-aws-region-delete.adoc[leveloffset=+1]
129+
130+
//Helper script
131+
include::modules/dr-hosted-cluster-within-aws-region-script.adoc[leveloffset=+1]
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
:_content-type: ASSEMBLY
2+
[id="etcd-backup-restore-hosted-cluster"]
3+
= Backing up and restoring etcd on a hosted cluster
4+
include::_attributes/common-attributes.adoc[]
5+
:context: etcd-backup-restore-hosted-cluster
6+
7+
toc::[]
8+
9+
If you use hosted control planes on {product-title}, the process to back up and restore etcd is different from xref:../../backup_and_restore/control_plane_backup_and_restore/backing-up-etcd.adoc#backing-up-etcd-data_backup-etcd[the usual etcd backup process].
10+
11+
:FeatureName: Hosted control planes
12+
include::snippets/technology-preview.adoc[]
13+
14+
// Backing up etcd on a hosted cluster
15+
include::modules/backup-etcd-hosted-cluster.adoc[leveloffset=+1]
16+
17+
// Restoring an etcd snapshot on a hosted cluster
18+
include::modules/restoring-etcd-snapshot-hosted-cluster.adoc[leveloffset=+1]
19+
20+
[role="_additional-resources"]
21+
[id="additional-resources_etcd-backup-restore-hosted-cluster"]
22+
== Additional resources
23+
* xref:../../backup_and_restore/control_plane_backup_and_restore/disaster_recovery/dr-hosted-cluster-within-aws-region.adoc#dr-hosted-cluster-within-aws-region[Disaster recovery for a hosted cluster within an AWS region]
Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
// Module included in the following assembly:
2+
//
3+
// * control_plane_backup_and_restore/etcd-backup-restore-hosted-cluster.adoc
4+
5+
:_content-type: PROCEDURE
6+
[id="backup-etcd-hosted-cluster_{context}"]
7+
= Taking a snapshot of etcd on a hosted cluster
8+
9+
As part of the process to back up etcd for a hosted cluster, you take a snapshot of etcd. After you take the snapshot, you can restore it, for example, as part of a disaster recovery operation.
10+
11+
[IMPORTANT]
12+
====
13+
This procedure requires API downtime.
14+
====
15+
16+
.Procedure
17+
18+
. Pause reconciliation of the hosted cluster by entering this command:
19+
+
20+
[source,terminal]
21+
----
22+
$ oc patch -n clusters hostedclusters/${CLUSTER_NAME} -p '{"spec":{"pausedUntil":"'${PAUSED_UNTIL}'"}}' --type=merge
23+
----
24+
25+
. Stop all etcd-writer deployments by entering this command:
26+
+
27+
[source,terminal]
28+
----
29+
$ oc scale deployment -n ${HOSTED_CLUSTER_NAMESPACE} --replicas=0 kube-apiserver openshift-apiserver openshift-oauth-apiserver
30+
----
31+
32+
. Take an etcd snapshot by using the `exec` command in each etcd container:
33+
+
34+
[source,terminal]
35+
----
36+
$ oc exec -it etcd-0 -n ${HOSTED_CLUSTER_NAMESPACE} -- env ETCDCTL_API=3 /usr/bin/etcdctl --cacert /etc/etcd/tls/client/etcd-client-ca.crt --cert /etc/etcd/tls/client/etcd-client.crt --key /etc/etcd/tls/client/etcd-client.key --endpoints=localhost:2379 snapshot save /var/lib/data/snapshot.db
37+
$ oc exec -it etcd-0 -n ${HOSTED_CLUSTER_NAMESPACE} -- env ETCDCTL_API=3 /usr/bin/etcdctl -w table snapshot status /var/lib/data/snapshot.db
38+
----
39+
40+
. Copy the snapshot data to a location where you can retrieve it later, such as an S3 bucket, as shown in the following example.
41+
+
42+
[NOTE]
43+
====
44+
The following example uses signature version 2. If you are in a region that supports signature version 4, such as the us-east-2 region, use signature version 4. Otherwise, if you use signature version 2 to copy the snapshot to an S3 bucket, the upload fails and signature version 2 is deprecated.
45+
====
46+
+
47+
.Example
48+
[source,terminal]
49+
----
50+
BUCKET_NAME=somebucket
51+
FILEPATH="/${BUCKET_NAME}/${CLUSTER_NAME}-snapshot.db"
52+
CONTENT_TYPE="application/x-compressed-tar"
53+
DATE_VALUE=`date -R`
54+
SIGNATURE_STRING="PUT\n\n${CONTENT_TYPE}\n${DATE_VALUE}\n${FILEPATH}"
55+
ACCESS_KEY=accesskey
56+
SECRET_KEY=secret
57+
SIGNATURE_HASH=`echo -en ${SIGNATURE_STRING} | openssl sha1 -hmac ${SECRET_KEY} -binary | base64`
58+
59+
oc exec -it etcd-0 -n ${HOSTED_CLUSTER_NAMESPACE} -- curl -X PUT -T "/var/lib/data/snapshot.db" \
60+
-H "Host: ${BUCKET_NAME}.s3.amazonaws.com" \
61+
-H "Date: ${DATE_VALUE}" \
62+
-H "Content-Type: ${CONTENT_TYPE}" \
63+
-H "Authorization: AWS ${ACCESS_KEY}:${SIGNATURE_HASH}" \
64+
https://${BUCKET_NAME}.s3.amazonaws.com/${CLUSTER_NAME}-snapshot.db
65+
----
66+
67+
. If you want to be able to restore the snapshot on a new cluster later, save the encryption secret that the hosted cluster references, as shown in this example:
68+
+
69+
.Example
70+
[source,terminal]
71+
----
72+
oc get hostedcluster $CLUSTER_NAME -o=jsonpath='{.spec.secretEncryption.aescbc}'
73+
{"activeKey":{"name":"CLUSTER_NAME-etcd-encryption-key"}}
74+
75+
# Save this secret, or the key it contains so the etcd data can later be decrypted
76+
oc get secret ${CLUSTER_NAME}-etcd-encryption-key -o=jsonpath='{.data.key}'
77+
----
78+
79+
.Next steps
80+
81+
Restore the etcd snapshot.

0 commit comments

Comments
 (0)