Skip to content

Commit ce63dcf

Browse files
committed
adding backup, restore, and dr tasks for hosted control planes
1 parent 14bbab4 commit ce63dcf

9 files changed

+859
-0
lines changed

_topic_maps/_topic_map.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2536,6 +2536,8 @@ Topics:
25362536
File: backing-up-etcd
25372537
- Name: Replacing an unhealthy etcd member
25382538
File: replacing-unhealthy-etcd-member
2539+
- Name: Backing up and restoring etcd on a hosted cluster
2540+
File: etcd-backup-restore-hosted-cluster
25392541
- Name: Disaster recovery
25402542
Dir: disaster_recovery
25412543
Topics:
@@ -2545,6 +2547,8 @@ Topics:
25452547
File: scenario-2-restoring-cluster-state
25462548
- Name: Recovering from expired control plane certificates
25472549
File: scenario-3-expired-certs
2550+
- Name: Disaster recovery for a hosted cluster within an AWS region
2551+
File: dr-hosted-cluster-within-aws-region
25482552
---
25492553
Name: Migrating from version 3 to 4
25502554
Dir: migrating_from_ocp_3_to_4
Lines changed: 131 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,131 @@
1+
:_content-type: ASSEMBLY
2+
[id="dr-hosted-cluster-within-aws-region"]
3+
= Disaster recovery for a hosted cluster within an AWS region
4+
include::_attributes/common-attributes.adoc[]
5+
:context: dr-hosted-cluster-within-aws-region
6+
7+
toc::[]
8+
9+
In a situation where you need disaster recovery (DR) for a hosted cluster, you can recover a hosted cluster to the same region within AWS. For example, you need DR when the upgrade of a management cluster fails and the hosted cluster is in a read-only state.
10+
11+
:FeatureName: Hosted control planes
12+
include::snippets/technology-preview.adoc[]
13+
14+
The DR process involves three main steps:
15+
16+
. Backing up the hosted cluster on the source management cluster
17+
. Restoring the hosted cluster on a destination management cluster
18+
. Deleting the hosted cluster from the source management cluster
19+
20+
Your workloads remain running during the process. The Cluster API might be unavailable for a period, but that will not affect the services that are running on the worker nodes.
21+
22+
[IMPORTANT]
23+
====
24+
Both the source management cluster and the destination management cluster must have the `--external-dns` flags to maintain the API server URL, as shown in this example:
25+
26+
.Example: External DNS flags
27+
[source,terminal]
28+
----
29+
--external-dns-provider=aws \
30+
--external-dns-credentials=<AWS Credentials location> \
31+
--external-dns-domain-filter=<DNS Base Domain>
32+
----
33+
34+
That way, the server URL ends with `https://api-sample-hosted.sample-hosted.aws.openshift.com`.
35+
36+
If you do not include the `--external-dns` flags to maintain the API server URL, the hosted cluster cannot be migrated.
37+
====
38+
39+
[id="dr-hosted-cluster-env-context"]
40+
== Example environment and context
41+
42+
Consider an scenario where you have three clusters to restore. Two are management clusters, and one is a hosted cluster. You can restore either the control plane only or the control plane and the nodes. Before you begin, you need the following information:
43+
44+
* Source MGMT Namespace: The source management namespace
45+
* Source MGMT ClusterName: The source management cluster name
46+
* Source MGMT Kubeconfig: The source management `kubeconfig` file
47+
* Destination MGMT Kubeconfig: The destination management `kubeconfig` file
48+
* HC Kubeconfig: The hosted cluster `kubeconfig` file
49+
* SSH key file: The SSH public key
50+
* Pull secret: The pull secret file to access the release images
51+
* AWS credentials
52+
* AWS region
53+
* Base domain: The DNS base domain to use as an external DNS
54+
* S3 bucket name: The bucket in the AWS region where you plan to upload the etcd backup
55+
56+
This information is shown in the following example environment variables.
57+
58+
.Example environment variables
59+
[source,terminal]
60+
----
61+
SSH_KEY_FILE=${HOME}/.ssh/id_rsa.pub
62+
BASE_PATH=${HOME}/hypershift
63+
BASE_DOMAIN="aws.sample.com"
64+
PULL_SECRET_FILE="${HOME}/pull_secret.json"
65+
AWS_CREDS="${HOME}/.aws/credentials"
66+
AWS_ZONE_ID="Z02718293M33QHDEQBROL"
67+
68+
CONTROL_PLANE_AVAILABILITY_POLICY=SingleReplica
69+
HYPERSHIFT_PATH=${BASE_PATH}/src/hypershift
70+
HYPERSHIFT_CLI=${HYPERSHIFT_PATH}/bin/hypershift
71+
HYPERSHIFT_IMAGE=${HYPERSHIFT_IMAGE:-"quay.io/${USER}/hypershift:latest"}
72+
NODE_POOL_REPLICAS=${NODE_POOL_REPLICAS:-2}
73+
74+
# MGMT Context
75+
MGMT_REGION=us-west-1
76+
MGMT_CLUSTER_NAME="${USER}-dev"
77+
MGMT_CLUSTER_NS=${USER}
78+
MGMT_CLUSTER_DIR="${BASE_PATH}/hosted_clusters/${MGMT_CLUSTER_NS}-${MGMT_CLUSTER_NAME}"
79+
MGMT_KUBECONFIG="${MGMT_CLUSTER_DIR}/kubeconfig"
80+
81+
# MGMT2 Context
82+
MGMT2_CLUSTER_NAME="${USER}-dest"
83+
MGMT2_CLUSTER_NS=${USER}
84+
MGMT2_CLUSTER_DIR="${BASE_PATH}/hosted_clusters/${MGMT2_CLUSTER_NS}-${MGMT2_CLUSTER_NAME}"
85+
MGMT2_KUBECONFIG="${MGMT2_CLUSTER_DIR}/kubeconfig"
86+
87+
# Hosted Cluster Context
88+
HC_CLUSTER_NS=clusters
89+
HC_REGION=us-west-1
90+
HC_CLUSTER_NAME="${USER}-hosted"
91+
HC_CLUSTER_DIR="${BASE_PATH}/hosted_clusters/${HC_CLUSTER_NS}-${HC_CLUSTER_NAME}"
92+
HC_KUBECONFIG="${HC_CLUSTER_DIR}/kubeconfig"
93+
BACKUP_DIR=${HC_CLUSTER_DIR}/backup
94+
95+
BUCKET_NAME="${USER}-hosted-${MGMT_REGION}"
96+
97+
# DNS
98+
AWS_ZONE_ID="Z07342811SH9AA102K1AC"
99+
EXTERNAL_DNS_DOMAIN="hc.jpdv.aws.kerbeross.com"
100+
----
101+
102+
[id="dr-hosted-cluster-process"]
103+
== Overview of the backup and restore process
104+
105+
The backup and restore process works as follows:
106+
107+
. On management cluster 1, which you can think of as the source management cluster, the control plane and workers interact by using the ExternalDNS API.
108+
109+
. You take a snapshot of the hosted cluster, which includes etcd, the control plane, and the worker nodes. The worker nodes are moved to the external DNS, the control plane is saved in a local manifest file, and etcd is backed up to an S3 bucket.
110+
111+
. On management cluster 2, which you can think of as the destination management cluster, you restore etcd from the S3 bucket and restore the control plane from the local manifest file.
112+
113+
. By using the External DNS API, the worker nodes are restored to management cluster 2.
114+
115+
. On management cluster 2, the control plane and worker nodes interact by using the ExternalDNS API.
116+
117+
// When the updated diagram is available, I will add it here and update the first sentence in this section to read, "As shown in the following diagram, the backup and restore process works as follows:"
118+
119+
You can manually back up and restore your hosted cluster, or you can run a script to complete the process. For more information about the script, see "Running a script to back up and restore a hosted cluster".
120+
121+
// Backing up the hosted cluster
122+
include::modules/dr-hosted-cluster-within-aws-region-backup.adoc[leveloffset=+1]
123+
124+
// Restoring the hosted cluster
125+
include::modules/dr-hosted-cluster-within-aws-region-restore.adoc[leveloffset=+1]
126+
127+
// Deleting the hosted cluster
128+
include::modules/dr-hosted-cluster-within-aws-region-delete.adoc[leveloffset=+1]
129+
130+
//Helper script
131+
include::modules/dr-hosted-cluster-within-aws-region-script.adoc[leveloffset=+1]
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
:_content-type: ASSEMBLY
2+
[id="etcd-backup-restore-hosted-cluster"]
3+
= Backing up and restoring etcd on a hosted cluster
4+
include::_attributes/common-attributes.adoc[]
5+
:context: etcd-backup-restore-hosted-cluster
6+
7+
toc::[]
8+
9+
If you use hosted control planes on {product-title}, the process to back up and restore etcd is different from xref:../../backup_and_restore/control_plane_backup_and_restore/backing-up-etcd.adoc#backing-up-etcd-data_backup-etcd[the usual etcd backup process].
10+
11+
:FeatureName: Hosted control planes
12+
include::snippets/technology-preview.adoc[]
13+
14+
// Backing up etcd on a hosted cluster
15+
include::modules/backup-etcd-hosted-cluster.adoc[leveloffset=+1]
16+
17+
// Restoring an etcd snapshot on a hosted cluster
18+
include::modules/restoring-etcd-snapshot-hosted-cluster.adoc[leveloffset=+1]
19+
20+
[role="_additional-resources"]
21+
[id="additional-resources_etcd-backup-restore-hosted-cluster"]
22+
== Additional resources
23+
* xref:../../backup_and_restore/control_plane_backup_and_restore/disaster_recovery/dr-hosted-cluster-within-aws-region.adoc#dr-hosted-cluster-within-aws-region[Disaster recovery for a hosted cluster within an AWS region]
Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
// Module included in the following assembly:
2+
//
3+
// * control_plane_backup_and_restore/etcd-backup-restore-hosted-cluster.adoc
4+
5+
:_content-type: PROCEDURE
6+
[id="backup-etcd-hosted-cluster_{context}"]
7+
= Taking a snapshot of etcd on a hosted cluster
8+
9+
As part of the process to back up etcd for a hosted cluster, you take a snapshot of etcd. After you take the snapshot, you can restore it, for example, as part of a disaster recovery operation.
10+
11+
[IMPORTANT]
12+
====
13+
This procedure requires API downtime.
14+
====
15+
16+
.Procedure
17+
18+
. Pause reconciliation of the hosted cluster by entering this command:
19+
+
20+
[source,terminal]
21+
----
22+
$ oc patch -n clusters hostedclusters/${CLUSTER_NAME} -p '{"spec":{"pausedUntil":"'${PAUSED_UNTIL}'"}}' --type=merge
23+
----
24+
25+
. Stop all etcd-writer deployments by entering this command:
26+
+
27+
[source,terminal]
28+
----
29+
$ oc scale deployment -n ${HOSTED_CLUSTER_NAMESPACE} --replicas=0 kube-apiserver openshift-apiserver openshift-oauth-apiserver
30+
----
31+
32+
. Take an etcd snapshot by using the `exec` command in each etcd container:
33+
+
34+
[source,terminal]
35+
----
36+
$ oc exec -it etcd-0 -n ${HOSTED_CLUSTER_NAMESPACE} -- env ETCDCTL_API=3 /usr/bin/etcdctl --cacert /etc/etcd/tls/client/etcd-client-ca.crt --cert /etc/etcd/tls/client/etcd-client.crt --key /etc/etcd/tls/client/etcd-client.key --endpoints=localhost:2379 snapshot save /var/lib/data/snapshot.db
37+
$ oc exec -it etcd-0 -n ${HOSTED_CLUSTER_NAMESPACE} -- env ETCDCTL_API=3 /usr/bin/etcdctl -w table snapshot status /var/lib/data/snapshot.db
38+
----
39+
40+
. Copy the snapshot data to a location where you can retrieve it later, such as an S3 bucket, as shown in the following example.
41+
+
42+
[NOTE]
43+
====
44+
The following example uses signature version 2. If you are in a region that supports signature version 4, such as the us-east-2 region, use signature version 4. Otherwise, if you use signature version 2 to copy the snapshot to an S3 bucket, the upload fails and signature version 2 is deprecated.
45+
====
46+
+
47+
.Example
48+
[source,terminal]
49+
----
50+
BUCKET_NAME=somebucket
51+
FILEPATH="/${BUCKET_NAME}/${CLUSTER_NAME}-snapshot.db"
52+
CONTENT_TYPE="application/x-compressed-tar"
53+
DATE_VALUE=`date -R`
54+
SIGNATURE_STRING="PUT\n\n${CONTENT_TYPE}\n${DATE_VALUE}\n${FILEPATH}"
55+
ACCESS_KEY=accesskey
56+
SECRET_KEY=secret
57+
SIGNATURE_HASH=`echo -en ${SIGNATURE_STRING} | openssl sha1 -hmac ${SECRET_KEY} -binary | base64`
58+
59+
oc exec -it etcd-0 -n ${HOSTED_CLUSTER_NAMESPACE} -- curl -X PUT -T "/var/lib/data/snapshot.db" \
60+
-H "Host: ${BUCKET_NAME}.s3.amazonaws.com" \
61+
-H "Date: ${DATE_VALUE}" \
62+
-H "Content-Type: ${CONTENT_TYPE}" \
63+
-H "Authorization: AWS ${ACCESS_KEY}:${SIGNATURE_HASH}" \
64+
https://${BUCKET_NAME}.s3.amazonaws.com/${CLUSTER_NAME}-snapshot.db
65+
----
66+
67+
. If you want to be able to restore the snapshot on a new cluster later, save the encryption secret that the hosted cluster references, as shown in this example:
68+
+
69+
.Example
70+
[source,terminal]
71+
----
72+
oc get hostedcluster $CLUSTER_NAME -o=jsonpath='{.spec.secretEncryption.aescbc}'
73+
{"activeKey":{"name":"CLUSTER_NAME-etcd-encryption-key"}}
74+
75+
# Save this secret, or the key it contains so the etcd data can later be decrypted
76+
oc get secret ${CLUSTER_NAME}-etcd-encryption-key -o=jsonpath='{.data.key}'
77+
----
78+
79+
.Next steps
80+
81+
Restore the etcd snapshot.

0 commit comments

Comments
 (0)