Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
80 changes: 73 additions & 7 deletions modules/dr-restoring-cluster-state.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -514,6 +514,79 @@ $ oc -n openshift-ovn-kubernetes get pod -l app=ovnkube-node --field-selector=sp
It might take several minutes for the pods to restart.
====

. Delete and re-create other non-recovery, control plane machines, one by one. After the machines are re-created, a new revision is forced and `etcd` automatically scales up.
+
** If you use a user-provisioned bare metal installation, you can re-create a control plane machine by using the same method that you used to originally create it. For more information, see "Installing a user-provisioned cluster on bare metal".
+
[WARNING]
====
Do not delete and re-create the machine for the recovery host.
====
+
** If you are running installer-provisioned infrastructure, or you used the Machine API to create your machines, follow these steps:
+
[WARNING]
====
Do not delete and re-create the machine for the recovery host.

For bare metal installations on installer-provisioned infrastructure, control plane machines are not re-created. For more information, see "Replacing a bare-metal control plane node".
====
.. Obtain the machine for one of the lost control plane hosts.
+
In a terminal that has access to the cluster as a cluster-admin user, run the following command:
+
[source,terminal]
----
$ oc get machines -n openshift-machine-api -o wide
----
+
.Example output
[source,terminal]
----
NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE
clustername-8qw5l-master-0 Running m4.xlarge us-east-1 us-east-1a 3h37m ip-10-0-131-183.ec2.internal aws:///us-east-1a/i-0ec2782f8287dfb7e stopped <1>
clustername-8qw5l-master-1 Running m4.xlarge us-east-1 us-east-1b 3h37m ip-10-0-143-125.ec2.internal aws:///us-east-1b/i-096c349b700a19631 running
clustername-8qw5l-master-2 Running m4.xlarge us-east-1 us-east-1c 3h37m ip-10-0-154-194.ec2.internal aws:///us-east-1c/i-02626f1dba9ed5bba running
clustername-8qw5l-worker-us-east-1a-wbtgd Running m4.large us-east-1 us-east-1a 3h28m ip-10-0-129-226.ec2.internal aws:///us-east-1a/i-010ef6279b4662ced running
clustername-8qw5l-worker-us-east-1b-lrdxb Running m4.large us-east-1 us-east-1b 3h28m ip-10-0-144-248.ec2.internal aws:///us-east-1b/i-0cb45ac45a166173b running
clustername-8qw5l-worker-us-east-1c-pkg26 Running m4.large us-east-1 us-east-1c 3h28m ip-10-0-170-181.ec2.internal aws:///us-east-1c/i-06861c00007751b0a running
----
<1> This is the control plane machine for the lost control plane host, `ip-10-0-131-183.ec2.internal`.

.. Delete the machine of the lost control plane host by running:
+
[source,terminal]
----
$ oc delete machine -n openshift-machine-api clustername-8qw5l-master-0 <1>
----
<1> Specify the name of the control plane machine for the lost control plane host.
+
A new machine is automatically provisioned after deleting the machine of the lost control plane host.

.. Verify that a new machine has been created by running:
+
[source,terminal]
----
$ oc get machines -n openshift-machine-api -o wide
----
+
.Example output
[source,terminal]
----
NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE
clustername-8qw5l-master-1 Running m4.xlarge us-east-1 us-east-1b 3h37m ip-10-0-143-125.ec2.internal aws:///us-east-1b/i-096c349b700a19631 running
clustername-8qw5l-master-2 Running m4.xlarge us-east-1 us-east-1c 3h37m ip-10-0-154-194.ec2.internal aws:///us-east-1c/i-02626f1dba9ed5bba running
clustername-8qw5l-master-3 Provisioning m4.xlarge us-east-1 us-east-1a 85s ip-10-0-173-171.ec2.internal aws:///us-east-1a/i-015b0888fe17bc2c8 running <1>
clustername-8qw5l-worker-us-east-1a-wbtgd Running m4.large us-east-1 us-east-1a 3h28m ip-10-0-129-226.ec2.internal aws:///us-east-1a/i-010ef6279b4662ced running
clustername-8qw5l-worker-us-east-1b-lrdxb Running m4.large us-east-1 us-east-1b 3h28m ip-10-0-144-248.ec2.internal aws:///us-east-1b/i-0cb45ac45a166173b running
clustername-8qw5l-worker-us-east-1c-pkg26 Running m4.large us-east-1 us-east-1c 3h28m ip-10-0-170-181.ec2.internal aws:///us-east-1c/i-06861c00007751b0a running
----
<1> The new machine, `clustername-8qw5l-master-3` is being created and is ready after the phase changes from `Provisioning` to `Running`.
+
It might take a few minutes for the new machine to be created. The `etcd` cluster Operator will automatically sync when the machine or node returns to a healthy state.

.. Repeat these steps for each lost control plane host that is not the recovery host.

. Turn off the quorum guard by running the following command:
+
[source,terminal]
Expand Down Expand Up @@ -657,13 +730,6 @@ AllNodesAtLatestRevision
+
If the output includes multiple revision numbers, such as `2 nodes are at revision 6; 1 nodes are at revision 7`, this means that the update is still in progress. Wait a few minutes and try again.

. If the `keepalived` daemon is in use, restore the configuration on the control plane nodes other than the recovery host by running the following command. Otherwise, the network operator will not advance beyond the "Progressing" state.
+
[source,terminal]
----
$ sudo cp -v /home/core/keepalived.yaml /etc/kubernetes/manifests/
----

. Monitor the platform Operators by running the following command:
+
[source,terminal]
Expand Down