diff --git a/modules/dr-restoring-cluster-state.adoc b/modules/dr-restoring-cluster-state.adoc index 8b6cb51d97e6..cd75eb4d29ab 100644 --- a/modules/dr-restoring-cluster-state.adoc +++ b/modules/dr-restoring-cluster-state.adoc @@ -514,6 +514,79 @@ $ oc -n openshift-ovn-kubernetes get pod -l app=ovnkube-node --field-selector=sp It might take several minutes for the pods to restart. ==== +. Delete and re-create other non-recovery, control plane machines, one by one. After the machines are re-created, a new revision is forced and `etcd` automatically scales up. ++ +** If you use a user-provisioned bare metal installation, you can re-create a control plane machine by using the same method that you used to originally create it. For more information, see "Installing a user-provisioned cluster on bare metal". ++ +[WARNING] +==== +Do not delete and re-create the machine for the recovery host. +==== ++ +** If you are running installer-provisioned infrastructure, or you used the Machine API to create your machines, follow these steps: ++ +[WARNING] +==== +Do not delete and re-create the machine for the recovery host. + +For bare metal installations on installer-provisioned infrastructure, control plane machines are not re-created. For more information, see "Replacing a bare-metal control plane node". +==== +.. Obtain the machine for one of the lost control plane hosts. ++ +In a terminal that has access to the cluster as a cluster-admin user, run the following command: ++ +[source,terminal] +---- +$ oc get machines -n openshift-machine-api -o wide +---- ++ +.Example output +[source,terminal] +---- +NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE +clustername-8qw5l-master-0 Running m4.xlarge us-east-1 us-east-1a 3h37m ip-10-0-131-183.ec2.internal aws:///us-east-1a/i-0ec2782f8287dfb7e stopped <1> +clustername-8qw5l-master-1 Running m4.xlarge us-east-1 us-east-1b 3h37m ip-10-0-143-125.ec2.internal aws:///us-east-1b/i-096c349b700a19631 running +clustername-8qw5l-master-2 Running m4.xlarge us-east-1 us-east-1c 3h37m ip-10-0-154-194.ec2.internal aws:///us-east-1c/i-02626f1dba9ed5bba running +clustername-8qw5l-worker-us-east-1a-wbtgd Running m4.large us-east-1 us-east-1a 3h28m ip-10-0-129-226.ec2.internal aws:///us-east-1a/i-010ef6279b4662ced running +clustername-8qw5l-worker-us-east-1b-lrdxb Running m4.large us-east-1 us-east-1b 3h28m ip-10-0-144-248.ec2.internal aws:///us-east-1b/i-0cb45ac45a166173b running +clustername-8qw5l-worker-us-east-1c-pkg26 Running m4.large us-east-1 us-east-1c 3h28m ip-10-0-170-181.ec2.internal aws:///us-east-1c/i-06861c00007751b0a running +---- +<1> This is the control plane machine for the lost control plane host, `ip-10-0-131-183.ec2.internal`. + +.. Delete the machine of the lost control plane host by running: ++ +[source,terminal] +---- +$ oc delete machine -n openshift-machine-api clustername-8qw5l-master-0 <1> +---- +<1> Specify the name of the control plane machine for the lost control plane host. ++ +A new machine is automatically provisioned after deleting the machine of the lost control plane host. + +.. Verify that a new machine has been created by running: ++ +[source,terminal] +---- +$ oc get machines -n openshift-machine-api -o wide +---- ++ +.Example output +[source,terminal] +---- +NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE +clustername-8qw5l-master-1 Running m4.xlarge us-east-1 us-east-1b 3h37m ip-10-0-143-125.ec2.internal aws:///us-east-1b/i-096c349b700a19631 running +clustername-8qw5l-master-2 Running m4.xlarge us-east-1 us-east-1c 3h37m ip-10-0-154-194.ec2.internal aws:///us-east-1c/i-02626f1dba9ed5bba running +clustername-8qw5l-master-3 Provisioning m4.xlarge us-east-1 us-east-1a 85s ip-10-0-173-171.ec2.internal aws:///us-east-1a/i-015b0888fe17bc2c8 running <1> +clustername-8qw5l-worker-us-east-1a-wbtgd Running m4.large us-east-1 us-east-1a 3h28m ip-10-0-129-226.ec2.internal aws:///us-east-1a/i-010ef6279b4662ced running +clustername-8qw5l-worker-us-east-1b-lrdxb Running m4.large us-east-1 us-east-1b 3h28m ip-10-0-144-248.ec2.internal aws:///us-east-1b/i-0cb45ac45a166173b running +clustername-8qw5l-worker-us-east-1c-pkg26 Running m4.large us-east-1 us-east-1c 3h28m ip-10-0-170-181.ec2.internal aws:///us-east-1c/i-06861c00007751b0a running +---- +<1> The new machine, `clustername-8qw5l-master-3` is being created and is ready after the phase changes from `Provisioning` to `Running`. ++ +It might take a few minutes for the new machine to be created. The `etcd` cluster Operator will automatically sync when the machine or node returns to a healthy state. + +.. Repeat these steps for each lost control plane host that is not the recovery host. + . Turn off the quorum guard by running the following command: + [source,terminal] @@ -657,13 +730,6 @@ AllNodesAtLatestRevision + If the output includes multiple revision numbers, such as `2 nodes are at revision 6; 1 nodes are at revision 7`, this means that the update is still in progress. Wait a few minutes and try again. -. If the `keepalived` daemon is in use, restore the configuration on the control plane nodes other than the recovery host by running the following command. Otherwise, the network operator will not advance beyond the "Progressing" state. -+ -[source,terminal] ----- -$ sudo cp -v /home/core/keepalived.yaml /etc/kubernetes/manifests/ ----- - . Monitor the platform Operators by running the following command: + [source,terminal]