|
| 1 | +// Module included in the following assemblies: |
| 2 | +// Epic CNF-3901 (CNF-2133) (4.11), Story TELCODOCS-339 |
| 3 | +// * scalability_and_performance/cnf-talm-for-cluster-upgrades.adoc |
| 4 | + |
| 5 | +:_content-type: PROCEDURE |
| 6 | +[id="talo-backup-recovery_{context}"] |
| 7 | += Recovering a cluster after a failed upgrade |
| 8 | + |
| 9 | +If an upgrade of a cluster fails, you can manually log in to the cluster and use the backup to return the cluster to its preupgrade state. There are two stages: |
| 10 | + |
| 11 | +Rollback:: If the attempted upgrade included a change to the platform OS deployment, you must roll back to the previous version before running the recovery script. |
| 12 | +Recovery:: The recovery shuts down containers and uses files from the backup partition to relaunch containers and restore clusters. |
| 13 | + |
| 14 | +.Prerequisites |
| 15 | + |
| 16 | +* Install the {cgu-operator-first}. |
| 17 | +* Provision one or more managed clusters. |
| 18 | +* Install {rh-rhacm-first}. |
| 19 | +* Log in as a user with `cluster-admin` privileges. |
| 20 | +* Run an upgrade that is configured for backup. |
| 21 | +
|
| 22 | +.Procedure |
| 23 | + |
| 24 | +. Delete the previously created `ClusterGroupUpgrade` custom resource (CR) by running the following command: |
| 25 | ++ |
| 26 | +[source,terminal] |
| 27 | +---- |
| 28 | +$ oc delete cgu/du-upgrade-4918 -n ztp-group-du-sno |
| 29 | +---- |
| 30 | + |
| 31 | +. Log in to the cluster that you want to recover. |
| 32 | + |
| 33 | +. Check the status of the platform OS deployment by running the following command: |
| 34 | ++ |
| 35 | +[source,terminal] |
| 36 | +---- |
| 37 | +$ oc ostree admin status |
| 38 | +---- |
| 39 | +.Example outputs |
| 40 | ++ |
| 41 | +[source,terminal] |
| 42 | +---- |
| 43 | +[root@lab-test-spoke2-node-0 core]# ostree admin status |
| 44 | +* rhcos c038a8f08458bbed83a77ece033ad3c55597e3f64edad66ea12fda18cbdceaf9.0 |
| 45 | + Version: 49.84.202202230006-0 |
| 46 | + Pinned: yes <1> |
| 47 | + origin refspec: c038a8f08458bbed83a77ece033ad3c55597e3f64edad66ea12fda18cbdceaf9 |
| 48 | +---- |
| 49 | +<1> The current deployment is pinned. A platform OS deployment rollback is not necessary. |
| 50 | ++ |
| 51 | +[source,terminal] |
| 52 | +---- |
| 53 | +[root@lab-test-spoke2-node-0 core]# ostree admin status |
| 54 | +* rhcos f750ff26f2d5550930ccbe17af61af47daafc8018cd9944f2a3a6269af26b0fa.0 |
| 55 | + Version: 410.84.202204050541-0 |
| 56 | + origin refspec: f750ff26f2d5550930ccbe17af61af47daafc8018cd9944f2a3a6269af26b0fa |
| 57 | +rhcos ad8f159f9dc4ea7e773fd9604c9a16be0fe9b266ae800ac8470f63abc39b52ca.0 (rollback) <1> |
| 58 | + Version: 410.84.202203290245-0 |
| 59 | + Pinned: yes <2> |
| 60 | + origin refspec: ad8f159f9dc4ea7e773fd9604c9a16be0fe9b266ae800ac8470f63abc39b52ca |
| 61 | +---- |
| 62 | +<1> This platform OS deployment is marked for rollback. |
| 63 | +<2> The previous deployment is pinned and can be rolled back. |
| 64 | + |
| 65 | +. To trigger a rollback of the platform OS deployment, run the following command: |
| 66 | ++ |
| 67 | +[source,terminal] |
| 68 | +---- |
| 69 | +$ rpm-ostree rollback -r |
| 70 | +---- |
| 71 | + |
| 72 | +. The first phase of the recovery shuts down containers and restores files from the backup partition to the targeted directories. To begin the recovery, run the following command: |
| 73 | ++ |
| 74 | +[source,terminal] |
| 75 | +---- |
| 76 | +$ /var/recovery/upgrade-recovery.sh |
| 77 | +---- |
| 78 | ++ |
| 79 | +
|
| 80 | +. When prompted, reboot the cluster by running the following command: |
| 81 | ++ |
| 82 | +[source,terminal] |
| 83 | +---- |
| 84 | +$ systemctl reboot |
| 85 | +---- |
| 86 | +. After the reboot, restart the recovery by running the following command: |
| 87 | ++ |
| 88 | +[source,terminal] |
| 89 | +---- |
| 90 | +$ /var/recovery/upgrade-recovery.sh --resume |
| 91 | +---- |
| 92 | + |
| 93 | +[NOTE] |
| 94 | +==== |
| 95 | +If the recovery utility fails, you can retry with the `--restart` option: |
| 96 | +[source,terminal] |
| 97 | +---- |
| 98 | +$ /var/recovery/upgrade-recovery.sh --restart |
| 99 | +---- |
| 100 | +==== |
| 101 | + |
| 102 | +.Verification |
| 103 | +* To check the status of the recovery run the following command: |
| 104 | ++ |
| 105 | +[source,terminal] |
| 106 | +---- |
| 107 | +$ oc get clusterversion,nodes,clusteroperator |
| 108 | +---- |
| 109 | ++ |
| 110 | +.Example output |
| 111 | +[source,terminal] |
| 112 | +---- |
| 113 | +NAME VERSION AVAILABLE PROGRESSING SINCE STATUS |
| 114 | +clusterversion.config.openshift.io/version 4.9.23 True False 86d Cluster version is 4.9.23 <1> |
| 115 | +
|
| 116 | +
|
| 117 | +NAME STATUS ROLES AGE VERSION |
| 118 | +node/lab-test-spoke1-node-0 Ready master,worker 86d v1.22.3+b93fd35 <2> |
| 119 | + |
| 120 | +NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE |
| 121 | +clusteroperator.config.openshift.io/authentication 4.9.23 True False False 2d7h <3> |
| 122 | +clusteroperator.config.openshift.io/baremetal 4.9.23 True False False 86d |
| 123 | + |
| 124 | + |
| 125 | +.............. |
| 126 | +---- |
| 127 | +<1> The cluster version is available and has the correct version. |
| 128 | +<2> The node status is `Ready`. |
| 129 | +<3> The `ClusterOperator` object's availability is `True`. |
0 commit comments