Skip to content

Commit 55eb7bf

Browse files
authored
Merge pull request #46563 from slovern/TELCODOCS-339
TELCODOCS-339 - SNO Recovery from failed OCP upgrade for RAN/DU Deployments
2 parents 0bff9bc + 394e04e commit 55eb7bf

4 files changed

+280
-0
lines changed
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
// Module included in the following assemblies:
2+
// Epic CNF-3901 (CNF-2133) (4.11), Story TELCODOCS-339
3+
// * scalability_and_performance/cnf-talm-for-cluster-upgrades.adoc
4+
5+
:_content-type: CONCEPT
6+
[id="talo-backup-feature-concept_{context}"]
7+
= Creating a backup of cluster resources before upgrade
8+
9+
For {sno}, the {cgu-operator-first} can create a backup of a deployment before an upgrade. If the upgrade fails, you can recover the previous version and restore a cluster to a working state without requiring a reprovision of applications.
10+
11+
The container image backup starts when the `backup` field is set to `true` in the `ClusterGroupUpgrade` CR.
12+
13+
The backup process can be in the following statuses:
14+
15+
`BackupStatePreparingToStart`:: The first reconciliation pass is in progress. The {cgu-operator} deletes any spoke backup namespace and hub view resources that have been created in a failed upgrade attempt.
16+
`BackupStateStarting`:: The backup prerequisites and backup job are being created.
17+
`BackupStateActive`:: The backup is in progress.
18+
`BackupStateSucceeded`:: The backup has succeeded.
19+
`BackupStateTimeout`:: Artifact backup has been partially done.
20+
`BackupStateError`:: The backup has ended with a non-zero exit code.
21+
[NOTE]
22+
====
23+
If the backup fails and enters the `BackupStateTimeout` or `BackupStateError` state, the cluster upgrade does not proceed.
24+
====
Lines changed: 121 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,121 @@
1+
// Module included in the following assemblies:
2+
// Epic CNF-3901 (CNF-2133) (4.11), Story TELCODOCS-339
3+
// * scalability_and_performance/cnf-talm-for-cluster-upgrades.adoc
4+
5+
:_content-type: PROCEDURE
6+
[id="talo-backup-start_and_update_{context}"]
7+
= Creating a ClusterGroupUpgrade CR with backup
8+
9+
For {sno}, you can create a backup of a deployment before an upgrade. If the upgrade fails you can use the `upgrade-recovery.sh` script generated by {cgu-operator-first} to return the system to its preupgrade state.
10+
The backup consists of the following items:
11+
12+
Cluster backup:: A snapshot of `etcd` and static pod manifests.
13+
Content backup:: Backups of folders, for example, `/etc`, `/usr/local`, `/var/lib/kubelet`.
14+
Changed files backup:: Any files managed by `machine-config` that have been changed.
15+
Deployment:: A pinned `ostree` deployment.
16+
Images (Optional):: Any container images that are in use.
17+
18+
19+
.Prerequisites
20+
21+
* Install the {cgu-operator-first}.
22+
* Provision one or more managed clusters.
23+
* Log in as a user with `cluster-admin` privileges.
24+
* Install {rh-rhacm-first}.
25+
26+
[NOTE]
27+
====
28+
It is highly recommended that you create a recovery partition.
29+
The following is an example `SiteConfig` custome resource (CR) for a recovery partition of 50 GB:
30+
31+
[source,yaml]
32+
----
33+
nodes:
34+
- hostName: "snonode.sno-worker-0.e2e.bos.redhat.com"
35+
role: "master"
36+
rootDeviceHints:
37+
hctl: "0:2:0:0"
38+
deviceName: /dev/sda
39+
........
40+
........
41+
#Disk /dev/sda: 893.3 GiB, 959119884288 bytes, 1873281024 sectors
42+
diskPartition:
43+
- device: /dev/sda
44+
partitions:
45+
- mount_point: /var/recovery
46+
size: 51200
47+
start: 800000
48+
----
49+
====
50+
51+
.Procedure
52+
53+
. Save the contents of the `ClusterGroupUpgrade` CR with the `backup` field set to `true` in the `clustergroupupgrades-group-du.yaml` file:
54+
+
55+
[source,yaml]
56+
----
57+
apiVersion: ran.openshift.io/v1alpha1
58+
kind: ClusterGroupUpgrade
59+
metadata:
60+
name: du-upgrade-4918
61+
namespace: ztp-group-du-sno
62+
spec:
63+
preCaching: true
64+
backup: true
65+
clusters:
66+
- cnfdb1
67+
- cnfdb2
68+
enable: false
69+
managedPolicies:
70+
- du-upgrade-platform-upgrade
71+
remediationStrategy:
72+
maxConcurrency: 2
73+
timeout: 240
74+
----
75+
76+
. To start the update, apply the `ClusterGroupUpgrade` CR by running the following command:
77+
+
78+
[source,terminal]
79+
----
80+
$ oc apply -f clustergroupupgrades-group-du.yaml
81+
----
82+
83+
.Verification
84+
85+
* Check the status of the upgrade in the hub cluster by running the following command:
86+
+
87+
[source,terminal]
88+
----
89+
$ oc get cgu -n ztp-group-du-sno du-upgrade-4918 -o jsonpath='{.status}'
90+
----
91+
+
92+
.Example output
93+
+
94+
[source,json]
95+
----
96+
{
97+
"backup": {
98+
"clusters": [
99+
"cnfdb2",
100+
"cnfdb1"
101+
],
102+
"status": {
103+
"cnfdb1": "Succeeded",
104+
"cnfdb2": "Succeeded"
105+
}
106+
},
107+
"computedMaxConcurrency": 1,
108+
"conditions": [
109+
{
110+
"lastTransitionTime": "2022-04-05T10:37:19Z",
111+
"message": "Backup is completed",
112+
"reason": "BackupCompleted",
113+
"status": "True",
114+
"type": "BackupDone"
115+
}
116+
],
117+
"precaching": {
118+
"spec": {}
119+
},
120+
"status": {}
121+
----
Lines changed: 129 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,129 @@
1+
// Module included in the following assemblies:
2+
// Epic CNF-3901 (CNF-2133) (4.11), Story TELCODOCS-339
3+
// * scalability_and_performance/cnf-talm-for-cluster-upgrades.adoc
4+
5+
:_content-type: PROCEDURE
6+
[id="talo-backup-recovery_{context}"]
7+
= Recovering a cluster after a failed upgrade
8+
9+
If an upgrade of a cluster fails, you can manually log in to the cluster and use the backup to return the cluster to its preupgrade state. There are two stages:
10+
11+
Rollback:: If the attempted upgrade included a change to the platform OS deployment, you must roll back to the previous version before running the recovery script.
12+
Recovery:: The recovery shuts down containers and uses files from the backup partition to relaunch containers and restore clusters.
13+
14+
.Prerequisites
15+
16+
* Install the {cgu-operator-first}.
17+
* Provision one or more managed clusters.
18+
* Install {rh-rhacm-first}.
19+
* Log in as a user with `cluster-admin` privileges.
20+
* Run an upgrade that is configured for backup.
21+
22+
.Procedure
23+
24+
. Delete the previously created `ClusterGroupUpgrade` custom resource (CR) by running the following command:
25+
+
26+
[source,terminal]
27+
----
28+
$ oc delete cgu/du-upgrade-4918 -n ztp-group-du-sno
29+
----
30+
31+
. Log in to the cluster that you want to recover.
32+
33+
. Check the status of the platform OS deployment by running the following command:
34+
+
35+
[source,terminal]
36+
----
37+
$ oc ostree admin status
38+
----
39+
.Example outputs
40+
+
41+
[source,terminal]
42+
----
43+
[root@lab-test-spoke2-node-0 core]# ostree admin status
44+
* rhcos c038a8f08458bbed83a77ece033ad3c55597e3f64edad66ea12fda18cbdceaf9.0
45+
Version: 49.84.202202230006-0
46+
Pinned: yes <1>
47+
origin refspec: c038a8f08458bbed83a77ece033ad3c55597e3f64edad66ea12fda18cbdceaf9
48+
----
49+
<1> The current deployment is pinned. A platform OS deployment rollback is not necessary.
50+
+
51+
[source,terminal]
52+
----
53+
[root@lab-test-spoke2-node-0 core]# ostree admin status
54+
* rhcos f750ff26f2d5550930ccbe17af61af47daafc8018cd9944f2a3a6269af26b0fa.0
55+
Version: 410.84.202204050541-0
56+
origin refspec: f750ff26f2d5550930ccbe17af61af47daafc8018cd9944f2a3a6269af26b0fa
57+
rhcos ad8f159f9dc4ea7e773fd9604c9a16be0fe9b266ae800ac8470f63abc39b52ca.0 (rollback) <1>
58+
Version: 410.84.202203290245-0
59+
Pinned: yes <2>
60+
origin refspec: ad8f159f9dc4ea7e773fd9604c9a16be0fe9b266ae800ac8470f63abc39b52ca
61+
----
62+
<1> This platform OS deployment is marked for rollback.
63+
<2> The previous deployment is pinned and can be rolled back.
64+
65+
. To trigger a rollback of the platform OS deployment, run the following command:
66+
+
67+
[source,terminal]
68+
----
69+
$ rpm-ostree rollback -r
70+
----
71+
72+
. The first phase of the recovery shuts down containers and restores files from the backup partition to the targeted directories. To begin the recovery, run the following command:
73+
+
74+
[source,terminal]
75+
----
76+
$ /var/recovery/upgrade-recovery.sh
77+
----
78+
+
79+
80+
. When prompted, reboot the cluster by running the following command:
81+
+
82+
[source,terminal]
83+
----
84+
$ systemctl reboot
85+
----
86+
. After the reboot, restart the recovery by running the following command:
87+
+
88+
[source,terminal]
89+
----
90+
$ /var/recovery/upgrade-recovery.sh --resume
91+
----
92+
93+
[NOTE]
94+
====
95+
If the recovery utility fails, you can retry with the `--restart` option:
96+
[source,terminal]
97+
----
98+
$ /var/recovery/upgrade-recovery.sh --restart
99+
----
100+
====
101+
102+
.Verification
103+
* To check the status of the recovery run the following command:
104+
+
105+
[source,terminal]
106+
----
107+
$ oc get clusterversion,nodes,clusteroperator
108+
----
109+
+
110+
.Example output
111+
[source,terminal]
112+
----
113+
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
114+
clusterversion.config.openshift.io/version 4.9.23 True False 86d Cluster version is 4.9.23 <1>
115+
116+
117+
NAME STATUS ROLES AGE VERSION
118+
node/lab-test-spoke1-node-0 Ready master,worker 86d v1.22.3+b93fd35 <2>
119+
120+
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE
121+
clusteroperator.config.openshift.io/authentication 4.9.23 True False False 2d7h <3>
122+
clusteroperator.config.openshift.io/baremetal 4.9.23 True False False 86d
123+
124+
125+
..............
126+
----
127+
<1> The cluster version is available and has the correct version.
128+
<2> The node status is `Ready`.
129+
<3> The `ClusterOperator` object's availability is `True`.

scalability_and_performance/cnf-talm-for-cluster-upgrades.adoc

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,12 @@ For more information about `PolicyGenTemplate` CRD, see xref:../scalability_and_
3232

3333
include::modules/cnf-topology-aware-lifecycle-manager-apply-policies.adoc[leveloffset=+2]
3434

35+
include::modules/cnf-topology-aware-lifecycle-manager-backup-concept.adoc[leveloffset=+1]
36+
37+
include::modules/cnf-topology-aware-lifecycle-manager-backup-feature.adoc[leveloffset=+2]
38+
39+
include::modules/cnf-topology-aware-lifecycle-manager-backup-recovery.adoc[leveloffset=+2]
40+
3541
include::modules/cnf-topology-aware-lifecycle-manager-precache-concept.adoc[leveloffset=+1]
3642

3743
include::modules/cnf-topology-aware-lifecycle-manager-precache-feature.adoc[leveloffset=+2]

0 commit comments

Comments
 (0)