Merge pull request #46563 from slovern/TELCODOCS-339

stevsmit · web-flow · commit 55eb7bf8c980 · 2022-09-29T09:40:27.000-04:00
TELCODOCS-339 - SNO Recovery from failed OCP upgrade for RAN/DU Deployments
diff --git a/modules/cnf-topology-aware-lifecycle-manager-backup-concept.adoc b/modules/cnf-topology-aware-lifecycle-manager-backup-concept.adoc
@@ -0,0 +1,24 @@
+// Module included in the following assemblies:
+// Epic CNF-3901 (CNF-2133) (4.11), Story TELCODOCS-339
+// * scalability_and_performance/cnf-talm-for-cluster-upgrades.adoc
+
+:_content-type: CONCEPT
+[id="talo-backup-feature-concept_{context}"]
+= Creating a backup of cluster resources before upgrade
+
+For {sno}, the {cgu-operator-first} can create a backup of a deployment before an upgrade. If the upgrade fails, you can recover the previous version and restore a cluster to a working state without requiring a reprovision of applications.
+
+The container image backup starts when the `backup` field is set to `true` in the `ClusterGroupUpgrade` CR.
+
+The backup process can be in the following statuses:
+
+`BackupStatePreparingToStart`:: The first reconciliation pass is in progress. The {cgu-operator} deletes any spoke backup namespace and hub view resources that have been created in a failed upgrade attempt.
+`BackupStateStarting`:: The backup prerequisites and backup job are being created.
+`BackupStateActive`:: The backup is in progress.
+`BackupStateSucceeded`:: The backup has succeeded.
+`BackupStateTimeout`:: Artifact backup has been partially done.
+`BackupStateError`:: The backup has ended with a non-zero exit code.
+[NOTE]
+====
+If the backup fails and enters the `BackupStateTimeout` or `BackupStateError` state, the cluster upgrade does not proceed.
+====
diff --git a/modules/cnf-topology-aware-lifecycle-manager-backup-feature.adoc b/modules/cnf-topology-aware-lifecycle-manager-backup-feature.adoc
@@ -0,0 +1,121 @@
+// Module included in the following assemblies:
+// Epic CNF-3901 (CNF-2133) (4.11), Story TELCODOCS-339
+// * scalability_and_performance/cnf-talm-for-cluster-upgrades.adoc
+
+:_content-type: PROCEDURE
+[id="talo-backup-start_and_update_{context}"]
+= Creating a ClusterGroupUpgrade CR with backup
+
+For {sno}, you can create a backup of a deployment before an upgrade. If the upgrade fails you can use the `upgrade-recovery.sh` script generated by {cgu-operator-first} to return the system to its preupgrade state.
+The backup consists of the following items:
+
+Cluster backup:: A snapshot of `etcd` and static pod manifests.
+Content backup:: Backups of folders, for example, `/etc`, `/usr/local`, `/var/lib/kubelet`.
+Changed files backup:: Any files managed by `machine-config` that have been changed.
+Deployment:: A pinned `ostree` deployment.
+Images (Optional):: Any container images that are in use.
+
+
+.Prerequisites
+
+* Install the {cgu-operator-first}.
+* Provision one or more managed clusters.
+* Log in as a user with `cluster-admin` privileges.
+* Install {rh-rhacm-first}.
+
+[NOTE]
+====
+It is highly recommended that you create a recovery partition.
+The following is an example `SiteConfig` custome resource (CR) for a recovery partition of 50 GB:
+
+[source,yaml]
+----
+nodes:
+    - hostName: "snonode.sno-worker-0.e2e.bos.redhat.com"
+    role: "master"
+    rootDeviceHints:
+        hctl: "0:2:0:0"
+        deviceName: /dev/sda
+........
+........
+    #Disk /dev/sda: 893.3 GiB, 959119884288 bytes, 1873281024 sectors
+    diskPartition:
+        - device: /dev/sda
+        partitions:
+        - mount_point: /var/recovery
+            size: 51200
+            start: 800000
+----
+====
+
+.Procedure
+
+. Save the contents of the `ClusterGroupUpgrade` CR with the `backup` field set to `true` in the `clustergroupupgrades-group-du.yaml` file:
++
+[source,yaml]
+----
+apiVersion: ran.openshift.io/v1alpha1
+kind: ClusterGroupUpgrade
+metadata:
+  name: du-upgrade-4918
+  namespace: ztp-group-du-sno
+spec:
+  preCaching: true
+  backup: true
+  clusters:
+  - cnfdb1
+  - cnfdb2
+  enable: false
+  managedPolicies:
+  - du-upgrade-platform-upgrade
+  remediationStrategy:
+    maxConcurrency: 2
+    timeout: 240
+----
+
+. To start the update, apply the `ClusterGroupUpgrade` CR by running the following command:
++
+[source,terminal]
+----
+$ oc apply -f clustergroupupgrades-group-du.yaml
+----
+
+.Verification
+
+* Check the status of the upgrade in the hub cluster by running the following command:
++
+[source,terminal]
+----
+$ oc get cgu -n ztp-group-du-sno du-upgrade-4918 -o jsonpath='{.status}'
+----
++
+.Example output
++
+[source,json]
+----
+{
+    "backup": {
+        "clusters": [
+            "cnfdb2",
+            "cnfdb1"
+    ],
+    "status": {
+        "cnfdb1": "Succeeded",
+        "cnfdb2": "Succeeded"
+    }
+},
+"computedMaxConcurrency": 1,
+"conditions": [
+    {
+        "lastTransitionTime": "2022-04-05T10:37:19Z",
+        "message": "Backup is completed",
+        "reason": "BackupCompleted",
+        "status": "True",
+        "type": "BackupDone"
+    }
+],
+"precaching": {
+    "spec": {}
+},
+"status": {}
+----
diff --git a/modules/cnf-topology-aware-lifecycle-manager-backup-recovery.adoc b/modules/cnf-topology-aware-lifecycle-manager-backup-recovery.adoc
@@ -0,0 +1,129 @@
+// Module included in the following assemblies:
+// Epic CNF-3901 (CNF-2133) (4.11), Story TELCODOCS-339
+// * scalability_and_performance/cnf-talm-for-cluster-upgrades.adoc
+
+:_content-type: PROCEDURE
+[id="talo-backup-recovery_{context}"]
+= Recovering a cluster after a failed upgrade
+
+If an upgrade of a cluster fails, you can manually log in to the cluster and use the backup to return the cluster to its preupgrade state. There are two stages:
+
+Rollback:: If the attempted upgrade included a change to the platform OS deployment, you must roll back to the previous version before running the recovery script.
+Recovery:: The recovery shuts down containers and uses files from the backup partition to relaunch containers and restore clusters.
+
+.Prerequisites
+
+* Install the {cgu-operator-first}.
+* Provision one or more managed clusters.
+* Install {rh-rhacm-first}.
+* Log in as a user with `cluster-admin` privileges.
+* Run an upgrade that is configured for backup.
+
+.Procedure
+
+. Delete the previously created `ClusterGroupUpgrade` custom resource (CR) by running the following command:
++
+[source,terminal]
+----
+$ oc delete cgu/du-upgrade-4918 -n ztp-group-du-sno
+----
+
+. Log in to the cluster that you want to recover.
+
+. Check the status of the platform OS deployment by running the following command:
++
+[source,terminal]
+----
+$ oc ostree admin status
+----
+.Example outputs
++
+[source,terminal]
+----
+[root@lab-test-spoke2-node-0 core]# ostree admin status
+* rhcos c038a8f08458bbed83a77ece033ad3c55597e3f64edad66ea12fda18cbdceaf9.0
+    Version: 49.84.202202230006-0
+    Pinned: yes <1>
+    origin refspec: c038a8f08458bbed83a77ece033ad3c55597e3f64edad66ea12fda18cbdceaf9
+----
+<1> The current deployment is pinned. A platform OS deployment rollback is not necessary.
++
+[source,terminal]
+----
+[root@lab-test-spoke2-node-0 core]# ostree admin status
+* rhcos f750ff26f2d5550930ccbe17af61af47daafc8018cd9944f2a3a6269af26b0fa.0
+    Version: 410.84.202204050541-0
+    origin refspec: f750ff26f2d5550930ccbe17af61af47daafc8018cd9944f2a3a6269af26b0fa
+rhcos ad8f159f9dc4ea7e773fd9604c9a16be0fe9b266ae800ac8470f63abc39b52ca.0 (rollback) <1>
+    Version: 410.84.202203290245-0
+    Pinned: yes <2>
+    origin refspec: ad8f159f9dc4ea7e773fd9604c9a16be0fe9b266ae800ac8470f63abc39b52ca
+----
+<1> This platform OS deployment is marked for rollback.
+<2> The previous deployment is pinned and can be rolled back.
+
+. To trigger a rollback of the platform OS deployment, run the following command:
++
+[source,terminal]
+----
+$ rpm-ostree rollback -r
+----
+
+. The first phase of the recovery shuts down containers and restores files from the backup partition to the targeted directories. To begin the recovery, run the following command:
++
+[source,terminal]
+----
+$ /var/recovery/upgrade-recovery.sh
+----
++
+
+. When prompted, reboot the cluster by running the following command:
++
+[source,terminal]
+----
+$ systemctl reboot
+----
+. After the reboot, restart the recovery by running the following command:
++
+[source,terminal]
+----
+$ /var/recovery/upgrade-recovery.sh  --resume
+----
+
+[NOTE]
+====
+If the recovery utility fails, you can retry with the `--restart` option:
+[source,terminal]
+----
+$ /var/recovery/upgrade-recovery.sh --restart
+----
+====
+
+.Verification
+* To check the status of the recovery run the following command:
++
+[source,terminal]
+----
+$ oc get clusterversion,nodes,clusteroperator
+----
++
+.Example output
+[source,terminal]
+----
+NAME                                         VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
+clusterversion.config.openshift.io/version   4.9.23    True        False         86d     Cluster version is 4.9.23 <1>
+
+
+NAME                          STATUS   ROLES           AGE   VERSION
+node/lab-test-spoke1-node-0   Ready    master,worker   86d   v1.22.3+b93fd35 <2>
+
+NAME                                                                           VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
+clusteroperator.config.openshift.io/authentication                             4.9.23    True        False         False      2d7h    <3>
+clusteroperator.config.openshift.io/baremetal                                  4.9.23    True        False         False      86d
+
+
+..............
+----
+<1> The cluster version is available and has the correct version.
+<2> The node status is `Ready`.
+<3> The `ClusterOperator` object's availability is `True`.
diff --git a/scalability_and_performance/cnf-talm-for-cluster-upgrades.adoc b/scalability_and_performance/cnf-talm-for-cluster-upgrades.adoc
@@ -32,6 +32,12 @@ For more information about `PolicyGenTemplate` CRD, see xref:../scalability_and_
 
 include::modules/cnf-topology-aware-lifecycle-manager-apply-policies.adoc[leveloffset=+2]
 
+include::modules/cnf-topology-aware-lifecycle-manager-backup-concept.adoc[leveloffset=+1]
+
+include::modules/cnf-topology-aware-lifecycle-manager-backup-feature.adoc[leveloffset=+2]
+
+include::modules/cnf-topology-aware-lifecycle-manager-backup-recovery.adoc[leveloffset=+2]
+
 include::modules/cnf-topology-aware-lifecycle-manager-precache-concept.adoc[leveloffset=+1]
 
 include::modules/cnf-topology-aware-lifecycle-manager-precache-feature.adoc[leveloffset=+2]