Merge pull request #68477 from skopacz1/OSDOCS-2427

ShaunaDiaz · web-flow · commit a833ecc7b630 · 2024-01-15T07:05:00.000-05:00
OSDOCS-2427: Update best practices guidance
diff --git a/modules/update-best-practices.adoc b/modules/update-best-practices.adoc
@@ -0,0 +1,63 @@
+// Module included in the following assemblies:
+//
+// * updating/preparing_for_updates/updating-cluster-prepare.adoc
+
+:_mod-docs-content-type: REFERENCE
+[id="update-best-practices_{context}"]
+= Best practices for cluster updates
+
+{product-title} provides a robust update experience that minimizes workload disruptions during an update.
+Updates will not begin unless the cluster is in an upgradeable state at the time of the update request.
+
+This design enforces some key conditions before initiating an update, but there are a number of actions you can take to increase your chances of a successful cluster update.
+
+[id="recommended-versions_{context}"]
+== Choose versions recommended by the OpenShift Update Service
+
+The OpenShift Update Service (OSUS) provides update recommendations based on cluster characteristics such as the cluster's subscribed channel.
+The Cluster Version Operator saves these recommendations as either recommended or conditional updates.
+While it is possible to attempt an update to a version that is not recommended by OSUS, following a recommended update path protects users from encountering known issues or unintended consequences on the cluster.
+
+Choose only update targets that are recommended by OSUS to ensure a successful update.
+
+[id="critical-alerts_{context}"]
+== Address all critical alerts on the cluster
+
+Critical alerts must always be addressed as soon as possible, but it is especially important to address these alerts and resolve any problems before initiating a cluster update.
+Failing to address critical alerts before beginning an update can cause problematic conditions for the cluster.
+
+In the *Administrator* perspective of the web console, navigate to *Observe* -> *Alerting* to find critical alerts.
+
+[id="cluster-upgradeable_{context}"]
+== Ensure that the cluster is in an Upgradable state
+
+When one or more Operators have not reported their `Upgradeable` condition as `True` for more than an hour, the `ClusterNotUpgradeable` warning alert is triggered in the cluster.
+In most cases this alert does not block patch updates, but you cannot perform a minor version update until you resolve this alert and all Operators report `Upgradeable` as `True`.
+
+For more information about the `Upgradeable` condition, see "Understanding cluster Operator condition types" in the additional resources section.
+
+[id="nodes-ready_{context}"]
+== Ensure that enough spare nodes are available
+
+// Completely guessing the explanation in this section just to have something to start with when this is reviewed by an SME.
+A cluster should not be running with little to no spare node capacity, especially when initiating a cluster update.
+Nodes that are not running and available may limit a cluster's ability to perform an update with minimal disruption to cluster workloads.
+
+Depending on the configured value of the cluster's `maxUnavailable` spec, the cluster might not be able to apply machine configuration changes to nodes if there is an unavailable node.
+Additionally, if compute nodes do not have enough spare capacity, workloads might not be able to temporarily shift to another node while the first node is taken offline for an update.
+
+Make sure that you have enough available nodes in each worker pool, as well as enough spare capacity on your compute nodes, to increase the chance of successful node updates.
+
+[id="pod-disruption-budget_{context}"]
+== Ensure that the cluster's PodDisruptionBudget is properly configured
+
+You can use the `PodDisruptionBudget` object to define the minimum number or percentage of pod replicas that must be available at any given time.
+This configuration protects workloads from disruptions during maintenance tasks such as cluster updates.
+
+However, it is possible to configure the `PodDisruptionBudget` for a given topology in a way that prevents nodes from being drained and updated during a cluster update.
+
+When planning a cluster update, check the configuration of the `PodDisruptionBudget` object for the following factors:
+
+* For highly available workloads, make sure there are replicas that can be temporarily taken offline without being prohibited by the `PodDisruptionBudget`.
+
+* For workloads that aren't highly available, make sure they are either not protected by a `PodDisruptionBudget` or have some alternative mechanism for draining these workloads eventually, such as periodic restart or guaranteed eventual termination.
diff --git a/updating/preparing_for_updates/updating-cluster-prepare.adoc b/updating/preparing_for_updates/updating-cluster-prepare.adoc
@@ -57,3 +57,10 @@ include::modules/update-preparing-conditional.adoc[leveloffset=+1]
 [role="_additional-resources"]
 .Additional resources
 * xref:../../updating/understanding_updates/how-updates-work.adoc#update-evaluate-availability_how-updates-work[Evaluation of update availability]
+
+// Best practices for cluster updates
+include::modules/update-best-practices.adoc[leveloffset=+1]
+
+[role="_additional-resources"]
+.Additional resources
+* xref:../../updating/understanding_updates/intro-to-updates.adoc#understanding_clusteroperator_conditiontypes_understanding-openshift-updates[Understanding cluster Operator condition types]