Merge pull request #57047 from slovern/TELCODOCS-949

jeana-redhat · web-flow · commit 980e09883f0f · 2023-03-29T14:32:00.000-04:00
TELCODOCS-949 TALM 4.13 updates
diff --git a/modules/cnf-topology-aware-lifecycle-manager-about-cgu-crs.adoc b/modules/cnf-topology-aware-lifecycle-manager-about-cgu-crs.adoc
@@ -55,6 +55,7 @@ After {cgu-operator} completes a cluster update, the cluster does not update aga
 * The `clusters` field specifies a list of clusters to update.
 * The `canaries` field specifies the clusters for canary updates.
 * The `maxConcurrency` field specifies the number of clusters to update in a batch.
+* The `actions` field specifies `beforeEnable` actions that {cgu-operator} takes as it begins the update process, and `afterCompletion` actions that {cgu-operator} takes as it completes policy remediation for each cluster.
 
 You can use the `clusters`, `clusterLabelSelector`, and `clusterSelector` fields together to create a combined list of clusters.
 
@@ -76,43 +77,49 @@ metadata:
   resourceVersion: '40451823'
   uid: cca245a5-4bca-45fa-89c0-aa6af81a596c
 Spec:
-  actions:
-    afterCompletion:
+  actions: 
+    afterCompletion: <1>
+      addClusterLabels:
+        upgrade-done: "" 
+      deleteClusterLabels:
+        upgrade-running: ""
       deleteObjects: true
-    beforeEnable: {}
+    beforeEnable: <2>
+      addClusterLabels:
+        upgrade-running: ""     
   backup: false
-  clusters: <1>
+  clusters: <3>
     - spoke1
-  enable: false <2>
-  managedPolicies: <3>
+  enable: false <4>
+  managedPolicies: <5>
     - talm-policy
   preCaching: false
-  remediationStrategy: <4>
-    canaries: <5>
+  remediationStrategy: <6>
+    canaries: <7>
         - spoke1
-    maxConcurrency: 2 <6>
+    maxConcurrency: 2 <8>
     timeout: 240
-  clusterLabelSelectors: <7>
+  clusterLabelSelectors: <9>
     - matchExpressions:
       - key: label1
       operator: In
       values:
         - value1a
         - value1b
-  batchTimeoutAction: <8>
-status: <9>
+  batchTimeoutAction: <10>
+status: <11>
     computedMaxConcurrency: 2
     conditions:
       - lastTransitionTime: '2022-11-18T16:27:15Z'
         message: All selected clusters are valid
         reason: ClusterSelectionCompleted
         status: 'True'
-        type: ClustersSelected <10>
+        type: ClustersSelected <12>
       - lastTransitionTime: '2022-11-18T16:27:15Z'
         message: Completed validation
         reason: ValidationCompleted
         status: 'True'
-        type: Validated <11>
+        type: Validated <13>
       - lastTransitionTime: '2022-11-18T16:37:16Z'
         message: Not enabled
         reason: NotEnabled
@@ -129,17 +136,19 @@ status: <9>
         - spoke3
     status:
 ----
-<1> Defines the list of clusters to update.
-<2> The `enable` field is set to `false`.
-<3> Lists the user-defined set of policies to remediate.
-<4> Defines the specifics of the cluster updates.
-<5> Defines the clusters for canary updates.
-<6> Defines the maximum number of concurrent updates in a batch. The number of remediation batches is the number of canary clusters, plus the number of clusters, except the canary clusters, divided by the maxConcurrency value. The clusters that are already compliant with all the managed policies are excluded from the remediation plan.
-<7> Displays the parameters for selecting clusters.
-<8> Controls what happens if a batch times out. Possible values are `abort` or `continue`. If unspecified, the default is `continue`.
-<9> Displays information about the status of the updates.
-<10> The `ClustersSelected` condition shows that all selected clusters are valid.
-<11> The `Validated` condition shows that all selected clusters have been validated.
+<1> Specifies the action that {cgu-operator} takes when it completes policy remediation for each cluster.
+<2> Specifies the action that {cgu-operator} takes as it begins the update process.
+<3> Defines the list of clusters to update.
+<4> The `enable` field is set to `false`.
+<5> Lists the user-defined set of policies to remediate.
+<6> Defines the specifics of the cluster updates.
+<7> Defines the clusters for canary updates.
+<8> Defines the maximum number of concurrent updates in a batch. The number of remediation batches is the number of canary clusters, plus the number of clusters, except the canary clusters, divided by the `maxConcurrency` value. The clusters that are already compliant with all the managed policies are excluded from the remediation plan.
+<9> Displays the parameters for selecting clusters.
+<10> Controls what happens if a batch times out. Possible values are `abort` or `continue`. If unspecified, the default is `continue`.
+<11> Displays information about the status of the updates.
+<12> The `ClustersSelected` condition shows that all selected clusters are valid.
+<13> The `Validated` condition shows that all selected clusters have been validated.
 
 [NOTE]
 ====
@@ -168,7 +177,8 @@ Policies are missing or invalid, or an invalid platform image has been specified
 [id="precaching_{context}"]
 == Pre-caching
 
-Clusters might have limited bandwidth to access the container image registry, which can cause a timeout before the updates are completed. On {sno} clusters, you can use pre-caching to avoid this. The container image pre-caching starts when you create a `ClusterGroupUpgrade` CR with the `preCaching` field set to `true`.
+Clusters might have limited bandwidth to access the container image registry, which can cause a timeout before the updates are completed. On {sno} clusters, you can use pre-caching to avoid this. The container image pre-caching starts when you create a `ClusterGroupUpgrade` CR with the `preCaching` field set to `true`. 
+{cgu-operator} compares the available disk space with the estimated {product-title} image size to ensure that there is enough space. If a cluster has insufficient space, {cgu-operator} cancels pre-caching for that cluster and does not remediate policies on it.
 
 {cgu-operator} uses the `PrecacheSpecValid` condition to report status information as follows:
 
diff --git a/modules/cnf-topology-aware-lifecycle-manager-autocreate-cgu-cr-ztp.adoc b/modules/cnf-topology-aware-lifecycle-manager-autocreate-cgu-cr-ztp.adoc
@@ -8,12 +8,9 @@
 
 {cgu-operator} has a controller called `ManagedClusterForCGU` that monitors the `Ready` state of the `ManagedCluster` CRs on the hub cluster and creates the `ClusterGroupUpgrade` CRs for ZTP (zero touch provisioning).
 
-For any managed cluster in the `Ready` state without a "ztp-done" label applied, the `ManagedClusterForCGU` controller automatically creates a `ClusterGroupUpgrade` CR in the `ztp-install` namespace with its associated {rh-rhacm} policies that are created during the ZTP process. {cgu-operator} then remediates the set of configuration policies that are listed in the auto-created `ClusterGroupUpgrade` CR to push the configuration CRs to the managed cluster.
+For any managed cluster in the `Ready` state without a `ztp-done` label applied, the `ManagedClusterForCGU` controller automatically creates a `ClusterGroupUpgrade` CR in the `ztp-install` namespace with its associated {rh-rhacm} policies that are created during the ZTP process. {cgu-operator} then remediates the set of configuration policies that are listed in the auto-created `ClusterGroupUpgrade` CR to push the configuration CRs to the managed cluster.
 
-[NOTE]
-====
-If the managed cluster has no bound policies when the cluster becomes `Ready`, no `ClusterGroupUpgrade` CR is created.
-====
+If there are no policies for the managed cluster at the time when the cluster becomes `Ready`, a `ClusterGroupUpgrade` CR with no policies is created. Upon completion of the `ClusterGroupUpgrade` the managed cluster is labeled as `ztp-done`. If there are policies that you want to apply for that managed cluster, manually create a `ClusterGroupUpgrade` as a day-2 operation. 
 
 .Example of an auto-created `ClusterGroupUpgrade` CR for ZTP
 
diff --git a/modules/cnf-topology-aware-lifecycle-manager-precache-feature.adoc b/modules/cnf-topology-aware-lifecycle-manager-precache-feature.adoc
@@ -8,6 +8,11 @@
 
 For {sno}, the pre-cache feature allows the required container images to be present on the spoke cluster before the update starts.
 
+[NOTE]
+====
+For pre-caching, {cgu-operator} uses the `spec.remediationStrategy.timeout` value from the `ClusterGroupUpgrade` CR. You must set a `timeout` value that allows sufficient time for the pre-caching job to complete. When you enable the `ClusterGroupUpgrade` CR after pre-caching has completed, you can change the `timeout` value to a duration that is appropriate for the update.
+====
+
 .Prerequisites
 
 * Install the {cgu-operator-first}.
diff --git a/modules/cnf-topology-aware-lifecycle-manager-precache-image-filter.adoc b/modules/cnf-topology-aware-lifecycle-manager-precache-image-filter.adoc
@@ -0,0 +1,33 @@
+// Module included in the following assemblies:
+// Epic CNF-6848 (4.13), Story TELCODOCS-949
+// * scalability_and_performance/cnf-talm-for-cluster-upgrades.adoc
+
+:_content-type: CONCEPT
+[id="talo-precache-feature-image-filter_{context}"]
+= Using the container image pre-cache filter
+
+The pre-cache feature typically downloads more images than a cluster needs for an update. You can control which pre-cache images are downloaded to a cluster. This decreases download time, and saves bandwidth and storage.
+
+You can see a list of all images to be downloaded using the following command:
+
+[source,terminal]
+----
+$ oc adm release info <ocp-version>
+----
+
+The following `ConfigMap` example shows how you can exclude images using the `excludePrecachePatterns` field.
+
+[source,yaml]
+----
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: cluster-group-upgrade-overrides
+data:
+  excludePrecachePatterns: |
+    azure <1>
+    aws 
+    vsphere 
+    alibaba
+----
+<1> {cgu-operator} excludes all images with names that include any of the patterns listed here.
diff --git a/modules/cnf-topology-aware-lifecycle-manager-troubleshooting.adoc b/modules/cnf-topology-aware-lifecycle-manager-troubleshooting.adoc
@@ -442,4 +442,55 @@ This may be because:
 * The CGU was run too soon after a policy was created or updated. 
 * The remediation of a policy affects the compliance of subsequent policies in the `ClusterGroupUpgrade` CR.
 
-Resolution:: Create a new and apply `ClusterGroupUpdate` CR with the same specification .
+Resolution:: Create and apply a new `ClusterGroupUpdate` CR with the same specification.
+
+[discrete]
+[id="talo-troubleshooting-auto-create-policies_{context}"]
+=== Auto-created `ClusterGroupUpgrade` CR in the ZTP workflow has no managed policies
+
+Issue:: If there are no policies for the managed cluster when the cluster becomes `Ready`, a `ClusterGroupUpgrade` CR with no policies is auto-created. 
+Upon completion of the `ClusterGroupUpgrade` CR, the managed cluster is labeled as `ztp-done`. 
+If the `PolicyGenTemplate` CRs were not pushed to the Git repository within the required time after `SiteConfig` resources were pushed, this might result in no policies being available for the target cluster when the cluster became `Ready`.
+
+Resolution:: Verify that the policies you want to apply are available on the hub cluster, then create a `ClusterGroupUpgrade` CR with the required policies. 
+
+You can either manually create the `ClusterGroupUpgrade` CR or trigger auto-creation again. To trigger auto-creation of the `ClusterGroupUpgrade` CR, remove the `ztp-done` label from the cluster and delete the empty `ClusterGroupUpgrade` CR that was previously created in the `zip-install` namespace.
+
+[discrete]
+[id="talo-troubleshooting-pre-cache-failed_{context}"]
+=== Pre-caching has failed 
+
+Issue:: Pre-caching might fail for one of the following reasons: 
+* There is not enough free space on the node.
+* For a disconnected environment, the pre-cache image has not been properly mirrored. 
+* There was an issue when creating the pod.
+
+Resolution:: 
+. To check if pre-caching has failed due to insufficient space, check the log of the pre-caching pod in the node.
+.. Find the name of the pod using the following command:
++
+[source,terminal]
+----
+$ oc get pods -n openshift-talo-pre-cache
+----
++
+.. Check the logs to see if the error is related to insufficient space using the following command:
++
+[source,terminal]
+----
+$ oc logs -n openshift-talo-pre-cache <pod name>
+----
++
+. If there is no log, check the pod status using the following command:
++
+[source,terminal]
+----
+$ oc describe pod -n openshift-talo-pre-cache <pod name>
+----
++
+. If the pod does not exist, check the job status to see why it could not create a pod using the following command:
++
+[source,terminal]
+----
+$ oc describe job -n openshift-talo-pre-cache pre-cache
+----
diff --git a/scalability_and_performance/cnf-talm-for-cluster-upgrades.adoc b/scalability_and_performance/cnf-talm-for-cluster-upgrades.adoc
@@ -39,6 +39,8 @@ include::modules/cnf-topology-aware-lifecycle-manager-backup-recovery.adoc[level
 
 include::modules/cnf-topology-aware-lifecycle-manager-precache-concept.adoc[leveloffset=+1]
 
+include::modules/cnf-topology-aware-lifecycle-manager-precache-image-filter.adoc[leveloffset=+2]
+
 include::modules/cnf-topology-aware-lifecycle-manager-precache-feature.adoc[leveloffset=+2]
 
 include::modules/cnf-topology-aware-lifecycle-manager-troubleshooting.adoc[leveloffset=+1]