Skip to content

Commit 7626cd5

Browse files
authored
Merge pull request #50952 from slovern/TELCODOCS-477
TELCODOCS 477 CNF-3882 4.12 ACM Topology Aware Lifecycle Manager
2 parents eaed294 + 140f5cd commit 7626cd5

12 files changed

+510
-218
lines changed

modules/cnf-about-topology-aware-lifecycle-manager-config.adoc

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,5 +14,8 @@ The {cgu-operator-first} manages the deployment of {rh-rhacm-first} policies for
1414
* The update order of the clusters
1515
* The set of policies remediated to the cluster
1616
* The order of policies remediated to the cluster
17+
* The assignment of a canary cluster
18+
19+
For {sno}, the {cgu-operator-first} can create a backup of a deployment before an upgrade. If the upgrade fails, you can recover the previous version and restore a cluster to a working state without requiring a reprovision of applications.
1720

1821
{cgu-operator} supports the orchestration of the {product-title} y-stream and z-stream updates, and day-two operations on y-streams and z-streams.

modules/cnf-topology-aware-lifecycle-manager-about-cgu-crs.adoc

Lines changed: 372 additions & 153 deletions
Large diffs are not rendered by default.

modules/cnf-topology-aware-lifecycle-manager-apply-policies.adoc

Lines changed: 21 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -41,11 +41,13 @@ spec:
4141
remediationStrategy:
4242
maxConcurrency: 2 <3>
4343
timeout: 240 <4>
44+
batchTimeoutAction: <5>
4445
----
4546
<1> The name of the policies to apply.
4647
<2> The list of clusters to update.
4748
<3> The `maxConcurrency` field signifies the number of clusters updated at the same time.
4849
<4> The update timeout in minutes.
50+
<5> Controls what happens if a batch times out. Possible values are `abort` or `continue`. If unspecified, the default is `continue`.
4951

5052
. Create the `ClusterGroupUpgrade` CR by running the following command:
5153
+
@@ -65,8 +67,8 @@ $ oc get cgu --all-namespaces
6567
+
6668
[source,terminal]
6769
----
68-
NAMESPACE NAME AGE
69-
default cgu-1 8m55s
70+
NAMESPACE NAME AGE STATE DETAILS
71+
default cgu-1 8m55 NotEnabled Not Enabled
7072
----
7173

7274
.. Check the status of the update by running the following command:
@@ -85,10 +87,10 @@ $ oc get cgu -n default cgu-1 -ojsonpath='{.status}' | jq
8587
"conditions": [
8688
{
8789
"lastTransitionTime": "2022-02-25T15:34:07Z",
88-
"message": "The ClusterGroupUpgrade CR is not enabled", <1>
89-
"reason": "UpgradeNotStarted",
90+
"message": "Not enabled", <1>
91+
"reason": "NotEnabled",
9092
"status": "False",
91-
"type": "Ready"
93+
"type": "Progressing"
9294
}
9395
],
9496
"copiedPolicies": [
@@ -204,11 +206,21 @@ $ oc get cgu -n default cgu-1 -ojsonpath='{.status}' | jq
204206
"computedMaxConcurrency": 2,
205207
"conditions": [ <1>
206208
{
209+
"lastTransitionTime": "2022-02-25T15:33:07Z",
210+
"message": "All selected clusters are valid",
211+
"reason": "ClusterSelectionCompleted",
212+
"status": "True",
213+
"type": "ClustersSelected",
214+
"lastTransitionTime": "2022-02-25T15:33:07Z",
215+
"message": "Completed validation",
216+
"reason": "ValidationCompleted",
217+
"status": "True",
218+
"type": "Validated",
207219
"lastTransitionTime": "2022-02-25T15:34:07Z",
208-
"message": "The ClusterGroupUpgrade CR has upgrade policies that are still non compliant",
209-
"reason": "UpgradeNotCompleted",
210-
"status": "False",
211-
"type": "Ready"
220+
"message": "Remediating non-compliant policies",
221+
"reason": "InProgress",
222+
"status": "True",
223+
"type": "Progressing"
212224
}
213225
],
214226
"copiedPolicies": [

modules/cnf-topology-aware-lifecycle-manager-backup-concept.adoc

Lines changed: 29 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -8,17 +8,37 @@
88

99
For {sno}, the {cgu-operator-first} can create a backup of a deployment before an upgrade. If the upgrade fails, you can recover the previous version and restore a cluster to a working state without requiring a reprovision of applications.
1010

11-
The container image backup starts when the `backup` field is set to `true` in the `ClusterGroupUpgrade` CR.
11+
To use the backup feature you first create a `ClusterGroupUpgrade` CR with the `backup` field set to `true`. To ensure that the contents of the backup are up to date, the backup is not taken until you set the `enable` field in the `ClusterGroupUpgrade` CR to `true`.
1212

13-
The backup process can be in the following statuses:
13+
{cgu-operator} uses the `BackupSucceeded` condition to report the status and reasons as follows:
14+
15+
* `true`
16+
+
17+
Backup is completed for all clusters or the backup run has completed but failed for one or more clusters. If backup fails for any cluster, the update does not proceed for that cluster.
18+
* `false`
19+
+
20+
Backup is still in progress for one or more clusters or has failed for all clusters. The backup process running in the spoke clusters can have the following statuses:
21+
+
22+
** `PreparingToStart`
23+
+
24+
The first reconciliation pass is in progress. The {cgu-operator} deletes any spoke backup namespace and hub view resources that have been created in a failed upgrade attempt.
25+
** `Starting`
26+
+
27+
The backup prerequisites and backup job are being created.
28+
** `Active`
29+
+
30+
The backup is in progress.
31+
** `Succeeded`
32+
+
33+
The backup succeeded.
34+
** `BackupTimeout`
35+
+
36+
Artifact backup is partially done.
37+
** `UnrecoverableError`
38+
+
39+
The backup has ended with a non-zero exit code.
1440
15-
`BackupStatePreparingToStart`:: The first reconciliation pass is in progress. The {cgu-operator} deletes any spoke backup namespace and hub view resources that have been created in a failed upgrade attempt.
16-
`BackupStateStarting`:: The backup prerequisites and backup job are being created.
17-
`BackupStateActive`:: The backup is in progress.
18-
`BackupStateSucceeded`:: The backup has succeeded.
19-
`BackupStateTimeout`:: Artifact backup has been partially done.
20-
`BackupStateError`:: The backup has ended with a non-zero exit code.
2141
[NOTE]
2242
====
23-
If the backup fails and enters the `BackupStateTimeout` or `BackupStateError` state, the cluster upgrade does not proceed.
43+
If the backup of a cluster fails and enters the `BackupTimeout` or `UnrecoverableError` state, the cluster update does not proceed for that cluster. Updates to other clusters are not affected and continue.
2444
====

modules/cnf-topology-aware-lifecycle-manager-backup-feature.adoc

Lines changed: 11 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ nodes:
5050

5151
.Procedure
5252

53-
. Save the contents of the `ClusterGroupUpgrade` CR with the `backup` field set to `true` in the `clustergroupupgrades-group-du.yaml` file:
53+
. Save the contents of the `ClusterGroupUpgrade` CR with the `backup` and `enable` fields set to `true` in the `clustergroupupgrades-group-du.yaml` file:
5454
+
5555
[source,yaml]
5656
----
@@ -65,7 +65,7 @@ spec:
6565
clusters:
6666
- cnfdb1
6767
- cnfdb2
68-
enable: false
68+
enable: true
6969
managedPolicies:
7070
- du-upgrade-platform-upgrade
7171
remediationStrategy:
@@ -101,21 +101,25 @@ $ oc get cgu -n ztp-group-du-sno du-upgrade-4918 -o jsonpath='{.status}'
101101
],
102102
"status": {
103103
"cnfdb1": "Succeeded",
104-
"cnfdb2": "Succeeded"
104+
"cnfdb2": "Failed" <1>
105105
}
106106
},
107107
"computedMaxConcurrency": 1,
108108
"conditions": [
109109
{
110110
"lastTransitionTime": "2022-04-05T10:37:19Z",
111-
"message": "Backup is completed",
112-
"reason": "BackupCompleted",
113-
"status": "True",
114-
"type": "BackupDone"
111+
"message": "Backup failed for 1 cluster", <2>
112+
"reason": "PartiallyDone", <3>
113+
"status": "True", <4>
114+
"type": "Succeeded"
115115
}
116116
],
117117
"precaching": {
118118
"spec": {}
119119
},
120120
"status": {}
121121
----
122+
<1> Backup has failed for one cluster.
123+
<2> The message confirms that the backup failed for one cluster.
124+
<3> The backup was partially successful.
125+
<4> The backup process has finished.

modules/cnf-topology-aware-lifecycle-manager-backup-recovery.adoc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ $ oc delete cgu/du-upgrade-4918 -n ztp-group-du-sno
3434
+
3535
[source,terminal]
3636
----
37-
$ oc ostree admin status
37+
$ ostree admin status
3838
----
3939
.Example outputs
4040
+

modules/cnf-topology-aware-lifecycle-manager-policies-concept.adoc

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,4 +16,6 @@ If a spoke cluster does not report any compliant state to {rh-rhacm}, the manage
1616
* If a policy's `status.status` is missing, {cgu-operator} produces an error.
1717
* If a cluster's compliance status is missing in the policy's `status.status` field, {cgu-operator} considers that cluster to be non-compliant with that policy.
1818
19+
The `ClusterGroupUpgrade` CR's `batchTimeoutAction` determines what happens if an upgrade fails for a cluster. You can specify `continue` to skip the failing cluster and continue to upgrade other clusters, or specify `abort` to stop the policy remediation for all clusters. Once the timeout elapses, {cgu-operator} removes all enforce policies to ensure that no further updates are made to clusters.
20+
1921
For more information about {rh-rhacm} policies, see link:https://access.redhat.com/documentation/en-us/red_hat_advanced_cluster_management_for_kubernetes/{rh-rhacm-version}/html-single/governance/index#policy-overview[Policy overview].

modules/cnf-topology-aware-lifecycle-manager-precache-concept.adoc

Lines changed: 42 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -6,23 +6,55 @@
66
[id="talo-precache-feature-concept_{context}"]
77
= Using the container image pre-cache feature
88

9-
Clusters might have limited bandwidth to access the container image registry, which can cause a timeout before the updates are completed.
9+
Clusters might have limited bandwidth to access the container image registry, which can cause a timeout before the updates are completed.
1010

1111
[NOTE]
1212
====
1313
The time of the update is not set by {cgu-operator}. You can apply the `ClusterGroupUpgrade` CR at the beginning of the update by manual application or by external automation.
1414
====
1515

16-
The container image pre-caching starts when the `preCaching` field is set to `true` in the `ClusterGroupUpgrade` CR. After a successful pre-caching process, you can start remediating policies. The remediation actions start when the `enable` field is set to `true`.
16+
The container image pre-caching starts when the `preCaching` field is set to `true` in the `ClusterGroupUpgrade` CR.
17+
18+
{cgu-operator} uses the `PrecacheSpecValid` condition to report status information as follows:
19+
20+
* `true`
21+
+
22+
The pre-caching spec is valid and consistent.
23+
* `false`
24+
+
25+
The pre-caching spec is incomplete.
26+
27+
{cgu-operator} uses the `PrecachingSucceeded` condition to report status information as follows:
28+
29+
* `true`
30+
+
31+
TALM has concluded the pre-caching process. If pre-caching fails for any cluster, the update fails for that cluster but proceeds for all other clusters. A message informs you if pre-caching has failed for any clusters.
32+
* `false`
33+
+
34+
Pre-caching is still in progress for one or more clusters or has failed for all clusters.
35+
36+
After a successful pre-caching process, you can start remediating policies. The remediation actions start when the `enable` field is set to `true`. If there is a pre-caching failure on a cluster, the upgrade fails for that cluster. The upgrade process continues for all other clusters that have a successful pre-cache.
1737

1838
The pre-caching process can be in the following statuses:
1939

20-
`PrecacheNotStarted`:: This is the initial state all clusters are automatically assigned to on the first reconciliation pass of the `ClusterGroupUpgrade` CR.
40+
* `NotStarted`
41+
+
42+
This is the initial state all clusters are automatically assigned to on the first reconciliation pass of the `ClusterGroupUpgrade` CR. In this state, {cgu-operator} deletes any pre-caching namespace and hub view resources of spoke clusters that remain from previous incomplete updates. {cgu-operator} then creates a new `ManagedClusterView` resource for the spoke pre-caching namespace to verify its deletion in the `PrecachePreparing` state.
43+
* `PreparingToStart`
44+
+
45+
Cleaning up any remaining resources from previous incomplete updates is in progress.
46+
* `Starting`
47+
+
48+
Pre-caching job prerequisites and the job are created.
49+
* `Active`
50+
+
51+
The job is in "Active" state.
52+
* `Succeeded`
53+
+
54+
The pre-cache job succeeded.
55+
* `PrecacheTimeout`
56+
+
57+
The artifact pre-caching is partially done.
58+
* `UnrecoverableError`
2159
+
22-
In this state, {cgu-operator} deletes any pre-caching namespace and hub view resources of spoke clusters that remain from previous incomplete updates. {cgu-operator} then creates a new `ManagedClusterView` resource for the spoke pre-caching namespace to verify its deletion in the `PrecachePreparing` state.
23-
`PrecachePreparing`:: Cleaning up any remaining resources from previous incomplete updates is in progress.
24-
`PrecacheStarting`:: Pre-caching job prerequisites and the job are created.
25-
`PrecacheActive`:: The job is in "Active" state.
26-
`PrecacheSucceeded`:: The pre-cache job has succeeded.
27-
`PrecacheTimeout`:: The artifact pre-caching has been partially done.
28-
`PrecacheUnrecoverableError`:: The job ends with a non-zero exit code.
60+
The job ends with a non-zero exit code.

modules/cnf-topology-aware-lifecycle-manager-precache-feature.adoc

Lines changed: 11 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@ spec:
3939
----
4040
<1> The `preCaching` field is set to `true`, which enables {cgu-operator} to pull the container images before starting the update.
4141

42-
. When you want to start the update, apply the `ClusterGroupUpgrade` CR by running the following command:
42+
. When you want to start pre-caching, apply the `ClusterGroupUpgrade` CR by running the following command:
4343
+
4444
[source,terminal]
4545
----
@@ -59,8 +59,8 @@ $ oc get cgu -A
5959
+
6060
[source,terminal]
6161
----
62-
NAMESPACE NAME AGE
63-
ztp-group-du-sno du-upgrade-4918 10s <1>
62+
NAMESPACE NAME AGE STATE DETAILS
63+
ztp-group-du-sno du-upgrade-4918 10s InProgress Precaching is required and not done <1>
6464
----
6565
<1> The CR is created.
6666

@@ -77,19 +77,12 @@ $ oc get cgu -n ztp-group-du-sno du-upgrade-4918 -o jsonpath='{.status}'
7777
----
7878
{
7979
"conditions": [
80-
{
81-
"lastTransitionTime": "2022-01-27T19:07:24Z",
82-
"message": "Precaching is not completed (required)", <1>
83-
"reason": "PrecachingRequired",
84-
"status": "False",
85-
"type": "Ready"
86-
},
8780
{
8881
"lastTransitionTime": "2022-01-27T19:07:24Z",
8982
"message": "Precaching is required and not done",
90-
"reason": "PrecachingNotDone",
83+
"reason": "InProgress",
9184
"status": "False",
92-
"type": "PrecachingDone"
85+
"type": "PrecachingSucceeded"
9386
},
9487
{
9588
"lastTransitionTime": "2022-01-27T19:07:34Z",
@@ -101,17 +94,18 @@ $ oc get cgu -n ztp-group-du-sno du-upgrade-4918 -o jsonpath='{.status}'
10194
],
10295
"precaching": {
10396
"clusters": [
104-
"cnfdb1" <2>
97+
"cnfdb1" <1>
98+
"cnfdb2"
10599
],
106100
"spec": {
107101
"platformImage": "image.example.io"},
108102
"status": {
109-
"cnfdb1": "Active"}
103+
"cnfdb1": "Active"
104+
"cnfdb2": "Succeeded"}
110105
}
111106
}
112107
----
113-
<1> Displays that the update is in progress.
114-
<2> Displays the list of identified clusters.
108+
<1> Displays the list of identified clusters.
115109

116110
. Check the status of the pre-caching job by running the following command on the spoke cluster:
117111
+
@@ -155,7 +149,7 @@ $ oc get cgu -n ztp-group-du-sno du-upgrade-4918 -o jsonpath='{.status}'
155149
"message": "Precaching is completed",
156150
"reason": "PrecachingCompleted",
157151
"status": "True",
158-
"type": "PrecachingDone" <1>
152+
"type": "PrecachingSucceeded" <1>
159153
}
160154
----
161155
<1> The pre-cache tasks are done.

modules/cnf-topology-aware-lifecycle-manager-troubleshooting.adoc

Lines changed: 18 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -220,9 +220,9 @@ spoke3 true https://api.spoke3.testlab.com:6443 True
220220
<1> The value of the `AVAILABLE` field is `True` for the managed clusters.
221221

222222
[discrete]
223-
=== Checking clusterSelector
223+
=== Checking clusterLabelSelector
224224

225-
Issue:: You want to check if the `clusterSelector` field is specified in the `ClusterGroupUpgrade` CR in at least one of the managed clusters.
225+
Issue:: You want to check if the `clusterLabelSelector` field specified in the `ClusterGroupUpgrade` CR matches at least one of the managed clusters.
226226

227227
Resolution:: Run the following command:
228228
+
@@ -250,16 +250,14 @@ Issue:: You want to check if the canary clusters are present in the list of clus
250250
[source,yaml]
251251
----
252252
spec:
253-
clusters:
254-
- spoke1
255-
- spoke3
256-
clusterSelector:
257-
- upgrade2=true
258253
remediationStrategy:
259254
canaries:
260255
- spoke3
261256
maxConcurrency: 2
262257
timeout: 240
258+
clusterLabelSelectors:
259+
- matchLabels:
260+
upgrade: true
263261
----
264262

265263
Resolution:: Run the following commands:
@@ -276,7 +274,7 @@ $ oc get cgu lab-upgrade -ojsonpath='{.spec.clusters}'
276274
["spoke1", "spoke3"]
277275
----
278276

279-
. Check if the canary clusters are present in the list of clusters that match `clusterSelector` labels by running the following command:
277+
. Check if the canary clusters are present in the list of clusters that match `clusterLabelSelector` labels by running the following command:
280278
+
281279
[source,terminal]
282280
----
@@ -294,7 +292,7 @@ spoke3 true https://api.spoke3.testlab.com:6443 True Tr
294292

295293
[NOTE]
296294
====
297-
A cluster can be present in `spec.clusters` and also be matched by the `spec.clusterSelecter` label.
295+
A cluster can be present in `spec.clusters` and also be matched by the `spec.clusterLabelSelector` label.
298296
====
299297

300298
[discrete]
@@ -367,7 +365,7 @@ $ oc get cgu lab-upgrade -ojsonpath='{.status.conditions}'
367365
+
368366
[source,json]
369367
----
370-
{"lastTransitionTime":"2022-02-17T22:25:28Z", "message":"The ClusterGroupUpgrade CR has managed policies that are missing:[policyThatDoesntExist]", "reason":"UpgradeCannotStart", "status":"False", "type":"Ready"}
368+
{"lastTransitionTime":"2022-02-17T22:25:28Z", "message":"Missing managed policies:[policyList]", "reason":"NotAllManagedPoliciesExist", "status":"False", "type":"Validated"}
371369
----
372370

373371
[discrete]
@@ -435,3 +433,13 @@ ERROR controller-runtime.manager.controller.clustergroupupgrade Reconciler error
435433
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
436434
----
437435
<1> Displays the error.
436+
437+
[discrete]
438+
=== Clusters are not compliant to some policies after a `ClusterGroupUpgrade` CR has completed
439+
440+
Issue:: The policy compliance status that {cgu-operator} uses to decide if remediation is needed has not yet fully updated for all clusters.
441+
This may be because:
442+
* The CGU was run too soon after a policy was created or updated.
443+
* The remediation of a policy affects the compliance of subsequent policies in the `ClusterGroupUpgrade` CR.
444+
445+
Resolution:: Create a new and apply `ClusterGroupUpdate` CR with the same specification .

0 commit comments

Comments
 (0)