Skip to content

Commit f0f1c1c

Browse files
authored
Update howto-cluster-runtime-upgrade-template.md
Complete upgrade documentations.
1 parent 89fe2d8 commit f0f1c1c

File tree

1 file changed

+250
-15
lines changed

1 file changed

+250
-15
lines changed

articles/operator-nexus/howto-cluster-runtime-upgrade-template.md

Lines changed: 250 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -30,26 +30,25 @@ Runtime changes are categorized as follows:
3030
4. Target Cluster must be healthy in a running state.
3131

3232
## Required Parameters:
33-
- <ENVIRONMENT> - Instance Name
33+
- /<ENVIRONMENT/> - Instance Name
3434
- <AZURE_REGION> - Azure Region of Instance
3535
- <CUSTOMER_SUB_NAME>: Subscription Name
36-
- <CUSTOMER_SUB_TENANT_ID> // From 'az account show'
3736
- <CUSTOMER_SUB_ID>: Subscription ID
3837
- <CLUSTER_NAME>: Cluster Name
3938
- <CLUSTER_RG>: Cluster Resource Group
4039
- <CLUSTER_RID>: Cluster ARM ID
41-
- <CLUSTER_KEYVAULT_ID>: Cluster Keyvault ARM ID
4240
- <CLUSTER_MRG>: Cluster Managed Resource Group
4341
- <CLUSTER_CONTROL_BMM>: Cluster Control plane baremetalmachine
44-
- <CLUSTER_RUNTIME_VERSION>: Runtime version for upgrade
42+
- <CLUSTER_VERSION>: Runtime version for upgrade
4543
- <START_TIME>: Planned start time of upgrade
46-
- <DURATION>: Estimated Duration of upgrade
44+
- /<DURATION/>: Estimated Duration of upgrade
45+
- <DEPLOYMENT_THRESHOLD>: Compute deployment threshold
46+
- <DEPLOYMENT_PAUSE_MINS>: Time to wait before moving to the next rack once the current rack percent of Compute servers complete upgrade
4747
- <NFC_NAME>: Associated NFC
4848
- <CM_NAME>: Associated CM
4949
- <ETCD_LAST_ROTATION_DATE>: Control plane etcd credential last rotation date
5050
- <ETCD_ROTATION_DAYS>: Control plane etcd credential next rotation period
51-
- <FABRIC_NAME>: Associated Fabric
52-
- <NEXUS_VERSION>: Target upgrade version
51+
- <BMM_ISSUE_LIST>: List of BMM with provisioining issues afer Cluster upgrade is complete
5352

5453
## Pre-Checks
5554

@@ -59,13 +58,13 @@ Runtime changes are categorized as follows:
5958
- Validate the `lastRotationTime` and `rotationPeriodDays` under the `etcd credential` section:
6059
```
6160
{
62-
"lastRotationTime": "<ETCD_LAST_ROTATION_DATE>",
63-
"rotationPeriodDays": <ETCD_ROTATION_DAYS>,
64-
"secretType": "etcd credential"
65-
}
66-
```
67-
68-
>[!Important]
61+
"lastRotationTime": "<ETCD_LAST_ROTATION_DATE>",
62+
"rotationPeriodDays": <ETCD_ROTATION_DAYS>,
63+
"secretType": "etcd credential"
64+
}
65+
```
66+
67+
>[!Important]
6968
> If the upgrade will occur within three days of the next `etcd credential` rotation (<ETCD_LAST_ROTATION_DATE> + <ETCD_ROTATION_DAYS>), contact Miscrosoft Support to complete a manual rotation before starting the upgrade.
7069
7170
2. Validate the provisioning and detailed status for the Cluster Manager (CM) and Cluster.
@@ -77,6 +76,10 @@ Runtime changes are categorized as follows:
7776
export CM_NAME=<CM_NAME>
7877
export CLUSTER_RG=<CLUSTER_RG>
7978
export CLUSTER_NAME=<CLUSTER_NAME>
79+
export CLUSTER_RID=<CLUSTER_RID>
80+
export CLUSTER_MRG=<CLUSTER_MRG>
81+
export THRESHOLD=<DEPLOYMENT_THRESHOLD>
82+
export PAUSE_MINS=<DEPLOYMENT_PAUSE_MINS>
8083
```
8184
8285
Check that the CM is in `Succeeded` for `Provisioning state`:
@@ -92,5 +95,237 @@ Runtime changes are categorized as follows:
9295
>[!Note]
9396
> If CM `Provisioning state` is not `Succeeded` and Cluster `Detailed status` is not `Running` stop the upgrade until issues are resolved.
9497
95-
3. Review Operator Nexus Release notes for required checks and configuration updates not included in this document.
98+
3. Check the Bare Metal Machine status `Detailed status` is `Running`:
99+
```
100+
az networkcloud baremetalmachine list -g $CLUSTER_MRG --subscription $SUBSCRIPTION_ID --query "sort_by([].{name:name,kubernetesNodeName:kubernetesNodeName,location:location,readyState:readyState,provisioningState:provisioningState,detailedStatus:detailedStatus,detailedStatusMessage:detailedStatusMessage,cordonStatus:cordonStatus,powerState:powerState,kubernetesVersion:kubernetesVersion,machineClusterVersion:machineClusterVersion,machineRoles:machineRoles| join(', ', @),createdAt:systemData.createdAt}, &name)" -o table
101+
```
102+
103+
Check the following for each BMM:
104+
- ReadyState: True
105+
- ProvisioningState: Succeeded
106+
- DetailedStatus: Provisioned
107+
- CordonStatus: Uncordoned
108+
- PowerState: On
109+
110+
4. Collect a profile of the tenant workloads pre-upgrade:
111+
```
112+
az networkcloud clustermanager show -g $CM_RG --resource-name $CM_NAME --subscription $SUBSCRIPTION_ID -o table
113+
az networkcloud virtualmachine list --sub $SUBSCRIPTION_ID --query "reverse(sort_by([?clusterId=='$CLUSTER_RID'].{name:name, createdAt:systemData.createdAt, resourceGroup:resourceGroup, powerState:powerState, provisioningState:provisioningState, detailedStatus:detailedStatus,bareMetalMachineId:bareMetalMachineIdi,CPUCount:cpuCores, EmulatorStatus:isolateEmulatorThread}, &createdAt))" -o table
114+
az networkcloud kubernetescluster list --sub $SUBSCRIPTION_ID --query "[?clusterId=='$CLUSTER_RID'].{name:name, resourceGroup:resourceGroup, provisioningState:provisioningState, detailedStatus:detailedStatus, detailedStatusMessage:detailedStatusMessage, createdAt:systemData.createdAt, kubernetesVersion:kubernetesVersion}" -o table
115+
```
116+
5. Review Operator Nexus Release notes for required checks and configuration updates not included in this document.
117+
118+
## Send notification to Operations of upgrade schedule for the Cluster.
119+
120+
The following template can be used through email or support ticket:
121+
```
122+
Title: <ENVIRONMENT> <AZURE_REGION> <CLUSTER_NAME> runtime upgrade to <CLUSTER_VERSION> <START_TIME> - Completion ETA <DURATION>
123+
124+
Operations Support:
125+
126+
Deployment Team notification for <ENVIRONMENT> <AZURE_REGION> <CLUSTER_NAME> runtime upgrade to <CLUSTER_VERSION> <START_TIME> - Completion ETA <DURATION>
127+
128+
Subscription: <CUSTOMER_SUB_ID>
129+
NFC: <NFC_NAME>
130+
CM: <CM_NAME>
131+
Fabric: <NF_NAME>
132+
Cluster: <CLUSTER_NAME>
133+
Region: <AZURE_REGION>
134+
Version: <NEXUS_VERSION>
135+
136+
CC: stakeholder-list
137+
```
138+
139+
## Add resource tag on Cluster resource in Azure portal
140+
To help track upgrades, add a tag to the Cluster resource in Azure portal (optional):
141+
```
142+
|Name | Value |
143+
|----------------|-----------------
144+
|BF in progress |<DE_ID> |
145+
```
146+
147+
## Set deployment strategy and Compute threshold on Cluster if different from default
148+
The default threshold for the percent of Compute BMM to pass hardware validation and provisioning is 80% with a default pause between Racks of one minute.
149+
150+
`update-strategy` can be the following:
151+
* `Rack` - Upgrade each Rack one at a time and move to the next Rack once the Compute threshold is met for the curren Rack. Pause for <DEPLOYMENT_PAUSE_MINS> before starting next Rack.
152+
* `PauseAfterRack` - Wait for user API response to continue to the next Rack once the Compute threshold is met for the current Rack.
153+
154+
If `updateStrategy` is not set, the default are as follows:
155+
```
156+
"updateStrategy": {
157+
"maxUnavailable": 32767,
158+
"strategyType": "Rack",
159+
"thresholdType": "PercentSuccess",
160+
"thresholdValue": 80,
161+
"waitTimeMinutes": 1
162+
}
163+
```
164+
165+
### Set a deployment threshold and wait time different than default:
166+
```
167+
az networkcloud cluster update -n $CLUSTER_NAME -g $CLUSTER_RG --update-strategy strategy-type="Rack" threshold-type="PercentSuccess" threshold-value=$THRESHOLD wait-time-minutes=$PAUSE_MINS --subscription $SUBSCRIPTION_ID
168+
```
169+
>[!Important] If 100% threshold is required, review the BMM status reported during pre-checks and make sure all BMM are healthy before proceeding with the upgrade.
170+
171+
Verify update:
172+
```
173+
az networkcloud cluster show -n $CLUSTER_NAME -g $CLUSTER_RG --subscription $SUBSCRIPTION_ID| grep -A5 updateStrategy
174+
"updateStrategy": {
175+
"maxUnavailable": 32767,
176+
"strategyType": "Rack",
177+
"thresholdType": "PercentSuccess",
178+
"thresholdValue": 80,
179+
"waitTimeMinutes": 1
180+
}
181+
```
182+
183+
### Running Cluster upgrade with `PauseAfterRack` Strategy
184+
185+
`PauseAferRack` strategy allows the customer to control the upgrade by requiring an API call to continue to the next Rack after each Compute Rack completes to the configured threshold.
186+
187+
To configure strategy to use `PauseAfterRack`:
188+
```
189+
az networkcloud cluster update -n $CLUSTER_NAME -g $CLUSTER_RG --update-strategy strategy-type="PauseAfterRack" wait-time-minutes=0 threshold-type="PercentSuccess" threshold-value=$THRESHOLD --subscription $SUBSCRIPTION_ID
190+
```
191+
192+
Verify update:
193+
```
194+
az networkcloud cluster show -g <CLUSTER_RG> -n <CLUSTER_NAME> --subscription <CUSTOMER_SUB_ID>| grep -A5 updateStrategy
195+
"updateStrategy": {
196+
"maxUnavailable": 32767,
197+
"strategyType": "PauseAfterRack",
198+
"thresholdType": "PercentSuccess",
199+
"thresholdValue": $THRESHOLD,
200+
"waitTimeMinutes": 0
201+
```
202+
203+
## Run upgrade from either portal or cli:
204+
* To start upgrade from Azure portal, go to Cluster resource, click `Update`, select <CLUSTER_VERSION>, then click `Update`
205+
* To run upgrade from Azure CLI, run the following:
206+
```
207+
az networkcloud cluster update-version --subscription $SUBSCRIPTION_ID --cluster-name $CLUSTER_NAME --target-cluster-version $CLUSTER_VERSION --resource-group $CLUSTER_RG --no-wait --debug
208+
```
209+
210+
Gather ASYNC URL and Correlation ID info for further troubleshooting if needed.
211+
```
212+
cli.azure.cli.core.sdk.policies: 'mise-correlation-id': '<MISE_CID>'
213+
cli.azure.cli.core.sdk.policies: 'x-ms-correlation-request-id': '<CORRELATION_ID>'
214+
cli.azure.cli.core.sdk.policies: 'Azure-AsyncOperation': '<ASYNC_URL>'
215+
```
216+
Provide this information to Microsoft Support when opening a support ticket for upgrade issues.
217+
218+
## Monitor status of Cluster:
219+
```
220+
az networkcloud cluster list -g $CLUSTER_RG --subscription $SUBSCRIPTION_ID -o table
221+
```
222+
When the upgrade is complete, the Cluster `Detailed status` will move to `Running` state and the `Detailed status message` will show 'Cluster is up and running.`
223+
224+
## Monitor status of Bare Metal Machines:
225+
```
226+
az networkcloud baremetalmachine list -g $CLUSTER_MRG --subscription $SUBSCRIPTION_ID -o table
227+
az networkcloud baremetalmachine list -g $CLUSTER_MRG --subscription $SUBSCRIPTION_ID --query "sort_by([].{name:name,kubernetesNodeName:kubernetesNodeName,location:location,readyState:readyState,provisioningState:provisioningState,detailedStatus:detailedStatus,detailedStatusMessage:detailedStatusMessage,cordonStatus:cordonStatus,powerState:powerState,kubernetesVersion:kubernetesVersion,machineClusterVersion:machineClusterVersion,machineRoles:machineRoles| join(', ', @),createdAt:systemData.createdAt}, &name)" -o table
228+
```
229+
230+
Validate the following for each BMM:
231+
- ReadyState: True
232+
- ProvisioningState: Succeeded
233+
- DetailedStatus: Provisioned
234+
- CordonStatus: Uncordoned
235+
- PowerState: On
236+
- KubernetesVersion: <NEW_VERSION>
237+
- MachineClusterVersion: <NEXUS_VERSION>
238+
239+
For any BMM that does not complete provisioning, and Cluster upgrade is complete, add a Tag to the BMM resource (optional):
240+
```
241+
|Name | Value |
242+
|--------------------|-----------------
243+
|BF provision issue |<DE_ID> |
244+
```
245+
246+
## Continuing upgrade during `PauseAfterRack` strategy:
247+
Once a compute rack has met the success threshold, the upgrade will move into a pause until the user signals to the operator to continue the upgrade.
248+
249+
Use the following to continue upgrade once a Compute Rack has met the Compute deployment threshold for the rack:
250+
```
251+
az networkcloud cluster continue-update-version -g $CLUSTER_RG -n $CLUSTER_NAME$ --subscription $SUBSCRIPTION_ID
252+
```
253+
## Troubleshooting Cluster and BMM upgrade failures.
254+
The following troubleshooting documents can help recover BMM upgrade issues:
255+
- [Hardware validation failures](troubleshoot-hardware-validation-failure.md)
256+
- [BMM Provisioning issues](troubleshoot-bare-metal-machine-provisioning.md)
257+
- [BMM Degraded Status](troubleshoot-bare-metal-machine-degraded.md)
258+
- [BMM Warning Status](troubleshoot-bare-metal-machine-warning.md)
259+
260+
If troubleshooting does not resolve the issue, open a Microsoft support ticket:
261+
1. Collect any errors in the Azure CLI output.
262+
2. Collect Cluster and BMM operation state from Azure portal or Azure CLI.
263+
3. Create Azure Support Request for any Cluster or BMM upgrade failures and attach any errors along with ASYNC URL, correlation ID, and operation state of the Cluster and BMMs.
264+
265+
## Post-upgrade Validation
266+
Run the following commands to check the status of the CM, Cluster, and BMM:
267+
268+
1. Check that the CM is in `Succeeded` for `Provisioning state`:
269+
```
270+
az networkcloud clustermanager show -g $CM_RG --resource-name $CM_NAME --subscription $SUBSCRIPTION_ID -o table
271+
```
272+
273+
2. Check the Cluster status `Detailed status` is `Running`:
274+
```
275+
az networkcloud cluster show -g $CLUSTER_RG --resource-name $CLUSTER_NAME --subscription $SUBSCRIPTION_ID -o table
276+
```
277+
278+
3. Check the Bare Metal Machine status:
279+
```
280+
az networkcloud baremetalmachine list -g $CLUSTER_MRG --subscription $SUBSCRIPTION_ID --query "sort_by([].{name:name,kubernetesNodeName:kubernetesNodeName,location:location,readyState:readyState,provisioningState:provisioningState,detailedStatus:detailedStatus,detailedStatusMessage:detailedStatusMessage,cordonStatus:cordonStatus,powerState:powerState,kubernetesVersion:kubernetesVersion,machineClusterVersion:machineClusterVersion,machineRoles:machineRoles| join(', ', @),createdAt:systemData.createdAt}, &name)" -o table
281+
```
282+
283+
Check the following for each BMM:
284+
- ReadyState: True
285+
- ProvisioningState: Succeeded
286+
- DetailedStatus: Provisioned
287+
- CordonStatus: Uncordoned
288+
- PowerState: On
289+
290+
4. Collect a profile of the tenant workloads:
291+
```
292+
az networkcloud clustermanager show -g $CM_RG --resource-name $CM_NAME --subscription $SUBSCRIPTION_ID -o table
293+
az networkcloud virtualmachine list --sub $SUBSCRIPTION_ID --query "reverse(sort_by([?clusterId=='$CLUSTER_RID'].{name:name, createdAt:systemData.createdAt, resourceGroup:resourceGroup, powerState:powerState, provisioningState:provisioningState, detailedStatus:detailedStatus,bareMetalMachineId:bareMetalMachineIdi,CPUCount:cpuCores, EmulatorStatus:isolateEmulatorThread}, &createdAt))" -o table
294+
az networkcloud kubernetescluster list --sub $SUBSCRIPTION_ID --query "[?clusterId=='$CLUSTER_RID'].{name:name, resourceGroup:resourceGroup, provisioningState:provisioningState, detailedStatus:detailedStatus, detailedStatusMessage:detailedStatusMessage, createdAt:systemData.createdAt, kubernetesVersion:kubernetesVersion}" -o table
295+
```
296+
297+
## Send notification to Operations of Cluster upgrade completion
298+
299+
The following template can be used through email or ticketing system:
300+
```
301+
Title: <ENVIRONMENT> <AZURE_REGION> <CLUSTER_NAME> Runtime <CLUSTER_VERSION> Upgrade Complete
302+
303+
Operations:
304+
Deployment Team notification for <ENVIRONMENT> <AZURE_REGION> <CLUSTER_NAME> runtime <CLUSTER_VERSION> Upgrade Complete
305+
306+
Subscription: <CUSTOMER_SUB_ID>
307+
NFC: <NFC_NAME>
308+
CM: <CM_NAME>
309+
Fabric: <NF_NAME>
310+
Cluster: <CLUSTER_NAME>
311+
Region: <AZURE_REGION>
312+
Version: <NEXUS_VERSION>
313+
314+
The following is a list of BMM with provisioning issues during upgrade:
315+
<BMM_ISSUE_LIST>
316+
317+
CC: stakeholder_list
318+
```
319+
320+
## Remove resource tag on Cluster resource in Azure portal
321+
Remove the resource tag on the Cluster resource tracking the upgrade in Azure portal (if added previously):
322+
```
323+
|Name | Value |
324+
|----------------|-----------------
325+
|BF in progress |<DE_ID> |
326+
```
96327
328+
## Close out any Work Items in your ticketing system
329+
* Update Task hours for upgrade duration.
330+
* Set Cluster upgrade work item to `Complete`.
331+
* Add any notes on support tickets and issues encountered during upgrade

0 commit comments

Comments
 (0)