You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- <DEPLOYMENT_PAUSE_MINS>: Time to wait before moving to the next rack once the current rack percent of Compute servers complete upgrade
47
47
- <NFC_NAME>: Associated NFC
48
48
- <CM_NAME>: Associated CM
49
49
- <ETCD_LAST_ROTATION_DATE>: Control plane etcd credential last rotation date
50
50
- <ETCD_ROTATION_DAYS>: Control plane etcd credential next rotation period
51
-
- <FABRIC_NAME>: Associated Fabric
52
-
- <NEXUS_VERSION>: Target upgrade version
51
+
- <BMM_ISSUE_LIST>: List of BMM with provisioining issues afer Cluster upgrade is complete
53
52
54
53
## Pre-Checks
55
54
@@ -59,13 +58,13 @@ Runtime changes are categorized as follows:
59
58
- Validate the `lastRotationTime` and `rotationPeriodDays` under the `etcd credential` section:
60
59
```
61
60
{
62
-
"lastRotationTime": "<ETCD_LAST_ROTATION_DATE>",
63
-
"rotationPeriodDays": <ETCD_ROTATION_DAYS>,
64
-
"secretType": "etcd credential"
65
-
}
66
-
```
67
-
68
-
>[!Important]
61
+
"lastRotationTime": "<ETCD_LAST_ROTATION_DATE>",
62
+
"rotationPeriodDays": <ETCD_ROTATION_DAYS>,
63
+
"secretType": "etcd credential"
64
+
}
65
+
```
66
+
67
+
>[!Important]
69
68
> If the upgrade will occur within three days of the next `etcd credential` rotation (<ETCD_LAST_ROTATION_DATE> + <ETCD_ROTATION_DAYS>), contact Miscrosoft Support to complete a manual rotation before starting the upgrade.
70
69
71
70
2. Validate the provisioning and detailed status for the Cluster Manager (CM) and Cluster.
@@ -77,6 +76,10 @@ Runtime changes are categorized as follows:
77
76
export CM_NAME=<CM_NAME>
78
77
export CLUSTER_RG=<CLUSTER_RG>
79
78
export CLUSTER_NAME=<CLUSTER_NAME>
79
+
export CLUSTER_RID=<CLUSTER_RID>
80
+
export CLUSTER_MRG=<CLUSTER_MRG>
81
+
export THRESHOLD=<DEPLOYMENT_THRESHOLD>
82
+
export PAUSE_MINS=<DEPLOYMENT_PAUSE_MINS>
80
83
```
81
84
82
85
Check that the CM is in `Succeeded` for `Provisioning state`:
@@ -92,5 +95,237 @@ Runtime changes are categorized as follows:
92
95
>[!Note]
93
96
> If CM `Provisioning state` is not `Succeeded` and Cluster `Detailed status` is not `Running` stop the upgrade until issues are resolved.
94
97
95
-
3. Review Operator Nexus Release notes for required checks and configuration updates not included in this document.
98
+
3. Check the Bare Metal Machine status `Detailed status` is `Running`:
99
+
```
100
+
az networkcloud baremetalmachine list -g $CLUSTER_MRG --subscription $SUBSCRIPTION_ID --query "sort_by([].{name:name,kubernetesNodeName:kubernetesNodeName,location:location,readyState:readyState,provisioningState:provisioningState,detailedStatus:detailedStatus,detailedStatusMessage:detailedStatusMessage,cordonStatus:cordonStatus,powerState:powerState,kubernetesVersion:kubernetesVersion,machineClusterVersion:machineClusterVersion,machineRoles:machineRoles| join(', ', @),createdAt:systemData.createdAt}, &name)" -o table
101
+
```
102
+
103
+
Check the following for each BMM:
104
+
- ReadyState: True
105
+
- ProvisioningState: Succeeded
106
+
- DetailedStatus: Provisioned
107
+
- CordonStatus: Uncordoned
108
+
- PowerState: On
109
+
110
+
4. Collect a profile of the tenant workloads pre-upgrade:
111
+
```
112
+
az networkcloud clustermanager show -g $CM_RG --resource-name $CM_NAME --subscription $SUBSCRIPTION_ID -o table
113
+
az networkcloud virtualmachine list --sub $SUBSCRIPTION_ID --query "reverse(sort_by([?clusterId=='$CLUSTER_RID'].{name:name, createdAt:systemData.createdAt, resourceGroup:resourceGroup, powerState:powerState, provisioningState:provisioningState, detailedStatus:detailedStatus,bareMetalMachineId:bareMetalMachineIdi,CPUCount:cpuCores, EmulatorStatus:isolateEmulatorThread}, &createdAt))" -o table
114
+
az networkcloud kubernetescluster list --sub $SUBSCRIPTION_ID --query "[?clusterId=='$CLUSTER_RID'].{name:name, resourceGroup:resourceGroup, provisioningState:provisioningState, detailedStatus:detailedStatus, detailedStatusMessage:detailedStatusMessage, createdAt:systemData.createdAt, kubernetesVersion:kubernetesVersion}" -o table
115
+
```
116
+
5. Review Operator Nexus Release notes for required checks and configuration updates not included in this document.
117
+
118
+
## Send notification to Operations of upgrade schedule for the Cluster.
119
+
120
+
The following template can be used through email or support ticket:
121
+
```
122
+
Title: <ENVIRONMENT> <AZURE_REGION> <CLUSTER_NAME> runtime upgrade to <CLUSTER_VERSION> <START_TIME> - Completion ETA <DURATION>
123
+
124
+
Operations Support:
125
+
126
+
Deployment Team notification for <ENVIRONMENT> <AZURE_REGION> <CLUSTER_NAME> runtime upgrade to <CLUSTER_VERSION> <START_TIME> - Completion ETA <DURATION>
127
+
128
+
Subscription: <CUSTOMER_SUB_ID>
129
+
NFC: <NFC_NAME>
130
+
CM: <CM_NAME>
131
+
Fabric: <NF_NAME>
132
+
Cluster: <CLUSTER_NAME>
133
+
Region: <AZURE_REGION>
134
+
Version: <NEXUS_VERSION>
135
+
136
+
CC: stakeholder-list
137
+
```
138
+
139
+
## Add resource tag on Cluster resource in Azure portal
140
+
To help track upgrades, add a tag to the Cluster resource in Azure portal (optional):
141
+
```
142
+
|Name | Value |
143
+
|----------------|-----------------
144
+
|BF in progress |<DE_ID> |
145
+
```
146
+
147
+
## Set deployment strategy and Compute threshold on Cluster if different from default
148
+
The default threshold for the percent of Compute BMM to pass hardware validation and provisioning is 80% with a default pause between Racks of one minute.
149
+
150
+
`update-strategy` can be the following:
151
+
* `Rack` - Upgrade each Rack one at a time and move to the next Rack once the Compute threshold is met for the curren Rack. Pause for <DEPLOYMENT_PAUSE_MINS> before starting next Rack.
152
+
* `PauseAfterRack` - Wait for user API response to continue to the next Rack once the Compute threshold is met for the current Rack.
153
+
154
+
If `updateStrategy` is not set, the default are as follows:
155
+
```
156
+
"updateStrategy": {
157
+
"maxUnavailable": 32767,
158
+
"strategyType": "Rack",
159
+
"thresholdType": "PercentSuccess",
160
+
"thresholdValue": 80,
161
+
"waitTimeMinutes": 1
162
+
}
163
+
```
164
+
165
+
### Set a deployment threshold and wait time different than default:
>[!Important] If 100% threshold is required, review the BMM status reported during pre-checks and make sure all BMM are healthy before proceeding with the upgrade.
170
+
171
+
Verify update:
172
+
```
173
+
az networkcloud cluster show -n $CLUSTER_NAME -g $CLUSTER_RG --subscription $SUBSCRIPTION_ID| grep -A5 updateStrategy
174
+
"updateStrategy": {
175
+
"maxUnavailable": 32767,
176
+
"strategyType": "Rack",
177
+
"thresholdType": "PercentSuccess",
178
+
"thresholdValue": 80,
179
+
"waitTimeMinutes": 1
180
+
}
181
+
```
182
+
183
+
### Running Cluster upgrade with `PauseAfterRack` Strategy
184
+
185
+
`PauseAferRack` strategy allows the customer to control the upgrade by requiring an API call to continue to the next Rack after each Compute Rack completes to the configured threshold.
Provide this information to Microsoft Support when opening a support ticket for upgrade issues.
217
+
218
+
## Monitor status of Cluster:
219
+
```
220
+
az networkcloud cluster list -g $CLUSTER_RG --subscription $SUBSCRIPTION_ID -o table
221
+
```
222
+
When the upgrade is complete, the Cluster `Detailed status` will move to `Running` state and the `Detailed status message` will show 'Cluster is up and running.`
223
+
224
+
## Monitor status of Bare Metal Machines:
225
+
```
226
+
az networkcloud baremetalmachine list -g $CLUSTER_MRG --subscription $SUBSCRIPTION_ID -o table
227
+
az networkcloud baremetalmachine list -g $CLUSTER_MRG --subscription $SUBSCRIPTION_ID --query "sort_by([].{name:name,kubernetesNodeName:kubernetesNodeName,location:location,readyState:readyState,provisioningState:provisioningState,detailedStatus:detailedStatus,detailedStatusMessage:detailedStatusMessage,cordonStatus:cordonStatus,powerState:powerState,kubernetesVersion:kubernetesVersion,machineClusterVersion:machineClusterVersion,machineRoles:machineRoles| join(', ', @),createdAt:systemData.createdAt}, &name)" -o table
228
+
```
229
+
230
+
Validate the following for each BMM:
231
+
- ReadyState: True
232
+
- ProvisioningState: Succeeded
233
+
- DetailedStatus: Provisioned
234
+
- CordonStatus: Uncordoned
235
+
- PowerState: On
236
+
- KubernetesVersion: <NEW_VERSION>
237
+
- MachineClusterVersion: <NEXUS_VERSION>
238
+
239
+
For any BMM that does not complete provisioning, and Cluster upgrade is complete, add a Tag to the BMM resource (optional):
240
+
```
241
+
|Name | Value |
242
+
|--------------------|-----------------
243
+
|BF provision issue |<DE_ID> |
244
+
```
245
+
246
+
## Continuing upgrade during `PauseAfterRack` strategy:
247
+
Once a compute rack has met the success threshold, the upgrade will move into a pause until the user signals to the operator to continue the upgrade.
248
+
249
+
Use the following to continue upgrade once a Compute Rack has met the Compute deployment threshold for the rack:
250
+
```
251
+
az networkcloud cluster continue-update-version -g $CLUSTER_RG -n $CLUSTER_NAME$ --subscription $SUBSCRIPTION_ID
252
+
```
253
+
## Troubleshooting Cluster and BMM upgrade failures.
254
+
The following troubleshooting documents can help recover BMM upgrade issues:
If troubleshooting does not resolve the issue, open a Microsoft support ticket:
261
+
1. Collect any errors in the Azure CLI output.
262
+
2. Collect Cluster and BMM operation state from Azure portal or Azure CLI.
263
+
3. Create Azure Support Request for any Cluster or BMM upgrade failures and attach any errors along with ASYNC URL, correlation ID, and operation state of the Cluster and BMMs.
264
+
265
+
## Post-upgrade Validation
266
+
Run the following commands to check the status of the CM, Cluster, and BMM:
267
+
268
+
1. Check that the CM is in `Succeeded` for `Provisioning state`:
269
+
```
270
+
az networkcloud clustermanager show -g $CM_RG --resource-name $CM_NAME --subscription $SUBSCRIPTION_ID -o table
271
+
```
272
+
273
+
2. Check the Cluster status `Detailed status` is `Running`:
274
+
```
275
+
az networkcloud cluster show -g $CLUSTER_RG --resource-name $CLUSTER_NAME --subscription $SUBSCRIPTION_ID -o table
276
+
```
277
+
278
+
3. Check the Bare Metal Machine status:
279
+
```
280
+
az networkcloud baremetalmachine list -g $CLUSTER_MRG --subscription $SUBSCRIPTION_ID --query "sort_by([].{name:name,kubernetesNodeName:kubernetesNodeName,location:location,readyState:readyState,provisioningState:provisioningState,detailedStatus:detailedStatus,detailedStatusMessage:detailedStatusMessage,cordonStatus:cordonStatus,powerState:powerState,kubernetesVersion:kubernetesVersion,machineClusterVersion:machineClusterVersion,machineRoles:machineRoles| join(', ', @),createdAt:systemData.createdAt}, &name)" -o table
281
+
```
282
+
283
+
Check the following for each BMM:
284
+
- ReadyState: True
285
+
- ProvisioningState: Succeeded
286
+
- DetailedStatus: Provisioned
287
+
- CordonStatus: Uncordoned
288
+
- PowerState: On
289
+
290
+
4. Collect a profile of the tenant workloads:
291
+
```
292
+
az networkcloud clustermanager show -g $CM_RG --resource-name $CM_NAME --subscription $SUBSCRIPTION_ID -o table
293
+
az networkcloud virtualmachine list --sub $SUBSCRIPTION_ID --query "reverse(sort_by([?clusterId=='$CLUSTER_RID'].{name:name, createdAt:systemData.createdAt, resourceGroup:resourceGroup, powerState:powerState, provisioningState:provisioningState, detailedStatus:detailedStatus,bareMetalMachineId:bareMetalMachineIdi,CPUCount:cpuCores, EmulatorStatus:isolateEmulatorThread}, &createdAt))" -o table
294
+
az networkcloud kubernetescluster list --sub $SUBSCRIPTION_ID --query "[?clusterId=='$CLUSTER_RID'].{name:name, resourceGroup:resourceGroup, provisioningState:provisioningState, detailedStatus:detailedStatus, detailedStatusMessage:detailedStatusMessage, createdAt:systemData.createdAt, kubernetesVersion:kubernetesVersion}" -o table
295
+
```
296
+
297
+
## Send notification to Operations of Cluster upgrade completion
298
+
299
+
The following template can be used through email or ticketing system:
0 commit comments