You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This how-to guide provides a step-by-step template for upgrading a Nexus Cluster designed to assist users in managing a reproducible end-to-end upgrade through Azure APIs and standard operating procedures. Regular updates are crucial for maintaining system integrity and accessing the latest product improvements.
15
15
16
16
## Overview
17
+
<details>
18
+
<summary> Overview of Cluster runtime upgrade template </summary>
17
19
18
20
**Runtime bundle components**: These components require operator consent for upgrades that may affect traffic behavior or necessitate server reboots. Nexus Cluster's design allows for updates to be applied while maintaining continuous workload operation.
19
21
@@ -22,33 +24,92 @@ Runtime changes are categorized as follows:
22
24
-**Operating system updates**: Necessary to support new Operating system features and resolve security issues.
23
25
-**Platform updates**: Necessary to support new platform features and resolve security issues.
24
26
27
+
</details>
28
+
25
29
## Prerequisites
30
+
<details>
31
+
<summary> Prerequisites for using this template to upgrade a Cluster </summary>
26
32
27
-
- Install the latest version of [Azure CLI](https://aka.ms/azcli).
28
-
- The latest `networkcloud` CLI extension is required. It can be installed following the steps listed in [Install CLI Extension](howto-install-cli-extensions.md).
33
+
- Latest version of [Azure CLI](https://aka.ms/azcli).
To view status of long running asynchronous operations, run the following command with `az rest`:
97
+
```
98
+
az rest -m get -u '<ASYNC_URL>'
99
+
```
100
+
101
+
Command status information is returned along with detailed informational or error messages:
102
+
-`"status": "Accepted"`
103
+
-`"status": "Succeeded"`
104
+
-`"status": "Failed"`
105
+
106
+
If any failures occur, report the <MISE_CID>, <CORRELATION_ID>, status code, and detailed messages when opening a support request.
107
+
108
+
</details>
50
109
51
110
## Pre-Checks
111
+
<details>
112
+
<summary> Pre-checks before starting Cluster upgrade </summary>
52
113
53
114
1. Validate the provisioning and detailed status for the CM and Cluster.
54
115
@@ -99,43 +160,19 @@ Runtime changes are categorized as follows:
99
160
100
161
3. Collect a profile of the tenant workloads:
101
162
```
102
-
az networkcloud clustermanager show -g $CM_RG --resource-name $CM_NAME --subscription $SUBSCRIPTION_ID -o table
103
163
az networkcloud virtualmachine list --sub $SUBSCRIPTION_ID --query "reverse(sort_by([?clusterId=='$CLUSTER_RID'].{name:name, createdAt:systemData.createdAt, resourceGroup:resourceGroup, powerState:powerState, provisioningState:provisioningState, detailedStatus:detailedStatus,bareMetalMachineId:bareMetalMachineIdi,CPUCount:cpuCores, EmulatorStatus:isolateEmulatorThread}, &createdAt))" -o table
104
164
az networkcloud kubernetescluster list --sub $SUBSCRIPTION_ID --query "[?clusterId=='$CLUSTER_RID'].{name:name, resourceGroup:resourceGroup, provisioningState:provisioningState, detailedStatus:detailedStatus, detailedStatusMessage:detailedStatusMessage, createdAt:systemData.createdAt, kubernetesVersion:kubernetesVersion}" -o table
105
165
```
106
166
107
167
4. Review Operator Nexus Release notes for required checks and configuration updates not included in this document.
108
168
109
-
## Send notification to Operations of upgrade schedule for the Cluster
110
-
111
-
The following template can be used through email or support ticket:
112
-
```
113
-
Title: <ENVIRONMENT> <AZURE_REGION> <CLUSTER_NAME> runtime upgrade to <CLUSTER_VERSION> <START_TIME> - Completion ETA <DURATION>
114
-
115
-
Operations Support:
116
-
117
-
Deployment Team notification for <ENVIRONMENT> <AZURE_REGION> <CLUSTER_NAME> runtime upgrade to <CLUSTER_VERSION> <START_TIME> - Completion ETA <DURATION>
## Add resource tag on Cluster resource in Azure portal
131
-
To help track upgrades, add a tag to the Cluster resource in Azure portal (optional):
132
-
```
133
-
|Name | Value |
134
-
|----------------|-----------------
135
-
|BF in progress |<DE_ID> |
136
-
```
137
-
138
-
## Set deployment strategy and Compute threshold on Cluster if different from default
175
+
### Cluster upgrade settings defaults
139
176
The default threshold for the percent of Compute BMM to pass hardware validation and provisioning is 80% with a default pause between Racks of one minute.
140
177
141
178
The following settings are available for `update-strategy`:
@@ -173,7 +210,6 @@ az networkcloud cluster show -n $CLUSTER_NAME -g $CLUSTER_RG --subscription $SUB
173
210
```
174
211
175
212
### How to run Cluster upgrade with `PauseAfterRack` Strategy
176
-
177
213
`PauseAferRack` strategy allows the customer to control the upgrade by requiring an API call to continue to the next Rack after each Compute Rack completes to the configured threshold.
178
214
179
215
To configure strategy to use `PauseAfterRack`:
@@ -192,7 +228,7 @@ az networkcloud cluster show -g <CLUSTER_RG> -n <CLUSTER_NAME> --subscription <C
192
228
"waitTimeMinutes": 0
193
229
```
194
230
195
-
## Run upgrade from either portal or cli
231
+
###Run upgrade from either portal or cli
196
232
* To start upgrade from Azure portal, go to Cluster resource, click `Update`, select <CLUSTER_VERSION>, then click `Update`
197
233
* To run upgrade from Azure CLI, run the following command:
198
234
```
@@ -207,13 +243,21 @@ az networkcloud cluster show -g <CLUSTER_RG> -n <CLUSTER_NAME> --subscription <C
207
243
```
208
244
Provide this information to Microsoft Support when opening a support ticket for upgrade issues.
209
245
210
-
## Monitor status of Cluster
246
+
### How to continue upgrade during `PauseAfterRack` strategy
247
+
Once a compute Rack meets the success threshold, the upgrade pauses until the user signals to the operator to continue the upgrade.
248
+
249
+
Use the following command to continue upgrade once a Compute Rack is paused after meeting the deployment threshold for the Rack:
250
+
```
251
+
az networkcloud cluster continue-update-version -g $CLUSTER_RG -n $CLUSTER_NAME$ --subscription $SUBSCRIPTION_ID
252
+
```
253
+
254
+
### Monitor status of Cluster
211
255
```
212
256
az networkcloud cluster list -g $CLUSTER_RG --subscription $SUBSCRIPTION_ID -o table
213
257
```
214
258
The Cluster `Detailed status` shows `Running` and the `Detailed status message` shows 'Cluster is up and running.` when the upgrade is complete.
215
259
216
-
## Monitor status of Bare Metal Machines
260
+
###Monitor status of Bare Metal Machines
217
261
```
218
262
az networkcloud baremetalmachine list -g $CLUSTER_MRG --subscription $SUBSCRIPTION_ID -o table
219
263
az networkcloud baremetalmachine list -g $CLUSTER_MRG --subscription $SUBSCRIPTION_ID --query "sort_by([].{name:name,kubernetesNodeName:kubernetesNodeName,location:location,readyState:readyState,provisioningState:provisioningState,detailedStatus:detailedStatus,detailedStatusMessage:detailedStatusMessage,cordonStatus:cordonStatus,powerState:powerState,kubernetesVersion:kubernetesVersion,machineClusterVersion:machineClusterVersion,machineRoles:machineRoles| join(', ', @),createdAt:systemData.createdAt}, &name)" -o table
@@ -228,21 +272,7 @@ Validate the following states for each BMM (except spare):
228
272
- KubernetesVersion: <NEW_VERSION>
229
273
- MachineClusterVersion: <NEXUS_VERSION>
230
274
231
-
Add a Tag to the BMM resource to track any BMM that fails to complete provisioning (optional):
232
-
```
233
-
|Name | Value |
234
-
|--------------------|-----------------
235
-
|BF provision issue |<DE_ID> |
236
-
```
237
-
238
-
## How to continue upgrade during `PauseAfterRack` strategy
239
-
Once a compute Rack meets the success threshold, the upgrade pauses until the user signals to the operator to continue the upgrade.
240
-
241
-
Use the following command to continue upgrade once a Compute Rack is paused after meeting the deployment threshold for the Rack:
242
-
```
243
-
az networkcloud cluster continue-update-version -g $CLUSTER_RG -n $CLUSTER_NAME$ --subscription $SUBSCRIPTION_ID
244
-
```
245
-
## How to troubleshoot Cluster and BMM upgrade failures
275
+
### How to troubleshoot Cluster and BMM upgrade failures
246
276
The following troubleshooting documents can help recover BMM upgrade issues:
@@ -254,80 +284,68 @@ If troubleshooting doesn't resolve the issue, open a Microsoft support ticket:
254
284
- Collect Cluster and BMM operation state from Azure portal or Azure CLI.
255
285
- Create Azure Support Request for any Cluster or BMM upgrade failures and attach any errors along with ASYNC URL, correlation ID, and operation state of the Cluster and BMMs.
256
286
257
-
## Post-upgrade validation
258
-
Run the following commands to check the status of the CM, Cluster, and BMM:
287
+
</details>
259
288
260
-
1. Check that the CM is in `Succeeded` for `Provisioning state`:
261
-
```
262
-
az networkcloud clustermanager show -g $CM_RG --resource-name $CM_NAME --subscription $SUBSCRIPTION_ID -o table
263
-
```
289
+
## Post-upgrade tasks
290
+
<details>
291
+
<summary> Detailed steps for post-upgrade tasks </summary>
264
292
265
-
2. Check the Cluster status `Detailed status` is `Running`:
266
-
```
267
-
az networkcloud cluster show -g $CLUSTER_RG --resource-name $CLUSTER_NAME --subscription $SUBSCRIPTION_ID -o table
268
-
```
293
+
### Review Operator Nexus release notes
294
+
Review the Operator Nexus release notes for any version specific actions required post-upgrade.
269
295
270
-
3. Check the Bare Metal Machine status:
271
-
```
272
-
az networkcloud baremetalmachine list -g $CLUSTER_MRG --subscription $SUBSCRIPTION_ID --query "sort_by([].{name:name,kubernetesNodeName:kubernetesNodeName,location:location,readyState:readyState,provisioningState:provisioningState,detailedStatus:detailedStatus,detailedStatusMessage:detailedStatusMessage,cordonStatus:cordonStatus,powerState:powerState,kubernetesVersion:kubernetesVersion,machineClusterVersion:machineClusterVersion,machineRoles:machineRoles| join(', ', @),createdAt:systemData.createdAt}, &name)" -o table
273
-
```
296
+
### Validate Nexus Instance
274
297
275
-
Validate the following resource states for each BMM (except spare)
276
-
- ReadyState: True
277
-
- ProvisioningState: Succeeded
278
-
- DetailedStatus: Provisioned
279
-
- CordonStatus: Uncordoned
280
-
- PowerState: On
298
+
Validate the health and status of all the Nexus Instance resources with the [Nexus Instance Readiness Test (IRT)](howto-run-instance-readiness-testing.md).
281
299
282
-
>[!Note]
283
-
> One control-plane BMM is labeled as a spare and is inactive.
284
-
285
-
4. Collect a profile of the tenant workloads:
286
-
```
287
-
az networkcloud clustermanager show -g $CM_RG --resource-name $CM_NAME --subscription $SUBSCRIPTION_ID -o table
288
-
az networkcloud virtualmachine list --sub $SUBSCRIPTION_ID --query "reverse(sort_by([?clusterId=='$CLUSTER_RID'].{name:name, createdAt:systemData.createdAt, resourceGroup:resourceGroup, powerState:powerState, provisioningState:provisioningState, detailedStatus:detailedStatus,bareMetalMachineId:bareMetalMachineIdi,CPUCount:cpuCores, EmulatorStatus:isolateEmulatorThread}, &createdAt))" -o table
289
-
az networkcloud kubernetescluster list --sub $SUBSCRIPTION_ID --query "[?clusterId=='$CLUSTER_RID'].{name:name, resourceGroup:resourceGroup, provisioningState:provisioningState, detailedStatus:detailedStatus, detailedStatusMessage:detailedStatusMessage, createdAt:systemData.createdAt, kubernetesVersion:kubernetesVersion}" -o table
290
-
```
291
-
292
-
## Send notification to Operations of Cluster upgrade completion
293
-
294
-
The following template can be used through email or ticketing system:
300
+
To perform a resource validation of the Nexus Instance components post-upgrade through Azure CLI:
Deployment Team notification for <ENVIRONMENT> <AZURE_REGION> <CLUSTER_NAME> runtime <CLUSTER_VERSION> Upgrade Complete
300
-
301
-
Subscription: <CUSTOMER_SUB_ID>
302
-
NFC: <NFC_NAME>
303
-
CM: <CM_NAME>
304
-
Fabric: <NF_NAME>
305
-
Cluster: <CLUSTER_NAME>
306
-
Region: <AZURE_REGION>
307
-
Version: <NEXUS_VERSION>
308
-
309
-
The following is a list of BMM with provisioning issues during upgrade:
310
-
<BMM_ISSUE_LIST>
311
-
312
-
CC: stakeholder_list
302
+
# NFC
303
+
az networkfabric controller list --subscription <CUSTOMER_SUB_ID> -o table
304
+
az vm list -o table --query "[?location=='<AZURE_REGION>']" --subscription <CUSTOMER_SUB_ID>
305
+
az customlocation list -o table --query "[?location=='<AZURE_REGION>']" | grep <NFC_NAME> --subscription <CUSTOMER_SUB_ID>
306
+
307
+
# Fabric
308
+
az networkfabric fabric list --resource-group <NF_RG> --subscription <CUSTOMER_SUB_ID> -o table
309
+
az networkfabric rack list -o table --resource-group <NF_RG> --subscription <CUSTOMER_SUB_ID> -o table
310
+
az networkfabric fabric device list --resource-group <NF_RG> --subscription <CUSTOMER_SUB_ID> -o table
311
+
az networkfabric nni list -g <NF_RG> --fabric <NF_NAME> --subscription <CUSTOMER_SUB_ID> -o table
312
+
az networkfabric acl list -g <NF_RG> --fabric <NF_NAME> --subscription <CUSTOMER_SUB_ID> -o table
313
+
az networkfabric l2domain list -g <NF_RG> --fabric <NF_NAME> --subscription <CUSTOMER_SUB_ID> -o table
314
+
315
+
# CM
316
+
az networkcloud clustermanager list --subscription <CUSTOMER_SUB_ID> -o table
317
+
318
+
# Cluster
319
+
az networkcloud cluster list --subscription <CUSTOMER_SUB_ID> -o table
320
+
az networkcloud baremetalmachine list -g <CLUSTER_MRG> --subscription <CUSTOMER_SUB_ID> --query "sort_by([]. {name:name,kubernetesNodeName:kubernetesNodeName,location:location,readyState:readyState,provisioningState:provisioningState,detailedStatus:detailedStatus,detailedStatusMessage:detailedStatusMessage,cordonStatus:cordonStatus,powerState:powerState,machineRoles:machineRoles| join(', ', @),createdAt:systemData.createdAt}, &name)" -o table
321
+
az networkcloud storageappliance list -g <CLUSTER_MRG> --subscription <CUSTOMER_SUB_ID> -o table
322
+
323
+
# Tenant Workloads
324
+
az networkcloud virtualmachine list --sub $SUBSCRIPTION_ID --query "reverse(sort_by([?clusterId=='$CLUSTER_RID'].{name:name, createdAt:systemData.createdAt, resourceGroup:resourceGroup, powerState:powerState, provisioningState:provisioningState, detailedStatus:detailedStatus,bareMetalMachineId:bareMetalMachineIdi,CPUCount:cpuCores, EmulatorStatus:isolateEmulatorThread}, &createdAt))" -o table
325
+
az networkcloud kubernetescluster list --sub $SUBSCRIPTION_ID --query "[?clusterId=='$CLUSTER_RID'].{name:name, resourceGroup:resourceGroup, provisioningState:provisioningState, detailedStatus:detailedStatus, detailedStatusMessage:detailedStatusMessage, createdAt:systemData.createdAt, kubernetesVersion:kubernetesVersion}" -o table
313
326
```
314
327
315
-
## Remove resource tag on Cluster resource in Azure portal
316
-
Remove the resource tag on the Cluster resource tracking the upgrade in Azure portal (if added previously):
317
-
```
318
-
|Name | Value |
319
-
|----------------|-----------------
320
-
|BF in progress |<DE_ID> |
321
-
```
328
+
> [!Note]
329
+
> IRT validation provides a complete functional test of networking and workloads across all components of the Nexus Instance. Simple validation does not provide functional tesing.
322
330
323
-
## Close out any Work Items in your ticketing system
324
-
* Update Task hours for upgrade duration.
325
-
* Set Cluster upgrade work item to `Complete`.
326
-
* Add any notes on support tickets and issues encountered during upgrade
331
+
</details>
327
332
328
333
## Links
329
-
-[Azure portal](https://aka.ms/nexus-portal)
334
+
<details>
335
+
<summary> Reference Links for Cluster upgrade </summary>
336
+
337
+
Reference links for Cluster upgrade:
338
+
- Access the [Azure portal](https://aka.ms/nexus-portal)
0 commit comments