Skip to content

Commit 8bceaba

Browse files
authored
Update howto-cluster-runtime-upgrade-template.md
1 parent 485e867 commit 8bceaba

File tree

1 file changed

+139
-121
lines changed

1 file changed

+139
-121
lines changed

articles/operator-nexus/howto-cluster-runtime-upgrade-template.md

Lines changed: 139 additions & 121 deletions
Original file line numberDiff line numberDiff line change
@@ -9,11 +9,13 @@ ms.topic: how-to
99
ms.custom: azure-operator-nexus, template-include
1010
---
1111

12-
# Cluster runtime upgrade template
12+
# Cluster Runtime Upgrade Template
1313

1414
This how-to guide provides a step-by-step template for upgrading a Nexus Cluster designed to assist users in managing a reproducible end-to-end upgrade through Azure APIs and standard operating procedures. Regular updates are crucial for maintaining system integrity and accessing the latest product improvements.
1515

1616
## Overview
17+
<details>
18+
<summary> Overview of Cluster runtime upgrade template </summary>
1719

1820
**Runtime bundle components**: These components require operator consent for upgrades that may affect traffic behavior or necessitate server reboots. Nexus Cluster's design allows for updates to be applied while maintaining continuous workload operation.
1921

@@ -22,33 +24,92 @@ Runtime changes are categorized as follows:
2224
- **Operating system updates**: Necessary to support new Operating system features and resolve security issues.
2325
- **Platform updates**: Necessary to support new platform features and resolve security issues.
2426

27+
</details>
28+
2529
## Prerequisites
30+
<details>
31+
<summary> Prerequisites for using this template to upgrade a Cluster </summary>
2632

27-
- Install the latest version of [Azure CLI](https://aka.ms/azcli).
28-
- The latest `networkcloud` CLI extension is required. It can be installed following the steps listed in [Install CLI Extension](howto-install-cli-extensions.md).
33+
- Latest version of [Azure CLI](https://aka.ms/azcli).
34+
- Latest `managednetworkfabric` [CLI extension](howto-install-cli-extensions.md).
35+
- Latest `networkcloud` [CLI extension](howto-install-cli-extensions.md).
2936
- Subscription access to run the Azure Operator Nexus Network Fabric (NF) and Network Cloud (NC) CLI extension commands.
30-
-Target Cluster must be healthy in a running state.
37+
- Target Cluster must be healthy in a running state.
38+
39+
</details>
3140

3241
## Required Parameters
42+
<details>
43+
<summary> Parameters used in this document </summary>
44+
3345
- \<ENVIRONMENT\>: - Instance Name
3446
- <AZURE_REGION>: - Azure Region of Instance
3547
- <CUSTOMER_SUB_NAME>: Subscription Name
3648
- <CUSTOMER_SUB_ID>: Subscription ID
49+
- \<NEXUS_VERSION\>: Nexus release version (for example, 2504.1)
50+
- <NNF_VERSION>: Operator Nexus Fabric release version (for example, 8.1)
51+
- <NF_VERSION>: NF runtime version for upgrade (for example, 5.0.0)
52+
- <NFC_NAME>: Associated Network Fabric Controller (NFC)
53+
- <CM_NAME>: Associated Cluster Manager (CM)
3754
- <CLUSTER_NAME>: Cluster Name
3855
- <CLUSTER_RG>: Cluster Resource Group
3956
- <CLUSTER_RID>: Cluster ARM ID
4057
- <CLUSTER_MRG>: Cluster Managed Resource Group
4158
- <CLUSTER_CONTROL_BMM>: Cluster Control plane baremetalmachine
4259
- <CLUSTER_VERSION>: Runtime version for upgrade
43-
- <START_TIME>: Planned start time of upgrade
44-
- \<DURATION\>: Estimated Duration of upgrade
4560
- <DEPLOYMENT_THRESHOLD>: Compute deployment threshold
4661
- <DEPLOYMENT_PAUSE_MINS>: Time to wait before moving to the next Rack once the current Rack meets the deployment threshold
47-
- <NFC_NAME>: Associated Network Fabric Controller (NFC)
48-
- <CM_NAME>: Associated Cluster Manager (CM)
49-
- <BMM_ISSUE_LIST>: List of BMM with provisioning issues after Cluster upgrade is complete
62+
- <MISE_CID>: Microsoft.Identity.ServiceEssentials (MISE) Correlation ID in debug output for Device updates
63+
- <CORRELATION_ID>: Operation Correlation ID in debug output for Device updates
64+
- <ASYNC_URL>: Asynchronous (ASYNC) URL in debug output for Device updates
65+
- <LINK_TO_TELCO_INPUT>: Link to the Instance Telco Input file
66+
67+
</details>
68+
69+
## Deployment Data
70+
<details>
71+
<summary> Deployment data details </summary>
72+
73+
```
74+
- Nexus: <NEXUS_VERSION>
75+
- NC: <NC_VERSION>
76+
- NF: <NF_VERSION>
77+
- Subscription Name: <CUSTOMER_SUB_NAME>
78+
- Subscription ID: <CUSTOMER_SUB_ID>
79+
- Tenant ID: <CUSTOMER_SUB_TENANT_ID>
80+
- Telco Input: <LINK_TO_TELCO_INPUT>
81+
```
82+
83+
</details>
84+
85+
## Debug information for Azure CLI commands
86+
<details>
87+
<summary> How to collect debug information for Azure CLI commands </summary>
88+
89+
Azure CLI deployment commands issued with `--debug` contain the following information in the command output:
90+
```
91+
cli.azure.cli.core.sdk.policies: 'mise-correlation-id': '<MISE_CID>'
92+
cli.azure.cli.core.sdk.policies: 'x-ms-correlation-request-id': '<CORRELATION_ID>'
93+
cli.azure.cli.core.sdk.policies: 'Azure-AsyncOperation': '<ASYNC_URL>'
94+
```
95+
96+
To view status of long running asynchronous operations, run the following command with `az rest`:
97+
```
98+
az rest -m get -u '<ASYNC_URL>'
99+
```
100+
101+
Command status information is returned along with detailed informational or error messages:
102+
- `"status": "Accepted"`
103+
- `"status": "Succeeded"`
104+
- `"status": "Failed"`
105+
106+
If any failures occur, report the <MISE_CID>, <CORRELATION_ID>, status code, and detailed messages when opening a support request.
107+
108+
</details>
50109

51110
## Pre-Checks
111+
<details>
112+
<summary> Pre-checks before starting Cluster upgrade </summary>
52113

53114
1. Validate the provisioning and detailed status for the CM and Cluster.
54115

@@ -99,43 +160,19 @@ Runtime changes are categorized as follows:
99160

100161
3. Collect a profile of the tenant workloads:
101162
```
102-
az networkcloud clustermanager show -g $CM_RG --resource-name $CM_NAME --subscription $SUBSCRIPTION_ID -o table
103163
az networkcloud virtualmachine list --sub $SUBSCRIPTION_ID --query "reverse(sort_by([?clusterId=='$CLUSTER_RID'].{name:name, createdAt:systemData.createdAt, resourceGroup:resourceGroup, powerState:powerState, provisioningState:provisioningState, detailedStatus:detailedStatus,bareMetalMachineId:bareMetalMachineIdi,CPUCount:cpuCores, EmulatorStatus:isolateEmulatorThread}, &createdAt))" -o table
104164
az networkcloud kubernetescluster list --sub $SUBSCRIPTION_ID --query "[?clusterId=='$CLUSTER_RID'].{name:name, resourceGroup:resourceGroup, provisioningState:provisioningState, detailedStatus:detailedStatus, detailedStatusMessage:detailedStatusMessage, createdAt:systemData.createdAt, kubernetesVersion:kubernetesVersion}" -o table
105165
```
106166

107167
4. Review Operator Nexus Release notes for required checks and configuration updates not included in this document.
108168

109-
## Send notification to Operations of upgrade schedule for the Cluster
110-
111-
The following template can be used through email or support ticket:
112-
```
113-
Title: <ENVIRONMENT> <AZURE_REGION> <CLUSTER_NAME> runtime upgrade to <CLUSTER_VERSION> <START_TIME> - Completion ETA <DURATION>
114-
115-
Operations Support:
116-
117-
Deployment Team notification for <ENVIRONMENT> <AZURE_REGION> <CLUSTER_NAME> runtime upgrade to <CLUSTER_VERSION> <START_TIME> - Completion ETA <DURATION>
169+
</details>
118170

119-
Subscription: <CUSTOMER_SUB_ID>
120-
NFC: <NFC_NAME>
121-
CM: <CM_NAME>
122-
Fabric: <NF_NAME>
123-
Cluster: <CLUSTER_NAME>
124-
Region: <AZURE_REGION>
125-
Version: <NEXUS_VERSION>
126-
127-
CC: stakeholder-list
128-
```
171+
## Upgrade Procedure
172+
<details>
173+
<summary> Custer runtime uUpgrade procedure details </summary>
129174

130-
## Add resource tag on Cluster resource in Azure portal
131-
To help track upgrades, add a tag to the Cluster resource in Azure portal (optional):
132-
```
133-
|Name | Value |
134-
|----------------|-----------------
135-
|BF in progress |<DE_ID> |
136-
```
137-
138-
## Set deployment strategy and Compute threshold on Cluster if different from default
175+
### Cluster upgrade settings defaults
139176
The default threshold for the percent of Compute BMM to pass hardware validation and provisioning is 80% with a default pause between Racks of one minute.
140177

141178
The following settings are available for `update-strategy`:
@@ -173,7 +210,6 @@ az networkcloud cluster show -n $CLUSTER_NAME -g $CLUSTER_RG --subscription $SUB
173210
```
174211

175212
### How to run Cluster upgrade with `PauseAfterRack` Strategy
176-
177213
`PauseAferRack` strategy allows the customer to control the upgrade by requiring an API call to continue to the next Rack after each Compute Rack completes to the configured threshold.
178214

179215
To configure strategy to use `PauseAfterRack`:
@@ -192,7 +228,7 @@ az networkcloud cluster show -g <CLUSTER_RG> -n <CLUSTER_NAME> --subscription <C
192228
"waitTimeMinutes": 0
193229
```
194230

195-
## Run upgrade from either portal or cli
231+
### Run upgrade from either portal or cli
196232
* To start upgrade from Azure portal, go to Cluster resource, click `Update`, select <CLUSTER_VERSION>, then click `Update`
197233
* To run upgrade from Azure CLI, run the following command:
198234
```
@@ -207,13 +243,21 @@ az networkcloud cluster show -g <CLUSTER_RG> -n <CLUSTER_NAME> --subscription <C
207243
```
208244
Provide this information to Microsoft Support when opening a support ticket for upgrade issues.
209245

210-
## Monitor status of Cluster
246+
### How to continue upgrade during `PauseAfterRack` strategy
247+
Once a compute Rack meets the success threshold, the upgrade pauses until the user signals to the operator to continue the upgrade.
248+
249+
Use the following command to continue upgrade once a Compute Rack is paused after meeting the deployment threshold for the Rack:
250+
```
251+
az networkcloud cluster continue-update-version -g $CLUSTER_RG -n $CLUSTER_NAME$ --subscription $SUBSCRIPTION_ID
252+
```
253+
254+
### Monitor status of Cluster
211255
```
212256
az networkcloud cluster list -g $CLUSTER_RG --subscription $SUBSCRIPTION_ID -o table
213257
```
214258
The Cluster `Detailed status` shows `Running` and the `Detailed status message` shows 'Cluster is up and running.` when the upgrade is complete.
215259

216-
## Monitor status of Bare Metal Machines
260+
### Monitor status of Bare Metal Machines
217261
```
218262
az networkcloud baremetalmachine list -g $CLUSTER_MRG --subscription $SUBSCRIPTION_ID -o table
219263
az networkcloud baremetalmachine list -g $CLUSTER_MRG --subscription $SUBSCRIPTION_ID --query "sort_by([].{name:name,kubernetesNodeName:kubernetesNodeName,location:location,readyState:readyState,provisioningState:provisioningState,detailedStatus:detailedStatus,detailedStatusMessage:detailedStatusMessage,cordonStatus:cordonStatus,powerState:powerState,kubernetesVersion:kubernetesVersion,machineClusterVersion:machineClusterVersion,machineRoles:machineRoles| join(', ', @),createdAt:systemData.createdAt}, &name)" -o table
@@ -228,21 +272,7 @@ Validate the following states for each BMM (except spare):
228272
- KubernetesVersion: <NEW_VERSION>
229273
- MachineClusterVersion: <NEXUS_VERSION>
230274

231-
Add a Tag to the BMM resource to track any BMM that fails to complete provisioning (optional):
232-
```
233-
|Name | Value |
234-
|--------------------|-----------------
235-
|BF provision issue |<DE_ID> |
236-
```
237-
238-
## How to continue upgrade during `PauseAfterRack` strategy
239-
Once a compute Rack meets the success threshold, the upgrade pauses until the user signals to the operator to continue the upgrade.
240-
241-
Use the following command to continue upgrade once a Compute Rack is paused after meeting the deployment threshold for the Rack:
242-
```
243-
az networkcloud cluster continue-update-version -g $CLUSTER_RG -n $CLUSTER_NAME$ --subscription $SUBSCRIPTION_ID
244-
```
245-
## How to troubleshoot Cluster and BMM upgrade failures
275+
### How to troubleshoot Cluster and BMM upgrade failures
246276
The following troubleshooting documents can help recover BMM upgrade issues:
247277
- [Hardware validation failures](troubleshoot-hardware-validation-failure.md)
248278
- [BMM Provisioning issues](troubleshoot-bare-metal-machine-provisioning.md)
@@ -254,80 +284,68 @@ If troubleshooting doesn't resolve the issue, open a Microsoft support ticket:
254284
- Collect Cluster and BMM operation state from Azure portal or Azure CLI.
255285
- Create Azure Support Request for any Cluster or BMM upgrade failures and attach any errors along with ASYNC URL, correlation ID, and operation state of the Cluster and BMMs.
256286

257-
## Post-upgrade validation
258-
Run the following commands to check the status of the CM, Cluster, and BMM:
287+
</details>
259288

260-
1. Check that the CM is in `Succeeded` for `Provisioning state`:
261-
```
262-
az networkcloud clustermanager show -g $CM_RG --resource-name $CM_NAME --subscription $SUBSCRIPTION_ID -o table
263-
```
289+
## Post-upgrade tasks
290+
<details>
291+
<summary> Detailed steps for post-upgrade tasks </summary>
264292

265-
2. Check the Cluster status `Detailed status` is `Running`:
266-
```
267-
az networkcloud cluster show -g $CLUSTER_RG --resource-name $CLUSTER_NAME --subscription $SUBSCRIPTION_ID -o table
268-
```
293+
### Review Operator Nexus release notes
294+
Review the Operator Nexus release notes for any version specific actions required post-upgrade.
269295

270-
3. Check the Bare Metal Machine status:
271-
```
272-
az networkcloud baremetalmachine list -g $CLUSTER_MRG --subscription $SUBSCRIPTION_ID --query "sort_by([].{name:name,kubernetesNodeName:kubernetesNodeName,location:location,readyState:readyState,provisioningState:provisioningState,detailedStatus:detailedStatus,detailedStatusMessage:detailedStatusMessage,cordonStatus:cordonStatus,powerState:powerState,kubernetesVersion:kubernetesVersion,machineClusterVersion:machineClusterVersion,machineRoles:machineRoles| join(', ', @),createdAt:systemData.createdAt}, &name)" -o table
273-
```
296+
### Validate Nexus Instance
274297

275-
Validate the following resource states for each BMM (except spare)
276-
- ReadyState: True
277-
- ProvisioningState: Succeeded
278-
- DetailedStatus: Provisioned
279-
- CordonStatus: Uncordoned
280-
- PowerState: On
298+
Validate the health and status of all the Nexus Instance resources with the [Nexus Instance Readiness Test (IRT)](howto-run-instance-readiness-testing.md).
281299

282-
>[!Note]
283-
> One control-plane BMM is labeled as a spare and is inactive.
284-
285-
4. Collect a profile of the tenant workloads:
286-
```
287-
az networkcloud clustermanager show -g $CM_RG --resource-name $CM_NAME --subscription $SUBSCRIPTION_ID -o table
288-
az networkcloud virtualmachine list --sub $SUBSCRIPTION_ID --query "reverse(sort_by([?clusterId=='$CLUSTER_RID'].{name:name, createdAt:systemData.createdAt, resourceGroup:resourceGroup, powerState:powerState, provisioningState:provisioningState, detailedStatus:detailedStatus,bareMetalMachineId:bareMetalMachineIdi,CPUCount:cpuCores, EmulatorStatus:isolateEmulatorThread}, &createdAt))" -o table
289-
az networkcloud kubernetescluster list --sub $SUBSCRIPTION_ID --query "[?clusterId=='$CLUSTER_RID'].{name:name, resourceGroup:resourceGroup, provisioningState:provisioningState, detailedStatus:detailedStatus, detailedStatusMessage:detailedStatusMessage, createdAt:systemData.createdAt, kubernetesVersion:kubernetesVersion}" -o table
290-
```
291-
292-
## Send notification to Operations of Cluster upgrade completion
293-
294-
The following template can be used through email or ticketing system:
300+
To perform a resource validation of the Nexus Instance components post-upgrade through Azure CLI:
295301
```
296-
Title: <ENVIRONMENT> <AZURE_REGION> <CLUSTER_NAME> Runtime <CLUSTER_VERSION> Upgrade Complete
297-
298-
Operations:
299-
Deployment Team notification for <ENVIRONMENT> <AZURE_REGION> <CLUSTER_NAME> runtime <CLUSTER_VERSION> Upgrade Complete
300-
301-
Subscription: <CUSTOMER_SUB_ID>
302-
NFC: <NFC_NAME>
303-
CM: <CM_NAME>
304-
Fabric: <NF_NAME>
305-
Cluster: <CLUSTER_NAME>
306-
Region: <AZURE_REGION>
307-
Version: <NEXUS_VERSION>
308-
309-
The following is a list of BMM with provisioning issues during upgrade:
310-
<BMM_ISSUE_LIST>
311-
312-
CC: stakeholder_list
302+
# NFC
303+
az networkfabric controller list --subscription <CUSTOMER_SUB_ID> -o table
304+
az vm list -o table --query "[?location=='<AZURE_REGION>']" --subscription <CUSTOMER_SUB_ID>
305+
az customlocation list -o table --query "[?location=='<AZURE_REGION>']" | grep <NFC_NAME> --subscription <CUSTOMER_SUB_ID>
306+
307+
# Fabric
308+
az networkfabric fabric list --resource-group <NF_RG> --subscription <CUSTOMER_SUB_ID> -o table
309+
az networkfabric rack list -o table --resource-group <NF_RG> --subscription <CUSTOMER_SUB_ID> -o table
310+
az networkfabric fabric device list --resource-group <NF_RG> --subscription <CUSTOMER_SUB_ID> -o table
311+
az networkfabric nni list -g <NF_RG> --fabric <NF_NAME> --subscription <CUSTOMER_SUB_ID> -o table
312+
az networkfabric acl list -g <NF_RG> --fabric <NF_NAME> --subscription <CUSTOMER_SUB_ID> -o table
313+
az networkfabric l2domain list -g <NF_RG> --fabric <NF_NAME> --subscription <CUSTOMER_SUB_ID> -o table
314+
315+
# CM
316+
az networkcloud clustermanager list --subscription <CUSTOMER_SUB_ID> -o table
317+
318+
# Cluster
319+
az networkcloud cluster list --subscription <CUSTOMER_SUB_ID> -o table
320+
az networkcloud baremetalmachine list -g <CLUSTER_MRG> --subscription <CUSTOMER_SUB_ID> --query "sort_by([]. {name:name,kubernetesNodeName:kubernetesNodeName,location:location,readyState:readyState,provisioningState:provisioningState,detailedStatus:detailedStatus,detailedStatusMessage:detailedStatusMessage,cordonStatus:cordonStatus,powerState:powerState,machineRoles:machineRoles| join(', ', @),createdAt:systemData.createdAt}, &name)" -o table
321+
az networkcloud storageappliance list -g <CLUSTER_MRG> --subscription <CUSTOMER_SUB_ID> -o table
322+
323+
# Tenant Workloads
324+
az networkcloud virtualmachine list --sub $SUBSCRIPTION_ID --query "reverse(sort_by([?clusterId=='$CLUSTER_RID'].{name:name, createdAt:systemData.createdAt, resourceGroup:resourceGroup, powerState:powerState, provisioningState:provisioningState, detailedStatus:detailedStatus,bareMetalMachineId:bareMetalMachineIdi,CPUCount:cpuCores, EmulatorStatus:isolateEmulatorThread}, &createdAt))" -o table
325+
az networkcloud kubernetescluster list --sub $SUBSCRIPTION_ID --query "[?clusterId=='$CLUSTER_RID'].{name:name, resourceGroup:resourceGroup, provisioningState:provisioningState, detailedStatus:detailedStatus, detailedStatusMessage:detailedStatusMessage, createdAt:systemData.createdAt, kubernetesVersion:kubernetesVersion}" -o table
313326
```
314327

315-
## Remove resource tag on Cluster resource in Azure portal
316-
Remove the resource tag on the Cluster resource tracking the upgrade in Azure portal (if added previously):
317-
```
318-
|Name | Value |
319-
|----------------|-----------------
320-
|BF in progress |<DE_ID> |
321-
```
328+
> [!Note]
329+
> IRT validation provides a complete functional test of networking and workloads across all components of the Nexus Instance. Simple validation does not provide functional tesing.
322330
323-
## Close out any Work Items in your ticketing system
324-
* Update Task hours for upgrade duration.
325-
* Set Cluster upgrade work item to `Complete`.
326-
* Add any notes on support tickets and issues encountered during upgrade
331+
</details>
327332

328333
## Links
329-
- [Azure portal](https://aka.ms/nexus-portal)
334+
<details>
335+
<summary> Reference Links for Cluster upgrade </summary>
336+
337+
Reference links for Cluster upgrade:
338+
- Access the [Azure portal](https://aka.ms/nexus-portal)
339+
- [Install Azure CLI](https://aka.ms/azcli)
340+
- [Install CLI Extension](howto-install-cli-extensions.md)
330341
- [Cluster Upgrade](howto-cluster-runtime-upgrade.md)
331342
- [Cluster Upgrade with PauseAfterRack](howto-cluster-runtime-upgrade-with-pauseafterrack-strategy.md)
332-
- [Azure CLI](https://aka.ms/azcli)
333-
- [Install CLI Extension](howto-install-cli-extensions.md)
343+
- [Troubleshoot hardware validation failure](troubleshoot-hardware-validation-failure.md)
344+
- [Troubleshoot BMM provisioning](troubleshoot-bare-metal-machine-provisioning.md)
345+
- [Troubleshoot BMM provisioning](troubleshoot-bare-metal-machine-provisioning.md)
346+
- [Troubleshoot BMM degraded](troubleshoot-bare-metal-machine-degraded.md)
347+
- [Troubleshoot BMM warning](troubleshoot-bare-metal-machine-warning.md)
348+
- Reference the [Nexus Telco Input Template](concepts-telco-input-template.md)
349+
- Reference the [Nexus Instance Readiness Test (IRT)](howto-run-instance-readiness-testing.md)
350+
351+
</details>

0 commit comments

Comments
 (0)