Skip to content

Commit f6a3487

Browse files
Merge pull request #298849 from bartpinto/bpinto-cluster-upgrade-template
New Nexus Cluster Upgrade Step-by-Step Template
2 parents db8c3bc + 1497452 commit f6a3487

File tree

2 files changed

+335
-0
lines changed

2 files changed

+335
-0
lines changed

articles/operator-nexus/TOC.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -163,6 +163,8 @@
163163
href: howto-cluster-runtime-upgrade.md
164164
- name: Cluster Upgrades With PauseAfterRack Strategy
165165
href: howto-cluster-runtime-upgrade-with-pauseafterrack-strategy.md
166+
- name: Cluster Upgrades Template
167+
href: howto-cluster-runtime-upgrade-template.md
166168
- name: Network Fabric Upgrades
167169
href: howto-upgrade-nexus-fabric.md
168170
- name: Network Fabric Upgrades Template
Lines changed: 333 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,333 @@
1+
---
2+
title: "Azure Operator Nexus: Cluster runtime upgrade template"
3+
description: Learn the process for upgrading Cluster for Operator Nexus with step-by-step parameterized template.
4+
author: bartpinto
5+
ms.author: bpinto
6+
ms.service: azure-operator-nexus
7+
ms.date: 04/24/2025
8+
ms.topic: how-to
9+
ms.custom: azure-operator-nexus, template-include
10+
---
11+
12+
# Cluster runtime upgrade template
13+
14+
This how-to guide provides a step-by-step template for upgrading a Nexus Cluster designed to assist users in managing a reproducible end-to-end upgrade through Azure APIs and standard operating procedures. Regular updates are crucial for maintaining system integrity and accessing the latest product improvements.
15+
16+
## Overview
17+
18+
**Runtime bundle components**: These components require operator consent for upgrades that may affect traffic behavior or necessitate server reboots. Nexus Cluster's design allows for updates to be applied while maintaining continuous workload operation.
19+
20+
Runtime changes are categorized as follows:
21+
- **Firmware/BIOS/BMC updates**: Necessary to support new server control features and resolve security issues.
22+
- **Operating system updates**: Necessary to support new Operating system features and resolve security issues.
23+
- **Platform updates**: Necessary to support new platform features and resolve security issues.
24+
25+
## Prerequisites
26+
27+
- Install the latest version of [Azure CLI](https://aka.ms/azcli).
28+
- The latest `networkcloud` CLI extension is required. It can be installed following the steps listed in [Install CLI Extension](howto-install-cli-extensions.md).
29+
- Subscription access to run the Azure Operator Nexus Network Fabric (NF) and Network Cloud (NC) CLI extension commands.
30+
-Target Cluster must be healthy in a running state.
31+
32+
## Required Parameters
33+
- \<ENVIRONMENT\>: - Instance Name
34+
- <AZURE_REGION>: - Azure Region of Instance
35+
- <CUSTOMER_SUB_NAME>: Subscription Name
36+
- <CUSTOMER_SUB_ID>: Subscription ID
37+
- <CLUSTER_NAME>: Cluster Name
38+
- <CLUSTER_RG>: Cluster Resource Group
39+
- <CLUSTER_RID>: Cluster ARM ID
40+
- <CLUSTER_MRG>: Cluster Managed Resource Group
41+
- <CLUSTER_CONTROL_BMM>: Cluster Control plane baremetalmachine
42+
- <CLUSTER_VERSION>: Runtime version for upgrade
43+
- <START_TIME>: Planned start time of upgrade
44+
- \<DURATION\>: Estimated Duration of upgrade
45+
- <DEPLOYMENT_THRESHOLD>: Compute deployment threshold
46+
- <DEPLOYMENT_PAUSE_MINS>: Time to wait before moving to the next Rack once the current Rack meets the deployment threshold
47+
- <NFC_NAME>: Associated Network Fabric Controller (NFC)
48+
- <CM_NAME>: Associated Cluster Manager (CM)
49+
- <BMM_ISSUE_LIST>: List of BMM with provisioning issues after Cluster upgrade is complete
50+
51+
## Pre-Checks
52+
53+
1. Validate the provisioning and detailed status for the CM and Cluster.
54+
55+
Set up the subscription, CM, and Cluster parameters:
56+
```
57+
export SUBSCRIPTION_ID=<CUSTOMER_SUB_ID>
58+
export CM_RG=<CM_RG>
59+
export CM_NAME=<CM_NAME>
60+
export CLUSTER_RG=<CLUSTER_RG>
61+
export CLUSTER_NAME=<CLUSTER_NAME>
62+
export CLUSTER_RID=<CLUSTER_RID>
63+
export CLUSTER_MRG=<CLUSTER_MRG>
64+
export THRESHOLD=<DEPLOYMENT_THRESHOLD>
65+
export PAUSE_MINS=<DEPLOYMENT_PAUSE_MINS>
66+
```
67+
68+
Check that the CM is in `Succeeded` for `Provisioning state`:
69+
```
70+
az networkcloud clustermanager show -g $CM_RG --resource-name $CM_NAME --subscription $SUBSCRIPTION_ID -o table
71+
```
72+
73+
Check the Cluster status `Detailed status` is `Running`:
74+
```
75+
az networkcloud cluster show -g $CLUSTER_RG --resource-name $CLUSTER_NAME --subscription $SUBSCRIPTION_ID -o table
76+
```
77+
78+
>[!Note]
79+
> If CM `Provisioning state` isn't `Succeeded` and Cluster `Detailed status` isn't `Running` stop the upgrade until issues are resolved.
80+
81+
2. Check the Bare Metal Machine (BMM) status `Detailed status` is `Running`:
82+
```
83+
az networkcloud baremetalmachine list -g $CLUSTER_MRG --subscription $SUBSCRIPTION_ID --query "sort_by([].{name:name,kubernetesNodeName:kubernetesNodeName,location:location,readyState:readyState,provisioningState:provisioningState,detailedStatus:detailedStatus,detailedStatusMessage:detailedStatusMessage,cordonStatus:cordonStatus,powerState:powerState,kubernetesVersion:kubernetesVersion,machineClusterVersion:machineClusterVersion,machineRoles:machineRoles| join(', ', @),createdAt:systemData.createdAt}, &name)" -o table
84+
```
85+
86+
Validate the following resource states for each BMM (except spare):
87+
- ReadyState: True
88+
- ProvisioningState: Succeeded
89+
- DetailedStatus: Provisioned
90+
- CordonStatus: Uncordoned
91+
- PowerState: On
92+
93+
One control-plane BMM is labeled as a spare with the following BMM status profile:
94+
- ReadyState: False
95+
- ProvisioningState: Succeeded
96+
- DetailedStatus: Available
97+
- CordonStatus: Uncordoned
98+
- PowerState: Off
99+
100+
3. Collect a profile of the tenant workloads:
101+
```
102+
az networkcloud clustermanager show -g $CM_RG --resource-name $CM_NAME --subscription $SUBSCRIPTION_ID -o table
103+
az networkcloud virtualmachine list --sub $SUBSCRIPTION_ID --query "reverse(sort_by([?clusterId=='$CLUSTER_RID'].{name:name, createdAt:systemData.createdAt, resourceGroup:resourceGroup, powerState:powerState, provisioningState:provisioningState, detailedStatus:detailedStatus,bareMetalMachineId:bareMetalMachineIdi,CPUCount:cpuCores, EmulatorStatus:isolateEmulatorThread}, &createdAt))" -o table
104+
az networkcloud kubernetescluster list --sub $SUBSCRIPTION_ID --query "[?clusterId=='$CLUSTER_RID'].{name:name, resourceGroup:resourceGroup, provisioningState:provisioningState, detailedStatus:detailedStatus, detailedStatusMessage:detailedStatusMessage, createdAt:systemData.createdAt, kubernetesVersion:kubernetesVersion}" -o table
105+
```
106+
107+
4. Review Operator Nexus Release notes for required checks and configuration updates not included in this document.
108+
109+
## Send notification to Operations of upgrade schedule for the Cluster
110+
111+
The following template can be used through email or support ticket:
112+
```
113+
Title: <ENVIRONMENT> <AZURE_REGION> <CLUSTER_NAME> runtime upgrade to <CLUSTER_VERSION> <START_TIME> - Completion ETA <DURATION>
114+
115+
Operations Support:
116+
117+
Deployment Team notification for <ENVIRONMENT> <AZURE_REGION> <CLUSTER_NAME> runtime upgrade to <CLUSTER_VERSION> <START_TIME> - Completion ETA <DURATION>
118+
119+
Subscription: <CUSTOMER_SUB_ID>
120+
NFC: <NFC_NAME>
121+
CM: <CM_NAME>
122+
Fabric: <NF_NAME>
123+
Cluster: <CLUSTER_NAME>
124+
Region: <AZURE_REGION>
125+
Version: <NEXUS_VERSION>
126+
127+
CC: stakeholder-list
128+
```
129+
130+
## Add resource tag on Cluster resource in Azure portal
131+
To help track upgrades, add a tag to the Cluster resource in Azure portal (optional):
132+
```
133+
|Name | Value |
134+
|----------------|-----------------
135+
|BF in progress |<DE_ID> |
136+
```
137+
138+
## Set deployment strategy and Compute threshold on Cluster if different from default
139+
The default threshold for the percent of Compute BMM to pass hardware validation and provisioning is 80% with a default pause between Racks of one minute.
140+
141+
The following settings are available for `update-strategy`:
142+
* `Rack` - Upgrade each Rack one at a time and move to the next Rack once the Compute threshold is met for the current Rack. Pause for <DEPLOYMENT_PAUSE_MINS> before starting next Rack.
143+
* `PauseAfterRack` - Wait for user API response to continue to the next Rack once the Compute threshold is met for the current Rack.
144+
145+
If `updateStrategy` isn't set, the default values are as follows:
146+
```
147+
"updateStrategy": {
148+
"maxUnavailable": 32767,
149+
"strategyType": "Rack",
150+
"thresholdType": "PercentSuccess",
151+
"thresholdValue": 80,
152+
"waitTimeMinutes": 1
153+
}
154+
```
155+
156+
### Set a deployment threshold and wait time different than default
157+
```
158+
az networkcloud cluster update -n $CLUSTER_NAME -g $CLUSTER_RG --update-strategy strategy-type="Rack" threshold-type="PercentSuccess" threshold-value=$THRESHOLD wait-time-minutes=$PAUSE_MINS --subscription $SUBSCRIPTION_ID
159+
```
160+
>[!Important]
161+
> If 100% threshold is required, review the BMM status reported during pre-checks and make sure all BMM are healthy before proceeding with the upgrade.
162+
163+
Verify update:
164+
```
165+
az networkcloud cluster show -n $CLUSTER_NAME -g $CLUSTER_RG --subscription $SUBSCRIPTION_ID| grep -A5 updateStrategy
166+
"updateStrategy": {
167+
"maxUnavailable": 32767,
168+
"strategyType": "Rack",
169+
"thresholdType": "PercentSuccess",
170+
"thresholdValue": $THRESHOLD,
171+
"waitTimeMinutes": $PAUSE_MINS
172+
}
173+
```
174+
175+
### How to run Cluster upgrade with `PauseAfterRack` Strategy
176+
177+
`PauseAferRack` strategy allows the customer to control the upgrade by requiring an API call to continue to the next Rack after each Compute Rack completes to the configured threshold.
178+
179+
To configure strategy to use `PauseAfterRack`:
180+
```
181+
az networkcloud cluster update -n $CLUSTER_NAME -g $CLUSTER_RG --update-strategy strategy-type="PauseAfterRack" wait-time-minutes=0 threshold-type="PercentSuccess" threshold-value=$THRESHOLD --subscription $SUBSCRIPTION_ID
182+
```
183+
184+
Verify update:
185+
```
186+
az networkcloud cluster show -g <CLUSTER_RG> -n <CLUSTER_NAME> --subscription <CUSTOMER_SUB_ID>| grep -A5 updateStrategy
187+
"updateStrategy": {
188+
"maxUnavailable": 32767,
189+
"strategyType": "PauseAfterRack",
190+
"thresholdType": "PercentSuccess",
191+
"thresholdValue": $THRESHOLD,
192+
"waitTimeMinutes": 0
193+
```
194+
195+
## Run upgrade from either portal or cli
196+
* To start upgrade from Azure portal, go to Cluster resource, click `Update`, select <CLUSTER_VERSION>, then click `Update`
197+
* To run upgrade from Azure CLI, run the following command:
198+
```
199+
az networkcloud cluster update-version --subscription $SUBSCRIPTION_ID --cluster-name $CLUSTER_NAME --target-cluster-version $CLUSTER_VERSION --resource-group $CLUSTER_RG --no-wait --debug
200+
```
201+
202+
Gather ASYNC URL and Correlation ID info for further troubleshooting if needed.
203+
```
204+
cli.azure.cli.core.sdk.policies: 'mise-correlation-id': '<MISE_CID>'
205+
cli.azure.cli.core.sdk.policies: 'x-ms-correlation-request-id': '<CORRELATION_ID>'
206+
cli.azure.cli.core.sdk.policies: 'Azure-AsyncOperation': '<ASYNC_URL>'
207+
```
208+
Provide this information to Microsoft Support when opening a support ticket for upgrade issues.
209+
210+
## Monitor status of Cluster
211+
```
212+
az networkcloud cluster list -g $CLUSTER_RG --subscription $SUBSCRIPTION_ID -o table
213+
```
214+
The Cluster `Detailed status` shows `Running` and the `Detailed status message` shows 'Cluster is up and running.` when the upgrade is complete.
215+
216+
## Monitor status of Bare Metal Machines
217+
```
218+
az networkcloud baremetalmachine list -g $CLUSTER_MRG --subscription $SUBSCRIPTION_ID -o table
219+
az networkcloud baremetalmachine list -g $CLUSTER_MRG --subscription $SUBSCRIPTION_ID --query "sort_by([].{name:name,kubernetesNodeName:kubernetesNodeName,location:location,readyState:readyState,provisioningState:provisioningState,detailedStatus:detailedStatus,detailedStatusMessage:detailedStatusMessage,cordonStatus:cordonStatus,powerState:powerState,kubernetesVersion:kubernetesVersion,machineClusterVersion:machineClusterVersion,machineRoles:machineRoles| join(', ', @),createdAt:systemData.createdAt}, &name)" -o table
220+
```
221+
222+
Validate the following states for each BMM (except spare):
223+
- ReadyState: True
224+
- ProvisioningState: Succeeded
225+
- DetailedStatus: Provisioned
226+
- CordonStatus: Uncordoned
227+
- PowerState: On
228+
- KubernetesVersion: <NEW_VERSION>
229+
- MachineClusterVersion: <NEXUS_VERSION>
230+
231+
Add a Tag to the BMM resource to track any BMM that fails to complete provisioning (optional):
232+
```
233+
|Name | Value |
234+
|--------------------|-----------------
235+
|BF provision issue |<DE_ID> |
236+
```
237+
238+
## How to continue upgrade during `PauseAfterRack` strategy
239+
Once a compute Rack meets the success threshold, the upgrade pauses until the user signals to the operator to continue the upgrade.
240+
241+
Use the following command to continue upgrade once a Compute Rack is paused after meeting the deployment threshold for the Rack:
242+
```
243+
az networkcloud cluster continue-update-version -g $CLUSTER_RG -n $CLUSTER_NAME$ --subscription $SUBSCRIPTION_ID
244+
```
245+
## How to troubleshoot Cluster and BMM upgrade failures
246+
The following troubleshooting documents can help recover BMM upgrade issues:
247+
- [Hardware validation failures](troubleshoot-hardware-validation-failure.md)
248+
- [BMM Provisioning issues](troubleshoot-bare-metal-machine-provisioning.md)
249+
- [BMM Degraded Status](troubleshoot-bare-metal-machine-degraded.md)
250+
- [BMM Warning Status](troubleshoot-bare-metal-machine-warning.md)
251+
252+
If troubleshooting doesn't resolve the issue, open a Microsoft support ticket:
253+
- Collect any errors in the Azure CLI output.
254+
- Collect Cluster and BMM operation state from Azure portal or Azure CLI.
255+
- Create Azure Support Request for any Cluster or BMM upgrade failures and attach any errors along with ASYNC URL, correlation ID, and operation state of the Cluster and BMMs.
256+
257+
## Post-upgrade validation
258+
Run the following commands to check the status of the CM, Cluster, and BMM:
259+
260+
1. Check that the CM is in `Succeeded` for `Provisioning state`:
261+
```
262+
az networkcloud clustermanager show -g $CM_RG --resource-name $CM_NAME --subscription $SUBSCRIPTION_ID -o table
263+
```
264+
265+
2. Check the Cluster status `Detailed status` is `Running`:
266+
```
267+
az networkcloud cluster show -g $CLUSTER_RG --resource-name $CLUSTER_NAME --subscription $SUBSCRIPTION_ID -o table
268+
```
269+
270+
3. Check the Bare Metal Machine status:
271+
```
272+
az networkcloud baremetalmachine list -g $CLUSTER_MRG --subscription $SUBSCRIPTION_ID --query "sort_by([].{name:name,kubernetesNodeName:kubernetesNodeName,location:location,readyState:readyState,provisioningState:provisioningState,detailedStatus:detailedStatus,detailedStatusMessage:detailedStatusMessage,cordonStatus:cordonStatus,powerState:powerState,kubernetesVersion:kubernetesVersion,machineClusterVersion:machineClusterVersion,machineRoles:machineRoles| join(', ', @),createdAt:systemData.createdAt}, &name)" -o table
273+
```
274+
275+
Validate the following resource states for each BMM (except spare)
276+
- ReadyState: True
277+
- ProvisioningState: Succeeded
278+
- DetailedStatus: Provisioned
279+
- CordonStatus: Uncordoned
280+
- PowerState: On
281+
282+
>[!Note]
283+
> One control-plane BMM is labeled as a spare and is inactive.
284+
285+
4. Collect a profile of the tenant workloads:
286+
```
287+
az networkcloud clustermanager show -g $CM_RG --resource-name $CM_NAME --subscription $SUBSCRIPTION_ID -o table
288+
az networkcloud virtualmachine list --sub $SUBSCRIPTION_ID --query "reverse(sort_by([?clusterId=='$CLUSTER_RID'].{name:name, createdAt:systemData.createdAt, resourceGroup:resourceGroup, powerState:powerState, provisioningState:provisioningState, detailedStatus:detailedStatus,bareMetalMachineId:bareMetalMachineIdi,CPUCount:cpuCores, EmulatorStatus:isolateEmulatorThread}, &createdAt))" -o table
289+
az networkcloud kubernetescluster list --sub $SUBSCRIPTION_ID --query "[?clusterId=='$CLUSTER_RID'].{name:name, resourceGroup:resourceGroup, provisioningState:provisioningState, detailedStatus:detailedStatus, detailedStatusMessage:detailedStatusMessage, createdAt:systemData.createdAt, kubernetesVersion:kubernetesVersion}" -o table
290+
```
291+
292+
## Send notification to Operations of Cluster upgrade completion
293+
294+
The following template can be used through email or ticketing system:
295+
```
296+
Title: <ENVIRONMENT> <AZURE_REGION> <CLUSTER_NAME> Runtime <CLUSTER_VERSION> Upgrade Complete
297+
298+
Operations:
299+
Deployment Team notification for <ENVIRONMENT> <AZURE_REGION> <CLUSTER_NAME> runtime <CLUSTER_VERSION> Upgrade Complete
300+
301+
Subscription: <CUSTOMER_SUB_ID>
302+
NFC: <NFC_NAME>
303+
CM: <CM_NAME>
304+
Fabric: <NF_NAME>
305+
Cluster: <CLUSTER_NAME>
306+
Region: <AZURE_REGION>
307+
Version: <NEXUS_VERSION>
308+
309+
The following is a list of BMM with provisioning issues during upgrade:
310+
<BMM_ISSUE_LIST>
311+
312+
CC: stakeholder_list
313+
```
314+
315+
## Remove resource tag on Cluster resource in Azure portal
316+
Remove the resource tag on the Cluster resource tracking the upgrade in Azure portal (if added previously):
317+
```
318+
|Name | Value |
319+
|----------------|-----------------
320+
|BF in progress |<DE_ID> |
321+
```
322+
323+
## Close out any Work Items in your ticketing system
324+
* Update Task hours for upgrade duration.
325+
* Set Cluster upgrade work item to `Complete`.
326+
* Add any notes on support tickets and issues encountered during upgrade
327+
328+
## Links
329+
- [Azure portal](https://aka.ms/nexus-portal)
330+
- [Cluster Upgrade](howto-cluster-runtime-upgrade.md)
331+
- [Cluster Upgrade with PauseAfterRack](howto-cluster-runtime-upgrade-with-pauseafterrack-strategy.md)
332+
- [Azure CLI](https://aka.ms/azcli)
333+
- [Install CLI Extension](howto-install-cli-extensions.md)

0 commit comments

Comments
 (0)