|
| 1 | +--- |
| 2 | +title: "Azure Operator Nexus: Cluster runtime upgrade template" |
| 3 | +description: Learn the process for upgrading Cluster for Operator Nexus with step-by-step parameterized template. |
| 4 | +author: bartpinto |
| 5 | +ms.author: bpinto |
| 6 | +ms.service: azure-operator-nexus |
| 7 | +ms.date: 04/24/2025 |
| 8 | +ms.topic: how-to |
| 9 | +ms.custom: azure-operator-nexus, template-include |
| 10 | +--- |
| 11 | + |
| 12 | +# Cluster runtime upgrade template |
| 13 | + |
| 14 | +This how-to guide provides a step-by-step template for upgrading a Nexus Cluster designed to assist users in managing a reproducible end-to-end upgrade through Azure APIs and standard operating procedures. Regular updates are crucial for maintaining system integrity and accessing the latest product improvements. |
| 15 | + |
| 16 | +## Overview |
| 17 | + |
| 18 | +**Runtime bundle components**: These components require operator consent for upgrades that may affect traffic behavior or necessitate server reboots. Nexus Cluster's design allows for updates to be applied while maintaining continuous workload operation. |
| 19 | + |
| 20 | +Runtime changes are categorized as follows: |
| 21 | +- **Firmware/BIOS/BMC updates**: Necessary to support new server control features and resolve security issues. |
| 22 | +- **Operating system updates**: Necessary to support new Operating system features and resolve security issues. |
| 23 | +- **Platform updates**: Necessary to support new platform features and resolve security issues. |
| 24 | + |
| 25 | +## Prerequisites |
| 26 | + |
| 27 | +- Install the latest version of [Azure CLI](https://aka.ms/azcli). |
| 28 | +- The latest `networkcloud` CLI extension is required. It can be installed following the steps listed in [Install CLI Extension](howto-install-cli-extensions.md). |
| 29 | +- Subscription access to run the Azure Operator Nexus Network Fabric (NF) and Network Cloud (NC) CLI extension commands. |
| 30 | +-Target Cluster must be healthy in a running state. |
| 31 | + |
| 32 | +## Required Parameters |
| 33 | +- \<ENVIRONMENT\>: - Instance Name |
| 34 | +- <AZURE_REGION>: - Azure Region of Instance |
| 35 | +- <CUSTOMER_SUB_NAME>: Subscription Name |
| 36 | +- <CUSTOMER_SUB_ID>: Subscription ID |
| 37 | +- <CLUSTER_NAME>: Cluster Name |
| 38 | +- <CLUSTER_RG>: Cluster Resource Group |
| 39 | +- <CLUSTER_RID>: Cluster ARM ID |
| 40 | +- <CLUSTER_MRG>: Cluster Managed Resource Group |
| 41 | +- <CLUSTER_CONTROL_BMM>: Cluster Control plane baremetalmachine |
| 42 | +- <CLUSTER_VERSION>: Runtime version for upgrade |
| 43 | +- <START_TIME>: Planned start time of upgrade |
| 44 | +- \<DURATION\>: Estimated Duration of upgrade |
| 45 | +- <DEPLOYMENT_THRESHOLD>: Compute deployment threshold |
| 46 | +- <DEPLOYMENT_PAUSE_MINS>: Time to wait before moving to the next Rack once the current Rack meets the deployment threshold |
| 47 | +- <NFC_NAME>: Associated Network Fabric Controller (NFC) |
| 48 | +- <CM_NAME>: Associated Cluster Manager (CM) |
| 49 | +- <BMM_ISSUE_LIST>: List of BMM with provisioning issues after Cluster upgrade is complete |
| 50 | + |
| 51 | +## Pre-Checks |
| 52 | + |
| 53 | +1. Validate the provisioning and detailed status for the CM and Cluster. |
| 54 | + |
| 55 | + Set up the subscription, CM, and Cluster parameters: |
| 56 | + ``` |
| 57 | + export SUBSCRIPTION_ID=<CUSTOMER_SUB_ID> |
| 58 | + export CM_RG=<CM_RG> |
| 59 | + export CM_NAME=<CM_NAME> |
| 60 | + export CLUSTER_RG=<CLUSTER_RG> |
| 61 | + export CLUSTER_NAME=<CLUSTER_NAME> |
| 62 | + export CLUSTER_RID=<CLUSTER_RID> |
| 63 | + export CLUSTER_MRG=<CLUSTER_MRG> |
| 64 | + export THRESHOLD=<DEPLOYMENT_THRESHOLD> |
| 65 | + export PAUSE_MINS=<DEPLOYMENT_PAUSE_MINS> |
| 66 | + ``` |
| 67 | + |
| 68 | + Check that the CM is in `Succeeded` for `Provisioning state`: |
| 69 | + ``` |
| 70 | + az networkcloud clustermanager show -g $CM_RG --resource-name $CM_NAME --subscription $SUBSCRIPTION_ID -o table |
| 71 | + ``` |
| 72 | + |
| 73 | + Check the Cluster status `Detailed status` is `Running`: |
| 74 | + ``` |
| 75 | + az networkcloud cluster show -g $CLUSTER_RG --resource-name $CLUSTER_NAME --subscription $SUBSCRIPTION_ID -o table |
| 76 | + ``` |
| 77 | + |
| 78 | + >[!Note] |
| 79 | + > If CM `Provisioning state` isn't `Succeeded` and Cluster `Detailed status` isn't `Running` stop the upgrade until issues are resolved. |
| 80 | +
|
| 81 | +2. Check the Bare Metal Machine (BMM) status `Detailed status` is `Running`: |
| 82 | + ``` |
| 83 | + az networkcloud baremetalmachine list -g $CLUSTER_MRG --subscription $SUBSCRIPTION_ID --query "sort_by([].{name:name,kubernetesNodeName:kubernetesNodeName,location:location,readyState:readyState,provisioningState:provisioningState,detailedStatus:detailedStatus,detailedStatusMessage:detailedStatusMessage,cordonStatus:cordonStatus,powerState:powerState,kubernetesVersion:kubernetesVersion,machineClusterVersion:machineClusterVersion,machineRoles:machineRoles| join(', ', @),createdAt:systemData.createdAt}, &name)" -o table |
| 84 | + ``` |
| 85 | + |
| 86 | + Validate the following resource states for each BMM (except spare): |
| 87 | + - ReadyState: True |
| 88 | + - ProvisioningState: Succeeded |
| 89 | + - DetailedStatus: Provisioned |
| 90 | + - CordonStatus: Uncordoned |
| 91 | + - PowerState: On |
| 92 | + |
| 93 | + One control-plane BMM is labeled as a spare with the following BMM status profile: |
| 94 | + - ReadyState: False |
| 95 | + - ProvisioningState: Succeeded |
| 96 | + - DetailedStatus: Available |
| 97 | + - CordonStatus: Uncordoned |
| 98 | + - PowerState: Off |
| 99 | + |
| 100 | +3. Collect a profile of the tenant workloads: |
| 101 | + ``` |
| 102 | + az networkcloud clustermanager show -g $CM_RG --resource-name $CM_NAME --subscription $SUBSCRIPTION_ID -o table |
| 103 | + az networkcloud virtualmachine list --sub $SUBSCRIPTION_ID --query "reverse(sort_by([?clusterId=='$CLUSTER_RID'].{name:name, createdAt:systemData.createdAt, resourceGroup:resourceGroup, powerState:powerState, provisioningState:provisioningState, detailedStatus:detailedStatus,bareMetalMachineId:bareMetalMachineIdi,CPUCount:cpuCores, EmulatorStatus:isolateEmulatorThread}, &createdAt))" -o table |
| 104 | + az networkcloud kubernetescluster list --sub $SUBSCRIPTION_ID --query "[?clusterId=='$CLUSTER_RID'].{name:name, resourceGroup:resourceGroup, provisioningState:provisioningState, detailedStatus:detailedStatus, detailedStatusMessage:detailedStatusMessage, createdAt:systemData.createdAt, kubernetesVersion:kubernetesVersion}" -o table |
| 105 | + ``` |
| 106 | + |
| 107 | +4. Review Operator Nexus Release notes for required checks and configuration updates not included in this document. |
| 108 | + |
| 109 | +## Send notification to Operations of upgrade schedule for the Cluster |
| 110 | + |
| 111 | +The following template can be used through email or support ticket: |
| 112 | +``` |
| 113 | +Title: <ENVIRONMENT> <AZURE_REGION> <CLUSTER_NAME> runtime upgrade to <CLUSTER_VERSION> <START_TIME> - Completion ETA <DURATION> |
| 114 | +
|
| 115 | +Operations Support: |
| 116 | +
|
| 117 | +Deployment Team notification for <ENVIRONMENT> <AZURE_REGION> <CLUSTER_NAME> runtime upgrade to <CLUSTER_VERSION> <START_TIME> - Completion ETA <DURATION> |
| 118 | +
|
| 119 | +Subscription: <CUSTOMER_SUB_ID> |
| 120 | +NFC: <NFC_NAME> |
| 121 | +CM: <CM_NAME> |
| 122 | +Fabric: <NF_NAME> |
| 123 | +Cluster: <CLUSTER_NAME> |
| 124 | +Region: <AZURE_REGION> |
| 125 | +Version: <NEXUS_VERSION> |
| 126 | +
|
| 127 | +CC: stakeholder-list |
| 128 | +``` |
| 129 | + |
| 130 | +## Add resource tag on Cluster resource in Azure portal |
| 131 | +To help track upgrades, add a tag to the Cluster resource in Azure portal (optional): |
| 132 | +``` |
| 133 | +|Name | Value | |
| 134 | +|----------------|----------------- |
| 135 | +|BF in progress |<DE_ID> | |
| 136 | +``` |
| 137 | + |
| 138 | +## Set deployment strategy and Compute threshold on Cluster if different from default |
| 139 | +The default threshold for the percent of Compute BMM to pass hardware validation and provisioning is 80% with a default pause between Racks of one minute. |
| 140 | + |
| 141 | +The following settings are available for `update-strategy`: |
| 142 | +* `Rack` - Upgrade each Rack one at a time and move to the next Rack once the Compute threshold is met for the current Rack. Pause for <DEPLOYMENT_PAUSE_MINS> before starting next Rack. |
| 143 | +* `PauseAfterRack` - Wait for user API response to continue to the next Rack once the Compute threshold is met for the current Rack. |
| 144 | + |
| 145 | +If `updateStrategy` isn't set, the default values are as follows: |
| 146 | +``` |
| 147 | +"updateStrategy": { |
| 148 | + "maxUnavailable": 32767, |
| 149 | + "strategyType": "Rack", |
| 150 | + "thresholdType": "PercentSuccess", |
| 151 | + "thresholdValue": 80, |
| 152 | + "waitTimeMinutes": 1 |
| 153 | +} |
| 154 | +``` |
| 155 | + |
| 156 | +### Set a deployment threshold and wait time different than default |
| 157 | +``` |
| 158 | +az networkcloud cluster update -n $CLUSTER_NAME -g $CLUSTER_RG --update-strategy strategy-type="Rack" threshold-type="PercentSuccess" threshold-value=$THRESHOLD wait-time-minutes=$PAUSE_MINS --subscription $SUBSCRIPTION_ID |
| 159 | +``` |
| 160 | +>[!Important] |
| 161 | +> If 100% threshold is required, review the BMM status reported during pre-checks and make sure all BMM are healthy before proceeding with the upgrade. |
| 162 | +
|
| 163 | +Verify update: |
| 164 | +``` |
| 165 | +az networkcloud cluster show -n $CLUSTER_NAME -g $CLUSTER_RG --subscription $SUBSCRIPTION_ID| grep -A5 updateStrategy |
| 166 | +"updateStrategy": { |
| 167 | + "maxUnavailable": 32767, |
| 168 | + "strategyType": "Rack", |
| 169 | + "thresholdType": "PercentSuccess", |
| 170 | + "thresholdValue": $THRESHOLD, |
| 171 | + "waitTimeMinutes": $PAUSE_MINS |
| 172 | +} |
| 173 | +``` |
| 174 | + |
| 175 | +### How to run Cluster upgrade with `PauseAfterRack` Strategy |
| 176 | + |
| 177 | +`PauseAferRack` strategy allows the customer to control the upgrade by requiring an API call to continue to the next Rack after each Compute Rack completes to the configured threshold. |
| 178 | + |
| 179 | +To configure strategy to use `PauseAfterRack`: |
| 180 | +``` |
| 181 | +az networkcloud cluster update -n $CLUSTER_NAME -g $CLUSTER_RG --update-strategy strategy-type="PauseAfterRack" wait-time-minutes=0 threshold-type="PercentSuccess" threshold-value=$THRESHOLD --subscription $SUBSCRIPTION_ID |
| 182 | +``` |
| 183 | + |
| 184 | +Verify update: |
| 185 | +``` |
| 186 | +az networkcloud cluster show -g <CLUSTER_RG> -n <CLUSTER_NAME> --subscription <CUSTOMER_SUB_ID>| grep -A5 updateStrategy |
| 187 | + "updateStrategy": { |
| 188 | + "maxUnavailable": 32767, |
| 189 | + "strategyType": "PauseAfterRack", |
| 190 | + "thresholdType": "PercentSuccess", |
| 191 | + "thresholdValue": $THRESHOLD, |
| 192 | + "waitTimeMinutes": 0 |
| 193 | +``` |
| 194 | + |
| 195 | +## Run upgrade from either portal or cli |
| 196 | +* To start upgrade from Azure portal, go to Cluster resource, click `Update`, select <CLUSTER_VERSION>, then click `Update` |
| 197 | +* To run upgrade from Azure CLI, run the following command: |
| 198 | + ``` |
| 199 | + az networkcloud cluster update-version --subscription $SUBSCRIPTION_ID --cluster-name $CLUSTER_NAME --target-cluster-version $CLUSTER_VERSION --resource-group $CLUSTER_RG --no-wait --debug |
| 200 | + ``` |
| 201 | + |
| 202 | + Gather ASYNC URL and Correlation ID info for further troubleshooting if needed. |
| 203 | + ``` |
| 204 | + cli.azure.cli.core.sdk.policies: 'mise-correlation-id': '<MISE_CID>' |
| 205 | + cli.azure.cli.core.sdk.policies: 'x-ms-correlation-request-id': '<CORRELATION_ID>' |
| 206 | + cli.azure.cli.core.sdk.policies: 'Azure-AsyncOperation': '<ASYNC_URL>' |
| 207 | + ``` |
| 208 | + Provide this information to Microsoft Support when opening a support ticket for upgrade issues. |
| 209 | + |
| 210 | +## Monitor status of Cluster |
| 211 | +``` |
| 212 | +az networkcloud cluster list -g $CLUSTER_RG --subscription $SUBSCRIPTION_ID -o table |
| 213 | +``` |
| 214 | +The Cluster `Detailed status` shows `Running` and the `Detailed status message` shows 'Cluster is up and running.` when the upgrade is complete. |
| 215 | + |
| 216 | +## Monitor status of Bare Metal Machines |
| 217 | +``` |
| 218 | +az networkcloud baremetalmachine list -g $CLUSTER_MRG --subscription $SUBSCRIPTION_ID -o table |
| 219 | +az networkcloud baremetalmachine list -g $CLUSTER_MRG --subscription $SUBSCRIPTION_ID --query "sort_by([].{name:name,kubernetesNodeName:kubernetesNodeName,location:location,readyState:readyState,provisioningState:provisioningState,detailedStatus:detailedStatus,detailedStatusMessage:detailedStatusMessage,cordonStatus:cordonStatus,powerState:powerState,kubernetesVersion:kubernetesVersion,machineClusterVersion:machineClusterVersion,machineRoles:machineRoles| join(', ', @),createdAt:systemData.createdAt}, &name)" -o table |
| 220 | +``` |
| 221 | + |
| 222 | +Validate the following states for each BMM (except spare): |
| 223 | +- ReadyState: True |
| 224 | +- ProvisioningState: Succeeded |
| 225 | +- DetailedStatus: Provisioned |
| 226 | +- CordonStatus: Uncordoned |
| 227 | +- PowerState: On |
| 228 | +- KubernetesVersion: <NEW_VERSION> |
| 229 | +- MachineClusterVersion: <NEXUS_VERSION> |
| 230 | + |
| 231 | +Add a Tag to the BMM resource to track any BMM that fails to complete provisioning (optional): |
| 232 | +``` |
| 233 | +|Name | Value | |
| 234 | +|--------------------|----------------- |
| 235 | +|BF provision issue |<DE_ID> | |
| 236 | +``` |
| 237 | + |
| 238 | +## How to continue upgrade during `PauseAfterRack` strategy |
| 239 | +Once a compute Rack meets the success threshold, the upgrade pauses until the user signals to the operator to continue the upgrade. |
| 240 | + |
| 241 | +Use the following command to continue upgrade once a Compute Rack is paused after meeting the deployment threshold for the Rack: |
| 242 | +``` |
| 243 | +az networkcloud cluster continue-update-version -g $CLUSTER_RG -n $CLUSTER_NAME$ --subscription $SUBSCRIPTION_ID |
| 244 | +``` |
| 245 | +## How to troubleshoot Cluster and BMM upgrade failures |
| 246 | +The following troubleshooting documents can help recover BMM upgrade issues: |
| 247 | +- [Hardware validation failures](troubleshoot-hardware-validation-failure.md) |
| 248 | +- [BMM Provisioning issues](troubleshoot-bare-metal-machine-provisioning.md) |
| 249 | +- [BMM Degraded Status](troubleshoot-bare-metal-machine-degraded.md) |
| 250 | +- [BMM Warning Status](troubleshoot-bare-metal-machine-warning.md) |
| 251 | + |
| 252 | +If troubleshooting doesn't resolve the issue, open a Microsoft support ticket: |
| 253 | +- Collect any errors in the Azure CLI output. |
| 254 | +- Collect Cluster and BMM operation state from Azure portal or Azure CLI. |
| 255 | +- Create Azure Support Request for any Cluster or BMM upgrade failures and attach any errors along with ASYNC URL, correlation ID, and operation state of the Cluster and BMMs. |
| 256 | + |
| 257 | +## Post-upgrade validation |
| 258 | +Run the following commands to check the status of the CM, Cluster, and BMM: |
| 259 | + |
| 260 | +1. Check that the CM is in `Succeeded` for `Provisioning state`: |
| 261 | + ``` |
| 262 | + az networkcloud clustermanager show -g $CM_RG --resource-name $CM_NAME --subscription $SUBSCRIPTION_ID -o table |
| 263 | + ``` |
| 264 | + |
| 265 | +2. Check the Cluster status `Detailed status` is `Running`: |
| 266 | + ``` |
| 267 | + az networkcloud cluster show -g $CLUSTER_RG --resource-name $CLUSTER_NAME --subscription $SUBSCRIPTION_ID -o table |
| 268 | + ``` |
| 269 | + |
| 270 | +3. Check the Bare Metal Machine status: |
| 271 | + ``` |
| 272 | + az networkcloud baremetalmachine list -g $CLUSTER_MRG --subscription $SUBSCRIPTION_ID --query "sort_by([].{name:name,kubernetesNodeName:kubernetesNodeName,location:location,readyState:readyState,provisioningState:provisioningState,detailedStatus:detailedStatus,detailedStatusMessage:detailedStatusMessage,cordonStatus:cordonStatus,powerState:powerState,kubernetesVersion:kubernetesVersion,machineClusterVersion:machineClusterVersion,machineRoles:machineRoles| join(', ', @),createdAt:systemData.createdAt}, &name)" -o table |
| 273 | + ``` |
| 274 | + |
| 275 | + Validate the following resource states for each BMM (except spare) |
| 276 | + - ReadyState: True |
| 277 | + - ProvisioningState: Succeeded |
| 278 | + - DetailedStatus: Provisioned |
| 279 | + - CordonStatus: Uncordoned |
| 280 | + - PowerState: On |
| 281 | + |
| 282 | + >[!Note] |
| 283 | + > One control-plane BMM is labeled as a spare and is inactive. |
| 284 | +
|
| 285 | +4. Collect a profile of the tenant workloads: |
| 286 | + ``` |
| 287 | + az networkcloud clustermanager show -g $CM_RG --resource-name $CM_NAME --subscription $SUBSCRIPTION_ID -o table |
| 288 | + az networkcloud virtualmachine list --sub $SUBSCRIPTION_ID --query "reverse(sort_by([?clusterId=='$CLUSTER_RID'].{name:name, createdAt:systemData.createdAt, resourceGroup:resourceGroup, powerState:powerState, provisioningState:provisioningState, detailedStatus:detailedStatus,bareMetalMachineId:bareMetalMachineIdi,CPUCount:cpuCores, EmulatorStatus:isolateEmulatorThread}, &createdAt))" -o table |
| 289 | + az networkcloud kubernetescluster list --sub $SUBSCRIPTION_ID --query "[?clusterId=='$CLUSTER_RID'].{name:name, resourceGroup:resourceGroup, provisioningState:provisioningState, detailedStatus:detailedStatus, detailedStatusMessage:detailedStatusMessage, createdAt:systemData.createdAt, kubernetesVersion:kubernetesVersion}" -o table |
| 290 | + ``` |
| 291 | + |
| 292 | +## Send notification to Operations of Cluster upgrade completion |
| 293 | + |
| 294 | +The following template can be used through email or ticketing system: |
| 295 | +``` |
| 296 | +Title: <ENVIRONMENT> <AZURE_REGION> <CLUSTER_NAME> Runtime <CLUSTER_VERSION> Upgrade Complete |
| 297 | +
|
| 298 | +Operations: |
| 299 | +Deployment Team notification for <ENVIRONMENT> <AZURE_REGION> <CLUSTER_NAME> runtime <CLUSTER_VERSION> Upgrade Complete |
| 300 | +
|
| 301 | +Subscription: <CUSTOMER_SUB_ID> |
| 302 | +NFC: <NFC_NAME> |
| 303 | +CM: <CM_NAME> |
| 304 | +Fabric: <NF_NAME> |
| 305 | +Cluster: <CLUSTER_NAME> |
| 306 | +Region: <AZURE_REGION> |
| 307 | +Version: <NEXUS_VERSION> |
| 308 | +
|
| 309 | +The following is a list of BMM with provisioning issues during upgrade: |
| 310 | +<BMM_ISSUE_LIST> |
| 311 | + |
| 312 | +CC: stakeholder_list |
| 313 | +``` |
| 314 | + |
| 315 | +## Remove resource tag on Cluster resource in Azure portal |
| 316 | +Remove the resource tag on the Cluster resource tracking the upgrade in Azure portal (if added previously): |
| 317 | +``` |
| 318 | +|Name | Value | |
| 319 | +|----------------|----------------- |
| 320 | +|BF in progress |<DE_ID> | |
| 321 | +``` |
| 322 | + |
| 323 | +## Close out any Work Items in your ticketing system |
| 324 | +* Update Task hours for upgrade duration. |
| 325 | +* Set Cluster upgrade work item to `Complete`. |
| 326 | +* Add any notes on support tickets and issues encountered during upgrade |
| 327 | + |
| 328 | +## Links |
| 329 | +- [Azure portal](https://aka.ms/nexus-portal) |
| 330 | +- [Cluster Upgrade](howto-cluster-runtime-upgrade.md) |
| 331 | +- [Cluster Upgrade with PauseAfterRack](howto-cluster-runtime-upgrade-with-pauseafterrack-strategy.md) |
| 332 | +- [Azure CLI](https://aka.ms/azcli) |
| 333 | +- [Install CLI Extension](howto-install-cli-extensions.md) |
0 commit comments