Update howto-cluster-runtime-upgrade-template.md

bartpinto · web-flow · commit 8bceaba138fa · 2025-05-05T10:40:38.000-07:00
diff --git a/articles/operator-nexus/howto-cluster-runtime-upgrade-template.md b/articles/operator-nexus/howto-cluster-runtime-upgrade-template.md
@@ -9,11 +9,13 @@ ms.topic: how-to
 ms.custom: azure-operator-nexus, template-include
 ---
 
-# Cluster runtime upgrade template
+# Cluster Runtime Upgrade Template
 
 This how-to guide provides a step-by-step template for upgrading a Nexus Cluster designed to assist users in managing a reproducible end-to-end upgrade through Azure APIs and standard operating procedures. Regular updates are crucial for maintaining system integrity and accessing the latest product improvements.
 
 ## Overview
+<details>
+<summary> Overview of Cluster runtime upgrade template </summary>
 
 **Runtime bundle components**: These components require operator consent for upgrades that may affect traffic behavior or necessitate server reboots. Nexus Cluster's design allows for updates to be applied while maintaining continuous workload operation.
 
@@ -22,33 +24,92 @@ Runtime changes are categorized as follows:
 - **Operating system updates**: Necessary to support new Operating system features and resolve security issues.
 - **Platform updates**: Necessary to support new platform features and resolve security issues.
 
+</details>
+
 ## Prerequisites
+<details>
+<summary> Prerequisites for using this template to upgrade a Cluster </summary>
 
-- Install the latest version of [Azure CLI](https://aka.ms/azcli).
-- The latest `networkcloud` CLI extension is required. It can be installed following the steps listed in [Install CLI Extension](howto-install-cli-extensions.md).
+- Latest version of [Azure CLI](https://aka.ms/azcli).
+- Latest `managednetworkfabric` [CLI extension](howto-install-cli-extensions.md).
+- Latest `networkcloud` [CLI extension](howto-install-cli-extensions.md).
 - Subscription access to run the Azure Operator Nexus Network Fabric (NF) and Network Cloud (NC) CLI extension commands.
--Target Cluster must be healthy in a running state.
+- Target Cluster must be healthy in a running state.
+
+</details>
 
 ## Required Parameters
+<details>
+<summary> Parameters used in this document </summary>
+
 - \<ENVIRONMENT\>: - Instance Name
 - <AZURE_REGION>: - Azure Region of Instance
 - <CUSTOMER_SUB_NAME>: Subscription Name
 - <CUSTOMER_SUB_ID>: Subscription ID
+- \<NEXUS_VERSION\>: Nexus release version (for example, 2504.1)
+- <NNF_VERSION>: Operator Nexus Fabric release version (for example, 8.1) 
+- <NF_VERSION>: NF runtime version for upgrade (for example, 5.0.0)
+- <NFC_NAME>: Associated Network Fabric Controller (NFC)
+- <CM_NAME>: Associated Cluster Manager (CM)
 - <CLUSTER_NAME>: Cluster Name
 - <CLUSTER_RG>: Cluster Resource Group
 - <CLUSTER_RID>: Cluster ARM ID
 - <CLUSTER_MRG>: Cluster Managed Resource Group
 - <CLUSTER_CONTROL_BMM>: Cluster Control plane baremetalmachine
 - <CLUSTER_VERSION>: Runtime version for upgrade
-- <START_TIME>: Planned start time of upgrade
-- \<DURATION\>: Estimated Duration of upgrade
 - <DEPLOYMENT_THRESHOLD>: Compute deployment threshold
 - <DEPLOYMENT_PAUSE_MINS>: Time to wait before moving to the next Rack once the current Rack meets the deployment threshold
-- <NFC_NAME>: Associated Network Fabric Controller (NFC)
-- <CM_NAME>: Associated Cluster Manager (CM)
-- <BMM_ISSUE_LIST>: List of BMM with provisioning issues after Cluster upgrade is complete
+- <MISE_CID>: Microsoft.Identity.ServiceEssentials (MISE) Correlation ID in debug output for Device updates
+- <CORRELATION_ID>: Operation Correlation ID in debug output for Device updates
+- <ASYNC_URL>: Asynchronous (ASYNC) URL in debug output for Device updates
+- <LINK_TO_TELCO_INPUT>: Link to the Instance Telco Input file
+
+</details>
+
+## Deployment Data
+<details>
+<summary> Deployment data details </summary>
+
+```
+- Nexus: <NEXUS_VERSION>
+- NC: <NC_VERSION>
+- NF: <NF_VERSION>
+- Subscription Name: <CUSTOMER_SUB_NAME>
+- Subscription ID: <CUSTOMER_SUB_ID>
+- Tenant ID: <CUSTOMER_SUB_TENANT_ID>
+- Telco Input: <LINK_TO_TELCO_INPUT>
+```
+
+</details>
+
+## Debug information for Azure CLI commands
+<details>
+<summary> How to collect debug information for Azure CLI commands </summary>
+
+Azure CLI deployment commands issued with `--debug` contain the following information in the command output:
+```
+cli.azure.cli.core.sdk.policies:     'mise-correlation-id': '<MISE_CID>'
+cli.azure.cli.core.sdk.policies:     'x-ms-correlation-request-id': '<CORRELATION_ID>'
+cli.azure.cli.core.sdk.policies:     'Azure-AsyncOperation': '<ASYNC_URL>'
+```
+
+To view status of long running asynchronous operations, run the following command with `az rest`:
+```
+az rest -m get -u '<ASYNC_URL>'
+```
+
+Command status information is returned along with detailed informational or error messages:
+- `"status": "Accepted"`
+- `"status": "Succeeded"`
+- `"status": "Failed"`
+
+If any failures occur, report the <MISE_CID>, <CORRELATION_ID>, status code, and detailed messages when opening a support request.
+
+</details>
 
 ## Pre-Checks
+<details>
+<summary> Pre-checks before starting Cluster upgrade </summary>
 
 1. Validate the provisioning and detailed status for the CM and Cluster.
    
@@ -99,43 +160,19 @@ Runtime changes are categorized as follows:
 
 3. Collect a profile of the tenant workloads:
    ```
-   az networkcloud clustermanager show -g $CM_RG --resource-name $CM_NAME --subscription $SUBSCRIPTION_ID -o table
    az networkcloud virtualmachine list --sub $SUBSCRIPTION_ID --query "reverse(sort_by([?clusterId=='$CLUSTER_RID'].{name:name, createdAt:systemData.createdAt, resourceGroup:resourceGroup, powerState:powerState, provisioningState:provisioningState, detailedStatus:detailedStatus,bareMetalMachineId:bareMetalMachineIdi,CPUCount:cpuCores, EmulatorStatus:isolateEmulatorThread}, &createdAt))" -o table
    az networkcloud kubernetescluster list --sub $SUBSCRIPTION_ID --query "[?clusterId=='$CLUSTER_RID'].{name:name, resourceGroup:resourceGroup, provisioningState:provisioningState, detailedStatus:detailedStatus, detailedStatusMessage:detailedStatusMessage, createdAt:systemData.createdAt, kubernetesVersion:kubernetesVersion}" -o table
    ```
 
 4. Review Operator Nexus Release notes for required checks and configuration updates not included in this document.
 
-## Send notification to Operations of upgrade schedule for the Cluster
-
-The following template can be used through email or support ticket:
-```
-Title: <ENVIRONMENT> <AZURE_REGION> <CLUSTER_NAME> runtime upgrade to <CLUSTER_VERSION> <START_TIME> - Completion ETA <DURATION>
-
-Operations Support:
-
-Deployment Team notification for <ENVIRONMENT> <AZURE_REGION> <CLUSTER_NAME> runtime upgrade to <CLUSTER_VERSION> <START_TIME> - Completion ETA <DURATION>
+</details>
 
-Subscription: <CUSTOMER_SUB_ID>
-NFC: <NFC_NAME>
-CM: <CM_NAME>
-Fabric: <NF_NAME>
-Cluster: <CLUSTER_NAME>
-Region: <AZURE_REGION>
-Version: <NEXUS_VERSION>
-
-CC: stakeholder-list
-```
+## Upgrade Procedure
+<details>
+<summary> Custer runtime uUpgrade procedure details </summary>
 
-## Add resource tag on Cluster resource in Azure portal
-To help track upgrades, add a tag to the Cluster resource in Azure portal (optional):
-```
-|Name            | Value          |
-|----------------|-----------------
-|BF in progress  |<DE_ID>         |
-```
-
-## Set deployment strategy and Compute threshold on Cluster if different from default
+### Cluster upgrade settings defaults
 The default threshold for the percent of Compute BMM to pass hardware validation and provisioning is 80% with a default pause between Racks of one minute.
 
 The following settings are available for `update-strategy`:
@@ -173,7 +210,6 @@ az networkcloud cluster show -n $CLUSTER_NAME -g $CLUSTER_RG --subscription $SUB
 ```
 
 ### How to run Cluster upgrade with `PauseAfterRack` Strategy
-
 `PauseAferRack` strategy allows the customer to control the upgrade by requiring an API call to continue to the next Rack after each Compute Rack completes to the configured threshold.
 
 To configure strategy to use `PauseAfterRack`:
@@ -192,7 +228,7 @@ az networkcloud cluster show -g <CLUSTER_RG> -n <CLUSTER_NAME> --subscription <C
     "waitTimeMinutes": 0
 ```
 
-## Run upgrade from either portal or cli
+### Run upgrade from either portal or cli
 * To start upgrade from Azure portal, go to Cluster resource, click `Update`, select <CLUSTER_VERSION>, then click `Update`
 * To run upgrade from Azure CLI, run the following command:
   ```
@@ -207,13 +243,21 @@ az networkcloud cluster show -g <CLUSTER_RG> -n <CLUSTER_NAME> --subscription <C
   ```
   Provide this information to Microsoft Support when opening a support ticket for upgrade issues.
 
-## Monitor status of Cluster
+### How to continue upgrade during `PauseAfterRack` strategy
+Once a compute Rack meets the success threshold, the upgrade pauses until the user signals to the operator to continue the upgrade.
+
+Use the following command to continue upgrade once a Compute Rack is paused after meeting the deployment threshold for the Rack:
+```
+az networkcloud cluster continue-update-version -g $CLUSTER_RG -n $CLUSTER_NAME$ --subscription $SUBSCRIPTION_ID
+```
+
+### Monitor status of Cluster
 ```
 az networkcloud cluster list -g $CLUSTER_RG --subscription $SUBSCRIPTION_ID -o table
 ```
 The Cluster `Detailed status` shows `Running` and the `Detailed status message` shows 'Cluster is up and running.` when the upgrade is complete.
 
-## Monitor status of Bare Metal Machines
+### Monitor status of Bare Metal Machines
 ```
 az networkcloud baremetalmachine list -g $CLUSTER_MRG --subscription $SUBSCRIPTION_ID -o table
 az networkcloud baremetalmachine list -g $CLUSTER_MRG --subscription $SUBSCRIPTION_ID --query "sort_by([].{name:name,kubernetesNodeName:kubernetesNodeName,location:location,readyState:readyState,provisioningState:provisioningState,detailedStatus:detailedStatus,detailedStatusMessage:detailedStatusMessage,cordonStatus:cordonStatus,powerState:powerState,kubernetesVersion:kubernetesVersion,machineClusterVersion:machineClusterVersion,machineRoles:machineRoles| join(', ', @),createdAt:systemData.createdAt}, &name)" -o table
@@ -228,21 +272,7 @@ Validate the following states for each BMM (except spare):
 - KubernetesVersion: <NEW_VERSION>
 - MachineClusterVersion: <NEXUS_VERSION>
 
-Add a Tag to the BMM resource to track any BMM that fails to complete provisioning (optional):
-```
-|Name                | Value          |
-|--------------------|-----------------
-|BF provision issue  |<DE_ID>         |
-```
-
-## How to continue upgrade during `PauseAfterRack` strategy
-Once a compute Rack meets the success threshold, the upgrade pauses until the user signals to the operator to continue the upgrade.
-
-Use the following command to continue upgrade once a Compute Rack is paused after meeting the deployment threshold for the Rack:
-```
-az networkcloud cluster continue-update-version -g $CLUSTER_RG -n $CLUSTER_NAME$ --subscription $SUBSCRIPTION_ID
-```
-## How to troubleshoot Cluster and BMM upgrade failures
+### How to troubleshoot Cluster and BMM upgrade failures
 The following troubleshooting documents can help recover BMM upgrade issues:
 - [Hardware validation failures](troubleshoot-hardware-validation-failure.md)
 - [BMM Provisioning issues](troubleshoot-bare-metal-machine-provisioning.md)
@@ -254,80 +284,68 @@ If troubleshooting doesn't resolve the issue, open a Microsoft support ticket:
 - Collect Cluster and BMM operation state from Azure portal or Azure CLI.
 - Create Azure Support Request for any Cluster or BMM upgrade failures and attach any errors along with ASYNC URL, correlation ID, and operation state of the Cluster and BMMs.
 
-## Post-upgrade validation
-Run the following commands to check the status of the CM, Cluster, and BMM:
+</details>
 
-1. Check that the CM is in `Succeeded` for `Provisioning state`:
-   ```
-   az networkcloud clustermanager show -g $CM_RG --resource-name $CM_NAME --subscription $SUBSCRIPTION_ID -o table
-   ```
+## Post-upgrade tasks
+<details>
+ <summary> Detailed steps for post-upgrade tasks </summary>
 
-2. Check the Cluster status `Detailed status` is `Running`:
-   ```  
-   az networkcloud cluster show -g $CLUSTER_RG --resource-name $CLUSTER_NAME --subscription $SUBSCRIPTION_ID -o table
-   ```
+### Review Operator Nexus release notes
+Review the Operator Nexus release notes for any version specific actions required post-upgrade.
 
-3. Check the Bare Metal Machine status:
-   ```
-   az networkcloud baremetalmachine list -g $CLUSTER_MRG --subscription $SUBSCRIPTION_ID --query "sort_by([].{name:name,kubernetesNodeName:kubernetesNodeName,location:location,readyState:readyState,provisioningState:provisioningState,detailedStatus:detailedStatus,detailedStatusMessage:detailedStatusMessage,cordonStatus:cordonStatus,powerState:powerState,kubernetesVersion:kubernetesVersion,machineClusterVersion:machineClusterVersion,machineRoles:machineRoles| join(', ', @),createdAt:systemData.createdAt}, &name)" -o table
-   ```
+### Validate Nexus Instance
 
-   Validate the following resource states for each BMM (except spare)
-   - ReadyState: True
-   - ProvisioningState: Succeeded
-   - DetailedStatus: Provisioned
-   - CordonStatus: Uncordoned
-   - PowerState: On
+Validate the health and status of all the Nexus Instance resources with the [Nexus Instance Readiness Test (IRT)](howto-run-instance-readiness-testing.md).
 
-   >[!Note]
-   > One control-plane BMM is labeled as a spare and is inactive.
-
-4. Collect a profile of the tenant workloads:
-   ```
-   az networkcloud clustermanager show -g $CM_RG --resource-name $CM_NAME --subscription $SUBSCRIPTION_ID -o table
-   az networkcloud virtualmachine list --sub $SUBSCRIPTION_ID --query "reverse(sort_by([?clusterId=='$CLUSTER_RID'].{name:name, createdAt:systemData.createdAt, resourceGroup:resourceGroup, powerState:powerState, provisioningState:provisioningState, detailedStatus:detailedStatus,bareMetalMachineId:bareMetalMachineIdi,CPUCount:cpuCores, EmulatorStatus:isolateEmulatorThread}, &createdAt))" -o table
-   az networkcloud kubernetescluster list --sub $SUBSCRIPTION_ID --query "[?clusterId=='$CLUSTER_RID'].{name:name, resourceGroup:resourceGroup, provisioningState:provisioningState, detailedStatus:detailedStatus, detailedStatusMessage:detailedStatusMessage, createdAt:systemData.createdAt, kubernetesVersion:kubernetesVersion}" -o table
-   ```
-
-## Send notification to Operations of Cluster upgrade completion
-
-The following template can be used through email or ticketing system:
+To perform a resource validation of the Nexus Instance components post-upgrade through Azure CLI:
 ```
-Title: <ENVIRONMENT> <AZURE_REGION> <CLUSTER_NAME> Runtime <CLUSTER_VERSION> Upgrade Complete
-
-Operations:
-Deployment Team notification for <ENVIRONMENT> <AZURE_REGION> <CLUSTER_NAME> runtime <CLUSTER_VERSION> Upgrade Complete
-
-Subscription: <CUSTOMER_SUB_ID>
-NFC: <NFC_NAME>
-CM: <CM_NAME>
-Fabric: <NF_NAME>
-Cluster: <CLUSTER_NAME>
-Region: <AZURE_REGION>
-Version: <NEXUS_VERSION>
-
-The following is a list of BMM with provisioning issues during upgrade:
-<BMM_ISSUE_LIST>
- 
-CC: stakeholder_list
+# NFC
+az networkfabric controller list --subscription <CUSTOMER_SUB_ID> -o table
+az vm list -o table --query "[?location=='<AZURE_REGION>']" --subscription <CUSTOMER_SUB_ID>
+az customlocation list -o table --query "[?location=='<AZURE_REGION>']" | grep <NFC_NAME> --subscription <CUSTOMER_SUB_ID>
+
+# Fabric
+az networkfabric fabric list --resource-group <NF_RG> --subscription <CUSTOMER_SUB_ID> -o table
+az networkfabric rack list -o table --resource-group <NF_RG> --subscription <CUSTOMER_SUB_ID> -o table
+az networkfabric fabric device list --resource-group <NF_RG> --subscription <CUSTOMER_SUB_ID> -o table
+az networkfabric nni list -g <NF_RG> --fabric <NF_NAME> --subscription <CUSTOMER_SUB_ID> -o table
+az networkfabric acl list -g <NF_RG> --fabric <NF_NAME> --subscription <CUSTOMER_SUB_ID> -o table
+az networkfabric l2domain list -g <NF_RG> --fabric <NF_NAME> --subscription <CUSTOMER_SUB_ID> -o table
+
+# CM
+az networkcloud clustermanager list --subscription <CUSTOMER_SUB_ID> -o table
+
+# Cluster
+az networkcloud cluster list --subscription <CUSTOMER_SUB_ID> -o table
+az networkcloud baremetalmachine list -g <CLUSTER_MRG> --subscription <CUSTOMER_SUB_ID> --query "sort_by([]. {name:name,kubernetesNodeName:kubernetesNodeName,location:location,readyState:readyState,provisioningState:provisioningState,detailedStatus:detailedStatus,detailedStatusMessage:detailedStatusMessage,cordonStatus:cordonStatus,powerState:powerState,machineRoles:machineRoles| join(', ', @),createdAt:systemData.createdAt}, &name)" -o table
+az networkcloud storageappliance list -g <CLUSTER_MRG> --subscription <CUSTOMER_SUB_ID> -o table
+
+# Tenant Workloads
+az networkcloud virtualmachine list --sub $SUBSCRIPTION_ID --query "reverse(sort_by([?clusterId=='$CLUSTER_RID'].{name:name, createdAt:systemData.createdAt, resourceGroup:resourceGroup, powerState:powerState, provisioningState:provisioningState, detailedStatus:detailedStatus,bareMetalMachineId:bareMetalMachineIdi,CPUCount:cpuCores, EmulatorStatus:isolateEmulatorThread}, &createdAt))" -o table
+az networkcloud kubernetescluster list --sub $SUBSCRIPTION_ID --query "[?clusterId=='$CLUSTER_RID'].{name:name, resourceGroup:resourceGroup, provisioningState:provisioningState, detailedStatus:detailedStatus, detailedStatusMessage:detailedStatusMessage, createdAt:systemData.createdAt, kubernetesVersion:kubernetesVersion}" -o table
 ```
 
-## Remove resource tag on Cluster resource in Azure portal
-Remove the resource tag on the Cluster resource tracking the upgrade in Azure portal (if added previously):
-```
-|Name            | Value          |
-|----------------|-----------------
-|BF in progress  |<DE_ID>         |
-```
+> [!Note]
+> IRT validation provides a complete functional test of networking and workloads across all components of the Nexus Instance. Simple validation does not provide functional tesing.
 
-## Close out any Work Items in your ticketing system
-* Update Task hours for upgrade duration.
-* Set Cluster upgrade work item to `Complete`.
-* Add any notes on support tickets and issues encountered during upgrade
+</details>
 
 ## Links
-- [Azure portal](https://aka.ms/nexus-portal)
+<details>
+<summary> Reference Links for Cluster upgrade </summary>
+
+Reference links for Cluster upgrade:
+- Access the [Azure portal](https://aka.ms/nexus-portal)
+- [Install Azure CLI](https://aka.ms/azcli)
+- [Install CLI Extension](howto-install-cli-extensions.md)
 - [Cluster Upgrade](howto-cluster-runtime-upgrade.md)
 - [Cluster Upgrade with PauseAfterRack](howto-cluster-runtime-upgrade-with-pauseafterrack-strategy.md)
-- [Azure CLI](https://aka.ms/azcli)
-- [Install CLI Extension](howto-install-cli-extensions.md)
+- [Troubleshoot hardware validation failure](troubleshoot-hardware-validation-failure.md)
+- [Troubleshoot BMM provisioning](troubleshoot-bare-metal-machine-provisioning.md)
+- [Troubleshoot BMM provisioning](troubleshoot-bare-metal-machine-provisioning.md)
+- [Troubleshoot BMM degraded](troubleshoot-bare-metal-machine-degraded.md)
+- [Troubleshoot BMM warning](troubleshoot-bare-metal-machine-warning.md)
+- Reference the [Nexus Telco Input Template](concepts-telco-input-template.md)
+- Reference the [Nexus Instance Readiness Test (IRT)](howto-run-instance-readiness-testing.md)
+
+</details>