|
| 1 | +--- |
| 2 | +title: "Azure Operator Nexus: Fabric runtime upgrade template" |
| 3 | +description: Learn the process for upgrading Fabric for Operator Nexus with step-by-step parameterized template. |
| 4 | +author: bartpinto |
| 5 | +ms.author: bpinto |
| 6 | +ms.service: azure-operator-nexus |
| 7 | +ms.date: 04/23/2025 |
| 8 | +ms.topic: how-to |
| 9 | +ms.custom: azure-operator-nexus, template-include |
| 10 | +--- |
| 11 | + |
| 12 | +# Fabric runtime upgrade template |
| 13 | + |
| 14 | +This how-to guide provides a step-by-step template for upgrading a Fabric. It is designed to assist users in enhancing their network infrastructure through Azure APIs, which facilitate the lifecycle management of various network devices. Regular updates are crucial for maintaining system integrity and accessing the latest product improvements. |
| 15 | + |
| 16 | +## Overview |
| 17 | + |
| 18 | +**Runtime bundle components**: These components require operator consent for upgrades that may affect traffic behavior or necessitate device reboots. The network fabric's design allows for updates to be applied while maintaining continuous data traffic flow. |
| 19 | + |
| 20 | +Runtime changes are categorized as follows: |
| 21 | +- **Operating system updates**: Necessary to support new features or resolve issues. |
| 22 | +- **Base configuration updates**: Initial settings applied during device bootstrapping. |
| 23 | +- **Configuration structure updates**: Generated based on user input for conf |
| 24 | + |
| 25 | +## Required Parameters: |
| 26 | +- <START_DATE>: Planned start date/time of upgrade |
| 27 | +- <ENVIRONMENT> - Instance name |
| 28 | +- <AZURE_REGION> - Azure region of instance |
| 29 | +- <CUSTOMER_SUB_NAME>: Subscription name |
| 30 | +- <CUSTOMER_SUB_TENANT_ID>: Tenant ID // From 'az account show' |
| 31 | +- <CUSTOMER_SUB_ID>: Subscription ID |
| 32 | +- <NEXUS_VERSION>: Operator Nexus release version (e.g. 2504.1) |
| 33 | +- <NNF_VERSION>: Operator Nexus Fabric release version (e.g. 8.1) |
| 34 | +- <NF_VERSION>: NF runtime version for upgrade (e.g. 5.0.0) |
| 35 | +- <NF_DEVICE_NAME>: Network Fabric Device Name |
| 36 | +- <NF_DEVICE_RID>: Network Fabric Device Resource ID |
| 37 | +- <NF_NAME>: Network Fabric Name |
| 38 | +- <NF_RG>: Network Fabric Resource Group |
| 39 | +- <NF_RID>: Network Fabric ARM ID |
| 40 | +- <NFC_NAME>: Associated NFC |
| 41 | +- <NFC_RG>: NFC Resource Group |
| 42 | +- <NFC_RID>: NFC ARM ID |
| 43 | +- <CLUSTER_KEYVAULT_ID>: Cluster Keyvault ARM ID |
| 44 | +- <NFC_MRG>: Cluster Managed Resource Group |
| 45 | +- <DURATION>: Estimated Duration of upgrade |
| 46 | +- <DE_ID> Deployment Engineer performing upgrade |
| 47 | + |
| 48 | +## Links |
| 49 | +[Azure Portal](https://aka.ms/nexus-portal) |
| 50 | +[Operator Nexus Releases and Notes](./release-notes-2402.2) |
| 51 | +[Network Fabric Upgrade](./howto-upgrade-nexus-fabric) |
| 52 | + |
| 53 | +## Pre-Checks before executing the Fabric upgrade |
| 54 | + |
| 55 | +1. The following role permissions should be assigned to end users responsible for Fabric create, upgrade, and delete operations. These permissions can be granted temporarily, limited to the duration required to perform these operations. |
| 56 | + * Microsoft.NexusIdentity/identitySets/read |
| 57 | + * Microsoft.NexusIdentity/identitySets/write |
| 58 | + * Microsoft.NexusIdentity/identitySets/delete |
| 59 | + * Ensure that Role Based Access Control Administrator is sucessfully activated. |
| 60 | + * To Check: AZ Portal-> Network Fabric-> Access control (IAM) -> View my access. In current role assignments, you should see the following two roles: |
| 61 | + - Nexus Contributor |
| 62 | + - Role Based Access Control Administrator |
| 63 | + |
| 64 | +2. Validate Network Fabric Contoller and Network Fabric provisioning status. |
| 65 | + |
| 66 | + Setup the subscription, NFC and NF parameters: |
| 67 | + ``` |
| 68 | + export SUBSCRIPTION_ID=<CUSTOMER_SUB_ID> |
| 69 | + export NFC_RG=<NFC_RG> |
| 70 | + export NFC_NAME=<NFC_NAME> |
| 71 | + export NF_RG=<NF_RG> |
| 72 | + export NF_NAME=<NF_NAME> |
| 73 | + ``` |
| 74 | + |
| 75 | + Check that the NFC is in Provisioned state. |
| 76 | + ``` |
| 77 | + az networkfabric controller show -g $NFC_RG --resource-name $NFC_NAME --subscription $SUBSCRIPTION_ID -o table |
| 78 | + ``` |
| 79 | + |
| 80 | + Check the NF status: |
| 81 | + ``` |
| 82 | + az networkfabric fabric show -g $NF_RG --resource-name $NF_NAME --subscription $SUBSCRIPTION_ID -o table |
| 83 | + ``` |
| 84 | + **Note down the fabricVersion and provisioningState - if provisioningState is not Succeeded then upgrade should not continue until resolved.** |
| 85 | + |
| 86 | +3. Microsoft.NexusIdentity user RP must be registered on the customer subscription. To check: |
| 87 | + ``` |
| 88 | + az provider show --namespace Microsoft.NexusIdentity -o table --subscription $SUBSCRIPTION_ID |
| 89 | + Namespace RegistrationPolicy RegistrationState |
| 90 | + ----------------------- -------------------- ------------------- |
| 91 | + Microsoft.NexusIdentity RegistrationRequired Registered |
| 92 | + ``` |
| 93 | + |
| 94 | + If not registered, run the following: |
| 95 | + ``` |
| 96 | + az provider register --namespace Microsoft.NexusIdentity --wait --subscription $SUBSCRIPTION_ID |
| 97 | +
|
| 98 | + az provider show --namespace Microsoft.NexusIdentity -o table |
| 99 | + Namespace RegistrationPolicy RegistrationState |
| 100 | + ----------------------- -------------------- ------------------- |
| 101 | + Microsoft.NexusIdentity RegistrationRequired Registered |
| 102 | + ``` |
| 103 | + |
| 104 | +4. Minimum available disk space on each device(CE, TOR, NPB, Mgmt Switch) must be more than 3.5 GB for a successful device upgrade. |
| 105 | + |
| 106 | + Verify the available space on all devices using the following admin action.If there isn't enough space, remove archived EOS images and support bundle files. |
| 107 | + ``` |
| 108 | + az networkfabric device run-ro --resource-name <ND_DEVICE_NAME> --resource-group <NF_RG> --ro-command "dir flash" --subscription <CUSTOMER_SUB_ID> --debug |
| 109 | + ``` |
| 110 | + |
| 111 | +5. Check no simultaneous fabric upgrade within NFC to prevent contention issues with the NFC storage account: |
| 112 | + ``` |
| 113 | + az networkfabric fabric list --subscription <CUSTOMER_SUB_ID> -o table | grep <NFC_NAME> |
| 114 | + ``` |
| 115 | + |
| 116 | + Verify there are no other Fabrics showing `provisioningState` as `Updating` on the same Network Fabric Controller. |
| 117 | + |
| 118 | +6. Check Network Packet Broker for any orphaned Network Taps: |
| 119 | + In the AZ Portal: |
| 120 | + * Select Network Fabrics -> <NF_NAME>. |
| 121 | + * Click on the Resource group. |
| 122 | + * In the Resources list, filter on "network packet broker". |
| 123 | + * Click on the network packet broker name. |
| 124 | + * Click on "Network Taps". |
| 125 | + * All Network taps should be `Succeeded` for `Configuration State` and `Provisioning State` and `Enabled` for `Administrative State`. |
| 126 | + * Look for any taps with a red `X` and a status of `not found`, `failed` or `error`. |
| 127 | + |
| 128 | + If any taps are "not found", failed" or "error" status, do not start the upgrade until the network taps issues are cleared. |
| 129 | + |
| 130 | +7. Run cabling validation report: |
| 131 | + ``` |
| 132 | + az networkfabric fabric validate-configuration --resource-group $NF_RG --resource-name $NF_NAME --validate-action "Cabling" --debug |
| 133 | + ``` |
| 134 | + Following link to Storage Account in output where report is uploaded in JSON format. Satya wrote a python tool to convert to html (see Teams chat). |
| 135 | + |
| 136 | + Attached zipped html validation report to iTrack: e.g. report-01-05-2024-15-00.zip |
| 137 | + |
| 138 | + Add comment to itrack with interface NotConnected Status or ports mismatched. |
| 139 | + |
| 140 | + Report identifies following issues: |
| 141 | + Device Name Interface Map Name Validation Result Status Destination Hostname Destination Port Device |
| 142 | + Configuration Error Reason Map Type |
| 143 | + <LIST_FAILURES> |
| 144 | + |
| 145 | + Validate under `Unknown` section, any ports with `Not-Connected` should be verified against the BOM. |
| 146 | + |
| 147 | + List all port connection and cabling issues in the iTrack and notify AT&T Nitro/Team |
| 148 | + |
| 149 | +11. Notify SRE of Upgrade and ETA: |
| 150 | + |
| 151 | + DE will send notification to SRE of production resource upgrade and ETA using the following template: |
| 152 | + ``` |
| 153 | + Title: <ENVIRONMENT> <REGION> <FABRIC_NAME> Runtime upgrade to <FABRIC_RUNTIME_VERSION> <START_TIME> - |
| 154 | + Completion ETA <DURATION> |
| 155 | +
|
| 156 | + |
| 157 | +
|
| 158 | + Nexus DE Team <ENVIRONMENT> <REGION> <FABRIC_NAME> Runtime upgrade to <FABRIC_RUNTIME_VERSION> <START_TIME> |
| 159 | + - Completion ETA <DURATION> |
| 160 | +
|
| 161 | + Subscription: <CUSTOMER_SUB_ID> |
| 162 | + NFC: <NFC_NAME> |
| 163 | + CM: <CM_NAME> |
| 164 | + Fabric: <FABRIC_NAME> |
| 165 | + Cluster: <CLUSTER_NAME> |
| 166 | + Region: <AZURE_REGION> |
| 167 | + Version: <NEXUS_VERSION> |
| 168 | + |
| 169 | + cc: aods-de- |
| 170 | + |
| 171 | + ``` |
| 172 | + |
| 173 | +12. Azure Resource Tags on Deployment Resources: |
| 174 | + |
| 175 | + ``` |
| 176 | + To help track customer deployments, DE will add tags to DE created Azure resources in Azure portal for |
| 177 | + Fabric: |
| 178 | + |Name | Value | |
| 179 | + ---------------|----------------- |
| 180 | + |BF in progress|<DE_CUSTOMER_ID>| |
| 181 | +
|
| 182 | + When deployment is complete or issue is resolved, the DE will remove the tag. |
| 183 | + ``` |
| 184 | + |
| 185 | + |
| 186 | +#PROCEDURE |
| 187 | +**STEP 1: TRIGGER UPGRADE ON FABRIC** |
| 188 | +Operator triggers the upgrade POST action on NetworkFabric via AZCLI/Portal with request payload as: |
| 189 | +``` |
| 190 | +az networkfabric fabric upgrade -g $NF_RG --resource-name $NF_NAME --action start --version "5.0.0" --subscription $SUBSCRIPTION_ID --debug |
| 191 | +{} |
| 192 | +``` |
| 193 | +**Note: Output showing `{}` indicates successful execution of upgrade command** |
| 194 | + |
| 195 | +As part of the above POST action request, RP validates if the version upgrade is allowed from the existing fabric version. We only allow an upgrade from 4.0.0 to 5.0.0. |
| 196 | +The above command marks the NetworkFabric Under Maintenance and prevents any other operation on the Fabric. |
| 197 | + |
| 198 | +**STEP 2: TRIGGER UPGRADE PER DEVICE** |
| 199 | +Operator triggers upgrade POST actions per device (in order as recommended by NNF team). The service completes the device upgrade to success, and marks it upgraded to a newer version. |
| 200 | +**NOTE: In case the device upgrade fails, the issue needs to be mitigated manually before the operator can proceed to upgrade the next device. Please raise an azure portal support request.** |
| 201 | + |
| 202 | +To see steps on how to create an Azure Portal Support Request and to see the flow for ticketing deployment issues, click here: |
| 203 | +https://dev.azure.com/msazuredev/AzureForOperatorsIndustry/_wiki/wikis/AzureForOperatorsIndustry.wiki/27142/Azure-Portal-Support-Request-and-IcM-Ticketing-Process |
| 204 | + |
| 205 | +An `8-rack` environment will have the following 30 devices: |
| 206 | +Aggr Rack - 2 CE's, 2 NPB's, 2 Mgmt Switches (6 devices) |
| 207 | +8 Compute Racks - Each compute rack has 2 TOR's and 1 Mgmt Switch (24 devices) |
| 208 | + |
| 209 | +A `4-rack` environment will have the following 17 devices: |
| 210 | +Aggr Rack - 2 CE's, 1 NPB's, 2 Mgmt Switches (5 devices) |
| 211 | +4 Compute Racks - Each compute rack has 2 TOR's and 1 Mgmt Switch (12 devices) |
| 212 | + |
| 213 | +**Device Upgrade Order** |
| 214 | +***Compute Racks:*** |
| 215 | +1. Odd numbered TORs (***NOTE: All DEVICES IN THIS GROUP CAN BE DONE IN PARALLEL; WAIT FOR SUCCESSFUL UPGRADE ON ALL DEVICES BEFORE MOVING TO THE NEXT GROUP***) |
| 216 | + |
| 217 | +2. Even numbered TORs (***NOTE: All DEVICES IN THIS GROUP CAN BE DONE IN PARALLEL; WAIT FOR SUCCESSFUL UPGRADE ON ALL DEVICES BEFORE MOVING TO THE NEXT GROUP***) |
| 218 | + |
| 219 | +3. Compute rack management switches (***NOTE: All DEVICES IN THIS GROUP CAN BE DONE IN PARALLEL; WAIT FOR SUCCESSFUL UPGRADE ON ALL DEVICES BEFORE MOVING TO THE NEXT GROUP***) |
| 220 | + |
| 221 | +***Aggregate Racks:*** |
| 222 | + |
| 223 | +4. CEs are to be upgraded one after the other in a serial manner. ***(NOTE: WAIT FOR SUCCESSFUL UPGRADE ON |
| 224 | +EACH DEVICE BEFORE MOVING TO THE NEXT DEVICE)*** Stop the upgrade procedure if there are any failures |
| 225 | +corresponding to CE upgrade operation. After each CE upgrade, ***wait for a duration of five minutes*** to |
| 226 | +ensure that the recovery process is complete before proceeding to the next device upgrade. ***(NOTE: WAIT |
| 227 | +FOR SUCCESSFUL UPGRADE ON BOTH CE DEVICES BEFORE MOVING TO THE NPBs)*** |
| 228 | + |
| 229 | +5. NPBs are to be upgraded one after the other in a serial manner. ***(NOTE: Most sites will be an 8-rack environment with two NPB devices. If the site has a 4-rack environment, there will only be one NPB device. WAIT FOR SUCCESSFUL UPGRADE ON EACH DEVICE BEFORE MOVING TO THE NEXT DEVICE; WAIT FOR SUCCESSFUL UPGRADE ON BOTH NPB DEVICES BEFORE |
| 230 | +MOVING TO THE Aggr Mgmt Switches)*** |
| 231 | + |
| 232 | +6. Remaining aggr rack mgmt switches are to be upgraded one after the other in a serial manner. ***(NOTE: |
| 233 | +WAIT FOR SUCCESSFUL UPGRADE ON EACH DEVICE BEFORE MOVING TO THE NEXT DEVICE)*** |
| 234 | + |
| 235 | + |
| 236 | +Verify all devices have ConfigurationState `Succeeded` and ProvisionState `Succeeded`. |
| 237 | + ``` |
| 238 | + az networkfabric device list -g $NF_RG -o table --subscription $SUBSCRIPTION_ID |
| 239 | + ``` |
| 240 | + |
| 241 | +Run the following device upgrade command on the devices **following the Device Upgrade Order listed above**. |
| 242 | + ``` |
| 243 | + az networkfabric device upgrade --version 5.0.0 -g $NF_RG --resource-name $NF_DEVICE_NAME --debug --subscription |
| 244 | + $SUBSCRIPTION_ID --debug |
| 245 | + ``` |
| 246 | +Gather ASYNC URL and Correlation ID info for further troubleshooting if needed. |
| 247 | + ``` |
| 248 | + cli.azure.cli.core.sdk.policies: 'mise-correlation-id': '<MISE_CID>' |
| 249 | + cli.azure.cli.core.sdk.policies: 'x-ms-correlation-request-id': '<CORRELATION_ID>' |
| 250 | + cli.azure.cli.core.sdk.policies: 'Azure-AsyncOperation': '<ASYNC_URL>' |
| 251 | + ``` |
| 252 | +Verify any issue in activity log and dgrep for each device in the group before starting next group. |
| 253 | +``` |
| 254 | + Endpoint: Diagnostics PROD |
| 255 | + Namespace: AfoNetworkFabric |
| 256 | + Events to search: APIValidationErrors, Errors |
| 257 | + Time range: Make sure to encompass the period for when the upgrade action was executed |
| 258 | +
|
| 259 | + Us the following Filtering conditions |
| 260 | + source |
| 261 | + | order by serviceTimestampString asc |
| 262 | + | where * contains "<FABRIC_NAME>" |
| 263 | + | where * contains "<CORRELATION_ID>" |
| 264 | + |
| 265 | + Example DGREP [query](https://portal.microsoftgeneva.com/s/E8AB3E31). |
| 266 | +``` |
| 267 | + |
| 268 | +As part of the upgrade, NF devices will be kept in maintenance mode. During the maintenance mode, Device will drain out the traffic and stop advertising routes so that the traffic flow to the device stops. |
| 269 | + |
| 270 | +After this step, NNF service updates the NetworkDevice resource version property to the newer version. Verify this accuracy of the information before moving to the next device. |
| 271 | +During this entire workflow, if there is a failure encountered at any step then NNF fails the device upgrade operation. Failures need to be mitigated by human intervention. |
| 272 | + |
| 273 | +Operator triggers device upgrades each at a time. |
| 274 | + |
| 275 | +**STEP 3: POST DEVICE UPGRADES** |
| 276 | +After device upgrades are complete, make sure that all the devices are showing as 5.0.0 by running the following command: |
| 277 | +``` |
| 278 | +az networkfabric device list -g $NF_RG --query "[].{name:name,version:version}" -o table --subscription $SUBSCRIPTION_ID |
| 279 | +``` |
| 280 | + |
| 281 | +**STEP 4: COMPLETE NETWORK FABRIC UPGRADE** |
| 282 | +Once all the devices are upgraded, run the following command to take the network fabric out of maintenance state. |
| 283 | + |
| 284 | +``` |
| 285 | +az networkfabric fabric upgrade --action Complete --version "5.0.0" -g $NF_RG --resource-name $NF_NAME --debug --subscription $SUBSCRIPTION_ID |
| 286 | +``` |
| 287 | +Once complete, run the following command to check fabric version is showing 5.0.0: |
| 288 | +``` |
| 289 | +az networkfabric fabric list -g $NF_RG --query "[].{name:name,fabricVersion:fabricVersion,configurationState:configurationState,provisioningState:provisioningState}" -o table --subscription $SUBSCRIPTION_ID |
| 290 | + |
| 291 | +az networkfabric fabric show -g $NF_RG --resource-name $NF_NAME --subscription $SUBSCRIPTION_ID |
| 292 | +``` |
| 293 | + |
| 294 | +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| 295 | + |
| 296 | +**Troubleshooting if device update failed:** |
| 297 | +1. Check device operation state from admin Api |
| 298 | +2. Check errors in AzCli output. |
| 299 | + |
| 300 | +Troubleshoot Network Fabric upgrade TSG doc: [(https://eng.ms/docs/strategic-missions-and-technologies/strategic-missions-and-technologies-organization/azure-for-operators-industry/network-cloud/afoi-network-cloud/network-cloud-tsgs/doc/undercloud/deployment/how-to-troubleshoot-deployment-run)https://eng.ms/docs/cloud-ai-platform/azure-edge-platform-aep/aep-edge/nexus/nexus-network-fabric/nexus-network-fabric-troubleshooting-guides/networkfabric/networkfabric-upgrade-start-failed]() |
| 301 | + |
| 302 | + |
| 303 | +#Create Azure Support Request in Portal |
| 304 | + |
| 305 | + For any device upgrade failure issue, please create an Azure Portal Support request to facilitate better tracking. |
| 306 | + |
| 307 | +To see steps on how to create an Azure Portal Support Request and to see the flow for ticketing deployment issues, click here: |
| 308 | +https://dev.azure.com/msazuredev/AzureForOperatorsIndustry/_wiki/wikis/AzureForOperatorsIndustry.wiki/27142/Azure-Portal-Support-Request-and-IcM-Ticketing-Process |
| 309 | + |
| 310 | + |
| 311 | +# Post-upgrade Validation |
| 312 | +1. Validation with prov-val.sh Scripts: |
| 313 | + Clone nc-labs: git clone https://[email protected]/msazuredev/AzureForOperatorsIndustry/_git/nc-labs |
| 314 | + ``` |
| 315 | + cd ~/att/nc-labs/scripts/validation-bash-scripts |
| 316 | + $ ./prov-val-prod.sh |
| 317 | + ``` |
| 318 | + |
| 319 | + Attach `prov-val-prod` output log to the iTrack ticket. |
| 320 | + |
| 321 | + |
| 322 | +2. Notify Operations to perform upgrade validation. |
| 323 | + |
| 324 | + THe following template can be used through email or ticketing system: |
| 325 | + ``` |
| 326 | + Title: <ENVIRONMENT> <REGION> <FABRIC_NAME> Runtime <FABRIC_RUNTIME_VERSION> Upgrade Complete - Validation Requested |
| 327 | + Operations |
| 328 | +
|
| 329 | + Nexus DE Team <ENVIRONMENT> <REGION> <FABRIC_NAME> Runtime <FABRIC_RUNTIME_VERSION> Upgrade Complete - Validation Requested |
| 330 | +
|
| 331 | + Subscription: <CUSTOMER_SUB_ID> |
| 332 | + NFC: <NFC_NAME> |
| 333 | + CM: <CM_NAME> |
| 334 | + Fabric: <FABRIC_NAME> |
| 335 | + Cluster: <CLUSTER_NAME> |
| 336 | + Region: <AZURE_REGION> |
| 337 | + Version: <NEXUS_VERSION> |
| 338 | + |
| 339 | + cc: < |
| 340 | + ``` |
| 341 | + |
| 342 | +# Wait for SRE Validation report |
| 343 | +SRE will send a validation report and give OK to handoff to customer before continuing. |
| 344 | + |
| 345 | +# Remove Fabric Tag |
| 346 | +Remove the following tag in Azure portal on the Fabric resource added for upgrade tracking: |
| 347 | +`BF in progress: <DE_ID>` |
| 348 | + |
| 349 | +## Close out any Work Items in your ticketing system |
| 350 | +* Update Task hours for upgrade duration. |
| 351 | +* Set Fabric upgrade work item to `Complete`. |
0 commit comments