Skip to content

Commit 9008f41

Browse files
authored
Create howto-upgrade-nexus-fabric-template.md
1 parent 524c912 commit 9008f41

File tree

1 file changed

+351
-0
lines changed

1 file changed

+351
-0
lines changed
Lines changed: 351 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,351 @@
1+
---
2+
title: "Azure Operator Nexus: Fabric runtime upgrade template"
3+
description: Learn the process for upgrading Fabric for Operator Nexus with step-by-step parameterized template.
4+
author: bartpinto
5+
ms.author: bpinto
6+
ms.service: azure-operator-nexus
7+
ms.date: 04/23/2025
8+
ms.topic: how-to
9+
ms.custom: azure-operator-nexus, template-include
10+
---
11+
12+
# Fabric runtime upgrade template
13+
14+
This how-to guide provides a step-by-step template for upgrading a Fabric. It is designed to assist users in enhancing their network infrastructure through Azure APIs, which facilitate the lifecycle management of various network devices. Regular updates are crucial for maintaining system integrity and accessing the latest product improvements.
15+
16+
## Overview
17+
18+
**Runtime bundle components**: These components require operator consent for upgrades that may affect traffic behavior or necessitate device reboots. The network fabric's design allows for updates to be applied while maintaining continuous data traffic flow.
19+
20+
Runtime changes are categorized as follows:
21+
- **Operating system updates**: Necessary to support new features or resolve issues.
22+
- **Base configuration updates**: Initial settings applied during device bootstrapping.
23+
- **Configuration structure updates**: Generated based on user input for conf
24+
25+
## Required Parameters:
26+
- <START_DATE>: Planned start date/time of upgrade
27+
- <ENVIRONMENT> - Instance name
28+
- <AZURE_REGION> - Azure region of instance
29+
- <CUSTOMER_SUB_NAME>: Subscription name
30+
- <CUSTOMER_SUB_TENANT_ID>: Tenant ID // From 'az account show'
31+
- <CUSTOMER_SUB_ID>: Subscription ID
32+
- <NEXUS_VERSION>: Operator Nexus release version (e.g. 2504.1)
33+
- <NNF_VERSION>: Operator Nexus Fabric release version (e.g. 8.1)
34+
- <NF_VERSION>: NF runtime version for upgrade (e.g. 5.0.0)
35+
- <NF_DEVICE_NAME>: Network Fabric Device Name
36+
- <NF_DEVICE_RID>: Network Fabric Device Resource ID
37+
- <NF_NAME>: Network Fabric Name
38+
- <NF_RG>: Network Fabric Resource Group
39+
- <NF_RID>: Network Fabric ARM ID
40+
- <NFC_NAME>: Associated NFC
41+
- <NFC_RG>: NFC Resource Group
42+
- <NFC_RID>: NFC ARM ID
43+
- <CLUSTER_KEYVAULT_ID>: Cluster Keyvault ARM ID
44+
- <NFC_MRG>: Cluster Managed Resource Group
45+
- <DURATION>: Estimated Duration of upgrade
46+
- <DE_ID> Deployment Engineer performing upgrade
47+
48+
## Links
49+
[Azure Portal](https://aka.ms/nexus-portal)
50+
[Operator Nexus Releases and Notes](./release-notes-2402.2)
51+
[Network Fabric Upgrade](./howto-upgrade-nexus-fabric)
52+
53+
## Pre-Checks before executing the Fabric upgrade
54+
55+
1. The following role permissions should be assigned to end users responsible for Fabric create, upgrade, and delete operations. These permissions can be granted temporarily, limited to the duration required to perform these operations.
56+
* Microsoft.NexusIdentity/identitySets/read
57+
* Microsoft.NexusIdentity/identitySets/write
58+
* Microsoft.NexusIdentity/identitySets/delete
59+
* Ensure that Role Based Access Control Administrator is sucessfully activated.
60+
* To Check: AZ Portal-> Network Fabric-> Access control (IAM) -> View my access. In current role assignments, you should see the following two roles:
61+
- Nexus Contributor
62+
- Role Based Access Control Administrator
63+
64+
2. Validate Network Fabric Contoller and Network Fabric provisioning status.
65+
66+
Setup the subscription, NFC and NF parameters:
67+
```
68+
export SUBSCRIPTION_ID=<CUSTOMER_SUB_ID>
69+
export NFC_RG=<NFC_RG>
70+
export NFC_NAME=<NFC_NAME>
71+
export NF_RG=<NF_RG>
72+
export NF_NAME=<NF_NAME>
73+
```
74+
75+
Check that the NFC is in Provisioned state.
76+
```
77+
az networkfabric controller show -g $NFC_RG --resource-name $NFC_NAME --subscription $SUBSCRIPTION_ID -o table
78+
```
79+
80+
Check the NF status:
81+
```
82+
az networkfabric fabric show -g $NF_RG --resource-name $NF_NAME --subscription $SUBSCRIPTION_ID -o table
83+
```
84+
**Note down the fabricVersion and provisioningState - if provisioningState is not Succeeded then upgrade should not continue until resolved.**
85+
86+
3. Microsoft.NexusIdentity user RP must be registered on the customer subscription. To check:
87+
```
88+
az provider show --namespace Microsoft.NexusIdentity -o table --subscription $SUBSCRIPTION_ID
89+
Namespace RegistrationPolicy RegistrationState
90+
----------------------- -------------------- -------------------
91+
Microsoft.NexusIdentity RegistrationRequired Registered
92+
```
93+
94+
If not registered, run the following:
95+
```
96+
az provider register --namespace Microsoft.NexusIdentity --wait --subscription $SUBSCRIPTION_ID
97+
98+
az provider show --namespace Microsoft.NexusIdentity -o table
99+
Namespace RegistrationPolicy RegistrationState
100+
----------------------- -------------------- -------------------
101+
Microsoft.NexusIdentity RegistrationRequired Registered
102+
```
103+
104+
4. Minimum available disk space on each device(CE, TOR, NPB, Mgmt Switch) must be more than 3.5 GB for a successful device upgrade.
105+
106+
Verify the available space on all devices using the following admin action.If there isn't enough space, remove archived EOS images and support bundle files.
107+
```
108+
az networkfabric device run-ro --resource-name <ND_DEVICE_NAME> --resource-group <NF_RG> --ro-command "dir flash" --subscription <CUSTOMER_SUB_ID> --debug
109+
```
110+
111+
5. Check no simultaneous fabric upgrade within NFC to prevent contention issues with the NFC storage account:
112+
```
113+
az networkfabric fabric list --subscription <CUSTOMER_SUB_ID> -o table | grep <NFC_NAME>
114+
```
115+
116+
Verify there are no other Fabrics showing `provisioningState` as `Updating` on the same Network Fabric Controller.
117+
118+
6. Check Network Packet Broker for any orphaned Network Taps:
119+
In the AZ Portal:
120+
* Select Network Fabrics -> <NF_NAME>.
121+
* Click on the Resource group.
122+
* In the Resources list, filter on "network packet broker".
123+
* Click on the network packet broker name.
124+
* Click on "Network Taps".
125+
* All Network taps should be `Succeeded` for `Configuration State` and `Provisioning State` and `Enabled` for `Administrative State`.
126+
* Look for any taps with a red `X` and a status of `not found`, `failed` or `error`.
127+
128+
If any taps are "not found", failed" or "error" status, do not start the upgrade until the network taps issues are cleared.
129+
130+
7. Run cabling validation report:
131+
```
132+
az networkfabric fabric validate-configuration --resource-group $NF_RG --resource-name $NF_NAME --validate-action "Cabling" --debug
133+
```
134+
Following link to Storage Account in output where report is uploaded in JSON format. Satya wrote a python tool to convert to html (see Teams chat).
135+
136+
Attached zipped html validation report to iTrack: e.g. report-01-05-2024-15-00.zip
137+
138+
Add comment to itrack with interface NotConnected Status or ports mismatched.
139+
140+
Report identifies following issues:
141+
Device Name Interface Map Name Validation Result Status Destination Hostname Destination Port Device
142+
Configuration Error Reason Map Type
143+
<LIST_FAILURES>
144+
145+
Validate under `Unknown` section, any ports with `Not-Connected` should be verified against the BOM.
146+
147+
List all port connection and cabling issues in the iTrack and notify AT&T Nitro/Team
148+
149+
11. Notify SRE of Upgrade and ETA:
150+
151+
DE will send notification to SRE of production resource upgrade and ETA using the following template:
152+
```
153+
Title: <ENVIRONMENT> <REGION> <FABRIC_NAME> Runtime upgrade to <FABRIC_RUNTIME_VERSION> <START_TIME> -
154+
Completion ETA <DURATION>
155+
156+
157+
158+
Nexus DE Team <ENVIRONMENT> <REGION> <FABRIC_NAME> Runtime upgrade to <FABRIC_RUNTIME_VERSION> <START_TIME>
159+
- Completion ETA <DURATION>
160+
161+
Subscription: <CUSTOMER_SUB_ID>
162+
NFC: <NFC_NAME>
163+
CM: <CM_NAME>
164+
Fabric: <FABRIC_NAME>
165+
Cluster: <CLUSTER_NAME>
166+
Region: <AZURE_REGION>
167+
Version: <NEXUS_VERSION>
168+
169+
cc: aods-de-
170+
171+
```
172+
173+
12. Azure Resource Tags on Deployment Resources:
174+
175+
```
176+
To help track customer deployments, DE will add tags to DE created Azure resources in Azure portal for
177+
Fabric:
178+
|Name | Value |
179+
---------------|-----------------
180+
|BF in progress|<DE_CUSTOMER_ID>|
181+
182+
When deployment is complete or issue is resolved, the DE will remove the tag.
183+
```
184+
185+
186+
#PROCEDURE
187+
**STEP 1: TRIGGER UPGRADE ON FABRIC**
188+
Operator triggers the upgrade POST action on NetworkFabric via AZCLI/Portal with request payload as:
189+
```
190+
az networkfabric fabric upgrade -g $NF_RG --resource-name $NF_NAME --action start --version "5.0.0" --subscription $SUBSCRIPTION_ID --debug
191+
{}
192+
```
193+
**Note: Output showing `{}` indicates successful execution of upgrade command**
194+
195+
As part of the above POST action request, RP validates if the version upgrade is allowed from the existing fabric version. We only allow an upgrade from 4.0.0 to 5.0.0.
196+
The above command marks the NetworkFabric Under Maintenance and prevents any other operation on the Fabric.
197+
198+
**STEP 2: TRIGGER UPGRADE PER DEVICE**
199+
Operator triggers upgrade POST actions per device (in order as recommended by NNF team). The service completes the device upgrade to success, and marks it upgraded to a newer version.
200+
**NOTE: In case the device upgrade fails, the issue needs to be mitigated manually before the operator can proceed to upgrade the next device. Please raise an azure portal support request.**
201+
202+
To see steps on how to create an Azure Portal Support Request and to see the flow for ticketing deployment issues, click here:
203+
https://dev.azure.com/msazuredev/AzureForOperatorsIndustry/_wiki/wikis/AzureForOperatorsIndustry.wiki/27142/Azure-Portal-Support-Request-and-IcM-Ticketing-Process
204+
205+
An `8-rack` environment will have the following 30 devices:
206+
Aggr Rack - 2 CE's, 2 NPB's, 2 Mgmt Switches (6 devices)
207+
8 Compute Racks - Each compute rack has 2 TOR's and 1 Mgmt Switch (24 devices)
208+
209+
A `4-rack` environment will have the following 17 devices:
210+
Aggr Rack - 2 CE's, 1 NPB's, 2 Mgmt Switches (5 devices)
211+
4 Compute Racks - Each compute rack has 2 TOR's and 1 Mgmt Switch (12 devices)
212+
213+
**Device Upgrade Order**
214+
***Compute Racks:***
215+
1. Odd numbered TORs (***NOTE: All DEVICES IN THIS GROUP CAN BE DONE IN PARALLEL; WAIT FOR SUCCESSFUL UPGRADE ON ALL DEVICES BEFORE MOVING TO THE NEXT GROUP***)
216+
217+
2. Even numbered TORs (***NOTE: All DEVICES IN THIS GROUP CAN BE DONE IN PARALLEL; WAIT FOR SUCCESSFUL UPGRADE ON ALL DEVICES BEFORE MOVING TO THE NEXT GROUP***)
218+
219+
3. Compute rack management switches (***NOTE: All DEVICES IN THIS GROUP CAN BE DONE IN PARALLEL; WAIT FOR SUCCESSFUL UPGRADE ON ALL DEVICES BEFORE MOVING TO THE NEXT GROUP***)
220+
221+
***Aggregate Racks:***
222+
223+
4. CEs are to be upgraded one after the other in a serial manner. ***(NOTE: WAIT FOR SUCCESSFUL UPGRADE ON
224+
EACH DEVICE BEFORE MOVING TO THE NEXT DEVICE)*** Stop the upgrade procedure if there are any failures
225+
corresponding to CE upgrade operation. After each CE upgrade, ***wait for a duration of five minutes*** to
226+
ensure that the recovery process is complete before proceeding to the next device upgrade. ***(NOTE: WAIT
227+
FOR SUCCESSFUL UPGRADE ON BOTH CE DEVICES BEFORE MOVING TO THE NPBs)***
228+
229+
5. NPBs are to be upgraded one after the other in a serial manner. ***(NOTE: Most sites will be an 8-rack environment with two NPB devices. If the site has a 4-rack environment, there will only be one NPB device. WAIT FOR SUCCESSFUL UPGRADE ON EACH DEVICE BEFORE MOVING TO THE NEXT DEVICE; WAIT FOR SUCCESSFUL UPGRADE ON BOTH NPB DEVICES BEFORE
230+
MOVING TO THE Aggr Mgmt Switches)***
231+
232+
6. Remaining aggr rack mgmt switches are to be upgraded one after the other in a serial manner. ***(NOTE:
233+
WAIT FOR SUCCESSFUL UPGRADE ON EACH DEVICE BEFORE MOVING TO THE NEXT DEVICE)***
234+
235+
236+
Verify all devices have ConfigurationState `Succeeded` and ProvisionState `Succeeded`.
237+
```
238+
az networkfabric device list -g $NF_RG -o table --subscription $SUBSCRIPTION_ID
239+
```
240+
241+
Run the following device upgrade command on the devices **following the Device Upgrade Order listed above**.
242+
```
243+
az networkfabric device upgrade --version 5.0.0 -g $NF_RG --resource-name $NF_DEVICE_NAME --debug --subscription
244+
$SUBSCRIPTION_ID --debug
245+
```
246+
Gather ASYNC URL and Correlation ID info for further troubleshooting if needed.
247+
```
248+
cli.azure.cli.core.sdk.policies: 'mise-correlation-id': '<MISE_CID>'
249+
cli.azure.cli.core.sdk.policies: 'x-ms-correlation-request-id': '<CORRELATION_ID>'
250+
cli.azure.cli.core.sdk.policies: 'Azure-AsyncOperation': '<ASYNC_URL>'
251+
```
252+
Verify any issue in activity log and dgrep for each device in the group before starting next group.
253+
```
254+
Endpoint: Diagnostics PROD
255+
Namespace: AfoNetworkFabric
256+
Events to search: APIValidationErrors, Errors
257+
Time range: Make sure to encompass the period for when the upgrade action was executed
258+
259+
Us the following Filtering conditions
260+
source
261+
| order by serviceTimestampString asc
262+
| where * contains "<FABRIC_NAME>"
263+
| where * contains "<CORRELATION_ID>"
264+
265+
Example DGREP [query](https://portal.microsoftgeneva.com/s/E8AB3E31).
266+
```
267+
268+
As part of the upgrade, NF devices will be kept in maintenance mode. During the maintenance mode, Device will drain out the traffic and stop advertising routes so that the traffic flow to the device stops.
269+
270+
After this step, NNF service updates the NetworkDevice resource version property to the newer version. Verify this accuracy of the information before moving to the next device.
271+
During this entire workflow, if there is a failure encountered at any step then NNF fails the device upgrade operation. Failures need to be mitigated by human intervention.
272+
273+
Operator triggers device upgrades each at a time.
274+
275+
**STEP 3: POST DEVICE UPGRADES**
276+
After device upgrades are complete, make sure that all the devices are showing as 5.0.0 by running the following command:
277+
```
278+
az networkfabric device list -g $NF_RG --query "[].{name:name,version:version}" -o table --subscription $SUBSCRIPTION_ID
279+
```
280+
281+
**STEP 4: COMPLETE NETWORK FABRIC UPGRADE**
282+
Once all the devices are upgraded, run the following command to take the network fabric out of maintenance state.
283+
284+
```
285+
az networkfabric fabric upgrade --action Complete --version "5.0.0" -g $NF_RG --resource-name $NF_NAME --debug --subscription $SUBSCRIPTION_ID
286+
```
287+
Once complete, run the following command to check fabric version is showing 5.0.0:
288+
```
289+
az networkfabric fabric list -g $NF_RG --query "[].{name:name,fabricVersion:fabricVersion,configurationState:configurationState,provisioningState:provisioningState}" -o table --subscription $SUBSCRIPTION_ID
290+
291+
az networkfabric fabric show -g $NF_RG --resource-name $NF_NAME --subscription $SUBSCRIPTION_ID
292+
```
293+
294+
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
295+
296+
**Troubleshooting if device update failed:**
297+
1. Check device operation state from admin Api
298+
2. Check errors in AzCli output.
299+
300+
Troubleshoot Network Fabric upgrade TSG doc: [(https://eng.ms/docs/strategic-missions-and-technologies/strategic-missions-and-technologies-organization/azure-for-operators-industry/network-cloud/afoi-network-cloud/network-cloud-tsgs/doc/undercloud/deployment/how-to-troubleshoot-deployment-run)https://eng.ms/docs/cloud-ai-platform/azure-edge-platform-aep/aep-edge/nexus/nexus-network-fabric/nexus-network-fabric-troubleshooting-guides/networkfabric/networkfabric-upgrade-start-failed]()
301+
302+
303+
#Create Azure Support Request in Portal
304+
305+
For any device upgrade failure issue, please create an Azure Portal Support request to facilitate better tracking.
306+
307+
To see steps on how to create an Azure Portal Support Request and to see the flow for ticketing deployment issues, click here:
308+
https://dev.azure.com/msazuredev/AzureForOperatorsIndustry/_wiki/wikis/AzureForOperatorsIndustry.wiki/27142/Azure-Portal-Support-Request-and-IcM-Ticketing-Process
309+
310+
311+
# Post-upgrade Validation
312+
1. Validation with prov-val.sh Scripts:
313+
Clone nc-labs: git clone https://[email protected]/msazuredev/AzureForOperatorsIndustry/_git/nc-labs
314+
```
315+
cd ~/att/nc-labs/scripts/validation-bash-scripts
316+
$ ./prov-val-prod.sh
317+
```
318+
319+
Attach `prov-val-prod` output log to the iTrack ticket.
320+
321+
322+
2. Notify Operations to perform upgrade validation.
323+
324+
THe following template can be used through email or ticketing system:
325+
```
326+
Title: <ENVIRONMENT> <REGION> <FABRIC_NAME> Runtime <FABRIC_RUNTIME_VERSION> Upgrade Complete - Validation Requested
327+
Operations
328+
329+
Nexus DE Team <ENVIRONMENT> <REGION> <FABRIC_NAME> Runtime <FABRIC_RUNTIME_VERSION> Upgrade Complete - Validation Requested
330+
331+
Subscription: <CUSTOMER_SUB_ID>
332+
NFC: <NFC_NAME>
333+
CM: <CM_NAME>
334+
Fabric: <FABRIC_NAME>
335+
Cluster: <CLUSTER_NAME>
336+
Region: <AZURE_REGION>
337+
Version: <NEXUS_VERSION>
338+
339+
cc: <
340+
```
341+
342+
# Wait for SRE Validation report
343+
SRE will send a validation report and give OK to handoff to customer before continuing.
344+
345+
# Remove Fabric Tag
346+
Remove the following tag in Azure portal on the Fabric resource added for upgrade tracking:
347+
`BF in progress: <DE_ID>`
348+
349+
## Close out any Work Items in your ticketing system
350+
* Update Task hours for upgrade duration.
351+
* Set Fabric upgrade work item to `Complete`.

0 commit comments

Comments
 (0)