Skip to content

Commit dd5c4be

Browse files
authored
Merge pull request #223802 from robswain/reliability
Draft: Add Reliability article
2 parents 6b40678 + 08b47b6 commit dd5c4be

File tree

4 files changed

+134
-0
lines changed

4 files changed

+134
-0
lines changed
Lines changed: 127 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,127 @@
1+
---
2+
title: Reliability in Azure Private 5G Core
3+
description: Find out about reliability in Azure Private 5G Core
4+
author: robswain
5+
ms.author: robswain
6+
ms.service: private-5g-core
7+
ms.topic: overview
8+
ms.custom: subject-reliability, references_regions
9+
ms.date: 01/31/2022
10+
---
11+
12+
# Reliability for Azure Private 5G Core
13+
14+
This article describes reliability support in Azure Private 5G Core. It covers both regional resiliency with availability zones and cross-region resiliency with disaster recovery. For an overview of reliability in Azure, see [Azure reliability](/azure/architecture/framework/resiliency/overview.md).
15+
16+
## Availability zone support
17+
18+
The Azure Private 5G Core service is automatically deployed as zone-redundant in Azure regions that support availability zones, as listed in [Availability zone service and regional support](/azure/reliability/availability-zones-service-support). If a region supports availability zones then all Azure Private 5G Core resources created in a region can be managed from any of the availability zones.
19+
20+
No further work is required to configure or manage availability zones. Failover between availability zones is automatic.
21+
22+
Azure Private 5G Core is currently available in the EastUS and WestEurope regions.
23+
24+
### Zone down experience
25+
26+
In a zone-wide outage scenario, users should experience no impact because the service will move to take advantage of the healthy zone automatically. At the start of a zone-wide outage, you may see in-progress ARM requests time-out or fail. New requests will be directed to healthy nodes with zero impact on users and any failed operations should be retried. You'll still be able to create new resources and update, monitor and manage existing resources during the outage.
27+
28+
### Safe deployment techniques
29+
30+
The application ensures that all cloud state is replicated between availability zones in the region so all management operations will continue without interruption. The packet core is running at the Edge and is unaffected by the zone failure, so will continue to provide service for users.
31+
32+
## Disaster recovery: cross-region failover
33+
34+
Azure Private 5G Core is only available in multi-region (3+N) geographies. The service automatically replicates SIM credentials to a backup region in the same geography. This means that there's no loss of data in the event of region failure. Within four hours of the failure, all resources in the failed region are available to view through the Azure portal and ARM tools but will be read-only until the failed region is recovered. the packet running at the Edge continues to operate without interruption and network connectivity will be maintained.
35+
36+
To view all regions that support Azure Private 5G Core, see [Products available by region](https://azure.microsoft.com/explore/global-infrastructure/products-by-region/).
37+
38+
### Cross-region disaster recovery in multi-region geography
39+
40+
Microsoft is responsible for outage detection, notification and support for the Azure cloud aspects of the Azure Private 5G Core service.
41+
42+
#### Outage detection, notification, and management
43+
44+
Microsoft monitors the underlying resources providing the Azure Private 5G Core service in each region. If those resources start to show failures or health monitoring alerts that aren't restricted to a single availability zone then Microsoft will move the service to another supported region in the same geography. This is an Active-Active pattern. The service health for a particular region can be found on [Azure Service Health](https://status.azure.com/status) (Azure Private 5G Core is listed in the **Networking** section). You'll be notified of any region failures through normal Azure communications channels.
45+
46+
The service automatically replicates SIM credentials owned by the service to the backup region using Cosmos DB multi-region writes, so there's no loss of data in the event of region failure.
47+
48+
Azure Private 5G Core resources deployed in the failed region will become read-only, but resources in all other regions will continue to operate unaffected. If you need to be able to write resources at all times, follow the instructions in [Set up disaster recovery and outage detection](#set-up-disaster-recovery-and-outage-detection) to perform your own disaster recovery operation and set up the service in another region.
49+
50+
The packet core running at the Edge continues to operate without interruption and network connectivity will be maintained.
51+
52+
### Set up disaster recovery and outage detection
53+
54+
This section describes what action you can take to ensure you have a fully active management plane for the Azure Private 5G Core service in the event of a region failure. This is required if you want to be able to modify your resources in the event of a region failure.
55+
56+
Note that this will cause an outage of your packet core service and interrupt network connectivity to your UEs for up to eight hours, so we recommend you only use this procedure if you have a business-critical reason to manage resources while the Azure region is down.
57+
58+
In advance of a disaster recovery event, you must back up your resource configuration to another region that supports Azure Private 5G Core. When the region failure occurs, you can redeploy the packet core using the resources in your backup region.
59+
60+
##### Preparation
61+
62+
There are two types of Azure Private 5G Core configuration data that need to be backed up for disaster recovery: mobile network configuration and SIM credentials. We recommend that you:
63+
64+
- Update the SIM credentials in the backup region every time you add new SIMs to the primary region
65+
- Back up the mobile network configuration at least once a week, or more often if you're making frequent or large changes to the configuration such as creating a new site.
66+
67+
**Mobile network configuration**
68+
<br></br>
69+
Follow the instructions in [Move resources to a different region](/azure/private-5g-core/region-move) to export your Azure Private 5G Core resource configuration and upload it to the new region. We recommend that you use a new resource group for your backup configuration to clearly separate it from the active configuration. You must give the resources new names to distinguish them from the resources in your primary region. This new region is a passive backup, so to avoid conflicts you must not link the packet core configuration to your edge hardware yet. Instead, store the values from the **packetCoreControlPlanes.platform** field for every packet core in a safe location that can be accessed by whoever will perform the recovery procedure (such as a storage account referenced by internal documentation).
70+
71+
**SIM data**
72+
<br></br>
73+
For security reasons, Azure Private 5G Core will never return the SIM credentials that are provided to the service as part of SIM creation. Therefore it is not possible to export the SIM configuration in the same way as other Azure resources. We recommend that whenever new SIMs are added to the primary service, the same SIMs are also added to the backup service by repeating the [Provision new SIMs](/azure/private-5g-core/provision-sims-azure-portal) process for the backup mobile network.
74+
75+
**Other resources**
76+
<br></br>
77+
Your Azure Private 5G Core deployment may make use of Azure Key Vaults for storing [SIM encryption keys](/azure/private-5g-core/security#customer-managed-key-encryption-at-rest) or HTTPS certificates for [local monitoring](/azure/private-5g-core/security#access-to-local-monitoring-tools). You must follow the [Azure Key Vault documentation](/azure/key-vault/general/disaster-recovery-guidance) to ensure that your keys and certificates will be available in the backup region.
78+
79+
##### Recovery
80+
In the event of a region failure, first validate that all the resources in your backup region are present by querying the configuration through the Azure portal or API (see [Move resources to a different region](/azure/private-5g-core/region-move)). If all the resources aren't present, stop here and don't follow the rest of this procedure. You may not be able to recover service at the edge site without the resource configuration.
81+
82+
The recovery process is split into three stages for each packet core:
83+
84+
1. Disconnect the Azure Stack Edge device from the failed region by performing a reset
85+
1. Connect the Azure Stack Edge device to the backup region
86+
1. Re-install and validate the installation.
87+
88+
You must repeat this process for every packet core in your mobile network.
89+
90+
> [!CAUTION]
91+
> The recovery procedure will cause an outage of your packet core service and interrupt network connectivity to your UEs for up to eight hours for each packet core. We recommended that you only perform this procedure where you have a business-critical need to manage the Azure Private 5G Core deployment through Azure during the region failure.
92+
93+
**Disconnect the Azure Stack Edge device from the failed region**
94+
<br></br>
95+
The Azure Stack Edge device is currently running the packet core software and is controlled from the failed region. To disconnect the Azure Stack Edge device from the failed region and remove the running packet core, you must follow the reset and reactivate instructions in [Reset and reactivate your Azure Stack Edge device](/azure/databox-online/azure-stack-edge-reset-reactivate-device). Note that this will remove ALL software currently running on your Azure Stack Edge device, not just the packet core software, so ensure that you have the capability to reinstall any other software on the device. This will start a network outage for all devices connected to the packet core on this Azure Stack Edge device.
96+
97+
**Connect the Azure Stack Edge device to the new region**
98+
<br></br>
99+
Follow the instructions in [Commission the AKS cluster](/azure/private-5g-core/commission-cluster) to redeploy the Azure Kubernetes Service cluster on your Azure Stack Edge device. Ensure that you use a different name for this new installation to avoid clashes when the failed region recovers. As part of this process you'll get a new custom location ID for the cluster, which you should note down.
100+
101+
**Reinstall and validation**
102+
<br></br>
103+
Take a copy of the **packetCoreControlPlanes.platform** values you stored in [Preparation](#preparation) and update the **packetCoreControlPlane.platform.customLocation** field with the custom location ID you noted above. Ensure that **packetCoreControlPlane.platform.azureStackEdgeDevice** matches the ID of the Azure Stack Edge device you want to install the packet core on. Now follow [Modify a packet core](/azure/private-5g-core/modify-packet-core) to update the backup packet core with the platform values. This will trigger a packet core deployment onto the Azure Stack Edge device.
104+
105+
You should follow your normal process for validating a new site install to confirm that UE connectivity has been restored and all network functionality is operational. In particular, you should confirm that the site dashboards in the Azure portal show UE registrations and that data is flowing through the data plane.
106+
107+
##### Failed region restored
108+
109+
When the failed region recovers, you should ensure the configuration in the two regions is in sync by performing a backup from the active backup region to the recovered primary region, following the steps in [Preparation](#preparation).
110+
111+
You must also check for and remove any resources in the recovered region that haven't been destroyed by the preceding steps:
112+
113+
- For each Azure Stack Edge device that you moved to the backup region (following the steps in [Recovery](#recovery)) you must find and delete the old ARC cluster resource. The ID of this resource is in the **packetCoreControlPlane.platform.customLocation** field from the values you backed up in [Preparation](#preparation). The state of this resource will be **disconnected** because the corresponding Kubernetes cluster was deleted as part of the recovery process.
114+
- For each packet core that you moved to the backup region (following the steps in [Recovery](#recovery)) you must find and delete any NFM objects in the recovered region. These will be listed in the same resource group as the packet core control plane resources and the **Region** value will match the recovered region.
115+
116+
You then have two choices for ongoing management:
117+
118+
1. Use the operational backup region as the new primary region and use the recovered region as a backup. No further action is required.
119+
1. Make the recovered region the new active primary region by following the instructions in [Move resources to a different region](/azure/private-5g-core/region-move) to switch back to the recovered region.
120+
121+
##### Testing
122+
123+
If you want to test your disaster recovery plans, you can follow the recovery procedure for a single packet core at any time. Note that this will cause a service outage of your packet core service and interrupt network connectivity to your UEs for up to four hours, so we recommend only doing this with non-production packet core deployments or at a time when an outage won't adversely affect your business.
124+
125+
## Next steps
126+
127+
- [Resiliency in Azure](/azure/availability-zones/overview.md)

articles/private-5g-core/toc.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,8 @@ items:
4848
href: distributed-tracing.md
4949
- name: Packet core dashboards
5050
href: packet-core-dashboards.md
51+
- name: Reliability
52+
href: reliability-private-5g-core.md
5153
- name: Security
5254
href: security.md
5355
- name: Azure Stack Edge disconnects

articles/reliability/TOC.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -215,6 +215,8 @@
215215
href: ../network-watcher/frequently-asked-questions.yml?bc=%2fazure%2freliability%2fbreadcrumb%2ftoc.json&toc=%2fazure%2freliability%2ftoc.json#service-availability-and-redundancy
216216
- name: Azure Notification Hubs
217217
href: ../notification-hubs/availability-zones.md?toc=/azure/reliability/toc.json&bc=/azure/reliability/breadcrumb/toc.json
218+
- name: Azure Private 5G Core
219+
href: ../private-5g-core/reliability-private-5g-core.md?toc=/azure/reliability/toc.json&bc=/azure/reliability/breadcrumb/toc.json
218220
- name: Azure Private Link
219221
href: ../private-link/availability.md?toc=/azure/reliability/toc.json&bc=/azure/reliability/breadcrumb/toc.json
220222
- name: Azure Public IP

articles/reliability/breadcrumb/toc.yml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -143,4 +143,7 @@ items:
143143
topicHref: /azure/reliability/index
144144
- name: Reliability
145145
tocHref: /azure/azure-government/
146+
topicHref: /azure/reliability/index
147+
- name: Reliability
148+
tocHref: /azure/private-5g-core/
146149
topicHref: /azure/reliability/index

0 commit comments

Comments
 (0)