|
| 1 | +--- |
| 2 | +title: "Azure Operator Nexus: Availability" |
| 3 | +description: Overview of the availability features of Azure Operator Nexus. |
| 4 | +author: joemarshallmsft |
| 5 | +ms.author: joemarshall |
| 6 | +ms.service: azure-operator-nexus |
| 7 | +ms.topic: conceptual |
| 8 | +ms.date: 02/15/2024 |
| 9 | +ms.custom: template-concept |
| 10 | +--- |
| 11 | + |
| 12 | +# Introduction to Availability |
| 13 | + |
| 14 | +When it comes to availability, there are two areas to consider: |
| 15 | + |
| 16 | +- Availability of the Nexus platform itself, including: |
| 17 | + |
| 18 | + - Capacity and Redundancy Planning |
| 19 | + |
| 20 | + - Considering Workload Redundancy Requirements |
| 21 | + |
| 22 | + - Site Deployment and Connection |
| 23 | + |
| 24 | + - Other Networking Considerations for Availability |
| 25 | + |
| 26 | + - Identity and Authentication |
| 27 | + |
| 28 | + - Managing Platform Upgrade |
| 29 | + |
| 30 | +- Availability of the Network Functions (NFs) running on the platform, including: |
| 31 | + |
| 32 | + - Configuration Updates |
| 33 | + |
| 34 | + - Workload Upgrade |
| 35 | + |
| 36 | + - Workload Healing |
| 37 | + |
| 38 | +## Deploy and Configure Nexus for High Availability |
| 39 | + |
| 40 | +[Reliability in Azure Operator Nexus \| Microsoft Learn](https://learn.microsoft.com/en-us/azure/reliability/reliability-operator-nexus) provides details of how to deploy the Nexus services that run in Azure so as to maximize availability. |
| 41 | + |
| 42 | +### Capacity and Redundancy Planning |
| 43 | + |
| 44 | +Each on-premises deployment is a multi-rack design, providing physical redundancy at all levels of the stack. |
| 45 | + |
| 46 | +Go through the following steps to help plan a Nexus deployment. |
| 47 | + |
| 48 | +1. Determine the initial set of workloads (Network Functions) which the deployment should be sized to host. |
| 49 | + |
| 50 | +2. Determine the capacity requirements for each of these workloads, allowing for redundancy for each one. |
| 51 | + |
| 52 | +3. If your workloads support a split between control-plane and data-plane elements, consider whether to separately design control-plane sites that can control a larger number of more widely distributed data-plane sites. This option is only likely to be attractive for larger deployments. For smaller deployments, or deployments with workloads that don't support separating the control-plane and the data-plane, you're more likely to use a homogenous site architecture where all sites are identical. |
| 53 | + |
| 54 | + |
| 55 | +4. Plan the distribution of workload instances to determine the number of racks needed in each site type, allowing for the fact that each rack is a Nexus zone. The platform can enforce affinity/anti-affinity rules at the scope of these zones, to ensure workload instances are distributed in such a way as to be resilient to failures of individual servers or racks. See [this article](https://learn.microsoft.com/en-us/azure/operator-nexus/howto-virtual-machine-placement-hints) for more on affinity/anti-affinity rules. The Nexus Azure Kubernetes Server (NAKS) controller automatically distributes nodes within a cluster across the available servers in a zone as uniformly as possible, within other constraints. As a result, failure of any single server has the minimum impact on the total capacity remaining. |
| 56 | + |
| 57 | +5. Factor in the [threshold redundancy](https://learn.microsoft.com/en-us/azure/operator-nexus/howto-cluster-runtime-upgrade#configure-compute-threshold-parameters-for-runtime-upgrade-using-cluster-updatestrategy) that is required within each site on upgrade. This configuration option indicates to the orchestration engine the minimum number of worker nodes that must be available in order for a platform upgrade to be considered successful and allowed to proceed. Reserving these nodes eats into any capacity headroom. Setting a higher bar decreases the overall deployment's resilience to failure of individual nodes, but improves efficiency of utilization of the available capacity. |
| 58 | + |
| 59 | +6. Nexus supports between 1 and 8 racks per site inclusive, with each rack containing 4, 8, 12 or 16 servers. All racks must be identical in terms of number of servers. See [here](https://learn.microsoft.com/en-us/azure/operator-nexus/reference-near-edge-compute) for specifics of the resource available for workloads. See the following diagram, and also [this article](https://learn.microsoft.com/en-us/azure/operator-nexus/reference-limits-and-quotas) for other limits and quotas that might have an impact. |
| 60 | + |
| 61 | +7. Nexus supports one or two Pure storage arrays. Currently, these arrays are available to workload NFs running as Kubernetes nodes. Workloads running as VMs use local storage from the server they're instantiated on. |
| 62 | + |
| 63 | +8. Other factors to consider are the number of available physical sites, and any per-site limitations such as bandwidth or power. |
| 64 | + |
| 65 | +:::image type="content" source="media/nexus-availability-1.png" alt-text="Diagram of a typical server and rack structure in an Operator Nexus deployment."::: |
| 66 | + |
| 67 | +**Figure 1 - Nexus elements in a single site** |
| 68 | + |
| 69 | +In most cases, capacity planning is an iterative process. Work with your Microsoft account team, which has tooling in order to help make this process more straightforward. |
| 70 | + |
| 71 | +As the demand on the infrastructure increases over time, either due to subscriber growth or workloads being migrated to the platform, the Nexus deployment can be scaled by adding further racks to existing sites, or adding new sites, depending on criteria such as the limitations of any single site (power, space, bandwidth etc.). |
0 commit comments