You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/operator-nexus/concepts-nexus-availability.md
+6-6Lines changed: 6 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -53,13 +53,13 @@ Go through the following steps to help plan an Operator Nexus deployment.
53
53
3. If your workloads support a split between control-plane and data-plane elements, consider whether to separately design control-plane sites that can control a larger number of more widely distributed data-plane sites. This option is only likely to be attractive for larger deployments. For smaller deployments, or deployments with workloads that don't support separating the control-plane and the data-plane, you're more likely to use a homogenous site architecture where all sites are identical.
54
54
55
55
56
-
4. Plan the distribution of workload instances to determine the number of racks needed in each site type, allowing for the fact that each rack is an Operatorn Operator Nexus zone. The platform can enforce affinity/anti-affinity rules at the scope of these zones, to ensure workload instances are distributed in such a way as to be resilient to failures of individual servers or racks. See [this article](https://learn.microsoft.com/azure/operator-nexus/howto-virtual-machine-placement-hints) for more on affinity/anti-affinity rules. The Operator Nexus Azure Kubernetes Service (NAKS) controller automatically distributes nodes within a cluster across the available servers in a zone as uniformly as possible, within other constraints. As a result, failure of any single server has the minimum impact on the total capacity remaining.
56
+
4. Plan the distribution of workload instances to determine the number of racks needed in each site type, allowing for the fact that each rack is an Operator Nexus zone. The platform can enforce affinity/anti-affinity rules at the scope of these zones, to ensure workload instances are distributed in such a way as to be resilient to failures of individual servers or racks. See [this article](https://learn.microsoft.com/azure/operator-nexus/howto-virtual-machine-placement-hints) for more on affinity/anti-affinity rules. The Operator Nexus Azure Kubernetes Service (NAKS) controller automatically distributes nodes within a cluster across the available servers in a zone as uniformly as possible, within other constraints. As a result, failure of any single server has the minimum impact on the total capacity remaining.
57
57
58
58
5. Factor in the [threshold redundancy](https://learn.microsoft.com/azure/operator-nexus/howto-cluster-runtime-upgrade#configure-compute-threshold-parameters-for-runtime-upgrade-using-cluster-updatestrategy) that is required within each site on upgrade. This configuration option indicates to the orchestration engine the minimum number of worker nodes that must be available in order for a platform upgrade to be considered successful and allowed to proceed. Reserving these nodes eats into any capacity headroom. Setting a higher bar decreases the overall deployment's resilience to failure of individual nodes, but improves efficiency of utilization of the available capacity.
59
59
60
60
6. Operator Nexus supports between 1 and 8 racks per site inclusive, with each rack containing 4, 8, 12 or 16 servers. All racks must be identical in terms of number of servers. See [here](https://learn.microsoft.com/azure/operator-nexus/reference-near-edge-compute) for specifics of the resource available for workloads. See the following diagram, and also [this article](https://learn.microsoft.com/azure/operator-nexus/reference-limits-and-quotas) for other limits and quotas that might have an impact.
61
61
62
-
7. Operator Nexus supports one or two Pure storage arrays. Currently, these arrays are available to workload NFs running as Kubernetes nodes. Workloads running as VMs use local storage from the server they're instantiated on.
62
+
7. Operator Nexus supports one or two storage appliances. Currently, these arrays are available to workload NFs running as Kubernetes nodes. Workloads running as VMs use local storage from the server they're instantiated on.
63
63
64
64
8. Other factors to consider are the number of available physical sites, and any per-site limitations such as bandwidth or power.
65
65
@@ -91,11 +91,11 @@ Although the initial requirement was for 400 nodes across the deployment, the de
91
91
92
92
:::image type="content" source="media/nexus-availability-2.png" alt-text="A graph of the number of nodes needed for the workload, and the additional requirements for redundancy.":::
93
93
94
-
For another workload, you might choose not to "layer" the multiple levels of redundancy, taking the view that designing for concurrent failure of one site, a rack in another site and a server in another rack in that same site is overkill. Ultimately, the optimum design depends on the specific service offered by the workload, and details of the workload itself, in particular its load-balancing functionality. Modeling the service using Markov chains to identify the various error modes, with associated probabilities, would also help determine which errors might realistically occur simultaneously. For example, a workload that is able to apply back-pressure when a given site is suffering from reduced capacity due to a server failure might then be able to redirect traffic to one of the remaining sites which still have full redundancy.
94
+
For another workload, you might choose not to "layer" the multiple levels of redundancy, taking the view that designing for concurrent failure of one site, a rack in another site and a server in another rack in that same site is overkill. Ultimately, the optimum design depends on the specific service offered by the workload, and details of the workload itself, in particular its load-balancing functionality. Modeling the service using Markov chains to identify the various error modes, with associated probabilities, would also help determine which errors might realistically occur simultaneously. For example, a workload that is able to apply back-pressure when a given site is suffering from reduced capacity due to a server failure might then be able to redirect traffic to one of the remaining sites that still have full redundancy.
95
95
96
96
### Site Deployment and Connection
97
97
98
-
Each Operator Nexus site is connected to an Azure region that hosts the in-Azure resources such as Cluster Manager, Operator Nexus Fabric Controller etc. Ideally, connect each Operator Nexus site to a different Azure region in order to maximize the resilience of the Operator Nexus deployment to any interruption of the Azure regions. Depending on the geography, there is likely to be a trade-off between maximizing the number of distinct Azure regions the deployment is taking a dependency on, and any other restrictions around data residency or sovereignty. Note also that the relationship between the on-premise NAKS clusters and Cluster Manager is not necessarily 1:1. A single Cluster Manager can manage clusters in multiple sites.
98
+
Each Operator Nexus site is connected to an Azure region that hosts the in-Azure resources such as Cluster Manager, Operator Nexus Fabric Controller etc. Ideally, connect each Operator Nexus site to a different Azure region in order to maximize the resilience of the Operator Nexus deployment to any interruption of the Azure regions. Depending on the geography, theres likely to be a trade-off between maximizing the number of distinct Azure regions the deployment is taking a dependency on, and any other restrictions around data residency or sovereignty. Note also that the relationship between the on-premises instances and Cluster Manager isn't necessarily 1:1. A single Cluster Manager can manage instances in multiple sites.
99
99
100
100
Virtual machines, including Virtual Network Functions (VNFs) and Operator Nexus Azure Kubernetes Service (AKS), as well as services hosted on-premises within Operator Nexus, are provided with connectivity through highly available links between them and the network fabric. This enhanced connectivity is achieved through the utilization of redundant physical connections, which are seamlessly facilitated by Single Root Input/Output Virtualization (SR-IOV) interfaces employing Virtual Function Link Aggregation (VF-Lag) technology.
101
101
@@ -113,7 +113,7 @@ During a disconnection event, the on-premises infrastructure and workloads aren'
113
113
114
114
### Managing Platform Upgrade
115
115
116
-
Operator Nexus platform upgrade is a fairly lengthy process. The customer initiates the upgrade, but it's then managed by the platform itself. From an availability perspective, the following points are key:
116
+
Operator Nexus upgrade is initiated by the customer, but it's then managed by the platform itself. From an availability perspective, the following points are key:
117
117
118
118
- The customer decides when to initiate the upgrade. They can opt, for example, to initiate the upgrade in a maintenance window.
119
119
@@ -147,7 +147,7 @@ If updates in production are a common requirement, the workloads need to provide
147
147
148
148
### Workload Upgrade
149
149
150
-
Unlike a Public Cloud environment, as an Edge platform, Operator Nexus is more restricted in terms of the available capacity. This restriction needs to be taken into consideration when designing the process for upgrade of the workload instances, which needs to be managed by the customer, or potentially the provider of the workload, depending on the details of the arrangement between the Telco customer and the workload provider. Microsoft is responsible for upgrade of the Operator Nexus platform infrastructure.
150
+
Unlike a Public Cloud environment, as a Hybrid Cloud platform, Operator Nexus is more restricted in terms of the available capacity. This restriction needs to be taken into consideration when designing the process for upgrade of the workload instances, which needs to be managed by the customer, or potentially the provider of the workload, depending on the details of the arrangement between the Telco customer and the workload provider.
151
151
152
152
There are various options available for workload upgrade. The most efficient in terms of capacity, and least impactful, is to use standard Kubernetes processes supported by NAKS to apply a rolling upgrade of each workload cluster "in-place." This is the process adopted by the Operator Nexus undercloud itself. It is recommended that the customer has lab and staging environments available, so that the uplevel workload software can be validated in the customer's precise network for lab traffic and then at limited scale before rolling out across the entire production estate.
0 commit comments