Skip to content

Commit be689fb

Browse files
Merge pull request #272747 from jaypipes/jaypipes/placement
add documentation on Nexus Kubernetes placement
2 parents c2d79fd + cc2e016 commit be689fb

File tree

7 files changed

+314
-3
lines changed

7 files changed

+314
-3
lines changed

articles/operator-nexus/TOC.yml

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,12 @@
4242
- name: Access Control Lists
4343
href: concepts-access-control-lists.md
4444
- name: Nexus Kubernetes
45-
href: concepts-nexus-kubernetes-cluster.md
45+
expanded: false
46+
items:
47+
- name: Overview
48+
href: concepts-nexus-kubernetes-cluster.md
49+
- name: Resource Placement
50+
href: concepts-nexus-kubernetes-placement.md
4651
- name: Observability
4752
expanded: false
4853
items:
Lines changed: 304 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,304 @@
1+
---
2+
title: "Resource Placement in Azure Operator Nexus Kubernetes"
3+
description: An explanation of how Operator Nexus schedules Nexus Kubernetes resources.
4+
author: jaypipes
5+
ms.author: jaypipes
6+
ms.service: azure-operator-nexus
7+
ms.topic: conceptual
8+
ms.date: 04/19/2024
9+
ms.custom: template-concept
10+
---
11+
12+
# Background
13+
14+
Operator Nexus instances are deployed at the customer premises. Each instance
15+
comprises one or more racks of bare metal servers.
16+
17+
When a user creates a Nexus Kubernetes Cluster (NAKS), they specify a count and
18+
a [stock keeping unit](./reference-nexus-kubernetes-cluster-sku.md) (SKU) for
19+
virtual machines (VM) that make up the Kubernetes Control Plane and one or more
20+
Agent Pools. Agent Pools are the set of Worker Nodes on which a customer's
21+
containerized network functions run.
22+
23+
The Nexus platform is responsible for deciding the bare metal server on which
24+
each NAKS VM launches.
25+
26+
## How the Nexus Platform Schedules a NAKS VM
27+
28+
Nexus first identifies the set of potential bare metal servers that meet all of
29+
the resource requirements of the NAKS VM SKU. For example, if the user
30+
specified an `NC_G48_224_v1` VM SKU for their agent pool, Nexus collects the
31+
bare metal servers that have available capacity for 48 vCPU, 224Gi of RAM, etc.
32+
33+
Nexus then examines the `AvailabilityZones` field for the Agent Pool or Control
34+
Plane being scheduled. If this field isn't empty, Nexus filters the list of
35+
potential bare metal servers to only those servers in the specified
36+
availability zones (racks). This behavior is a *hard scheduling constraint*. If
37+
there's no bare metal servers in the filtered list, Nexus *doesn't schedule*
38+
the NAKS VM and the cluster fails to provision.
39+
40+
Once Nexus identifies a list of potential bare metal servers on which to place
41+
the NAKS VM, Nexus then picks one of the bare metal servers after applying the
42+
following sorting rules:
43+
44+
1. Prefer bare metal servers in availability zones (racks) that don't have NAKS
45+
VMs from this NAKS Cluster. In other words, *spread the NAKS VMs for a NAKS
46+
Cluster across availability zones*.
47+
48+
1. Prefer bare metal servers within a single availability zone (rack) that
49+
don't have other NAKS VMs from the same NAKS Cluster. In other words,
50+
*spread the NAKS VMs for a NAKS Cluster across bare metal servers within an
51+
availability zone*.
52+
53+
1. If the NAKS VM SKU is either `NC_G48_224_v1` or `NC_P46_224_v1`, prefer
54+
bare metal servers that already house `NC_G48_224_v1` or `NC_P46_224_v1`
55+
NAKS VMs from other NAKS Clusters. In other words, *group the extra-large
56+
VMs from different NAKS Clusters on the same bare metal servers*. This rule
57+
"bin packs" the extra-large VMs in order to reduce fragmentation of the
58+
available compute resources.
59+
60+
## Example Placement Scenarios
61+
62+
The following sections highlight behavior that Nexus users should expect
63+
when creating NAKS Clusters against an Operator Nexus environment.
64+
65+
> **Hint**: You can see which bare metal server your NAKS VMs were scheduled to
66+
> by examining the `nodes.bareMetalMachineId` property of the NAKS
67+
> KubernetesCluster resource or viewing the "Host" column in Azure Portal's
68+
> display of Kubernetes Cluster Nodes.
69+
70+
:::image type="content" source="media/nexus-kubernetes/show-baremetal-host.png" alt-text="A screenshot showing bare metal server for NAKS VMs.":::
71+
72+
The example Operator Nexus environment has these specifications:
73+
74+
* Eight racks of 16 bare metal servers
75+
* Each bare metal server contains two [Non-Uniform Memory Access][numa] (NUMA) cells
76+
* Each NUMA cell provides 48 CPU and 224Gi RAM
77+
78+
[numa]: https://en.wikipedia.org/wiki/Non-uniform_memory_access
79+
80+
### Empty Environment
81+
82+
Given an empty Operator Nexus environment with the given capacity, we create
83+
three differently sized Nexus Kubernetes Clusters.
84+
85+
The NAKS Clusters have these specifications, and we assume for the purposes of
86+
this exercise that the user creates the three Clusters in the following order:
87+
88+
Cluster A
89+
90+
* Control plane, `NC_G12_56_v1` SKU, three count
91+
* Agent pool #1, `NC_P46_224_v1` SKU, 24 count
92+
* Agent pool #2, `NC_G6_28_v1` SKU, six count
93+
94+
Cluster B
95+
96+
* Control plane, `NC_G24_112_v1` SKU, five count
97+
* Agent pool #1, `NC_P46_224_v1` SKU, 48 count
98+
* Agent pool #2, `NC_P22_112_v1` SKU, 24 count
99+
100+
Cluster C
101+
102+
* Control plane, `NC_G12_56_v1` SKU, three count
103+
* Agent pool #1, `NC_P46_224_v1` SKU, 12 count, `AvailabilityZones = [1,4]`
104+
105+
Here's a table summarizing what the user should see after launching Clusters
106+
A, B, and C on an empty Operator Nexus environment.
107+
108+
| Cluster | Pool | SKU | Total Count | Expected # Racks | Actual # Racks | Expected # VMs per Rack | Actual # VMs per Rack |
109+
| ------- | ---------------- | --------------- | ----------- | ---------------- | -------------- | ----------------------- | --------------------- |
110+
| A | Control Plane | `NC_G12_56_v1` | 3 | 3 | 3 | 1 | 1 |
111+
| A | Agent Pool #1 | `NC_P46_224_v1` | 24 | 8 | 8 | 3 | 3 |
112+
| A | Agent Pool #2 | `NC_G6_28_v1` | 6 | 6 | 6 | 1 | 1 |
113+
| B | Control Plane | `NC_G24_112_v1` | 5 | 5 | 5 | 1 | 1 |
114+
| B | Agent Pool #1 | `NC_P46_224_v1` | 48 | 8 | 8 | 6 | 6 |
115+
| B | Agent Pool #2 | `NC_P22_112_v1` | 24 | 8 | 8 | 3 | 3 |
116+
| C | Control Plane | `NC_G12_56_v1` | 3 | 3 | 3 | 1 | 1 |
117+
| C | Agent Pool #1 | `NC_P46_224_v1` | 12 | 2 | 2 | 6 | 6 |
118+
119+
There are eight racks so the VMs for each pool are spread over up to eight
120+
racks. Pools with more than eight VMs require multiple VMs per rack spread
121+
across different bare metal servers.
122+
123+
Cluster C Agent Pool #1 has 12 VMs restricted to AvailabilityZones [1, 4] so it
124+
has 12 VMs on 12 bare metal servers, six in each of racks 1 and 4.
125+
126+
Extra-large VMs (the `NC_P46_224_v1` SKU) from different clusters are placed
127+
on the same bare metal servers (see rule #3 in
128+
[How the Nexus Platform Schedules a VM][#how-the-nexus-platform-schedule-a-vm]).
129+
130+
Here's a visualization of a layout the user might see after deploying Clusters
131+
A, B, and C into an empty environment.
132+
133+
:::image type="content" source="media/nexus-kubernetes/after-first-deployment.png" alt-text="Diagram showing possible layout of VMs after first deployment.":::
134+
135+
### Half-full Environment
136+
137+
We now run through an example of launching another NAKS Cluster when the target
138+
environment is half-full. The target environment is half-full after Clusters A,
139+
B, and C are deployed into the target environment.
140+
141+
Cluster D has the following specifications:
142+
143+
* Control plane, `NC_G24_112_v1` SKU, five count
144+
* Agent pool #1, `NC_P46_224_v1` SKU, 24 count, `AvailabilityZones = [7,8]`
145+
* Agent pool #2, `NC_P22_112_v1` SKU, 24 count
146+
147+
Here's a table summarizing what the user should see after launching Cluster D
148+
into the half-full Operator Nexus environment that exists after launching
149+
Clusters A, B, and C.
150+
151+
| Cluster | Pool | SKU | Total Count | Expected # Racks | Actual # Racks | Expected # VMs per Rack | Actual # VMs per Rack |
152+
| ------- | ---------------- | --------------- | ----------- | ---------------- | -------------- | ----------------------- | --------------------- |
153+
| D | Control Plane | `NC_G12_56_v1` | 5 | 5 | 5 | 1 | 1 |
154+
| D | Agent Pool #1 | `NC_P46_224_v1` | 24 | 2 | 2 | 12 | 12 |
155+
| D | Agent Pool #2 | `NC_P22_112_v1` | 24 | 8 | 8 | 3 | 3 |
156+
157+
Cluster D Agent Pool #1 has 12 VMs restricted to AvailabilityZones [7, 8] so it
158+
has 12 VMs on 12 bare metal servers, six in each of racks 7 and 8. Those VMs
159+
land on bare metal servers also housing extra-large VMs from other clusters due
160+
to the sorting rule that groups extra-large VMs from different clusters onto
161+
the same bare metal servers.
162+
163+
If a Cluster D control plane VM lands on rack 7 or 8, it's likely that one
164+
Cluster D Agent Pool #1 VM lands on the same bare metal server as that Cluster
165+
D control plane VM. This behavior is due to Agent Pool #1 being "pinned" to
166+
racks 7 and 8. Capacity constraints in those racks cause the scheduler to
167+
collocate a control plane VM and an Agent Pool #1 VM from the same NAKS
168+
Cluster.
169+
170+
Cluster D's Agent Pool #2 has three VMs on different bare metal servers on each
171+
of the eight racks. Capacity constraints resulted from Cluster D's Agent Pool #1
172+
being pinned to racks 7 and 8. Therefore, VMs from Cluster D's Agent Pool #1
173+
and Agent Pool #2 are collocated on the same bare metal servers in racks 7 and
174+
8.
175+
176+
Here's a visualization of a layout the user might see after deploying Cluster
177+
D into the target environment.
178+
179+
:::image type="content" source="media/nexus-kubernetes/after-second-deployment.png" alt-text="Diagram showing possible layout of VMs after second deployment.":::
180+
181+
### Nearly full Environment
182+
183+
In our example target environment, four of the eight racks are
184+
close to capacity. Let's try to launch another NAKS Cluster.
185+
186+
Cluster E has the following specifications:
187+
188+
* Control plane, `NC_G24_112_v1` SKU, five count
189+
* Agent pool #1, `NC_P46_224_v1` SKU, 32 count
190+
191+
Here's a table summarizing what the user should see after launching Cluster E
192+
into the target environment.
193+
194+
| Cluster | Pool | SKU | Total Count | Expected # Racks | Actual # Racks | Expected # VMs per Rack | Actual # VMs per Rack |
195+
| ------- | ---------------- | --------------- | ----------- | ---------------- | -------------- | ----------------------- | --------------------- |
196+
| E | Control Plane | `NC_G24_112_v1` | 5 | 5 | 5 | 1 | 1 |
197+
| E | Agent Pool #1 | `NC_P46_224_v1` | 32 | 8 | 8 | **4** | **3, 4 or 5** |
198+
199+
Cluster E's Agent Pool #1 will spread unevenly over all eight racks. Racks 7
200+
and 8 will have three NAKS VMs from Agent Pool #1 instead of the expected four
201+
NAKS VMs because there's no more capacity for the extra-large SKU VMs in those
202+
racks after scheduling Clusters A through D. Because racks 7 and 8 don't have
203+
capacity for the fourth extra-large SKU in Agent Pool #1, five NAKS VMs will
204+
land on the two least-utilized racks. In our example, those least-utilized
205+
racks were racks 3 and 6.
206+
207+
Here's a visualization of a layout the user might see after deploying Cluster
208+
E into the target environment.
209+
210+
:::image type="content" source="media/nexus-kubernetes/after-third-deployment.png" alt-text="Diagram showing possible layout of VMs after third deployment.":::
211+
212+
## Placement during a Runtime Upgrade
213+
214+
As of April 2024 (Network Cloud 2304.1 release), runtime upgrades are performed
215+
using a rack-by-rack strategy. Bare metal servers in rack 1 are reimaged all at
216+
once. The upgrade process pauses until all the bare metal servers successfully
217+
restart and tell Nexus that they're ready to receive workloads.
218+
219+
> Note: It is possible to instruct Operator Nexus to only reimage a portion of
220+
> the bare metal servers in a rack at once, however the default is to reimage
221+
> all bare metal servers in a rack in parallel.
222+
223+
When an individual bare metal server is reimaged, all workloads running on that
224+
bare metal server, including all NAKS VMs, lose power, and connectivity. Workload
225+
containers running on NAKS VMs will, in turn, lose power, and connectivity.
226+
After one minute of not being able to reach those workload containers, the NAKS
227+
Cluster's Kubernetes Control Plane will mark the corresponding Pods as
228+
unhealthy. If the Pods are members of a Deployment or StatefulSet, the NAKS
229+
Cluster's Kubernetes Control Plane attempts to launch replacement Pods to
230+
bring the observed replica count of the Deployment or StatefulSet back to the
231+
desired replica count.
232+
233+
New Pods only launch if there's available capacity for the Pod in the remaining
234+
healthy NAKS VMs. As of April 2024 (Network Cloud 2304.1 release), new NAKS VMs
235+
aren't created to replace NAKS VMs that were on the bare metal server being
236+
reimaged.
237+
238+
Once the bare metal server is successfully reimaged and able to accept new NAKS
239+
VMs, the NAKS VMs that were originally on the same bare metal server relaunch
240+
on the newly reimaged bare metal server. Workload containers may then be
241+
scheduled to those NAKS VMs, potentially restoring the Deployments or
242+
StatefulSets that had Pods on NAKS VMs that were on the bare metal server.
243+
244+
> **Note**: This behavior may seem to the user as if the NAKS VMs did not
245+
> "move" from the bare metal server, when in fact a new instance of an identical
246+
> NAKS VM was launched on the newly reimaged bare metal server that retained the
247+
> same bare metal server name as before reimaging.
248+
249+
## Best Practices
250+
251+
When working with Operator Nexus, keep the following best practices in mind.
252+
253+
* Avoid specifying `AvailabilityZones` for an Agent Pool.
254+
* Launch larger NAKS Clusters before smaller ones.
255+
* Reduce the Agent Pool's Count before reducing the VM SKU size.
256+
257+
### Avoid specifying AvailabilityZones for an Agent Pool
258+
259+
As you can tell from the above placement scenarios, specifying
260+
`AvailabilityZones` for an Agent Pool is the primary reason that NAKS VMs from
261+
the same NAKS Cluster would end up on the same bare metal server. By specifying
262+
`AvailabilityZones`, you "pin" the Agent Pool to a subset of racks and
263+
therefore limit the number of potential bare metal servers in that set of racks
264+
for other NAKS Clusters and other Agent Pool VMs in the same NAKS Cluster to
265+
land on.
266+
267+
Therefore, our first best practice is to avoid specifying `AvailabilityZones`
268+
for an Agent Pool. If you require pinning an Agent Pool to a set of
269+
Availability Zones, make that set as large as possible to minimize the
270+
imbalance that can occur.
271+
272+
The one exception to this best practice is when you have a scenario with only
273+
two or three VMs in an agent pool. You might consider setting
274+
`AvailabilityZones` for that agent pool to `[1,3,5,7]` or `[0,2,4,6]` to
275+
increase availability during runtime upgrades.
276+
277+
### Launch larger NAKS Clusters before smaller ones
278+
279+
As of April 2024, and the Network Cloud 2403.1 release, NAKS Clusters are
280+
scheduled in the order in which they're created. To most efficiently pack your
281+
target environment, we recommended you create larger NAKS Clusters before
282+
smaller ones. Likewise, we recommended you schedule larger Agent Pools before
283+
smaller ones.
284+
285+
This recommendation is important for Agent Pools using the extra-large
286+
`NC_G48_224_v1` or `NC_P46_224_v1` SKU. Scheduling the Agent Pools with the
287+
greatest count of these extra-large SKU VMs creates a larger set of bare metal
288+
servers upon which other extra-large SKU VMs from Agent Pools in other NAKS
289+
Clusters can collocate.
290+
291+
### Reduce the Agent Pool's Count before reducing the VM SKU size
292+
293+
If you run into capacity constraints when launching a NAKS Cluster or Agent
294+
Pool, reduce the Count of the Agent Pool before adjusting the VM SKU size. For
295+
example, if you attempt to create a NAKS Cluster with an Agent Pool with VM SKU
296+
size of `NC_P46_224_v1` and a Count of 24 and get back a failure to provision
297+
the NAKS Cluster due to insufficient resources, you may be tempted to use a VM
298+
SKU Size of `NC_P36_168_v1` and continue with a Count of 24. However, due to
299+
requirements for workload VMs to be aligned to a single NUMA cell on a bare
300+
metal server, it's likely that that same request results in similar
301+
insufficient resource failures. Instead of reducing the VM SKU size, consider
302+
reducing the Count of the Agent Pool to 20. There's a better chance your
303+
request fits within the target environment's resource capacity and your overall
304+
deployment has more CPU cores than if you downsized the VM SKU.

articles/operator-nexus/index.yml

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,9 +6,9 @@ summary: Operator Nexus runs network-intensive workloads under stringent securit
66
metadata:
77
title: Operator Nexus
88
description: Azure Operator Nexus, the next-generation hybrid cloud platform for operators,
9-
proven to run network-intensive workloads and mission-critical applications under the stringent security,
9+
is proven to run network-intensive workloads and mission-critical applications under the stringent security,
1010
resiliency, observability, manageability, and performance required by telecommunication' operators.
11-
Learn how to deploy, manage and use Operator Nexus with these concepts and quickstarts.
11+
Learn how to deploy, manage, and use Operator Nexus with these concepts and quickstarts.
1212
ms.service: azure-operator-nexus
1313
ms.topic: landing-page
1414
author: jashobhit
@@ -36,6 +36,8 @@ landingContent:
3636
url: concepts-storage.md
3737
- text: Nexus Kubernetes overview
3838
url: concepts-nexus-kubernetes-cluster.md
39+
- text: Nexus Kubernetes resource placement
40+
url: concepts-nexus-kubernetes-placement.md
3941
- text: Observability
4042
url: concepts-observability.md
4143
- text: Security
62.8 KB
Loading
60.5 KB
Loading
60.3 KB
Loading
56.6 KB
Loading

0 commit comments

Comments
 (0)