Skip to content

Commit f0ddf49

Browse files
author
Jay Pipes
committed
add documentation on Nexus Kubernetes placement
Adds public documentation on how Nexus Kubernetes Cluster nodes are scheduled in an Operator Nexus instance. Signed-off-by: Jay Pipes <[email protected]>
1 parent 76ea992 commit f0ddf49

File tree

7 files changed

+312
-1
lines changed

7 files changed

+312
-1
lines changed

articles/operator-nexus/TOC.yml

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,12 @@
3838
- name: Access Control Lists
3939
href: concepts-access-control-lists.md
4040
- name: Nexus Kubernetes
41-
href: concepts-nexus-kubernetes-cluster.md
41+
expanded: false
42+
items:
43+
- name: Overview
44+
href: concepts-nexus-kubernetes-cluster.md
45+
- name: Resource Placement
46+
href: concepts-nexus-kubernetes-placement.md
4247
- name: Observability
4348
expanded: false
4449
items:
Lines changed: 304 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,304 @@
1+
---
2+
title: "Resource Placement in Azure Operator Nexus Kubernetes"
3+
description: An explanation of how Operator Nexus schedules Nexus Kubernetes resources.
4+
author: jaypipes
5+
ms.author: jaypipes
6+
ms.service: azure-operator-nexus
7+
ms.topic: conceptual
8+
ms.date: 04/19/2024
9+
ms.custom: template-concept
10+
---
11+
12+
# Background
13+
14+
Operator Nexus instances are deployed at the customer premises. Each instance
15+
comprises one or more racks of bare metal servers.
16+
17+
When a user creates a Nexus Kubernetes Cluster (NAKS), they specify a count and
18+
a [stock keeping unit](./reference-nexus-kubernetes-cluster-sku.md) (SKU) for
19+
virtual machines (VM) that make up the Kubernetes Control Plane and one or more
20+
Agent Pools. Agent Pools are the set of Worker Nodes on which a customer's
21+
containerized network functions run.
22+
23+
The Nexus platform is responsible for deciding the bare metal server on which
24+
each NAKS VM launches.
25+
26+
## How the Nexus Platform Schedules a NAKS VM
27+
28+
Nexus first identifies the set of potential bare metal servers that meet all of
29+
the resource requirements of the NAKS VM SKU. For example, if the user
30+
specified an `NC_G48_224_v1` VM SKU for their agent pool, Nexus collects the
31+
bare metal servers that have available capacity for 48 vCPU, 224Gi of RAM, etc.
32+
33+
Nexus then examines the `AvailabilityZones` field for the Agent Pool or Control
34+
Plane being scheduled. If this field isn't empty, Nexus filters the list of
35+
potential bare metal servers to only those servers in the specified
36+
availability zones (racks). This behavior is a *hard scheduling constraint*. If
37+
there's no bare metal servers in the filtered list, Nexus *doesn't schedule*
38+
the NAKS VM and the cluster fails to provision.
39+
40+
Once Nexus identifies a list of potential bare metal servers on which to place
41+
the NAKS VM, Nexus then picks one of the bare metal servers after applying the
42+
following sorting rules:
43+
44+
1. Prefer bare metal servers in availability zones (racks) that don't have NAKS
45+
VMs from this NAKS Cluster. In other words, *spread the NAKS VMs for a NAKS
46+
Cluster across availability zones*.
47+
48+
1. Prefer bare metal servers within a single availability zone (rack) that
49+
don't have other NAKS VMs from the same NAKS Cluster. In other words,
50+
*spread the NAKS VMs for a NAKS Cluster across bare metal servers within an
51+
availability zone*.
52+
53+
1. If the NAKS VM SKU is either `NC_G48_224_v1` or `NC_P46_224_v1`, prefer
54+
bare metal servers that already house `NC_G48_224_v1` or `NC_P46_224_v1`
55+
NAKS VMs from other NAKS Clusters. In other words, *group the extra-large
56+
VMs from different NAKS Clusters on the same bare metal servers*. This rule
57+
"bin packs" the extra-large VMs in order to reduce fragmentation of the
58+
available compute resources.
59+
60+
## Example Placement Scenarios
61+
62+
The following sections highlight behavior that Nexus users should expect
63+
when creating NAKS Clusters against an Operator Nexus environment.
64+
65+
> **Hint**: You can see which bare metal server your NAKS VMs were scheduled to
66+
> by examining the `nodes.bareMetalMachineId` property of the NAKS
67+
> KubernetesCluster resource or viewing the "Host" column in Azure Portal's
68+
> display of Kubernetes Cluster Nodes.
69+
70+
:::image type="content" source="media/nexus-kubernetes/show-baremetal-host.png" alt-text="A screenshot showing bare metal server for NAKS VMs.":::
71+
72+
The example Operator Nexus environment has these specifications:
73+
74+
* Eight racks of 16 bare metal servers
75+
* Each bare metal server contains two [Non-Uniform Memory Access][numa] (NUMA) cells
76+
* Each NUMA cell provides 48 CPU and 224Gi RAM
77+
78+
[numa]: https://en.wikipedia.org/wiki/Non-uniform_memory_access
79+
80+
### Empty Environment
81+
82+
Given an empty Operator Nexus environment with the given capacity, we create
83+
three differently sized Nexus Kubernetes Clusters.
84+
85+
The NAKS Clusters have these specifications, and we assume for the purposes of
86+
this exercise that the user creates the three Clusters in the following order:
87+
88+
Cluster A
89+
90+
* Control plane, `NC_G12_56_v1` SKU, three count
91+
* Agent pool #1, `NC_P46_224_v1` SKU, 24 count
92+
* Agent pool #2, `NC_G6_28_v1` SKU, six count
93+
94+
Cluster B
95+
96+
* Control plane, `NC_G24_112_v1` SKU, five count
97+
* Agent pool #1, `NC_P46_224_v1` SKU, 48 count
98+
* Agent pool #2, `NC_P22_112_v1` SKU, 24 count
99+
100+
Cluster C
101+
102+
* Control plane, `NC_G12_56_v1` SKU, three count
103+
* Agent pool #1, `NC_P46_224_v1` SKU, 12 count, `AvailabilityZones = [1,4]`
104+
105+
Here's a table summarizing what the user should see after launching Clusters
106+
A, B, and C on an empty Operator Nexus environment.
107+
108+
| Cluster | Pool | SKU | Total Count | Expected # Racks | Actual # Racks | Expected # VMs per Rack | Actual # VMs per Rack |
109+
| ------- | ---------------- | --------------- | ----------- | ---------------- | -------------- | ----------------------- | --------------------- |
110+
| A | Control Plane | `NC_G12_56_v1` | 3 | 3 | 3 | 1 | 1 |
111+
| A | Agent Pool #1 | `NC_P46_224_v1` | 24 | 8 | 8 | 3 | 3 |
112+
| A | Agent Pool #2 | `NC_G6_28_v1` | 6 | 6 | 6 | 1 | 1 |
113+
| B | Control Plane | `NC_G24_112_v1` | 5 | 5 | 5 | 1 | 1 |
114+
| B | Agent Pool #1 | `NC_P46_224_v1` | 48 | 8 | 8 | 6 | 6 |
115+
| B | Agent Pool #2 | `NC_P22_112_v1` | 24 | 8 | 8 | 3 | 3 |
116+
| C | Control Plane | `NC_G12_56_v1` | 3 | 3 | 3 | 1 | 1 |
117+
| C | Agent Pool #1 | `NC_P46_224_v1` | 12 | 2 | 2 | 6 | 6 |
118+
119+
There are eight racks so the VMs for each pool are spread over up to eight
120+
racks. Pools with more than eight VMs require multiple VMs per rack spread
121+
across different bare metal servers.
122+
123+
Cluster C Agent Pool #1 has 12 VMs restricted to AvailabilityZones [1, 4] so it
124+
has 12 VMs on 12 bare metal servers, six in each of racks 1 and 4.
125+
126+
Extra-large VMs (the `NC_P46_224_v1` SKU) from different clusters are placed
127+
on the same bare metal servers (see rule #3 in
128+
[How the Nexus Platform Schedules a VM][#how-the-nexus-platform-schedule-a-vm]).
129+
130+
Here's a visualization of a layout the user might see after deploying Clusters
131+
A, B, and C into an empty environment.
132+
133+
:::image type="content" source="media/nexus-kubernetes/after-first-deployment.png" alt-text="Diagram showing possible layout of VMs after first deployment.":::
134+
135+
### Half-full Environment
136+
137+
We now run through an example of launching another NAKS Cluster when the target
138+
environment is half-full. The target environment is half-full after Clusters A,
139+
B, and C are deployed into the target environment.
140+
141+
Cluster D has the following specifications:
142+
143+
* Control plane, `NC_G24_112_v1` SKU, five count
144+
* Agent pool #1, `NC_P46_224_v1` SKU, 24 count, `AvailabilityZones = [7,8]`
145+
* Agent pool #2, `NC_P22_112_v1` SKU, 24 count
146+
147+
Here's a table summarizing what the user should see after launching Cluster D
148+
into the half-full Operator Nexus environment that exists after launching
149+
Clusters A, B, and C.
150+
151+
| Cluster | Pool | SKU | Total Count | Expected # Racks | Actual # Racks | Expected # VMs per Rack | Actual # VMs per Rack |
152+
| ------- | ---------------- | --------------- | ----------- | ---------------- | -------------- | ----------------------- | --------------------- |
153+
| D | Control Plane | `NC_G12_56_v1` | 5 | 5 | 5 | 1 | 1 |
154+
| D | Agent Pool #1 | `NC_P46_224_v1` | 24 | 2 | 2 | 12 | 12 |
155+
| D | Agent Pool #2 | `NC_P22_112_v1` | 24 | 8 | 8 | 3 | 3 |
156+
157+
Cluster D Agent Pool #1 has 12 VMs restricted to AvailabilityZones [7, 8] so it
158+
has 12 VMs on 12 bare metal servers, six in each of racks 7 and 8. Those VMs
159+
land on bare metal servers also housing extra-large VMs from other clusters due
160+
to the sorting rule that groups extra-large VMs from different clusters onto
161+
the same bare metal servers.
162+
163+
If a Cluster D control plane VM lands on rack 7 or 8, it's likely that one
164+
Cluster D Agent Pool #1 VM lands on the same bare metal server as that Cluster
165+
D control plane VM. This behavior is due to Agent Pool #1 being "pinned" to
166+
racks 7 and 8. Capacity constraints in those racks cause the scheduler to
167+
collocate a control plane VM and an Agent Pool #1 VM from the same NAKS
168+
Cluster.
169+
170+
Cluster D's Agent Pool #2 has three VMs on different bare metal servers on each
171+
of the eight racks. Capacity constraints resulted from Cluster D's Agent Pool #1
172+
being pinned to racks 7 and 8. Therefore, VMs from Cluster D's Agent Pool #1
173+
and Agent Pool #2 are collocated on the same bare metal servers in racks 7 and
174+
8.
175+
176+
Here's a visualization of a layout the user might see after deploying Cluster
177+
D into the target environment.
178+
179+
:::image type="content" source="media/nexus-kubernetes/after-second-deployment.png" alt-text="Diagram showing possible layout of VMs after second deployment.":::
180+
181+
### Nearly full Environment
182+
183+
In our example target environment, four of the eight racks are
184+
close to capacity. Let's try to launch another NAKS Cluster.
185+
186+
Cluster E has the following specifications:
187+
188+
* Control plane, `NC_G24_112_v1` SKU, five count
189+
* Agent pool #1, `NC_P46_224_v1` SKU, 32 count
190+
191+
Here's a table summarizing what the user should see after launching Cluster E
192+
into the target environment.
193+
194+
| Cluster | Pool | SKU | Total Count | Expected # Racks | Actual # Racks | Expected # VMs per Rack | Actual # VMs per Rack |
195+
| ------- | ---------------- | --------------- | ----------- | ---------------- | -------------- | ----------------------- | --------------------- |
196+
| E | Control Plane | `NC_G24_112_v1` | 5 | 5 | 5 | 1 | 1 |
197+
| E | Agent Pool #1 | `NC_P46_224_v1` | 32 | 8 | 8 | **4** | **3, 4 or 5** |
198+
199+
Cluster E's Agent Pool #1 will spread unevenly over all eight racks. Racks 7
200+
and 8 will have three NAKS VMs from Agent Pool #1 instead of the expected four
201+
NAKS VMs because there's no more capacity for the extra-large SKU VMs in those
202+
racks after scheduling Clusters A through D. Because racks 7 and 8 don't have
203+
capacity for the fourth extra-large SKU in Agent Pool #1, five NAKS VMs will
204+
land on the two least-utilized racks. In our example, those least-utilized
205+
racks were racks 3 and 6.
206+
207+
Here's a visualization of a layout the user might see after deploying Cluster
208+
E into the target environment.
209+
210+
:::image type="content" source="media/nexus-kubernetes/after-third-deployment.png" alt-text="Diagram showing possible layout of VMs after third deployment.":::
211+
212+
## Placement during a Runtime Upgrade
213+
214+
As of April 2024 (Network Cloud 2304.1 release), runtime upgrades are performed
215+
using a rack-by-rack strategy. Bare metal servers in rack 1 are reimaged all at
216+
once. The upgrade process pauses until all the bare metal servers successfully
217+
restart and tell Nexus that they're ready to receive workloads.
218+
219+
> Note: It is possible to instruct Operator Nexus to only reimage a portion of
220+
> the bare metal servers in a rack at once, however the default is to reimage
221+
> all bare metal servers in a rack in parallel.
222+
223+
When an individual bare metal server is reimaged, all workloads running on that
224+
bare metal server, including all NAKS VMs, lose power, and connectivity. Workload
225+
containers running on NAKS VMs will, in turn, lose power, and connectivity.
226+
After one minute of not being able to reach those workload containers, the NAKS
227+
Cluster's Kubernetes Control Plane will mark the corresponding Pods as
228+
unhealthy. If the Pods are members of a Deployment or StatefulSet, the NAKS
229+
Cluster's Kubernetes Control Plane attempts to launch replacement Pods to
230+
bring the observed replica count of the Deployment or StatefulSet back to the
231+
desired replica count.
232+
233+
New Pods only launch if there's available capacity for the Pod in the remaining
234+
healthy NAKS VMs. As of April 2024 (Network Cloud 2304.1 release), new NAKS VMs
235+
aren't created to replace NAKS VMs that were on the bare metal server being
236+
reimaged.
237+
238+
Once the bare metal server is successfully reimaged and able to accept new NAKS
239+
VMs, the NAKS VMs that were originally on the same bare metal server relaunch
240+
on the newly reimaged bare metal server. Workload containers may then be
241+
scheduled to those NAKS VMs, potentially restoring the Deployments or
242+
StatefulSets that had Pods on NAKS VMs that were on the bare metal server.
243+
244+
> **Note**: This behavior may seem to the user as if the NAKS VMs did not
245+
> "move" from the bare metal server, when in fact a new instance of an identical
246+
> NAKS VM was launched on the newly reimaged bare metal server that retained the
247+
> same bare metal server name as before reimaging.
248+
249+
## Best Practices
250+
251+
When working with Operator Nexus, keep the following best practices in mind.
252+
253+
* Avoid specifying `AvailabilityZones` for an Agent Pool.
254+
* Launch larger NAKS Clusters before smaller ones.
255+
* Reduce the Agent Pool's Count before reducing the VM SKU size.
256+
257+
### Avoid specifying AvailabilityZones for an Agent Pool
258+
259+
As you can tell from the above placement scenarios, specifying
260+
`AvailabilityZones` for an Agent Pool is the primary reason that NAKS VMs from
261+
the same NAKS Cluster would end up on the same bare metal server. By specifying
262+
`AvailabilityZones`, you "pin" the Agent Pool to a subset of racks and
263+
therefore limit the number of potential bare metal servers in that set of racks
264+
for other NAKS Clusters and other Agent Pool VMs in the same NAKS Cluster to
265+
land on.
266+
267+
Therefore, our first best practice is to avoid specifying `AvailabilityZones`
268+
for an Agent Pool. If you require pinning an Agent Pool to a set of
269+
Availability Zones, make that set as large as possible to minimize the
270+
imbalance that can occur.
271+
272+
The one exception to this best practice is when you have a scenario with only
273+
two or three VMs in an agent pool. You might consider setting
274+
`AvailabilityZones` for that agent pool to `[1,3,5,7]` or `[0,2,4,6]` to
275+
increase availability during runtime upgrades.
276+
277+
### Launch larger NAKS Clusters before smaller ones
278+
279+
As of April 2024, and the Network Cloud 2403.1 release, NAKS Clusters are
280+
scheduled in the order in which they're created. To most efficiently pack your
281+
target environment, we recommended you create larger NAKS Clusters before
282+
smaller ones. Likewise, we recommended you schedule larger Agent Pools before
283+
smaller ones.
284+
285+
This recommendation is important for Agent Pools using the extra-large
286+
`NC_G48_224_v1` or `NC_P46_224_v1` SKU. Scheduling the Agent Pools with the
287+
greatest count of these extra-large SKU VMs creates a larger set of bare metal
288+
servers upon which other extra-large SKU VMs from Agent Pools in other NAKS
289+
Clusters can collocate.
290+
291+
### Reduce the Agent Pool's Count before reducing the VM SKU size
292+
293+
If you run into capacity constraints when launching a NAKS Cluster or Agent
294+
Pool, reduce the Count of the Agent Pool before adjusting the VM SKU size. For
295+
example, if you attempt to create a NAKS Cluster with an Agent Pool with VM SKU
296+
size of `NC_P46_224_v1` and a Count of 24 and get back a failure to provision
297+
the NAKS Cluster due to insufficient resources, you may be tempted to use a VM
298+
SKU Size of `NC_P36_168_v1` and continue with a Count of 24. However, due to
299+
requirements for workload VMs to be aligned to a single NUMA cell on a bare
300+
metal server, it's likely that that same request results in similar
301+
insufficient resource failures. Instead of reducing the VM SKU size, consider
302+
reducing the Count of the Agent Pool to 20. There's a better chance your
303+
request fits within the target environment's resource capacity and your overall
304+
deployment has more CPU cores than if you downsized the VM SKU.

articles/operator-nexus/index.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,8 @@ landingContent:
3636
url: concepts-storage.md
3737
- text: Nexus Kubernetes overview
3838
url: concepts-nexus-kubernetes-cluster.md
39+
- text: Nexus Kubernetes resource placement
40+
url: concepts-nexus-kubernetes-placement.md
3941
- text: Observability
4042
url: concepts-observability.md
4143
- text: Security
62.8 KB
Loading
60.5 KB
Loading
60.3 KB
Loading
56.6 KB
Loading

0 commit comments

Comments
 (0)