|
| 1 | +--- |
| 2 | +title: "Resource Placement in Azure Operator Nexus Kubernetes" |
| 3 | +description: An explanation of how Operator Nexus schedules Nexus Kubernetes resources. |
| 4 | +author: jaypipes |
| 5 | +ms.author: jaypipes |
| 6 | +ms.service: azure-operator-nexus |
| 7 | +ms.topic: conceptual |
| 8 | +ms.date: 04/19/2024 |
| 9 | +ms.custom: template-concept |
| 10 | +--- |
| 11 | + |
| 12 | +# Background |
| 13 | + |
| 14 | +Operator Nexus instances are deployed at the customer premises. Each instance |
| 15 | +comprises one or more racks of bare metal servers. |
| 16 | + |
| 17 | +When a user creates a Nexus Kubernetes Cluster (NAKS), they specify a count and |
| 18 | +a [stock keeping unit](./reference-nexus-kubernetes-cluster-sku.md) (SKU) for |
| 19 | +virtual machines (VM) that make up the Kubernetes Control Plane and one or more |
| 20 | +Agent Pools. Agent Pools are the set of Worker Nodes on which a customer's |
| 21 | +containerized network functions run. |
| 22 | + |
| 23 | +The Nexus platform is responsible for deciding the bare metal server on which |
| 24 | +each NAKS VM launches. |
| 25 | + |
| 26 | +## How the Nexus Platform Schedules a NAKS VM |
| 27 | + |
| 28 | +Nexus first identifies the set of potential bare metal servers that meet all of |
| 29 | +the resource requirements of the NAKS VM SKU. For example, if the user |
| 30 | +specified an `NC_G48_224_v1` VM SKU for their agent pool, Nexus collects the |
| 31 | +bare metal servers that have available capacity for 48 vCPU, 224Gi of RAM, etc. |
| 32 | + |
| 33 | +Nexus then examines the `AvailabilityZones` field for the Agent Pool or Control |
| 34 | +Plane being scheduled. If this field isn't empty, Nexus filters the list of |
| 35 | +potential bare metal servers to only those servers in the specified |
| 36 | +availability zones (racks). This behavior is a *hard scheduling constraint*. If |
| 37 | +there's no bare metal servers in the filtered list, Nexus *doesn't schedule* |
| 38 | +the NAKS VM and the cluster fails to provision. |
| 39 | + |
| 40 | +Once Nexus identifies a list of potential bare metal servers on which to place |
| 41 | +the NAKS VM, Nexus then picks one of the bare metal servers after applying the |
| 42 | +following sorting rules: |
| 43 | + |
| 44 | +1. Prefer bare metal servers in availability zones (racks) that don't have NAKS |
| 45 | + VMs from this NAKS Cluster. In other words, *spread the NAKS VMs for a NAKS |
| 46 | + Cluster across availability zones*. |
| 47 | + |
| 48 | +1. Prefer bare metal servers within a single availability zone (rack) that |
| 49 | + don't have other NAKS VMs from the same NAKS Cluster. In other words, |
| 50 | + *spread the NAKS VMs for a NAKS Cluster across bare metal servers within an |
| 51 | + availability zone*. |
| 52 | + |
| 53 | +1. If the NAKS VM SKU is either `NC_G48_224_v1` or `NC_P46_224_v1`, prefer |
| 54 | + bare metal servers that already house `NC_G48_224_v1` or `NC_P46_224_v1` |
| 55 | + NAKS VMs from other NAKS Clusters. In other words, *group the extra-large |
| 56 | + VMs from different NAKS Clusters on the same bare metal servers*. This rule |
| 57 | + "bin packs" the extra-large VMs in order to reduce fragmentation of the |
| 58 | + available compute resources. |
| 59 | + |
| 60 | +## Example Placement Scenarios |
| 61 | + |
| 62 | +The following sections highlight behavior that Nexus users should expect |
| 63 | +when creating NAKS Clusters against an Operator Nexus environment. |
| 64 | + |
| 65 | +> **Hint**: You can see which bare metal server your NAKS VMs were scheduled to |
| 66 | +> by examining the `nodes.bareMetalMachineId` property of the NAKS |
| 67 | +> KubernetesCluster resource or viewing the "Host" column in Azure Portal's |
| 68 | +> display of Kubernetes Cluster Nodes. |
| 69 | +
|
| 70 | +:::image type="content" source="media/nexus-kubernetes/show-baremetal-host.png" alt-text="A screenshot showing bare metal server for NAKS VMs."::: |
| 71 | + |
| 72 | +The example Operator Nexus environment has these specifications: |
| 73 | + |
| 74 | +* Eight racks of 16 bare metal servers |
| 75 | +* Each bare metal server contains two [Non-Uniform Memory Access][numa] (NUMA) cells |
| 76 | +* Each NUMA cell provides 48 CPU and 224Gi RAM |
| 77 | + |
| 78 | +[numa]: https://en.wikipedia.org/wiki/Non-uniform_memory_access |
| 79 | + |
| 80 | +### Empty Environment |
| 81 | + |
| 82 | +Given an empty Operator Nexus environment with the given capacity, we create |
| 83 | +three differently sized Nexus Kubernetes Clusters. |
| 84 | + |
| 85 | +The NAKS Clusters have these specifications, and we assume for the purposes of |
| 86 | +this exercise that the user creates the three Clusters in the following order: |
| 87 | + |
| 88 | +Cluster A |
| 89 | + |
| 90 | +* Control plane, `NC_G12_56_v1` SKU, three count |
| 91 | +* Agent pool #1, `NC_P46_224_v1` SKU, 24 count |
| 92 | +* Agent pool #2, `NC_G6_28_v1` SKU, six count |
| 93 | + |
| 94 | +Cluster B |
| 95 | + |
| 96 | +* Control plane, `NC_G24_112_v1` SKU, five count |
| 97 | +* Agent pool #1, `NC_P46_224_v1` SKU, 48 count |
| 98 | +* Agent pool #2, `NC_P22_112_v1` SKU, 24 count |
| 99 | + |
| 100 | +Cluster C |
| 101 | + |
| 102 | +* Control plane, `NC_G12_56_v1` SKU, three count |
| 103 | +* Agent pool #1, `NC_P46_224_v1` SKU, 12 count, `AvailabilityZones = [1,4]` |
| 104 | + |
| 105 | +Here's a table summarizing what the user should see after launching Clusters |
| 106 | +A, B, and C on an empty Operator Nexus environment. |
| 107 | + |
| 108 | +| Cluster | Pool | SKU | Total Count | Expected # Racks | Actual # Racks | Expected # VMs per Rack | Actual # VMs per Rack | |
| 109 | +| ------- | ---------------- | --------------- | ----------- | ---------------- | -------------- | ----------------------- | --------------------- | |
| 110 | +| A | Control Plane | `NC_G12_56_v1` | 3 | 3 | 3 | 1 | 1 | |
| 111 | +| A | Agent Pool #1 | `NC_P46_224_v1` | 24 | 8 | 8 | 3 | 3 | |
| 112 | +| A | Agent Pool #2 | `NC_G6_28_v1` | 6 | 6 | 6 | 1 | 1 | |
| 113 | +| B | Control Plane | `NC_G24_112_v1` | 5 | 5 | 5 | 1 | 1 | |
| 114 | +| B | Agent Pool #1 | `NC_P46_224_v1` | 48 | 8 | 8 | 6 | 6 | |
| 115 | +| B | Agent Pool #2 | `NC_P22_112_v1` | 24 | 8 | 8 | 3 | 3 | |
| 116 | +| C | Control Plane | `NC_G12_56_v1` | 3 | 3 | 3 | 1 | 1 | |
| 117 | +| C | Agent Pool #1 | `NC_P46_224_v1` | 12 | 2 | 2 | 6 | 6 | |
| 118 | + |
| 119 | +There are eight racks so the VMs for each pool are spread over up to eight |
| 120 | +racks. Pools with more than eight VMs require multiple VMs per rack spread |
| 121 | +across different bare metal servers. |
| 122 | + |
| 123 | +Cluster C Agent Pool #1 has 12 VMs restricted to AvailabilityZones [1, 4] so it |
| 124 | +has 12 VMs on 12 bare metal servers, six in each of racks 1 and 4. |
| 125 | + |
| 126 | +Extra-large VMs (the `NC_P46_224_v1` SKU) from different clusters are placed |
| 127 | +on the same bare metal servers (see rule #3 in |
| 128 | +[How the Nexus Platform Schedules a VM][#how-the-nexus-platform-schedule-a-vm]). |
| 129 | + |
| 130 | +Here's a visualization of a layout the user might see after deploying Clusters |
| 131 | +A, B, and C into an empty environment. |
| 132 | + |
| 133 | +:::image type="content" source="media/nexus-kubernetes/after-first-deployment.png" alt-text="Diagram showing possible layout of VMs after first deployment."::: |
| 134 | + |
| 135 | +### Half-full Environment |
| 136 | + |
| 137 | +We now run through an example of launching another NAKS Cluster when the target |
| 138 | +environment is half-full. The target environment is half-full after Clusters A, |
| 139 | +B, and C are deployed into the target environment. |
| 140 | + |
| 141 | +Cluster D has the following specifications: |
| 142 | + |
| 143 | +* Control plane, `NC_G24_112_v1` SKU, five count |
| 144 | +* Agent pool #1, `NC_P46_224_v1` SKU, 24 count, `AvailabilityZones = [7,8]` |
| 145 | +* Agent pool #2, `NC_P22_112_v1` SKU, 24 count |
| 146 | + |
| 147 | +Here's a table summarizing what the user should see after launching Cluster D |
| 148 | +into the half-full Operator Nexus environment that exists after launching |
| 149 | +Clusters A, B, and C. |
| 150 | + |
| 151 | +| Cluster | Pool | SKU | Total Count | Expected # Racks | Actual # Racks | Expected # VMs per Rack | Actual # VMs per Rack | |
| 152 | +| ------- | ---------------- | --------------- | ----------- | ---------------- | -------------- | ----------------------- | --------------------- | |
| 153 | +| D | Control Plane | `NC_G12_56_v1` | 5 | 5 | 5 | 1 | 1 | |
| 154 | +| D | Agent Pool #1 | `NC_P46_224_v1` | 24 | 2 | 2 | 12 | 12 | |
| 155 | +| D | Agent Pool #2 | `NC_P22_112_v1` | 24 | 8 | 8 | 3 | 3 | |
| 156 | + |
| 157 | +Cluster D Agent Pool #1 has 12 VMs restricted to AvailabilityZones [7, 8] so it |
| 158 | +has 12 VMs on 12 bare metal servers, six in each of racks 7 and 8. Those VMs |
| 159 | +land on bare metal servers also housing extra-large VMs from other clusters due |
| 160 | +to the sorting rule that groups extra-large VMs from different clusters onto |
| 161 | +the same bare metal servers. |
| 162 | + |
| 163 | +If a Cluster D control plane VM lands on rack 7 or 8, it's likely that one |
| 164 | +Cluster D Agent Pool #1 VM lands on the same bare metal server as that Cluster |
| 165 | +D control plane VM. This behavior is due to Agent Pool #1 being "pinned" to |
| 166 | +racks 7 and 8. Capacity constraints in those racks cause the scheduler to |
| 167 | +collocate a control plane VM and an Agent Pool #1 VM from the same NAKS |
| 168 | +Cluster. |
| 169 | + |
| 170 | +Cluster D's Agent Pool #2 has three VMs on different bare metal servers on each |
| 171 | +of the eight racks. Capacity constraints resulted from Cluster D's Agent Pool #1 |
| 172 | +being pinned to racks 7 and 8. Therefore, VMs from Cluster D's Agent Pool #1 |
| 173 | +and Agent Pool #2 are collocated on the same bare metal servers in racks 7 and |
| 174 | +8. |
| 175 | + |
| 176 | +Here's a visualization of a layout the user might see after deploying Cluster |
| 177 | +D into the target environment. |
| 178 | + |
| 179 | +:::image type="content" source="media/nexus-kubernetes/after-second-deployment.png" alt-text="Diagram showing possible layout of VMs after second deployment."::: |
| 180 | + |
| 181 | +### Nearly full Environment |
| 182 | + |
| 183 | +In our example target environment, four of the eight racks are |
| 184 | +close to capacity. Let's try to launch another NAKS Cluster. |
| 185 | + |
| 186 | +Cluster E has the following specifications: |
| 187 | + |
| 188 | +* Control plane, `NC_G24_112_v1` SKU, five count |
| 189 | +* Agent pool #1, `NC_P46_224_v1` SKU, 32 count |
| 190 | + |
| 191 | +Here's a table summarizing what the user should see after launching Cluster E |
| 192 | +into the target environment. |
| 193 | + |
| 194 | +| Cluster | Pool | SKU | Total Count | Expected # Racks | Actual # Racks | Expected # VMs per Rack | Actual # VMs per Rack | |
| 195 | +| ------- | ---------------- | --------------- | ----------- | ---------------- | -------------- | ----------------------- | --------------------- | |
| 196 | +| E | Control Plane | `NC_G24_112_v1` | 5 | 5 | 5 | 1 | 1 | |
| 197 | +| E | Agent Pool #1 | `NC_P46_224_v1` | 32 | 8 | 8 | **4** | **3, 4 or 5** | |
| 198 | + |
| 199 | +Cluster E's Agent Pool #1 will spread unevenly over all eight racks. Racks 7 |
| 200 | +and 8 will have three NAKS VMs from Agent Pool #1 instead of the expected four |
| 201 | +NAKS VMs because there's no more capacity for the extra-large SKU VMs in those |
| 202 | +racks after scheduling Clusters A through D. Because racks 7 and 8 don't have |
| 203 | +capacity for the fourth extra-large SKU in Agent Pool #1, five NAKS VMs will |
| 204 | +land on the two least-utilized racks. In our example, those least-utilized |
| 205 | +racks were racks 3 and 6. |
| 206 | + |
| 207 | +Here's a visualization of a layout the user might see after deploying Cluster |
| 208 | +E into the target environment. |
| 209 | + |
| 210 | +:::image type="content" source="media/nexus-kubernetes/after-third-deployment.png" alt-text="Diagram showing possible layout of VMs after third deployment."::: |
| 211 | + |
| 212 | +## Placement during a Runtime Upgrade |
| 213 | + |
| 214 | +As of April 2024 (Network Cloud 2304.1 release), runtime upgrades are performed |
| 215 | +using a rack-by-rack strategy. Bare metal servers in rack 1 are reimaged all at |
| 216 | +once. The upgrade process pauses until all the bare metal servers successfully |
| 217 | +restart and tell Nexus that they're ready to receive workloads. |
| 218 | + |
| 219 | +> Note: It is possible to instruct Operator Nexus to only reimage a portion of |
| 220 | +> the bare metal servers in a rack at once, however the default is to reimage |
| 221 | +> all bare metal servers in a rack in parallel. |
| 222 | +
|
| 223 | +When an individual bare metal server is reimaged, all workloads running on that |
| 224 | +bare metal server, including all NAKS VMs, lose power, and connectivity. Workload |
| 225 | +containers running on NAKS VMs will, in turn, lose power, and connectivity. |
| 226 | +After one minute of not being able to reach those workload containers, the NAKS |
| 227 | +Cluster's Kubernetes Control Plane will mark the corresponding Pods as |
| 228 | +unhealthy. If the Pods are members of a Deployment or StatefulSet, the NAKS |
| 229 | +Cluster's Kubernetes Control Plane attempts to launch replacement Pods to |
| 230 | +bring the observed replica count of the Deployment or StatefulSet back to the |
| 231 | +desired replica count. |
| 232 | + |
| 233 | +New Pods only launch if there's available capacity for the Pod in the remaining |
| 234 | +healthy NAKS VMs. As of April 2024 (Network Cloud 2304.1 release), new NAKS VMs |
| 235 | +aren't created to replace NAKS VMs that were on the bare metal server being |
| 236 | +reimaged. |
| 237 | + |
| 238 | +Once the bare metal server is successfully reimaged and able to accept new NAKS |
| 239 | +VMs, the NAKS VMs that were originally on the same bare metal server relaunch |
| 240 | +on the newly reimaged bare metal server. Workload containers may then be |
| 241 | +scheduled to those NAKS VMs, potentially restoring the Deployments or |
| 242 | +StatefulSets that had Pods on NAKS VMs that were on the bare metal server. |
| 243 | + |
| 244 | +> **Note**: This behavior may seem to the user as if the NAKS VMs did not |
| 245 | +> "move" from the bare metal server, when in fact a new instance of an identical |
| 246 | +> NAKS VM was launched on the newly reimaged bare metal server that retained the |
| 247 | +> same bare metal server name as before reimaging. |
| 248 | +
|
| 249 | +## Best Practices |
| 250 | + |
| 251 | +When working with Operator Nexus, keep the following best practices in mind. |
| 252 | + |
| 253 | +* Avoid specifying `AvailabilityZones` for an Agent Pool. |
| 254 | +* Launch larger NAKS Clusters before smaller ones. |
| 255 | +* Reduce the Agent Pool's Count before reducing the VM SKU size. |
| 256 | + |
| 257 | +### Avoid specifying AvailabilityZones for an Agent Pool |
| 258 | + |
| 259 | +As you can tell from the above placement scenarios, specifying |
| 260 | +`AvailabilityZones` for an Agent Pool is the primary reason that NAKS VMs from |
| 261 | +the same NAKS Cluster would end up on the same bare metal server. By specifying |
| 262 | +`AvailabilityZones`, you "pin" the Agent Pool to a subset of racks and |
| 263 | +therefore limit the number of potential bare metal servers in that set of racks |
| 264 | +for other NAKS Clusters and other Agent Pool VMs in the same NAKS Cluster to |
| 265 | +land on. |
| 266 | + |
| 267 | +Therefore, our first best practice is to avoid specifying `AvailabilityZones` |
| 268 | +for an Agent Pool. If you require pinning an Agent Pool to a set of |
| 269 | +Availability Zones, make that set as large as possible to minimize the |
| 270 | +imbalance that can occur. |
| 271 | + |
| 272 | +The one exception to this best practice is when you have a scenario with only |
| 273 | +two or three VMs in an agent pool. You might consider setting |
| 274 | +`AvailabilityZones` for that agent pool to `[1,3,5,7]` or `[0,2,4,6]` to |
| 275 | +increase availability during runtime upgrades. |
| 276 | + |
| 277 | +### Launch larger NAKS Clusters before smaller ones |
| 278 | + |
| 279 | +As of April 2024, and the Network Cloud 2403.1 release, NAKS Clusters are |
| 280 | +scheduled in the order in which they're created. To most efficiently pack your |
| 281 | +target environment, we recommended you create larger NAKS Clusters before |
| 282 | +smaller ones. Likewise, we recommended you schedule larger Agent Pools before |
| 283 | +smaller ones. |
| 284 | + |
| 285 | +This recommendation is important for Agent Pools using the extra-large |
| 286 | +`NC_G48_224_v1` or `NC_P46_224_v1` SKU. Scheduling the Agent Pools with the |
| 287 | +greatest count of these extra-large SKU VMs creates a larger set of bare metal |
| 288 | +servers upon which other extra-large SKU VMs from Agent Pools in other NAKS |
| 289 | +Clusters can collocate. |
| 290 | + |
| 291 | +### Reduce the Agent Pool's Count before reducing the VM SKU size |
| 292 | + |
| 293 | +If you run into capacity constraints when launching a NAKS Cluster or Agent |
| 294 | +Pool, reduce the Count of the Agent Pool before adjusting the VM SKU size. For |
| 295 | +example, if you attempt to create a NAKS Cluster with an Agent Pool with VM SKU |
| 296 | +size of `NC_P46_224_v1` and a Count of 24 and get back a failure to provision |
| 297 | +the NAKS Cluster due to insufficient resources, you may be tempted to use a VM |
| 298 | +SKU Size of `NC_P36_168_v1` and continue with a Count of 24. However, due to |
| 299 | +requirements for workload VMs to be aligned to a single NUMA cell on a bare |
| 300 | +metal server, it's likely that that same request results in similar |
| 301 | +insufficient resource failures. Instead of reducing the VM SKU size, consider |
| 302 | +reducing the Count of the Agent Pool to 20. There's a better chance your |
| 303 | +request fits within the target environment's resource capacity and your overall |
| 304 | +deployment has more CPU cores than if you downsized the VM SKU. |
0 commit comments