MicrosoftDocs
diff --git a/‎articles/operator-nexus/TOC.yml
Lines changed: 6 additions & 1 deletion b/‎articles/operator-nexus/TOC.yml
Lines changed: 6 additions & 1 deletion
diff --git a/‎articles/operator-nexus/concepts-nexus-kubernetes-placement.md
Lines changed: 304 additions & 0 deletions b/‎articles/operator-nexus/concepts-nexus-kubernetes-placement.md
Lines changed: 304 additions & 0 deletions
diff --git a/‎articles/operator-nexus/index.yml
Lines changed: 2 additions & 0 deletions b/‎articles/operator-nexus/index.yml
Lines changed: 2 additions & 0 deletions
diff --git a/‎articles/operator-nexus/media/nexus-kubernetes/after-first-deployment.png
62.8 KB b/‎articles/operator-nexus/media/nexus-kubernetes/after-first-deployment.png
62.8 KB
diff --git a/‎articles/operator-nexus/media/nexus-kubernetes/after-second-deployment.png
60.5 KB b/‎articles/operator-nexus/media/nexus-kubernetes/after-second-deployment.png
60.5 KB
diff --git a/‎articles/operator-nexus/media/nexus-kubernetes/after-third-deployment.png
60.3 KB b/‎articles/operator-nexus/media/nexus-kubernetes/after-third-deployment.png
60.3 KB
diff --git a/‎articles/operator-nexus/media/nexus-kubernetes/show-baremetal-host.png
56.6 KB b/‎articles/operator-nexus/media/nexus-kubernetes/show-baremetal-host.png
56.6 KB
@@ -38,7 +38,12 @@
     - name: Access Control Lists
       href: concepts-access-control-lists.md
     - name: Nexus Kubernetes
-      href: concepts-nexus-kubernetes-cluster.md
+      expanded: false
+      items:
+      - name: Overview
+        href: concepts-nexus-kubernetes-cluster.md
+      - name: Resource Placement
+        href: concepts-nexus-kubernetes-placement.md
     - name: Observability
       expanded: false
       items:
 
@@ -0,0 +1,304 @@
+---
+title: "Resource Placement in Azure Operator Nexus Kubernetes"
+description: An explanation of how Operator Nexus schedules Nexus Kubernetes resources.
+author: jaypipes
+ms.author: jaypipes
+ms.service: azure-operator-nexus
+ms.topic: conceptual
+ms.date: 04/19/2024
+ms.custom: template-concept
+---
+
+# Background
+
+Operator Nexus instances are deployed at the customer premises. Each instance
+comprises one or more racks of bare metal servers.
+
+When a user creates a Nexus Kubernetes Cluster (NAKS), they specify a count and
+a [stock keeping unit](./reference-nexus-kubernetes-cluster-sku.md) (SKU) for
+virtual machines (VM) that make up the Kubernetes Control Plane and one or more
+Agent Pools. Agent Pools are the set of Worker Nodes on which a customer's
+containerized network functions run.
+
+The Nexus platform is responsible for deciding the bare metal server on which
+each NAKS VM launches.
+
+## How the Nexus Platform Schedules a NAKS VM
+
+Nexus first identifies the set of potential bare metal servers that meet all of
+the resource requirements of the NAKS VM SKU. For example, if the user
+specified an `NC_G48_224_v1` VM SKU for their agent pool, Nexus collects the
+bare metal servers that have available capacity for 48 vCPU, 224Gi of RAM, etc.
+
+Nexus then examines the `AvailabilityZones` field for the Agent Pool or Control
+Plane being scheduled. If this field isn't empty, Nexus filters the list of
+potential bare metal servers to only those servers in the specified
+availability zones (racks). This behavior is a *hard scheduling constraint*. If
+there's no bare metal servers in the filtered list, Nexus *doesn't schedule*
+the NAKS VM and the cluster fails to provision.
+
+Once Nexus identifies a list of potential bare metal servers on which to place
+the NAKS VM, Nexus then picks one of the bare metal servers after applying the
+following sorting rules:
+
+1. Prefer bare metal servers in availability zones (racks) that don't have NAKS
+   VMs from this NAKS Cluster. In other words, *spread the NAKS VMs for a NAKS
+   Cluster across availability zones*.
+
+1. Prefer bare metal servers within a single availability zone (rack) that
+   don't have other NAKS VMs from the same NAKS Cluster. In other words,
+   *spread the NAKS VMs for a NAKS Cluster across bare metal servers within an
+   availability zone*.
+
+1. If the NAKS VM SKU is either `NC_G48_224_v1` or `NC_P46_224_v1`, prefer
+   bare metal servers that already house `NC_G48_224_v1` or `NC_P46_224_v1`
+   NAKS VMs from other NAKS Clusters. In other words, *group the extra-large
+   VMs from different NAKS Clusters on the same bare metal servers*. This rule
+   "bin packs" the extra-large VMs in order to reduce fragmentation of the
+   available compute resources.
+
+## Example Placement Scenarios
+
+The following sections highlight behavior that Nexus users should expect
+when creating NAKS Clusters against an Operator Nexus environment.
+
+> **Hint**: You can see which bare metal server your NAKS VMs were scheduled to
+> by examining the `nodes.bareMetalMachineId` property of the NAKS
+> KubernetesCluster resource or viewing the "Host" column in Azure Portal's
+> display of Kubernetes Cluster Nodes.
+
+:::image type="content" source="media/nexus-kubernetes/show-baremetal-host.png" alt-text="A screenshot showing bare metal server for NAKS VMs.":::
+
+The example Operator Nexus environment has these specifications:
+
+* Eight racks of 16 bare metal servers
+* Each bare metal server contains two [Non-Uniform Memory Access][numa] (NUMA) cells
+* Each NUMA cell provides 48 CPU and 224Gi RAM
+
+[numa]: https://en.wikipedia.org/wiki/Non-uniform_memory_access
+
+### Empty Environment 
+
+Given an empty Operator Nexus environment with the given capacity, we create
+three differently sized Nexus Kubernetes Clusters.
+
+The NAKS Clusters have these specifications, and we assume for the purposes of
+this exercise that the user creates the three Clusters in the following order:
+
+Cluster A
+
+* Control plane, `NC_G12_56_v1` SKU, three count
+* Agent pool #1, `NC_P46_224_v1` SKU, 24 count
+* Agent pool #2, `NC_G6_28_v1` SKU, six count
+
+Cluster B
+ 
+* Control plane, `NC_G24_112_v1` SKU, five count
+* Agent pool #1, `NC_P46_224_v1` SKU, 48 count
+* Agent pool #2, `NC_P22_112_v1` SKU, 24 count
+
+Cluster C
+
+* Control plane, `NC_G12_56_v1` SKU, three count
+* Agent pool #1, `NC_P46_224_v1` SKU, 12 count, `AvailabilityZones = [1,4]`
+
+Here's a table summarizing what the user should see after launching Clusters
+A, B, and C on an empty Operator Nexus environment.
+
+| Cluster | Pool             | SKU             | Total Count | Expected # Racks | Actual # Racks | Expected # VMs per Rack | Actual # VMs per Rack |
+| ------- | ---------------- | --------------- | ----------- | ---------------- | -------------- | ----------------------- | --------------------- |
+| A       | Control Plane    | `NC_G12_56_v1`  | 3           | 3                | 3              | 1                       | 1                     |
+| A       | Agent Pool #1    | `NC_P46_224_v1` | 24          | 8                | 8              | 3                       | 3                     |
+| A       | Agent Pool #2    | `NC_G6_28_v1`   | 6           | 6                | 6              | 1                       | 1                     |
+| B       | Control Plane    | `NC_G24_112_v1` | 5           | 5                | 5              | 1                       | 1                     |
+| B       | Agent Pool #1    | `NC_P46_224_v1` | 48          | 8                | 8              | 6                       | 6                     |
+| B       | Agent Pool #2    | `NC_P22_112_v1` | 24          | 8                | 8              | 3                       | 3                     |
+| C       | Control Plane    | `NC_G12_56_v1`  | 3           | 3                | 3              | 1                       | 1                     |
+| C       | Agent Pool #1    | `NC_P46_224_v1` | 12          | 2                | 2              | 6                       | 6                     |
+
+There are eight racks so the VMs for each pool are spread over up to eight
+racks. Pools with more than eight VMs require multiple VMs per rack spread
+across different bare metal servers.
+
+Cluster C Agent Pool #1 has 12 VMs restricted to AvailabilityZones [1, 4] so it
+has 12 VMs on 12 bare metal servers, six in each of racks 1 and 4.
+
+Extra-large VMs (the `NC_P46_224_v1` SKU) from different clusters are placed
+on the same bare metal servers (see rule #3 in
+[How the Nexus Platform Schedules a VM][#how-the-nexus-platform-schedule-a-vm]).
+
+Here's a visualization of a layout the user might see after deploying Clusters
+A, B, and C into an empty environment.
+
+:::image type="content" source="media/nexus-kubernetes/after-first-deployment.png" alt-text="Diagram showing possible layout of VMs after first deployment.":::
+
+### Half-full Environment
+
+We now run through an example of launching another NAKS Cluster when the target
+environment is half-full. The target environment is half-full after Clusters A,
+B, and C are deployed into the target environment.
+
+Cluster D has the following specifications:
+
+* Control plane, `NC_G24_112_v1` SKU, five count
+* Agent pool #1, `NC_P46_224_v1` SKU, 24 count, `AvailabilityZones = [7,8]`
+* Agent pool #2, `NC_P22_112_v1` SKU, 24 count
+
+Here's a table summarizing what the user should see after launching Cluster D
+into the half-full Operator Nexus environment that exists after launching
+Clusters A, B, and C.
+
+| Cluster | Pool             | SKU             | Total Count | Expected # Racks | Actual # Racks | Expected # VMs per Rack | Actual # VMs per Rack |
+| ------- | ---------------- | --------------- | ----------- | ---------------- | -------------- | ----------------------- | --------------------- |
+| D       | Control Plane    | `NC_G12_56_v1`  | 5           | 5                | 5              | 1                       | 1                     |
+| D       | Agent Pool #1    | `NC_P46_224_v1` | 24          | 2                | 2              | 12                      | 12                    |
+| D       | Agent Pool #2    | `NC_P22_112_v1` | 24          | 8                | 8              | 3                       | 3                     |
+
+Cluster D Agent Pool #1 has 12 VMs restricted to AvailabilityZones [7, 8] so it
+has 12 VMs on 12 bare metal servers, six in each of racks 7 and 8. Those VMs
+land on bare metal servers also housing extra-large VMs from other clusters due
+to the sorting rule that groups extra-large VMs from different clusters onto
+the same bare metal servers.
+
+If a Cluster D control plane VM lands on rack 7 or 8, it's likely that one
+Cluster D Agent Pool #1 VM lands on the same bare metal server as that Cluster
+D control plane VM. This behavior is due to Agent Pool #1 being "pinned" to
+racks 7 and 8. Capacity constraints in those racks cause the scheduler to
+collocate a control plane VM and an Agent Pool #1 VM from the same NAKS
+Cluster.
+
+Cluster D's Agent Pool #2 has three VMs on different bare metal servers on each
+of the eight racks. Capacity constraints resulted from Cluster D's Agent Pool #1
+being pinned to racks 7 and 8. Therefore, VMs from Cluster D's Agent Pool #1
+and Agent Pool #2 are collocated on the same bare metal servers in racks 7 and
+8.
+
+Here's a visualization of a layout the user might see after deploying Cluster
+D into the target environment.
+
+:::image type="content" source="media/nexus-kubernetes/after-second-deployment.png" alt-text="Diagram showing possible layout of VMs after second deployment.":::
+
+### Nearly full Environment
+
+In our example target environment, four of the eight racks are
+close to capacity. Let's try to launch another NAKS Cluster. 
+
+Cluster E has the following specifications:
+
+* Control plane, `NC_G24_112_v1` SKU, five count
+* Agent pool #1, `NC_P46_224_v1` SKU, 32 count
+
+Here's a table summarizing what the user should see after launching Cluster E
+into the target environment.
+
+| Cluster | Pool             | SKU             | Total Count | Expected # Racks | Actual # Racks | Expected # VMs per Rack | Actual # VMs per Rack |
+| ------- | ---------------- | --------------- | ----------- | ---------------- | -------------- | ----------------------- | --------------------- |
+| E       | Control Plane    | `NC_G24_112_v1` | 5           | 5                | 5              | 1                       | 1                     |
+| E       | Agent Pool #1    | `NC_P46_224_v1` | 32          | 8                | 8              | **4**                   | **3, 4 or 5**         |
+
+Cluster E's Agent Pool #1 will spread unevenly over all eight racks. Racks 7
+and 8 will have three NAKS VMs from Agent Pool #1 instead of the expected four
+NAKS VMs because there's no more capacity for the extra-large SKU VMs in those
+racks after scheduling Clusters A through D. Because racks 7 and 8 don't have
+capacity for the fourth extra-large SKU in Agent Pool #1, five NAKS VMs will
+land on the two least-utilized racks. In our example, those least-utilized
+racks were racks 3 and 6.
+
+Here's a visualization of a layout the user might see after deploying Cluster
+E into the target environment.
+
+:::image type="content" source="media/nexus-kubernetes/after-third-deployment.png" alt-text="Diagram showing possible layout of VMs after third deployment.":::
+
+## Placement during a Runtime Upgrade 
+
+As of April 2024 (Network Cloud 2304.1 release), runtime upgrades are performed
+using a rack-by-rack strategy. Bare metal servers in rack 1 are reimaged all at
+once. The upgrade process pauses until all the bare metal servers successfully
+restart and tell Nexus that they're ready to receive workloads.
+
+> Note: It is possible to instruct Operator Nexus to only reimage a portion of
+> the bare metal servers in a rack at once, however the default is to reimage
+> all bare metal servers in a rack in parallel.
+
+When an individual bare metal server is reimaged, all workloads running on that
+bare metal server, including all NAKS VMs, lose power, and connectivity. Workload
+containers running on NAKS VMs will, in turn, lose power, and connectivity.
+After one minute of not being able to reach those workload containers, the NAKS
+Cluster's Kubernetes Control Plane will mark the corresponding Pods as
+unhealthy. If the Pods are members of a Deployment or StatefulSet, the NAKS
+Cluster's Kubernetes Control Plane attempts to launch replacement Pods to
+bring the observed replica count of the Deployment or StatefulSet back to the
+desired replica count.
+
+New Pods only launch if there's available capacity for the Pod in the remaining
+healthy NAKS VMs. As of April 2024 (Network Cloud 2304.1 release), new NAKS VMs
+aren't created to replace NAKS VMs that were on the bare metal server being
+reimaged.
+
+Once the bare metal server is successfully reimaged and able to accept new NAKS
+VMs, the NAKS VMs that were originally on the same bare metal server relaunch
+on the newly reimaged bare metal server. Workload containers may then be
+scheduled to those NAKS VMs, potentially restoring the Deployments or
+StatefulSets that had Pods on NAKS VMs that were on the bare metal server.
+
+> **Note**: This behavior may seem to the user as if the NAKS VMs did not
+> "move" from the bare metal server, when in fact a new instance of an identical
+> NAKS VM was launched on the newly reimaged bare metal server that retained the
+> same bare metal server name as before reimaging.
+
+## Best Practices 
+
+When working with Operator Nexus, keep the following best practices in mind.
+
+* Avoid specifying `AvailabilityZones` for an Agent Pool.
+* Launch larger NAKS Clusters before smaller ones.
+* Reduce the Agent Pool's Count before reducing the VM SKU size.
+
+### Avoid specifying AvailabilityZones for an Agent Pool
+
+As you can tell from the above placement scenarios, specifying
+`AvailabilityZones` for an Agent Pool is the primary reason that NAKS VMs from
+the same NAKS Cluster would end up on the same bare metal server. By specifying
+`AvailabilityZones`, you "pin" the Agent Pool to a subset of racks and
+therefore limit the number of potential bare metal servers in that set of racks
+for other NAKS Clusters and other Agent Pool VMs in the same NAKS Cluster to
+land on.
+
+Therefore, our first best practice is to avoid specifying `AvailabilityZones`
+for an Agent Pool. If you require pinning an Agent Pool to a set of
+Availability Zones, make that set as large as possible to minimize the
+imbalance that can occur.
+
+The one exception to this best practice is when you have a scenario with only
+two or three VMs in an agent pool. You might consider setting
+`AvailabilityZones` for that agent pool to `[1,3,5,7]` or `[0,2,4,6]` to
+increase availability during runtime upgrades.
+
+### Launch larger NAKS Clusters before smaller ones
+
+As of April 2024, and the Network Cloud 2403.1 release, NAKS Clusters are
+scheduled in the order in which they're created. To most efficiently pack your
+target environment, we recommended you create larger NAKS Clusters before
+smaller ones. Likewise, we recommended you schedule larger Agent Pools before
+smaller ones.
+
+This recommendation is important for Agent Pools using the extra-large
+`NC_G48_224_v1` or `NC_P46_224_v1` SKU. Scheduling the Agent Pools with the
+greatest count of these extra-large SKU VMs creates a larger set of bare metal
+servers upon which other extra-large SKU VMs from Agent Pools in other NAKS
+Clusters can collocate.
+
+### Reduce the Agent Pool's Count before reducing the VM SKU size
+
+If you run into capacity constraints when launching a NAKS Cluster or Agent
+Pool, reduce the Count of the Agent Pool before adjusting the VM SKU size. For
+example, if you attempt to create a NAKS Cluster with an Agent Pool with VM SKU
+size of `NC_P46_224_v1` and a Count of 24 and get back a failure to provision
+the NAKS Cluster due to insufficient resources, you may be tempted to use a VM
+SKU Size of `NC_P36_168_v1` and continue with a Count of 24. However, due to
+requirements for workload VMs to be aligned to a single NUMA cell on a bare
+metal server, it's likely that that same request results in similar
+insufficient resource failures. Instead of reducing the VM SKU size, consider
+reducing the Count of the Agent Pool to 20. There's a better chance your
+request fits within the target environment's resource capacity and your overall
+deployment has more CPU cores than if you downsized the VM SKU.
@@ -36,6 +36,8 @@ landingContent:
              url: concepts-storage.md
            - text: Nexus Kubernetes overview
              url: concepts-nexus-kubernetes-cluster.md
+           - text: Nexus Kubernetes resource placement
+             url: concepts-nexus-kubernetes-placement.md
            - text: Observability
              url: concepts-observability.md
            - text: Security