MicrosoftDocs
diff --git a/‎articles/operator-nexus/TOC.yml
Lines changed: 4 additions & 0 deletions b/‎articles/operator-nexus/TOC.yml
Lines changed: 4 additions & 0 deletions
diff --git a/‎articles/operator-nexus/concepts-rack-resiliency.md
Lines changed: 57 additions & 0 deletions b/‎articles/operator-nexus/concepts-rack-resiliency.md
Lines changed: 57 additions & 0 deletions
diff --git a/‎articles/operator-nexus/media/troubleshoot-control-plane-quorum/graceful-power-on.png
91.7 KB b/‎articles/operator-nexus/media/troubleshoot-control-plane-quorum/graceful-power-on.png
91.7 KB
diff --git a/‎articles/operator-nexus/media/troubleshoot-control-plane-quorum/graceful-shutdown.png
173 KB b/‎articles/operator-nexus/media/troubleshoot-control-plane-quorum/graceful-shutdown.png
173 KB
diff --git a/‎articles/operator-nexus/troubleshoot-control-plane-quorum.md
Lines changed: 71 additions & 0 deletions b/‎articles/operator-nexus/troubleshoot-control-plane-quorum.md
Lines changed: 71 additions & 0 deletions
@@ -28,6 +28,8 @@
       href: concepts-observability.md 
     - name: Security
       href: concepts-security.md
+    - name: Control Plane Resiliency
+      href: concepts-rack-resiliency.md
 - name: Quickstarts
   items:
     - name: Before you start workload deployment
@@ -174,6 +176,8 @@
           href: howto-baremetal-run-read.md
         - name: BareMetal Run-Data-Extract Execution
           href: howto-baremetal-run-data-extract.md
+    - name: Troubleshoot Control Plane Quorum
+      href: troubleshoot-control-plane-quorum.md
 - name: FAQ
   href: azure-operator-nexus-faq.md
 - name: Reference
 
@@ -0,0 +1,57 @@
+---
+title: Operator Nexus rack resiliency
+description: Document how rack resiliency works in Operator Nexus Near Edge
+ms.topic: article
+ms.date: 01/05/2024
+author: matthewernst
+ms.author: matthewernst
+ms.service: azure-operator-nexus
+---
+
+# Ensuring control plane resiliency with Operator Nexus Service
+
+The Nexus service is engineered to uphold control plane resiliency across various compute rack configurations.
+
+## Instances with three or more compute racks
+
+Operator Nexus ensures the availability of three active control plane nodes in instances with three or more compute racks. For configurations exceeding two compute racks, an extra spare node is also maintained. These nodes are strategically distributed across different racks to guarantee control plane resiliency, when possible.
+
+During runtime upgrades, Operator Nexus implements a sequential upgrade of the control plane nodes, thereby preserving resiliency throughout the upgrade process.
+
+Three compute racks:
+  
+|  Rack 1    | Rack 2  | Rack 3   | 
+|------------|---------|----------|
+| KCP        | KCP    | KCP       |
+| KCP-spare  | MGMT   |  MGMT     |
+
+Four or more compute racks:
+
+|  Rack 1 | Rack 2  | Rack 3   | Rack 4   |
+|---------|---------|----------|----------|
+| KCP     | KCP     | KCP      | KCP-spare|
+| MGMT    | MGMT    | MGMT     | MGMT     |
+
+## Instances with less than three compute racks
+
+Operator Nexus maintains an active control plane node and, if available, a spare control plane instance. For instance, a two-rack configuration has one active Kubernetes Control Plane (KCP) node and one spare node.
+
+Two compute racks:
+  
+| Rack 1     | Rack 2   |
+|------------|----------|
+| KCP        | KCP-spare|
+| MGMT       | MGMT     |
+
+> [!NOTE] 
+> Operator Nexus supports control plane resiliency in single rack configurations by having three management nodes within the rack. For example, a single rack configuration with three management servers will provide an equivalent number of active control planes to ensure resiliency within a rack.
+
+## Impacts to on-premises instance
+
+In disaster situations when the control plane loses quorum, there are impacts to the Kubernetes API across the instance. This scenario can impact a workload's ability to read and write Customer Resources (CRs) and talk across racks. 
+
+## Related Links
+
+[Determining Control Plane Role](./reference-near-edge-baremetal-machine-roles.md)
+
+[Troubleshooting failed Control Plane Quorum](./troubleshoot-control-plane-quorum.md)
@@ -0,0 +1,71 @@
+---
+title: Troubleshoot control plane quorum loss
+description: Document how to restore control plane quorum loss
+ms.topic: article
+ms.date: 01/18/2024
+author: matthewernst
+ms.author: matthewernst
+ms.service: azure-operator-nexus
+---
+
+# Troubleshoot control plane quorum loss
+
+Follow this troubleshooting guide when multiple control plane nodes are offline or unavailable:
+
+## Prerequisites
+
+- Install the latest version of the
+  [appropriate Azure CLI extensions](./howto-install-cli-extensions.md).
+- Gather the following information:
+  - Subscription ID
+  - Cluster name and resource group
+  - Bare metal machine name
+- Ensure you're logged using `az login`
+
+
+## Symptoms
+
+- Kubernetes API isn't available
+- Multiple control plane nodes are offline or unavailable
+
+## Procedure
+
+1. Identify the Nexus Management Node
+- To identify the management nodes, run `az networkcloud baremetalmachine list -g <ResourceGroup_Name>`
+- Log in to the identified server
+- Ensure the ironic-conductor service is present on this node using `crictl ps -a |grep -i ironic-conductor`
+  Example output:
+
+~~~
+testuser@<servername> [ ~ ]$ sudo crictl ps -a |grep -i ironic-conductor
+<id>       <id>       6 hours ago       Running       ironic-conductor       0       <id>
+~~~
+
+2. Determine the iDRAC IP of the server
+- Run the command `az networkcloud cluster list -g <RG_Name>`
+- The output of the command is a JSON with the iDRAC IP
+
+    ~~~
+    {
+            "bmcConnectionString": "redfish+https://xx.xx.xx.xx/redfish/v1/Systems/System.Embedded.1",
+            "bmcCredentials": {
+              "username": "<username>"
+            },
+            "bmcMacAddress": "<bmcMacAddress>",
+            "bootMacAddress": "<bootMacAddress",
+            "machineDetails": "extraDetails",
+            "machineName": "<machineName>",
+            "rackSlot": <rackSlot>,
+            "serialNumber": "<serialNumber>"
+    },
+    ~~~
+
+3. Access the iDRAC GUI using the IP in your browser to shut down impacted management servers
+
+   :::image type="content" source="media\troubleshoot-control-plane-quorum\graceful-shutdown.png" alt-text="Screenshot of an iDRAC GUI and the button to perform a graceful shutdown." lightbox="media\troubleshoot-control-plane-quorum\graceful-shutdown.png":::
+
+4. When all impacted management servers are down, turn on the servers using the iDRAC GUI
+
+   :::image type="content" source="media\troubleshoot-control-plane-quorum\graceful-power-on.png" alt-text="Screenshot of an iDRAC GUI and the button to perform power on command." lightbox="media\troubleshoot-control-plane-quorum\graceful-power-on.png":::
+
+5. The servers should now be restored. If not, engage Microsoft support.