Skip to content

Commit 660fae8

Browse files
authored
Merge pull request #263566 from matternst7258/matternst7258/rack-resiliency
[operator-nexus] Documents the rack resiliency of the control plane
2 parents 9d168db + eb6b601 commit 660fae8

File tree

5 files changed

+132
-0
lines changed

5 files changed

+132
-0
lines changed

articles/operator-nexus/TOC.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,8 @@
2828
href: concepts-observability.md
2929
- name: Security
3030
href: concepts-security.md
31+
- name: Control Plane Resiliency
32+
href: concepts-rack-resiliency.md
3133
- name: Quickstarts
3234
items:
3335
- name: Before you start workload deployment
@@ -174,6 +176,8 @@
174176
href: howto-baremetal-run-read.md
175177
- name: BareMetal Run-Data-Extract Execution
176178
href: howto-baremetal-run-data-extract.md
179+
- name: Troubleshoot Control Plane Quorum
180+
href: troubleshoot-control-plane-quorum.md
177181
- name: FAQ
178182
href: azure-operator-nexus-faq.md
179183
- name: Reference
Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
---
2+
title: Operator Nexus rack resiliency
3+
description: Document how rack resiliency works in Operator Nexus Near Edge
4+
ms.topic: article
5+
ms.date: 01/05/2024
6+
author: matthewernst
7+
ms.author: matthewernst
8+
ms.service: azure-operator-nexus
9+
---
10+
11+
# Ensuring control plane resiliency with Operator Nexus Service
12+
13+
The Nexus service is engineered to uphold control plane resiliency across various compute rack configurations.
14+
15+
## Instances with three or more compute racks
16+
17+
Operator Nexus ensures the availability of three active control plane nodes in instances with three or more compute racks. For configurations exceeding two compute racks, an extra spare node is also maintained. These nodes are strategically distributed across different racks to guarantee control plane resiliency, when possible.
18+
19+
During runtime upgrades, Operator Nexus implements a sequential upgrade of the control plane nodes, thereby preserving resiliency throughout the upgrade process.
20+
21+
Three compute racks:
22+
23+
| Rack 1 | Rack 2 | Rack 3 |
24+
|------------|---------|----------|
25+
| KCP | KCP | KCP |
26+
| KCP-spare | MGMT | MGMT |
27+
28+
Four or more compute racks:
29+
30+
| Rack 1 | Rack 2 | Rack 3 | Rack 4 |
31+
|---------|---------|----------|----------|
32+
| KCP | KCP | KCP | KCP-spare|
33+
| MGMT | MGMT | MGMT | MGMT |
34+
35+
## Instances with less than three compute racks
36+
37+
Operator Nexus maintains an active control plane node and, if available, a spare control plane instance. For instance, a two-rack configuration has one active Kubernetes Control Plane (KCP) node and one spare node.
38+
39+
Two compute racks:
40+
41+
| Rack 1 | Rack 2 |
42+
|------------|----------|
43+
| KCP | KCP-spare|
44+
| MGMT | MGMT |
45+
46+
> [!NOTE]
47+
> Operator Nexus supports control plane resiliency in single rack configurations by having three management nodes within the rack. For example, a single rack configuration with three management servers will provide an equivalent number of active control planes to ensure resiliency within a rack.
48+
49+
## Impacts to on-premises instance
50+
51+
In disaster situations when the control plane loses quorum, there are impacts to the Kubernetes API across the instance. This scenario can impact a workload's ability to read and write Customer Resources (CRs) and talk across racks.
52+
53+
## Related Links
54+
55+
[Determining Control Plane Role](./reference-near-edge-baremetal-machine-roles.md)
56+
57+
[Troubleshooting failed Control Plane Quorum](./troubleshoot-control-plane-quorum.md)
91.7 KB
Loading
173 KB
Loading
Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
---
2+
title: Troubleshoot control plane quorum loss
3+
description: Document how to restore control plane quorum loss
4+
ms.topic: article
5+
ms.date: 01/18/2024
6+
author: matthewernst
7+
ms.author: matthewernst
8+
ms.service: azure-operator-nexus
9+
---
10+
11+
# Troubleshoot control plane quorum loss
12+
13+
Follow this troubleshooting guide when multiple control plane nodes are offline or unavailable:
14+
15+
## Prerequisites
16+
17+
- Install the latest version of the
18+
[appropriate Azure CLI extensions](./howto-install-cli-extensions.md).
19+
- Gather the following information:
20+
- Subscription ID
21+
- Cluster name and resource group
22+
- Bare metal machine name
23+
- Ensure you're logged using `az login`
24+
25+
26+
## Symptoms
27+
28+
- Kubernetes API isn't available
29+
- Multiple control plane nodes are offline or unavailable
30+
31+
## Procedure
32+
33+
1. Identify the Nexus Management Node
34+
- To identify the management nodes, run `az networkcloud baremetalmachine list -g <ResourceGroup_Name>`
35+
- Log in to the identified server
36+
- Ensure the ironic-conductor service is present on this node using `crictl ps -a |grep -i ironic-conductor`
37+
Example output:
38+
39+
~~~
40+
testuser@<servername> [ ~ ]$ sudo crictl ps -a |grep -i ironic-conductor
41+
<id> <id> 6 hours ago Running ironic-conductor 0 <id>
42+
~~~
43+
44+
2. Determine the iDRAC IP of the server
45+
- Run the command `az networkcloud cluster list -g <RG_Name>`
46+
- The output of the command is a JSON with the iDRAC IP
47+
48+
~~~
49+
{
50+
"bmcConnectionString": "redfish+https://xx.xx.xx.xx/redfish/v1/Systems/System.Embedded.1",
51+
"bmcCredentials": {
52+
"username": "<username>"
53+
},
54+
"bmcMacAddress": "<bmcMacAddress>",
55+
"bootMacAddress": "<bootMacAddress",
56+
"machineDetails": "extraDetails",
57+
"machineName": "<machineName>",
58+
"rackSlot": <rackSlot>,
59+
"serialNumber": "<serialNumber>"
60+
},
61+
~~~
62+
63+
3. Access the iDRAC GUI using the IP in your browser to shut down impacted management servers
64+
65+
:::image type="content" source="media\troubleshoot-control-plane-quorum\graceful-shutdown.png" alt-text="Screenshot of an iDRAC GUI and the button to perform a graceful shutdown." lightbox="media\troubleshoot-control-plane-quorum\graceful-shutdown.png":::
66+
67+
4. When all impacted management servers are down, turn on the servers using the iDRAC GUI
68+
69+
:::image type="content" source="media\troubleshoot-control-plane-quorum\graceful-power-on.png" alt-text="Screenshot of an iDRAC GUI and the button to perform power on command." lightbox="media\troubleshoot-control-plane-quorum\graceful-power-on.png":::
70+
71+
5. The servers should now be restored. If not, engage Microsoft support.

0 commit comments

Comments
 (0)