Skip to content

Commit 9d7ddc4

Browse files
authored
Merge pull request #300530 from eak13/main
Updates to add info on automated remediation
2 parents f1e35c8 + dffdb8d commit 9d7ddc4

File tree

1 file changed

+47
-26
lines changed

1 file changed

+47
-26
lines changed

articles/operator-nexus/concepts-rack-resiliency.md

Lines changed: 47 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,9 @@
22
title: Operator Nexus rack resiliency
33
description: Document how rack resiliency works in Operator Nexus Near Edge
44
ms.topic: article
5-
ms.date: 01/05/2024
6-
author: matthewernst
7-
ms.author: matthewernst
5+
ms.date: 05/28/2025
6+
author: eak13
7+
ms.author: ekarandjeff
88
ms.service: azure-operator-nexus
99
---
1010

@@ -19,48 +19,69 @@ Operator Nexus ensures the availability of three active Kubernetes control plane
1919
> [!TIP]
2020
> The Kubernetes control plane is a set of components that manage the state of a Kubernetes cluster, schedule workloads, and respond to cluster events. It includes the API server, etcd storage, scheduler, and controller managers.
2121
>
22-
> The remaining management nodes contain various operators which run the platform software as well as other components performing support capabilities for monitoring, storage and networking.
22+
> The remaining management nodes contain various operators which run the platform software and other components performing support capabilities for monitoring, storage, and networking.
2323
24-
During runtime upgrades, Operator Nexus implements a sequential upgrade of the control plane nodes, thereby preserving resiliency throughout the upgrade process.
24+
During runtime upgrades, Operator Nexus implements a sequential upgrade of the control plane nodes which preserves resiliency throughout the upgrade process.
2525

2626
Three compute racks:
27-
28-
| Rack 1 | Rack 2 | Rack 3 |
29-
|------------|---------|----------|
30-
| KCP | KCP | KCP |
31-
| KCP-spare | MGMT | MGMT |
27+
28+
KCP = Kubernetes Control Plane Node
29+
MGMT = Management Node Pool Node
30+
31+
| Rack 1 | Rack 2 | Rack 3 |
32+
| --------- | ------ | ------ |
33+
| KCP | KCP | KCP |
34+
| KCP-spare | MGMT | MGMT |
3235

3336
Four or more compute racks:
3437

35-
| Rack 1 | Rack 2 | Rack 3 | Rack 4 |
36-
|---------|---------|----------|----------|
37-
| KCP | KCP | KCP | KCP-spare|
38-
| MGMT | MGMT | MGMT | MGMT |
38+
| Rack 1 | Rack 2 | Rack 3 | Rack 4 |
39+
| ------ | ------ | ------ | --------- |
40+
| KCP | KCP | KCP | KCP-spare |
41+
| MGMT | MGMT | MGMT | MGMT |
3942

4043
## Instances with less than three compute racks
4144

4245
Operator Nexus maintains an active control plane node and, if available, a spare control plane instance. For instance, a two-rack configuration has one active Kubernetes Control Plane (KCP) node and one spare node.
4346

4447
Two compute racks:
45-
46-
| Rack 1 | Rack 2 |
47-
|------------|----------|
48-
| KCP | KCP-spare|
49-
| MGMT | MGMT |
48+
49+
| Rack 1 | Rack 2 |
50+
| ------ | --------- |
51+
| KCP | KCP-spare |
52+
| MGMT | MGMT |
5053

5154
Single compute rack:
5255

53-
Operator Nexus supports control plane resiliency in single rack configurations by having three management nodes within the rack. For example, a single rack configuration with three management servers will provide an equivalent number of active control planes to ensure resiliency within a rack.
56+
Operator Nexus supports control plane resiliency in single rack configurations by having three management nodes within the rack. For example, a single rack configuration with three management servers provides an equivalent number of active control planes to ensure resiliency within a rack.
5457

55-
| Rack 1 |
56-
|------------|
57-
| KCP |
58-
| KCP |
59-
| KCP |
58+
| Rack 1 |
59+
| ------ |
60+
| KCP |
61+
| KCP |
62+
| KCP |
6063

6164
## Resiliency implications of lost quorum
6265

63-
In disaster situations when the control plane loses quorum, there are impacts to the Kubernetes API across the instance. This scenario can affect a workload's ability to read and write Custom Resources (CRs) and talk across racks.
66+
In disaster situations when the control plane loses quorum, there are impacts to the Kubernetes API across the instance. This scenario can affect a workload's ability to read and write Custom Resources (CRs) and talk across racks.
67+
68+
## Automated remediation for Kubernetes Control Plane, Management Plane, and Compute nodes
69+
70+
To avoid losing Kubernetes control plane (KCP) quorum, Operator Nexus provides automated remediation when certain server issues are detected. In certain situations, this automated remediation extends to Management Plane & Compute nodes as well.
71+
72+
As a general overview of server resilience, here are the triggers for automated remediation:
73+
74+
- For all servers: if a server fails to provision successfully after four hours, automated remediation occurs.
75+
- For all servers: if a running node is stuck in a read only root filesystem mode for 10 minutes, automated remediation occurs.
76+
- For KCP and Management Plane servers, if a Kubernetes node is in an Unknown state for 30 minutes, automated remediation occurs.
77+
78+
Remediation Process:
79+
80+
- Remediation of a Compute node is one reprovisioning attempt. If the reprovisioning fails, the node is marked `Unhealthy`.
81+
- Remediation of a Management Plane node is to attempt one reboot and then one reprovisioning attempt. If those steps fail, the node is marked `Unhealthy`.
82+
- Remediation of a KCP node is to attempt one reboot. If the reboot fails, the node is marked `Unhealthy` which triggers the immediate provisioning of the spare KCP node.
83+
84+
Ongoing control plane resiliency requires a spare KCP node. When KCP node fails remediation and is marked `Unhealthy`, a deprovisioning of the node occurs, and an exchange with a suitable healthy Management Plane server occurs. This Management Plane server becomes the new spare KCP node. The failed KCP node is updated and labeled as a Management Plane node. Once the label changes, an attempt to provision the newly labeled management plane node occurs. If it fails to provision, the management plane remediation process takes over. If it fails provisioning or doesn't run successfully, the machine's status remains unhealthy and the user must fix. The unhealthy condition surfaces to the Bare Metal Machine's (BMM) `detailedStatus` fields in Azure, and clears through a BMM Replace action.
6485

6586
## Related Links
6687

0 commit comments

Comments
 (0)