You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/operator-nexus/concepts-rack-resiliency.md
+12-9Lines changed: 12 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,7 +2,7 @@
2
2
title: Operator Nexus rack resiliency
3
3
description: Document how rack resiliency works in Operator Nexus Near Edge
4
4
ms.topic: article
5
-
ms.date: 05/28/2024
5
+
ms.date: 05/28/2025
6
6
author: eak13
7
7
ms.author: ekarandjeff
8
8
ms.service: azure-operator-nexus
@@ -19,12 +19,15 @@ Operator Nexus ensures the availability of three active Kubernetes control plane
19
19
> [!TIP]
20
20
> The Kubernetes control plane is a set of components that manage the state of a Kubernetes cluster, schedule workloads, and respond to cluster events. It includes the API server, etcd storage, scheduler, and controller managers.
21
21
>
22
-
> The remaining management nodes contain various operators which run the platform software as well as other components performing support capabilities for monitoring, storage and networking.
22
+
> The remaining management nodes contain various operators which run the platform software and other components performing support capabilities for monitoring, storage, and networking.
23
23
24
-
During runtime upgrades, Operator Nexus implements a sequential upgrade of the control plane nodes, thereby preserving resiliency throughout the upgrade process.
24
+
During runtime upgrades, Operator Nexus implements a sequential upgrade of the control plane nodes which preserves resiliency throughout the upgrade process.
25
25
26
26
Three compute racks:
27
27
28
+
KCP = Kubernetes Control Plane Node
29
+
MGMT = Mangement Node Pool Node
30
+
28
31
| Rack 1 | Rack 2 | Rack 3 |
29
32
| --------- | ------ | ------ |
30
33
| KCP | KCP | KCP |
@@ -50,7 +53,7 @@ Two compute racks:
50
53
51
54
Single compute rack:
52
55
53
-
Operator Nexus supports control plane resiliency in single rack configurations by having three management nodes within the rack. For example, a single rack configuration with three management servers will provide an equivalent number of active control planes to ensure resiliency within a rack.
56
+
Operator Nexus supports control plane resiliency in single rack configurations by having three management nodes within the rack. For example, a single rack configuration with three management servers provides an equivalent number of active control planes to ensure resiliency within a rack.
54
57
55
58
| Rack 1 |
56
59
| ------ |
@@ -62,23 +65,23 @@ Operator Nexus supports control plane resiliency in single rack configurations b
62
65
63
66
In disaster situations when the control plane loses quorum, there are impacts to the Kubernetes API across the instance. This scenario can affect a workload's ability to read and write Custom Resources (CRs) and talk across racks.
64
67
65
-
## Automated remediation for Kubernetes Control Plane, Management Plane and Compute nodes
68
+
## Automated remediation for Kubernetes Control Plane, Management Plane, and Compute nodes
66
69
67
70
To avoid losing Kubernetes control plane (KCP) quorum, Operator Nexus provides automated remediation when certain server issues are detected. In certain situations, this automated remediation extends to Management Plane & Compute nodes as well.
68
71
69
72
As a general overview of server resilience, here are the triggers for automated remediation:
70
73
71
74
- For all servers: if a server fails to provision successfully after four hours, automated remediation occurs.
72
-
- For all servers: if a running node is stuck in a Readonly Root Filesystem mode for ten minutes, automated remediation occurs.
75
+
- For all servers: if a running node is stuck in a read only root filesystem mode for 10 minutes, automated remediation occurs.
73
76
- For KCP and Management Plane servers, if a Kubernetes node is in an Unknown state for 30 minutes, automated remediation occurs.
74
77
75
78
Remediation Process:
76
79
77
-
- Remediation of a Compute node is one re-provisioning attempt. If the re-provisioning fails, the node is marked `Unhealthy`.
78
-
- Remediation of a Management Plane node is to attempt one reboot and then one re-provisioning attempt. If those steps fail, the node is marked `Unhealthy`.
80
+
- Remediation of a Compute node is one reprovisioning attempt. If the reprovisioning fails, the node is marked `Unhealthy`.
81
+
- Remediation of a Management Plane node is to attempt one reboot and then one reprovisioning attempt. If those steps fail, the node is marked `Unhealthy`.
79
82
- Remediation of a KCP node is to attempt one reboot. If the reboot fails, the node is marked `Unhealthy` which triggers the immediate provisioning of the spare KCP node.
80
83
81
-
A spare KCP node is required to ensure ongoing control plane resiliency. When KCP node fails remediation and is marked `Unhealthy`, it is deprovisioned and then swapped with a suitable healthy Management Plane host. This Management Plane host becomes the new spare KCP node. The failed KCP node is updated to be labeled as a Management Plane node. If it continues to fail to provision or run successfully, it is left in an unhealthy state for the customer to fix the underlying issue. The unhealthy condition is surfaced to the BMMdetailedStatusin Azure, and it can be cleared by a BMM Replace action.
84
+
A spare KCP node is required to ensure ongoing control plane resiliency. When KCP node fails remediation and is marked `Unhealthy`, it's deprovisioned and then swapped with a suitable healthy Management Plane host. This Management Plane host becomes the new spare KCP node. The failed KCP node is updated and labeled as a Management Plane node. If it continues to fail to provision or run successfully, it's left in an unhealthy state for the customer to fix the underlying issue. The unhealthy condition surfaces to the Bare Metal Machine's (BMM) `detailedStatus` fields in Azure, and clears through a BMM Replace action.
0 commit comments