Skip to content

Commit e6bd8ae

Browse files
author
Andrew
committed
Updates to add info on automated remediation
1 parent 1c57913 commit e6bd8ae

File tree

1 file changed

+41
-23
lines changed

1 file changed

+41
-23
lines changed

articles/operator-nexus/concepts-rack-resiliency.md

Lines changed: 41 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,9 @@
22
title: Operator Nexus rack resiliency
33
description: Document how rack resiliency works in Operator Nexus Near Edge
44
ms.topic: article
5-
ms.date: 01/05/2024
6-
author: matthewernst
7-
ms.author: matthewernst
5+
ms.date: 05/28/2024
6+
author: eak13
7+
ms.author: ekarandjeff
88
ms.service: azure-operator-nexus
99
---
1010

@@ -24,43 +24,61 @@ Operator Nexus ensures the availability of three active Kubernetes control plane
2424
During runtime upgrades, Operator Nexus implements a sequential upgrade of the control plane nodes, thereby preserving resiliency throughout the upgrade process.
2525

2626
Three compute racks:
27-
28-
| Rack 1 | Rack 2 | Rack 3 |
29-
|------------|---------|----------|
30-
| KCP | KCP | KCP |
31-
| KCP-spare | MGMT | MGMT |
27+
28+
| Rack 1 | Rack 2 | Rack 3 |
29+
| --------- | ------ | ------ |
30+
| KCP | KCP | KCP |
31+
| KCP-spare | MGMT | MGMT |
3232

3333
Four or more compute racks:
3434

35-
| Rack 1 | Rack 2 | Rack 3 | Rack 4 |
36-
|---------|---------|----------|----------|
37-
| KCP | KCP | KCP | KCP-spare|
38-
| MGMT | MGMT | MGMT | MGMT |
35+
| Rack 1 | Rack 2 | Rack 3 | Rack 4 |
36+
| ------ | ------ | ------ | --------- |
37+
| KCP | KCP | KCP | KCP-spare |
38+
| MGMT | MGMT | MGMT | MGMT |
3939

4040
## Instances with less than three compute racks
4141

4242
Operator Nexus maintains an active control plane node and, if available, a spare control plane instance. For instance, a two-rack configuration has one active Kubernetes Control Plane (KCP) node and one spare node.
4343

4444
Two compute racks:
45-
46-
| Rack 1 | Rack 2 |
47-
|------------|----------|
48-
| KCP | KCP-spare|
49-
| MGMT | MGMT |
45+
46+
| Rack 1 | Rack 2 |
47+
| ------ | --------- |
48+
| KCP | KCP-spare |
49+
| MGMT | MGMT |
5050

5151
Single compute rack:
5252

5353
Operator Nexus supports control plane resiliency in single rack configurations by having three management nodes within the rack. For example, a single rack configuration with three management servers will provide an equivalent number of active control planes to ensure resiliency within a rack.
5454

55-
| Rack 1 |
56-
|------------|
57-
| KCP |
58-
| KCP |
59-
| KCP |
55+
| Rack 1 |
56+
| ------ |
57+
| KCP |
58+
| KCP |
59+
| KCP |
6060

6161
## Resiliency implications of lost quorum
6262

63-
In disaster situations when the control plane loses quorum, there are impacts to the Kubernetes API across the instance. This scenario can affect a workload's ability to read and write Custom Resources (CRs) and talk across racks.
63+
In disaster situations when the control plane loses quorum, there are impacts to the Kubernetes API across the instance. This scenario can affect a workload's ability to read and write Custom Resources (CRs) and talk across racks.
64+
65+
## Automated remediation for Kubernetes Control Plane, Management Plane and Compute nodes
66+
67+
To avoid losing Kubernetes control plane (KCP) quorum, Operator Nexus provides automated remediation when certain server issues are detected. In certain situations, this automated remediation extends to Management Plane & Compute nodes as well.
68+
69+
As a general overview of server resilience, here are the triggers for automated remediation:
70+
71+
- For all servers: if a server fails to provision successfully after four hours, automated remediation occurs.
72+
- For all servers: if a running node is stuck in a Readonly Root Filesystem mode for ten minutes, automated remediation occurs.
73+
- For KCP and Management Plane servers, if a Kubernetes node is in an Unknown state for 30 minutes, automated remediation occurs.
74+
75+
Remediation Process:
76+
77+
- Remediation of a Compute node is one re-provisioning attempt. If the re-provisioning fails, the node is marked `Unhealthy`.
78+
- Remediation of a Management Plane node is to attempt one reboot and then one re-provisioning attempt. If those steps fail, the node is marked `Unhealthy`.
79+
- Remediation of a KCP node is to attempt one reboot. If the reboot fails, the node is marked `Unhealthy` which triggers the immediate provisioning of the spare KCP node.
80+
81+
A spare KCP node is required to ensure ongoing control plane resiliency. When KCP node fails remediation and is marked `Unhealthy`, it is deprovisioned and then swapped with a suitable healthy Management Plane host. This Management Plane host becomes the new spare KCP node. The failed KCP node is updated to be labeled as a Management Plane node. If it continues to fail to provision or run successfully, it is left in an unhealthy state for the customer to fix the underlying issue. The unhealthy condition is surfaced to the BMM detailedStatus in Azure, and it can be cleared by a BMM Replace action.
6482

6583
## Related Links
6684

0 commit comments

Comments
 (0)