|
1 | 1 | ---
|
2 |
| -title: Automatically repairing Azure Kubernetes Service (AKS) nodes |
3 |
| -description: Learn about node auto-repair functionality, and how AKS fixes broken worker nodes. |
| 2 | +title: Automatically repair Azure Kubernetes Service (AKS) nodes |
| 3 | +description: Learn about node auto-repair functionality and how AKS fixes broken worker nodes. |
4 | 4 | ms.topic: conceptual
|
5 |
| -ms.date: 03/11/2021 |
| 5 | +ms.date: 05/30/2023 |
6 | 6 | ---
|
7 | 7 |
|
8 | 8 | # Azure Kubernetes Service (AKS) node auto-repair
|
9 | 9 |
|
10 |
| -AKS continuously monitors the health state of worker nodes and performs automatic node repair if they become unhealthy. The Azure virtual machine (VM) platform [performs maintenance on VMs][vm-updates] experiencing issues. |
| 10 | +Azure Kubernetes Service (AKS) continuously monitors the health state of worker nodes and performs automatic node repair if they become unhealthy. The Azure virtual machine (VM) platform [performs maintenance on VMs][vm-updates] experiencing issues. AKS and Azure VMs work together to minimize service disruptions for clusters. |
11 | 11 |
|
12 |
| -AKS and Azure VMs work together to minimize service disruptions for clusters. |
13 |
| - |
14 |
| -In this document, you'll learn how automatic node repair functionality behaves for both Windows and Linux nodes. |
| 12 | +In this article, you learn how the automatic node repair functionality behaves for Windows and Linux nodes. |
15 | 13 |
|
16 | 14 | ## How AKS checks for unhealthy nodes
|
17 | 15 |
|
18 |
| -AKS uses the following rules to determine if a node is unhealthy and needs repair: |
19 |
| -* The node reports **NotReady** status on consecutive checks within a 10-minute timeframe. |
20 |
| -* The node doesn't report any status within 10 minutes. |
| 16 | +AKS uses the following rules to determine if a node is unhealthy and needs repair: |
21 | 17 |
|
22 |
| -You can manually check the health state of your nodes with kubectl. |
| 18 | +* The node reports the **NotReady** status on consecutive checks within a 10-minute time frame. |
| 19 | +* The node doesn't report any status within 10 minutes. |
23 | 20 |
|
24 |
| -``` |
25 |
| -kubectl get nodes |
26 |
| -``` |
| 21 | +You can manually check the health state of your nodes with the `kubectl get nodes` command. |
27 | 22 |
|
28 | 23 | ## How automatic repair works
|
29 | 24 |
|
30 |
| -> [!Note] |
| 25 | +> [!NOTE] |
31 | 26 | > AKS initiates repair operations with the user account **aks-remediator**.
|
32 | 27 |
|
33 |
| -If AKS identifies an unhealthy node that remains unhealthy for 5 minutes, AKS takes the following actions: |
| 28 | +If AKS identifies an unhealthy node that remains unhealthy for *five* minutes, AKS performs the following actions: |
| 29 | + |
| 30 | +1. Attempts to restart the node. |
| 31 | +2. If the node restart is unsuccessful, AKS reimages the node. |
| 32 | +3. If the reimage is unsuccessful and it's a Linux node, AKS redeploys the node. |
34 | 33 |
|
35 |
| -1. Restarts the node. |
36 |
| -1. If the restart is unsuccessful, reimages the node. |
37 |
| -1. If the reimage is unsuccessful, and this is a Linux node, redeploys the node. |
| 34 | +AKS engineers investigate alternative remediations if auto-repair is unsuccessful. |
38 | 35 |
|
39 |
| -Alternative remediations are investigated by AKS engineers if auto-repair is unsuccessful. |
40 |
| -As well as if you want to get the node to reimage you can always add the nodeCondition "customerMarkedAsUnhealthy": true, and remediator will reimage your node that way. |
| 36 | +If you want the remediator to reimage the node, you can add the `nodeCondition "customerMarkedAsUnhealthy": true`. |
41 | 37 |
|
42 |
| -## Node Autodrain |
43 |
| -[Scheduled Events][scheduled-events] can occur on the underlying virtual machines (VMs) in any of your node pools. For [spot node pools][spot-node-pools], scheduled events may cause a *preempt* node event for the node. Certain node events, such as *preempt*, cause AKS node autodrain to attempt a cordon and drain of the affected node, which allows for a graceful reschedule of any affected workloads on that node. When this happens, you might notice the node to receive a taint with *"remediator.aks.microsoft.com/unschedulable"*, because of *"kubernetes.azure.com/scalesetpriority: spot"*. |
| 38 | +## Node auto-drain |
44 | 39 |
|
| 40 | +[Scheduled events][scheduled-events] can occur on the underlying VMs in any of your node pools. For [spot node pools][spot-node-pools], scheduled events may cause a *preempt* node event for the node. Certain node events, such as *preempt*, cause AKS node auto-drain to attempt a cordon and drain of the affected node. This process enables rescheduling for any affected workloads on that node. You might notice the node receives a taint with `"remediator.aks.microsoft.com/unschedulable"`, because of `"kubernetes.azure.com/scalesetpriority: spot"`. |
45 | 41 |
|
46 |
| -The following table shows the node events, and the actions they cause for AKS node autodrain. |
| 42 | +The following table shows the node events and actions they cause for AKS node auto-drain: |
47 | 43 |
|
48 | 44 | | Event | Description | Action |
|
49 | 45 | | --- | --- | --- |
|
50 |
| -| Freeze | The VM is scheduled to pause for a few seconds. CPU and network connectivity may be suspended, but there is no impact on memory or open files | No action | |
51 |
| -| Reboot | The VM is scheduled for reboot. The VM's non-persistent memory is lost. | No action | |
52 |
| -| Redeploy | The VM is scheduled to move to another node. The VM's ephemeral disks are lost. | Cordon and drain | |
| 46 | +| Freeze | The VM is scheduled to pause for a few seconds. CPU and network connectivity may be suspended, but there's no impact on memory or open files. | No action. | |
| 47 | +| Reboot | The VM is scheduled for reboot. The VM's non-persistent memory is lost. | No action. | |
| 48 | +| Redeploy | The VM is scheduled to move to another node. The VM's ephemeral disks are lost. | Cordon and drain. | |
53 | 49 | | Preempt | The spot VM is being deleted. The VM's ephemeral disks are lost. | Cordon and drain |
|
54 |
| -| Terminate | The VM is scheduled to be deleted.| Cordon and drain | |
55 |
| - |
56 |
| - |
| 50 | +| Terminate | The VM is scheduled for deletion.| Cordon and drain. | |
57 | 51 |
|
58 | 52 | ## Limitations
|
59 | 53 |
|
60 |
| -In many cases, AKS can determine if a node is unhealthy and attempt to repair the issue, but there are cases where AKS either can't repair the issue or can't detect that there is an issue. For example, AKS can't detect issues if a node status is not being reported due to error in network configuration, or has failed to initially register as a healthy node. |
| 54 | +In many cases, AKS can determine if a node is unhealthy and attempt to repair the issue. However, there are cases where AKS either can't repair the issue or detect that an issue exists. For example, AKS can't detect issues in the following example scenarios: |
| 55 | + |
| 56 | +* A node status isn't being reported due to error in network configuration. |
| 57 | +* A node failed to initially register as a healthy node. |
61 | 58 |
|
62 | 59 | ## Next steps
|
63 | 60 |
|
64 |
| -Use [Availability Zones][availability-zones] to increase high availability with your AKS cluster workloads. |
| 61 | +Use [availability zones][availability-zones] to increase high availability with your AKS cluster workloads. |
65 | 62 |
|
66 |
| -<!-- LINKS - External --> |
67 | 63 | <!-- LINKS - Internal -->
|
68 | 64 | [availability-zones]: ./availability-zones.md
|
69 | 65 | [vm-updates]: ../virtual-machines/maintenance-and-updates.md
|
|
0 commit comments