Skip to content

Commit 33cbab3

Browse files
authored
Merge pull request #107132 from mlearned/mdl-1662862-autorepair-nodes
Mdl 1662862 autorepair nodes
2 parents f1db56a + 1a99131 commit 33cbab3

File tree

2 files changed

+56
-0
lines changed

2 files changed

+56
-0
lines changed

articles/aks/TOC.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -74,6 +74,8 @@
7474
href: concepts-storage.md
7575
- name: Scale
7676
href: concepts-scale.md
77+
- name: Node auto-repair
78+
href: node-auto-repair.md
7779
- name: Best practices
7880
items:
7981
- name: Overview

articles/aks/node-auto-repair.md

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
---
2+
title: Automatically repairing Azure Kubernetes Service (AKS) nodes
3+
description: Learn about node auto-repair functionality, and how AKS fixes broken worker nodes.
4+
services: container-service
5+
ms.topic: conceptual
6+
ms.date: 03/10/2020
7+
---
8+
9+
# Azure Kubernetes Service (AKS) node auto-repair
10+
11+
AKS continuously checks the health state of worker nodes and performs automatic repair of the nodes if they become unhealthy. This documentation describes how Azure Kubernetes Service (AKS) monitors worker nodes, and repairs unhealthy worker nodes. The documentation is to inform AKS operators on the behavior of node repair functionality. It is also important to note that Azure platform [performs maintenance on Virtual Machines][vm-updates] that experience issues. AKS and Azure work together to minimize service disruptions for your clusters.
12+
13+
> [!Important]
14+
> Noe auto-repair functionality isn't currently supported for Windows Server node pools.
15+
16+
## How AKS checks for unhealthy nodes
17+
18+
> [!Note]
19+
> AKS takes repair action on nodes with the user account **aks-remediator**.
20+
21+
AKS uses rules to determine if a node is an unhealthy state and needs repair. AKS uses the following rules to determine if automatic repair is needed.
22+
23+
* The node reports status of **NotReady** on consecutive checks within a 10-minute timeframe
24+
* The node doesn't report a status within 10 minutes
25+
26+
You can manually check the health state of your nodes with kubectl.
27+
28+
```
29+
kubectl get nodes
30+
```
31+
32+
## How automatic repair works
33+
34+
> [!Note]
35+
> AKS takes repair action on nodes with the user account **aks-remediator**.
36+
37+
This behavior is for **Virtual Machine Scale Sets**. Auto-repair takes several steps to repair a broken node. If a node is determined to be unhealthy, AKS attempts several remediation steps. The steps are performed in this order:
38+
39+
1. After the container runtime becomes unresponsive for 10 minutes, the failing runtime services are restarted on the node.
40+
2. If the node is not ready within 10 minutes, the node is rebooted.
41+
3. If the node is not ready within 30 minutes, the node is re-imaged.
42+
43+
> [!Note]
44+
> If multiple nodes are unhealthy, they are repaired one by one
45+
46+
## Next steps
47+
48+
Use [Availability Zones][availability-zones] to increase high availability with your AKS cluster workloads.
49+
50+
<!-- LINKS - External -->
51+
52+
<!-- LINKS - Internal -->
53+
[availability-zones]: ./availability-zones.md
54+
[vm-updates]: ../virtual-machines/maintenance-and-updates.md

0 commit comments

Comments
 (0)