You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
#Customer intent: As an Azure Kubernetes user, I want to prevent an Azure Kubernetes Service (AKS) cluster node from regressing to a Not Ready status so that I can continue to use the cluster node successfully.
8
-
ms.custom: sap:Node/node pool availability and performance
8
+
ms.custom: sap:Node/node pool availability and performance, innovation-engine
9
9
---
10
+
10
11
# Troubleshoot a change in a healthy node to Not Ready status
11
12
12
13
This article discusses a scenario in which the status of an Azure Kubernetes Service (AKS) cluster node changes to **Not Ready** after the node is in a healthy state for some time. This article outlines the particular cause and provides a possible solution.
@@ -24,6 +25,17 @@ This article discusses a scenario in which the status of an Azure Kubernetes Ser
az aks get-credentials --resource-group $RESOURCE_GROUP --name $AKS_CLUSTER --overwrite-existing
37
+
```
38
+
27
39
## Symptoms
28
40
29
41
The status of a cluster node that has a healthy state (all services running) unexpectedly changes to **Not Ready**. To view the status of a node, run the following [kubectl describe](https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#describe) command:
@@ -36,18 +48,37 @@ kubectl describe nodes
36
48
37
49
The [kubelet](https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/) stopped posting its **Ready** status.
38
50
39
-
Examine the output of the `kubectl describe nodes` command to find the [Conditions](https://kubernetes.io/docs/reference/node/node-status/#condition) field and the [Capacity and Allocatable](https://kubernetes.io/docs/reference/node/node-status/#capacity) blocks. Do the content of these fields appear as expected? (For example, in the **Conditions** field, does the `message` property contain the "kubelet is posting ready status" string?) In this case, if you have direct Secure Shell (SSH) access to the node, check the recent events to understand the error. Look within the */var/log/messages* file. Or, generate the kubelet and container daemon log files by running the following shell commands:
51
+
Examine the output of the `kubectl describe nodes` command to find the [Conditions](https://kubernetes.io/docs/reference/node/node-status/#condition) field and the [Capacity and Allocatable](https://kubernetes.io/docs/reference/node/node-status/#capacity) blocks. Do the content of these fields appear as expected? (For example, in the **Conditions** field, does the `message` property contain the "kubelet is posting ready status" string?) In this case, if you have direct Secure Shell (SSH) access to the node, check the recent events to understand the error. Look within the */var/log/syslog* file instead of */var/log/messages* (not available on all distributions). Or, generate the kubelet and container daemon log files by running the following shell commands:
40
52
41
53
```bash
42
-
# To check messages file,
43
-
cat /var/log/messages
44
-
45
-
# To check kubelet and containerd daemon logs,
46
-
journalctl -u kubelet > kubelet.log
47
-
journalctl -u containerd > containerd.log
54
+
# First, identify the NotReady node
55
+
export NODE_NAME=$(kubectl get nodes --no-headers | grep NotReady | awk '{print $1}'| head -1)
After you run these commands, examine the messages and daemon log files for more information about the error.
81
+
After you run these commands, examine the syslog and daemon log files for more information about the error.
51
82
52
83
## Solution
53
84
@@ -124,7 +155,8 @@ Instead, identify the offending application, and then take the appropriate actio
124
155
To monitor the thread count for each control group (cgroup) and print the top eight cgroups, run the following shell command:
125
156
126
157
```bash
127
-
watch 'ps -e -w -o "thcount,cgname" --no-headers | awk "{a[\$2] += \$1} END{for (i in a) print a[i], i}" | sort --numeric-sort --reverse | head --lines=8'
158
+
# Show current thread count for each cgroup (top 8)
159
+
ps -e -w -o "thcount,cgname" --no-headers | awk '{a[$2] += $1} END{for (i in a) print a[i], i}'| sort --numeric-sort --reverse | head --lines=8
128
160
```
129
161
130
162
For more information, see [Process ID limits and reservations](https://kubernetes.io/docs/concepts/policy/pid-limiting/).
@@ -146,4 +178,4 @@ You can make sure that the AKS API server has high availability by using a highe
146
178
147
179
- To view the health and performance of the AKS API server and kubelets, see [Managed AKS components](/azure/aks/monitor-aks#level-2---managed-aks-components).
148
180
149
-
- For general troubleshooting steps, see [Basic troubleshooting of node not ready failures](node-not-ready-basic-troubleshooting.md).
181
+
- For general troubleshooting steps, see [Basic troubleshooting of node not ready failures](node-not-ready-basic-troubleshooting.md).
0 commit comments