Skip to content

Commit 3363891

Browse files
authored
Merge pull request #9234 from naman-msft/docs-editor/node-not-ready-after-being-hea-1751316149
AB#6462: Update node-not-ready-after-being-healthy.md
2 parents a8f75f3 + 79ef6bb commit 3363891

File tree

1 file changed

+43
-11
lines changed

1 file changed

+43
-11
lines changed

support/azure/azure-kubernetes/availability-performance/node-not-ready-after-being-healthy.md

Lines changed: 43 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -5,8 +5,9 @@ ms.date: 08/27/2024
55
ms.reviewer: rissing, chiragpa, momajed, v-leedennis
66
ms.service: azure-kubernetes-service
77
#Customer intent: As an Azure Kubernetes user, I want to prevent an Azure Kubernetes Service (AKS) cluster node from regressing to a Not Ready status so that I can continue to use the cluster node successfully.
8-
ms.custom: sap:Node/node pool availability and performance
8+
ms.custom: sap:Node/node pool availability and performance, innovation-engine
99
---
10+
1011
# Troubleshoot a change in a healthy node to Not Ready status
1112

1213
This article discusses a scenario in which the status of an Azure Kubernetes Service (AKS) cluster node changes to **Not Ready** after the node is in a healthy state for some time. This article outlines the particular cause and provides a possible solution.
@@ -24,6 +25,17 @@ This article discusses a scenario in which the status of an Azure Kubernetes Ser
2425
- [sort](https://man7.org/linux/man-pages/man1/sort.1.html)
2526
- [watch](https://man7.org/linux/man-pages/man1/watch.1.html)
2627

28+
## Connect to the AKS cluster
29+
30+
Before you can troubleshoot the issue, you must connect to the AKS cluster. To do so, run the following commands:
31+
32+
```bash
33+
export RANDOM_SUFFIX=$(head -c 3 /dev/urandom | xxd -p)
34+
export RESOURCE_GROUP="my-resource-group$RANDOM_SUFFIX"
35+
export AKS_CLUSTER="my-aks-cluster$RANDOM_SUFFIX"
36+
az aks get-credentials --resource-group $RESOURCE_GROUP --name $AKS_CLUSTER --overwrite-existing
37+
```
38+
2739
## Symptoms
2840

2941
The status of a cluster node that has a healthy state (all services running) unexpectedly changes to **Not Ready**. To view the status of a node, run the following [kubectl describe](https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#describe) command:
@@ -36,18 +48,37 @@ kubectl describe nodes
3648

3749
The [kubelet](https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/) stopped posting its **Ready** status.
3850

39-
Examine the output of the `kubectl describe nodes` command to find the [Conditions](https://kubernetes.io/docs/reference/node/node-status/#condition) field and the [Capacity and Allocatable](https://kubernetes.io/docs/reference/node/node-status/#capacity) blocks. Do the content of these fields appear as expected? (For example, in the **Conditions** field, does the `message` property contain the "kubelet is posting ready status" string?) In this case, if you have direct Secure Shell (SSH) access to the node, check the recent events to understand the error. Look within the */var/log/messages* file. Or, generate the kubelet and container daemon log files by running the following shell commands:
51+
Examine the output of the `kubectl describe nodes` command to find the [Conditions](https://kubernetes.io/docs/reference/node/node-status/#condition) field and the [Capacity and Allocatable](https://kubernetes.io/docs/reference/node/node-status/#capacity) blocks. Do the content of these fields appear as expected? (For example, in the **Conditions** field, does the `message` property contain the "kubelet is posting ready status" string?) In this case, if you have direct Secure Shell (SSH) access to the node, check the recent events to understand the error. Look within the */var/log/syslog* file instead of */var/log/messages* (not available on all distributions). Or, generate the kubelet and container daemon log files by running the following shell commands:
4052

4153
```bash
42-
# To check messages file,
43-
cat /var/log/messages
44-
45-
# To check kubelet and containerd daemon logs,
46-
journalctl -u kubelet > kubelet.log
47-
journalctl -u containerd > containerd.log
54+
# First, identify the NotReady node
55+
export NODE_NAME=$(kubectl get nodes --no-headers | grep NotReady | awk '{print $1}' | head -1)
56+
57+
if [ -z "$NODE_NAME" ]; then
58+
echo "No NotReady nodes found"
59+
kubectl get nodes
60+
else
61+
echo "Found NotReady node: $NODE_NAME"
62+
63+
# Use kubectl debug to access the node
64+
kubectl debug node/$NODE_NAME -it --image=mcr.microsoft.com/dotnet/runtime-deps:6.0 -- chroot /host bash -c "
65+
echo '=== Checking syslog ==='
66+
if [ -f /var/log/syslog ]; then
67+
tail -100 /var/log/syslog
68+
else
69+
echo 'syslog not found'
70+
fi
71+
72+
echo '=== Checking kubelet logs ==='
73+
journalctl -u kubelet --no-pager | tail -100
74+
75+
echo '=== Checking containerd logs ==='
76+
journalctl -u containerd --no-pager | tail -100
77+
"
78+
fi
4879
```
4980

50-
After you run these commands, examine the messages and daemon log files for more information about the error.
81+
After you run these commands, examine the syslog and daemon log files for more information about the error.
5182

5283
## Solution
5384

@@ -124,7 +155,8 @@ Instead, identify the offending application, and then take the appropriate actio
124155
To monitor the thread count for each control group (cgroup) and print the top eight cgroups, run the following shell command:
125156

126157
```bash
127-
watch 'ps -e -w -o "thcount,cgname" --no-headers | awk "{a[\$2] += \$1} END{for (i in a) print a[i], i}" | sort --numeric-sort --reverse | head --lines=8'
158+
# Show current thread count for each cgroup (top 8)
159+
ps -e -w -o "thcount,cgname" --no-headers | awk '{a[$2] += $1} END{for (i in a) print a[i], i}' | sort --numeric-sort --reverse | head --lines=8
128160
```
129161

130162
For more information, see [Process ID limits and reservations](https://kubernetes.io/docs/concepts/policy/pid-limiting/).
@@ -146,4 +178,4 @@ You can make sure that the AKS API server has high availability by using a highe
146178

147179
- To view the health and performance of the AKS API server and kubelets, see [Managed AKS components](/azure/aks/monitor-aks#level-2---managed-aks-components).
148180

149-
- For general troubleshooting steps, see [Basic troubleshooting of node not ready failures](node-not-ready-basic-troubleshooting.md).
181+
- For general troubleshooting steps, see [Basic troubleshooting of node not ready failures](node-not-ready-basic-troubleshooting.md).

0 commit comments

Comments
 (0)