Skip to content

Commit e33634c

Browse files
authored
Merge pull request #301882 from jeremyhouser-ms/main
Add public Learn document for repairing cordoned nodes within KubernetesClusters within Nexus AKS
2 parents f4fb89b + 6e3baa0 commit e33634c

File tree

2 files changed

+66
-0
lines changed

2 files changed

+66
-0
lines changed

articles/operator-nexus/TOC.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -428,6 +428,8 @@
428428
items:
429429
- name: Due To Bare Metal Machine Power Failure
430430
href: troubleshoot-kubernetes-cluster-stuck-workloads-due-to-power-failure.md
431+
- name: Troubleshoot a Kubernetes Cluster Node in NotReady,Scheduling Disabled after Runtime Upgrade
432+
href: troubleshoot-kubernetes-cluster-node-cordoned.md
431433
- name: Storage Appliance
432434
expanded: false
433435
items:
Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
---
2+
title: Troubleshoot a Kubernetes Cluster Node in NotReady,Scheduling Disabled after Runtime Upgrade
3+
description: Learn what to do when your Kubernetes Cluster Node is in the state NotReady,Scheduling Disabled after a runtime upgrade.
4+
ms.service: azure-operator-nexus
5+
ms.custom: troubleshooting
6+
ms.topic: troubleshooting
7+
ms.date: 06/25/2025
8+
ms.author: jeremyhouser
9+
author: jeremyhouser-ms
10+
---
11+
# Troubleshoot a Kubernetes Cluster Node in NotReady,Scheduling Disabled state
12+
13+
The purpose of this guide is to troubleshoot a Kubernetes Cluster when 1 or more of its nodes fail to uncordon after a runtime upgrade. This guide is only applicable if that Node remains in the state `Ready,SchedulingDisabled`.
14+
15+
## Prerequisites
16+
17+
- Ability to run kubectl commands against the Kubernetes Cluster
18+
- Familiarity with the capabilities referenced in this article by reviewing [how to connect to Kubernetes Clusters](howto-kubernetes-cluster-connect.md)
19+
20+
## Typical Cause
21+
22+
During a Nexus Cluster runtime upgrade on a Baremetal Machine hosting Tenant workloads, the system will cordon and drain Virtual Machine resources scheduled to that Baremetal Machine. It will then shut down the Baremetal Machine to complete the reimaging process. Once the Baremetal Machine completes the runtime upgrade and reboots, the expectation is that the system reschedules Virtual Machines to that Baremetal Machine. It would then uncordon the Virtual Machine, with the Kubernetes Cluster Node that Virtual Machine supports reflecting the appropriate state `Ready`.
23+
24+
However, a race condition may occur wherein the system fails to find Virtual Machines that should be scheduled to that Baremetal Machine. Each Virtual Machine is deployed using a virt-launcher pod. This race condition happens when the virt-launcher pod's image pull job isn't yet complete. Only after the image pull job is complete will the pod be schedulable to a Baremetal Machine. When the system examines these virt-launcher pods during the uncordon action execution, it can't find which Baremetal Machine the pod. Therefore the system skips uncordoning that Virtual Machine that that pod represents.
25+
26+
## Procedure
27+
28+
After Kubernetes Cluster Nodes are discovered in the `Ready,SchedulingDisabled` state, the following remediation may be engaged.
29+
30+
1. Use kubectl to list the nodes using the wide flag. Observe the node in **Ready,SchedulingDisabled** status.
31+
~~~bash
32+
$ kubectl get nodes -o wide
33+
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
34+
example-naks-control-plane-tgmw8 Ready,SchedulingDisabled control-plane 2d6h v1.30.12 10.4.32.10 <none> Microsoft Azure Linux 3.0 6.6.85.1-2.azl3 containerd://2.0.0
35+
example-naks-agentpool1-md-s8vp4-xp98x Ready,SchedulingDisabled <none> 2d6h v1.30.12 10.4.32.11 <none> Microsoft Azure Linux 3.0 6.6.85.1-2.azl3 containerd://2.0.0
36+
~~~
37+
38+
1. Issue the kubectl command to uncordon the Node in the undesired state.
39+
40+
~~~bash
41+
$ kubectl uncordon example-naks-agentpool1-md-s8vp4-xp98x
42+
node/example-naks-agentpool1-md-s8vp4-xp98x uncordoned
43+
~~~
44+
45+
Alternatively, as this issue is more common in larger scale deployments, it may be desirable to perform this action in bulk. In this case, issue the uncordon command as part of a loop to find and uncordon all affected Nodes.
46+
47+
~~~bash
48+
cordoned_nodes=$(kubectl get nodes -o wide --no-headers | awk '/SchedulingDisabled/ {print $1}')
49+
for node in $cordoned_nodes; do
50+
kubectl uncordon $node
51+
done
52+
~~~
53+
54+
55+
1. Use kubectl to list the nodes using the wide flag. Observe the node in **Ready** status.
56+
~~~bash
57+
$ kubectl get nodes -o wide
58+
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
59+
example-naks-control-plane-tgmw8 Ready control-plane 2d6h v1.30.12 10.4.32.10 <none> Microsoft Azure Linux 3.0 6.6.85.1-2.azl3 containerd://2.0.0
60+
example-naks-agentpool1-md-s8vp4-xp98x Ready <none> 2d6h v1.30.12 10.4.32.11 <none> Microsoft Azure Linux 3.0 6.6.85.1-2.azl3 containerd://2.0.0
61+
~~~
62+
63+
If you still have questions, [contact support](https://portal.azure.com/?#blade/Microsoft_Azure_Support/HelpAndSupportBlade).
64+
For more information about Support plans, see [Azure Support plans](https://azure.microsoft.com/support/plans/response/).

0 commit comments

Comments
 (0)