Skip to content

Commit 503521a

Browse files
Update for reviewer comments
1 parent 7e3f00f commit 503521a

File tree

2 files changed

+6
-4
lines changed

2 files changed

+6
-4
lines changed

articles/operator-nexus/TOC.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -428,6 +428,8 @@
428428
items:
429429
- name: Due To Bare Metal Machine Power Failure
430430
href: troubleshoot-kubernetes-cluster-stuck-workloads-due-to-power-failure.md
431+
- name: Troubleshoot a Kubernetes Cluster Node in NotReady,Scheduling Disabled after Runtime Upgrade
432+
href: troubleshoot-kubernetes-cluster-node-cordoned.md
431433
- name: Storage Appliance
432434
expanded: false
433435
items:

articles/operator-nexus/troubleshoot-kubernetes-cluster-node-cordoned.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
title: Troubleshoot a Kubernetes Cluster Node in NotReady,Scheduling Disabled after Runtime Upgrade
3-
description: Learn what to do when you a Kubernetes Cluster Node is in the state NotReady,Scheduling Disabled after a runtime upgrade.
3+
description: Learn what to do when your Kubernetes Cluster Node is in the state NotReady,Scheduling Disabled after a runtime upgrade.
44
ms.service: azure-operator-nexus
55
ms.custom: troubleshooting
66
ms.topic: troubleshooting
@@ -19,9 +19,9 @@ The purpose of this guide is to troubleshoot a Kubernetes Cluster when 1 or more
1919

2020
## Typical Cause
2121

22-
After a runtime upgrade, before a Baremetal Machine is shutdown for reimaging, the machine lifecycle controller will cordon and drain Virtual Machine resources scheduled to that Baremetal Machine. Once the Baremetal Machine resolves the reimaging process, the expectation is that Virtual Machines reschedule to the Baremetal Machine, and then be uncordoned by the machine lifecycle controller, reflecting the appropriate state `Ready`.
22+
After a runtime upgrade, before a Baremetal Machine is shut down for reimaging, the machine lifecycle controller will cordon and drain Virtual Machine resources scheduled to that Baremetal Machine. Once the Baremetal Machine resolves the reimaging process, the expectation is that Virtual Machine resources reschedule to the Baremetal Machine, and then be uncordoned by the machine lifecycle controller, reflecting the appropriate state `Ready`.
2323

24-
However, a race condition may occur wherein the machine lifecycle controller fails to find the virt-launcher pods responsible for deploying Virtual Machines. This is because the virt-launcher pod's image pull job is not yet complete. Only after the image pull job is complete will the pod be schedulable to a Baremetal Machine. When the machine lifecycle controller examines these virt-launcher pods during the uncordon action execution, it cannot find which Baremetal Machine the pod is tied to, and skips the pod and the Virtual Machine it represents.
24+
However, a race condition may occur wherein the machine lifecycle controller fails to find the virt-launcher pods responsible for deploying Virtual Machines. This is because the virt-launcher pod's image pull job isn't yet complete. Only after the image pull job is complete will the pod be schedulable to a Baremetal Machine. When the machine lifecycle controller examines these virt-launcher pods during the uncordon action execution, it can't find which Baremetal Machine the pod is tied to, and skips the pod and the Virtual Machine it represents.
2525

2626
## Procedure
2727

@@ -42,7 +42,7 @@ After Kubernetes Cluster Nodes are discovered in the `Ready,SchedulingDisabled`
4242
node/example-naks-agentpool1-md-s8vp4-xp98x uncordoned
4343
~~~
4444

45-
Alternatively, as this is more common in larger scale deployments, it may be desirable to perform this action in bulk. In this case, issue the uncordon command as part of a loop to find and uncordon all affected Nodes.
45+
Alternatively, as this issue is more common in larger scale deployments, it may be desirable to perform this action in bulk. In this case, issue the uncordon command as part of a loop to find and uncordon all affected Nodes.
4646

4747
~~~bash
4848
cordoned_nodes=$(kubectl get nodes -o wide --no-headers | awk '/SchedulingDisabled/ {print $1}')

0 commit comments

Comments
 (0)