Skip to content

Commit 7513ba5

Browse files
Apply suggestions from review, fix examples
1 parent d2091b4 commit 7513ba5

File tree

1 file changed

+14
-4
lines changed

1 file changed

+14
-4
lines changed

articles/operator-nexus/troubleshoot-kubernetes-cluster-node-cordoned.md

Lines changed: 14 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -19,15 +19,15 @@ The purpose of this guide is to troubleshoot a Kubernetes Cluster when 1 or more
1919

2020
## Typical Cause
2121

22-
After a runtime upgrade, before a Baremetal Machine is shutdown for reimaging, the machine lifecycle controller will cordon and attempt to drain Virtual Machine resources scheduled to that Baremetal Machine. Once the Baremetal Machine resolves the reimaging process, the expectation is that Virtual Machines reschedule to the Baremetal Machine, and then be uncordoned by the machine lifecycle controller, reflecting the appropriate state `Ready`.
22+
After a runtime upgrade, before a Baremetal Machine is shutdown for reimaging, the machine lifecycle controller will cordon and drain Virtual Machine resources scheduled to that Baremetal Machine. Once the Baremetal Machine resolves the reimaging process, the expectation is that Virtual Machines reschedule to the Baremetal Machine, and then be uncordoned by the machine lifecycle controller, reflecting the appropriate state `Ready`.
2323

2424
However, a race condition may occur wherein the machine lifecycle controller fails to find the virt-launcher pods responsible for deploying Virtual Machines. This is because the virt-launcher pod's image pull job is not yet complete. Only after the image pull job is complete will the pod be schedulable to a Baremetal Machine. When the machine lifecycle controller examines these virt-launcher pods during the uncordon action execution, it cannot find which Baremetal Machine the pod is tied to, and skips the pod and the Virtual Machine it represents.
2525

26-
This problem should only appear during uncordon actions initiated by the machine lifecycle controller after runtime upgrades. It should occur infrequently on small clusters but frequently for large clusters, as multiple concurrent image pulls tends to result in longer scheduling times.
26+
This problem only appears during uncordon actions initiated by the machine lifecycle controller after runtime upgrades. It should occur infrequently on small clusters but frequently for large clusters, as multiple concurrent image pulls tends to result in longer scheduling times.
2727

2828
## Procedure
2929

30-
After KubernetesCluster Nodes are discovered in the `Ready,SchedulingDisabled` state, the following remediation may be engaged.
30+
After Kubernetes Cluster Nodes are discovered in the `Ready,SchedulingDisabled` state, the following remediation may be engaged.
3131

3232
1. Use kubectl to list the nodes using the wide flag. Observe the node in **Ready,SchedulingDisabled** status.
3333
~~~bash
@@ -40,10 +40,20 @@ After KubernetesCluster Nodes are discovered in the `Ready,SchedulingDisabled` s
4040
1. Issue the kubectl command to uncordon the Node in the undesired state.
4141

4242
~~~bash
43-
$ kubectl uncordon example-naks-control-plane-tgmw8
43+
$ kubectl uncordon example-naks-agentpool1-md-s8vp4-xp98x
4444
node/example-naks-agentpool1-md-s8vp4-xp98x uncordoned
4545
~~~
4646

47+
Alternatively, as this is more common in larger scale deployments, it may be more desirable to perform this action in bulk. In this case, issue the following looping command to find and uncordon all Nodes.
48+
49+
~~~bash
50+
cordoned_nodes=$(kubectl get nodes -o wide --no-headers | awk '/SchedulingDisabled/ {print $1}')
51+
for node in $cordoned_nodes; do
52+
kubectl uncordon $node
53+
done
54+
~~~
55+
56+
4757
1. Use kubectl to list the nodes using the wide flag. Observe the node in **Ready** status.
4858
~~~bash
4959
$ kubectl get nodes -o wide

0 commit comments

Comments
 (0)