You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/operator-nexus/troubleshoot-kubernetes-cluster-node-cordoned.md
+5-5Lines changed: 5 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -10,7 +10,7 @@ author: jeremyhouser-ms
10
10
---
11
11
# Troubleshoot a Kubernetes Cluster Node in NotReady,Scheduling Disabled state
12
12
13
-
The purpose of this guide is to troubleshoot a Kubernetes Cluster when 1 or more of it's nodes fail to uncordon after a runtime upgrade. This guide is only applicable if that Node remains in the state `Ready,SchedulingDisabled`.
13
+
The purpose of this guide is to troubleshoot a Kubernetes Cluster when 1 or more of its nodes fail to uncordon after a runtime upgrade. This guide is only applicable if that Node remains in the state `Ready,SchedulingDisabled`.
14
14
15
15
## Prerequisites
16
16
@@ -19,15 +19,15 @@ The purpose of this guide is to troubleshoot a Kubernetes Cluster when 1 or more
19
19
20
20
## Typical Cause
21
21
22
-
After a runtime upgrade, before a Baremetal Machine is shutdown for reimaging, the machine lifecycle controller will cordon and attempt to drain Virtual Machine resources scheduled to that Baremetal Machine. Once the Baremetal Machine has resolved the reimaging process, the expectation is that Virtual Machines tied to that Baremetal Machine will reschedule to the Baremetal Machine, and then be uncordoned by the machine lifecycle controller, reflecting the appropriate state `Ready`.
22
+
After a runtime upgrade, before a Baremetal Machine is shutdown for reimaging, the machine lifecycle controller will cordon and attempt to drain Virtual Machine resources scheduled to that Baremetal Machine. Once the Baremetal Machine resolves the reimaging process, the expectation is that Virtual Machines reschedule to the Baremetal Machine, and then be uncordoned by the machine lifecycle controller, reflecting the appropriate state `Ready`.
23
23
24
-
However, a race condition may occur wherein the machine lifecycle controller will fail to find the virt-launcher pods responsible for deploying Virtual Machines. This is because the virt-launcher pod's image pull job is not yet complete. Only after the image pull job is complete will the pod be schedulable to a Baremetal Machine. When the machine lifecycle controller examines these virt-launcher pods during the uncordon action execution, it cannot find which Baremetal Machine the pod is tied to, and skips the pod and the Virtual Machine it represents.
24
+
However, a race condition may occur wherein the machine lifecycle controller fails to find the virt-launcher pods responsible for deploying Virtual Machines. This is because the virt-launcher pod's image pull job is not yet complete. Only after the image pull job is complete will the pod be schedulable to a Baremetal Machine. When the machine lifecycle controller examines these virt-launcher pods during the uncordon action execution, it cannot find which Baremetal Machine the pod is tied to, and skips the pod and the Virtual Machine it represents.
25
25
26
-
This problem should only appear during uncordon actions initiated by the machine lifecycle controller after runtime upgrades. It should occur infrequently on small clusters but frequently for large clusters, as multiple concurrent image pulls will result in longer scheduling times.
26
+
This problem should only appear during uncordon actions initiated by the machine lifecycle controller after runtime upgrades. It should occur infrequently on small clusters but frequently for large clusters, as multiple concurrent image pulls tends to result in longer scheduling times.
27
27
28
28
## Procedure
29
29
30
-
After KubernetesCluster Nodes have been discovered in the `Ready,SchedulingDisabled` state, the following remediation may be engaged.
30
+
After KubernetesCluster Nodes are discovered in the `Ready,SchedulingDisabled` state, the following remediation may be engaged.
31
31
32
32
1. Use kubectl to list the nodes using the wide flag. Observe the node in **Ready,SchedulingDisabled** status.
0 commit comments