Skip to content

Commit e0cf4cd

Browse files
author
Nyeem Akhtar
committed
Adding TSG for remediating NAKS stuck workloads due to power failure
In the event of power-failure, Undercloud nodes will be powered-off causing NAKS nodes running on powered-off UC node become not-ready, which will result in stateful workloads being stuck on those not-ready nodes. This TSG helps to move those stuck workloads to healthy nodes.
1 parent 4122b56 commit e0cf4cd

7 files changed

+78
-57
lines changed

articles/operator-nexus/TOC.yml

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -371,8 +371,11 @@
371371
href: troubleshoot-neighbor-group-creation-error.md
372372
- name: Troubleshoot NAKS Cluster Node Packet Loss
373373
href: troubleshoot-packet-loss.md
374-
- name: Troubleshooting Nexus Kubernetes Cluster stuck (unable to reschedule) workloads due to power failure
375-
href: troubleshoot-kubernetes-cluster-stuck-workloads-due-to-power-failure.md
374+
- name: Troubleshoot Nexus Kubernetes Cluster stuck (unable to reschedule) workloads
375+
expanded: false
376+
items:
377+
- name: Due To Bare Metal Machine Power Failure
378+
href: troubleshoot-kubernetes-cluster-stuck-workloads-due-to-power-failure.md
376379
- name: FAQ
377380
href: azure-operator-nexus-faq.md
378381
- name: Reference
33.5 KB
Loading
Binary file not shown.
35.9 KB
Loading
-68.1 KB
Loading
209 KB
Loading

articles/operator-nexus/troubleshoot-kubernetes-cluster-stuck-workloads-due-to-power-failure.md

Lines changed: 73 additions & 55 deletions
Original file line numberDiff line numberDiff line change
@@ -4,92 +4,110 @@ description: Troubleshooting Nexus Kubernetes Cluster workloads stuck (unable to
44
ms.service: azure-operator-nexus
55
ms.custom: troubleshooting
66
ms.topic: troubleshooting
7-
ms.date: 01/30/2025
7+
ms.date: 02/18/2025
88
ms.author: nyeemakhtar
99
author: mdnyeemakhtar
1010
---
11-
# Troubleshooting stuck (unable to reschedule) workloads in a Nexus Kubernetes Cluster
11+
# Troubleshooting stuck (unable to reschedule) workloads in a Nexus Kubernetes Cluster due to power failure
1212

13-
This guide provides detailed steps for troubleshooting issues related to stuck workloads on Nexus Kubernetes Cluster not-ready nodes. If you're experiencing these issues due to bare-metal node power failure, this guide helps you identify and resolve the problem.
13+
What is a stuck workload?
14+
15+
A stuck workload is a pod that is unable to reschedule to another node in a Kubernetes Cluster due to the node being not-ready. This issue can happen due to many reasons, including node power failure.
16+
17+
Kubernetes, by design, doesn't move workloads that are stateful in nature if the node they're running on becomes not-ready (for example, due to power failure). For more information, see [Kubernetes documentation](https://kubernetes.io/docs/concepts/cluster-administration/node-shutdown/#non-graceful-node-shutdown).
18+
19+
This guide details troubleshooting steps for cases where workloads on a Nexus Kubernetes Cluster become stuck due to bare-metal machine power failures. It also explains how to restart a stuck pod so that it can be rescheduled on a different node.
1420

1521
## Prerequisites
1622

1723
* Permissions to view Azure resources in the subscription where the Nexus Kubernetes Cluster is deployed
18-
* Access to Azure monitoring
1924
* Necessary permissions to make changes using `kubectl` commands in Nexus Kubernetes Cluster (for example, deleting nodes)
2025

21-
## Symptoms
26+
## Diagnosing Stuck Workloads
2227

23-
* Nexus Kubernetes Cluster nodes are not-ready
24-
* Pods that aren't daemon-set pods stuck on the not-ready nodes
28+
If you observe that your applications aren't responding as expected, your workloads might be stuck on the not-ready nodes. To diagnose stuck workloads on Nexus Kubernetes Cluster not-ready nodes, look for the following symptoms:
2529

26-
## Cause
30+
* To check if Nexus Kubernetes Cluster nodes are not-ready, run the following `kubectl` command in Nexus Kubernetes Cluster:
2731

28-
Kubernetes, by design, doesn't move workloads that are stateful in nature if the node they're running on becomes not-ready. For more information, see [Kubernetes documentation](https://kubernetes.io/docs/concepts/cluster-administration/node-shutdown/#non-graceful-node-shutdown).
32+
```bash
33+
kubectl get nodes | awk '$2 != "Ready" {print $1, $2}' | column -t
34+
```
35+
![kubectl get nodes output](media/naks-nodes-not-ready.png)
2936

30-
The Nexus Kubernetes Cluster not-ready node could be caused by a power failure on the bare-metal machine in the cluster. You can verify by checking the "Idrac Power On" metric in Azure monitoring. If you have alerts set up for this metric, you might have received an alert indicating that the bare-metal machines are powered off.
37+
If the command returns no results, then all the nodes are ready. Sometimes, nodes take a few minutes to become not-ready after the power failure, so you might need to run the command again after a few minutes. If nodes continue to show as ready even after reasonable time (5-10 mins) and your applications are still not responding, then the issue might be different, contact support for further assistance.
3138

32-
## Warning
39+
* To list pods stuck on the not-ready nodes, run the following `kubectl` command in Nexus Kubernetes Cluster:
3340

34-
### Nexus Kubernetes Cluster nodes Rack Spread
41+
```bash
42+
kubectl get nodes -o json | jq -r '.items[]
43+
| select(.status.conditions[] | select(.type=="Ready") | .status != "True")
44+
| .metadata.name' | xargs -I {} sh -c 'kubectl get pods --all-namespaces --field-selector spec.nodeName={} --no-headers | awk -v node={} "{print \$1, \$2, \$4, node}"' | sort -k1,1 | column -t -N "NAMESPACE,NAME,STATUS,NODE"
45+
```
46+
This command returns pods (stateful and daemon-set) that are stuck on the not-ready nodes. Pods status might show `Pending`, `Terminating`, or `Running`.
3547

36-
This guide requires deleting nodes from the Nexus Kubernetes Cluster. This action can cause Nexus Kubernetes Cluster nodes Rack Spread to be impacted.
48+
![stuck workload on not-ready nodes](media/naks-workload-stuck-on-not-ready-nodes.png)
3749

38-
### Host-Path storage
50+
## Diagnosing Power Failure
3951

40-
If the pods are configured to use host-path storage on the node, deleting the node deletes the data too.
52+
Once you have confirmed that the workloads are stuck on the not-ready nodes, the next step will help you diagnose if Nexus Kubernetes Cluster nodes are not-ready due to power failure of one or more bare-metal machines.
4153

42-
## Solution
54+
To diagnose power failure on the bare-metal machine, look for the following symptoms:
4355

44-
To resolve the issue, follow these steps:
56+
* To list bare-metal machines where not-ready Nexus Kubernetes Cluster nodes are running, run the following `kubectl` command in the Nexus Kubernetes Cluster:
4557

46-
1. Check the "Idrac Power On" metric in Azure monitoring to verify that the bare-metal machine is powered off. Following is an example screenshot of the metric in Azure monitoring:
58+
```bash
59+
kubectl get nodes -o json | jq -r '.items[]
60+
| select(.status.conditions[] | select(.type=="Ready") | .status != "True")
61+
| [.metadata.name, .metadata.labels["topology.kubernetes.io/baremetalmachine"]]
62+
| @tsv' | column -t -s $'\t' -N "Node Name,BMM Name"
63+
```
64+
This command returns the list of nodes and the bare-metal machine name where the nodes are running.
4765

48-
![Idrac Power On metric in Azure monitoring](media/idrac-power-on-metric.png)
66+
![bmm names where node-ready nodes reside](media/bmm-where-not-ready-naks-reside.png)
4967

50-
2. If the "Idrac Power On" metric indicates that the bare-metal machine is powered on (showing a value of 1), **don't proceed with the following steps**.
68+
* To list bare-metal machines in the cluster that are powered off, run the following command at cluster managed resource group level:
5169

52-
3. If the "Idrac Power On" metric indicates that the bare-metal machine is powered off (showing a value of 0), take the following actions on all impacted Nexus Kubernetes Clusters within that environment:
70+
```bash
71+
az networkcloud baremetalmachine list \
72+
--subscription <subscription-id> \
73+
--resource-group <managed-resource-group> \
74+
--query "[? powerState == 'Off' && detailedStatus != 'Available'].{BMMName: name, ResourceGroup: resourceGroup, PowerState: powerState, DetailedStatus: detailedStatus}" \
75+
-o table
76+
```
77+
Replace `<subscription-id>` with the subscription ID and `<managed-resource-group>` with the managed resource group of the cluster where bare-metal resources reside.
5378

54-
* Note down the name of the bare-metal machine that is powered off and managed-resource-group name. In the example screenshot in step 1, the machine name is `a12345bcde1co1` and managed-resource-group name is `poc02-19cf0b39e1e5-HostedResources-540F8D4E`. If this bare-metal machine is a compute node, then proceed with the following steps.
55-
* To find out impacted Nexus Kubernetes Clusters due to the powered-off bare-metal machine, run the following `az networkcloud` command:
79+
for example:
80+
```bash
81+
az networkcloud baremetalmachine list \
82+
--subscription 00000000-0000-0000-0000-000000000000 \
83+
--resource-group poc02-19cf0b39e1e5-HostedResources-540F8D4E \
84+
--query "[? powerState == 'Off' && detailedStatus != 'Available'].{BMMName: name, ResourceGroup: resourceGroup, PowerState: powerState, DetailedStatus: detailedStatus}" \
85+
-o table
86+
```
5687

57-
```bash
58-
az networkcloud kubernetescluster list \
59-
--subscription <subscription-id> \
60-
--query "[?extendedLocation.name && contains(extendedLocation.name, '<managed-resource-group>') && nodes[? bareMetalMachineId == \`null\` || contains(bareMetalMachineId, '<powered-off-baremetal-machine-name>') && detailedStatus != 'Running']].{ClusterName: name}" \
61-
-o table
62-
```
63-
Replace `<subscription-id>` with the subscription ID and `<managed-resource-group>` where the bare-metal machine is located and `<powered-off-baremetal-machine-name>` with the name of the powered-off bare-metal machine noted earlier.
88+
![powered off bare-metal machines](media/list-of-powered-off-bmms.png)
89+
**Note:** This screenshot doesn't show subscription ID since it was already set in the Azure CLI session using `az account set --subscription <subscription-id>` command.
6490

65-
for example:
66-
```bash
67-
az networkcloud kubernetescluster list \
68-
--subscription 00000000-0000-0000-0000-000000000000 \
69-
--query "[?extendedLocation.name && contains(extendedLocation.name, 'poc02-19cf0b39e1e5-HostedResources-540F8D4E') && nodes[? bareMetalMachineId == \`null\` || contains(bareMetalMachineId, 'a12345bcde1co1') && detailedStatus != 'Running']].{ClusterName: name}" \
70-
-o table
71-
```
72-
**Note:** The prior command might return Nexus Kubernetes Clusters that aren't impacted by the powered-off bare-metal machine. Further steps help you identify the impacted Nexus Kubernetes Clusters.
73-
* If there are no Nexus Kubernetes Clusters in the list from prior command, then don't proceed with the next steps. However, if there are Nexus Kubernetes Clusters, run the following steps on each Nexus Kubernetes Cluster.
74-
* Run following `kubectl` command in the Nexus Kubernetes Cluster to get the list of all the nodes that are running on powered off bare-metal machines:
91+
If the command returns no results, rerun the command after a few minutes. If the command still returns no results after a reasonable time (5-10 mins) and your workloads are still stuck, then the issue might be different, contact support for further assistance.
7592

76-
```bash
77-
kubectl get nodes -l topology.kubernetes.io/baremetalmachine=<powered-off-baremetal-machine-name>
78-
```
79-
Replace `<powered-off-baremetal-machine-name>` with the name of the powered-off bare-metal machine noted earlier.
93+
Cross check the powered off bare-metal machines names with the list of bare-metal machines where not-ready nodes are running. If the bare-metal machines where not-ready nodes are running are powered off, then the issue is due to power failure. Now, you can proceed to the next section to resolve the issue.
8094

81-
for example:
82-
```bash
83-
kubectl get nodes -l topology.kubernetes.io/baremetalmachine=a12345bcde1co1
84-
```
85-
![kubectl get nodes output](media/naks-nodes-not-ready.png)
95+
## Warning
96+
97+
### Nexus Kubernetes Cluster virtual machine (VM) placement
8698

87-
If prior command doesn't list any nodes, no further action is needed for the current Nexus Kubernetes Cluster.
99+
This guide requires deleting nodes from the Nexus Kubernetes Cluster. This action can cause Nexus Kubernetes Cluster VMs rack placement to be impacted. For more information, see [how the Nexus platform schedules a Nexus Kubernetes Cluster VM](./concepts-nexus-kubernetes-placement.md#how-the-nexus-platform-schedules-a-nexus-kubernetes-cluster-vm).
100+
101+
### Host-Path storage
102+
103+
If the pods are configured to use host-path storage on the node, deleting the node deletes the data too.
104+
105+
## Solution
88106

89-
Don't proceed with the next steps until all of the nodes statuses are not-ready from prior command. If the nodes are ready, then wait and run the command again to check the status. When all the nodes statuses are not-ready, proceed to the next step.
107+
To move the stuck workloads to other nodes in the Nexus Kubernetes Cluster, you need to delete the Nexus Kubernetes Cluster nodes that are not-ready due to power failure. The workloads that are stuck on the not-ready nodes will be rescheduled to other nodes in the Nexus Kubernetes Cluster. Additionally, new nodes are created to replace the deleted nodes automatically if there's enough capacity available in the cluster.
90108

91-
* Double check by refreshing the "Idrac Power On" metric in Azure monitoring to verify that the bare-metal machine is still powered off. If the machine is showing as **powered on, don't proceed with the next steps**. If the machine is still showing as powered off, proceed to the next step.
92-
* To delete the nodes that are not-ready, run following `kubectl` command in the Nexus Kubernetes Cluster:
109+
* From prior steps, note down the names of the powered-off bare-metal machines where the not-ready nodes are running.
110+
* To delete the nodes that are not-ready, run the following `kubectl` command in the Nexus Kubernetes Cluster **for each powered-off bare-metal machine**:
93111

94112
```bash
95113
kubectl delete node -l topology.kubernetes.io/baremetalmachine=<powered-off-baremetal-machine-name>
@@ -98,6 +116,6 @@ To resolve the issue, follow these steps:
98116

99117
for example:
100118
```bash
101-
kubectl delete node -l topology.kubernetes.io/baremetalmachine=a12345bcde1co1
119+
kubectl delete node -l topology.kubernetes.io/baremetalmachine=b37100gipc1co01
102120
```
103-
* After you deleted the nodes, the workloads that were stuck on the not-ready nodes will be rescheduled to other nodes in the Nexus Kubernetes Cluster. **Note, this process might take upwards of 30 minutes to complete**. Additionally, new nodes are created to replace the deleted nodes automatically if there's enough capacity available in the cluster.
121+
* After you deleted the nodes, the workloads that were stuck on the not-ready nodes should be rescheduled to other nodes in the Nexus Kubernetes Cluster. Run prior `kubectl` command for all remaining powered-off bare-metal machines noted earlier.

0 commit comments

Comments
 (0)