Skip to content

Commit 25e7aa0

Browse files
authored
Merge pull request #293677 from mdnyeemakhtar/main
Adding TSG for remediating NAKS stuck workloads due to power failure
2 parents 11583cc + 247f648 commit 25e7aa0

6 files changed

+126
-0
lines changed

articles/operator-nexus/TOC.yml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -377,6 +377,11 @@
377377
href: troubleshoot-neighbor-group-creation-error.md
378378
- name: Troubleshoot NAKS Cluster Node Packet Loss
379379
href: troubleshoot-packet-loss.md
380+
- name: Troubleshoot Nexus Kubernetes Cluster stuck (unable to reschedule) workloads
381+
expanded: false
382+
items:
383+
- name: Due To Bare Metal Machine Power Failure
384+
href: troubleshoot-kubernetes-cluster-stuck-workloads-due-to-power-failure.md
380385
- name: FAQ
381386
href: azure-operator-nexus-faq.md
382387
- name: Reference
33.5 KB
Loading
35.9 KB
Loading
19 KB
Loading
209 KB
Loading
Lines changed: 121 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,121 @@
1+
---
2+
title: Troubleshoot Nexus Kubernetes Cluster Stuck (Unable to Reschedule) Workloads Due to Power Failure
3+
description: Troubleshooting Nexus Kubernetes Cluster workloads stuck (unable to reschedule) on not-ready nodes due to bare-metal node power failure.
4+
ms.service: azure-operator-nexus
5+
ms.custom: troubleshooting
6+
ms.topic: troubleshooting
7+
ms.date: 02/18/2025
8+
ms.author: nyeemakhtar
9+
author: mdnyeemakhtar
10+
---
11+
# Troubleshoot stuck (unable to reschedule) workloads in a Nexus Kubernetes Cluster due to power failure
12+
13+
What is a stuck workload?
14+
15+
A stuck workload is a pod that is unable to reschedule to another node in a Kubernetes Cluster due to the node being not-ready. This issue can happen due to many reasons, including node power failure.
16+
17+
Kubernetes, by design, doesn't move workloads that are stateful in nature if the node they're running on becomes not-ready (for example, due to power failure). For more information, see [Kubernetes documentation](https://kubernetes.io/docs/concepts/cluster-administration/node-shutdown/#non-graceful-node-shutdown).
18+
19+
This guide details troubleshooting steps for cases where workloads on a Nexus Kubernetes Cluster become stuck due to bare-metal machine power failures. It also explains how to restart a stuck pod so that it can be rescheduled on a different node.
20+
21+
## Prerequisites
22+
23+
* Permissions to view Azure resources in the subscription where the Nexus Kubernetes Cluster is deployed
24+
* Necessary permissions to make changes using `kubectl` commands in Nexus Kubernetes Cluster (for example, deleting nodes)
25+
26+
## Diagnose stuck workloads
27+
28+
If you observe that your applications aren't responding as expected, your workloads might be stuck on the not-ready nodes. To diagnose stuck workloads on Nexus Kubernetes Cluster not-ready nodes, look for the following symptoms:
29+
30+
* To check if Nexus Kubernetes Cluster nodes are not-ready, run the following `kubectl` command in Nexus Kubernetes Cluster:
31+
32+
```bash
33+
kubectl get nodes | awk '$2 != "Ready" {print $1, $2}' | column -t
34+
```
35+
![kubectl get nodes output](media/naks-nodes-not-ready.png)
36+
37+
If the command returns no results, then all the nodes are ready. Sometimes, nodes take a few minutes to become not-ready after the power failure, so you might need to run the command again after a few minutes. If nodes continue to show as ready even after reasonable time (5-10 mins) and your applications are still not responding, then the issue might be different, contact support for further assistance.
38+
39+
* To list pods stuck on the not-ready nodes, run the following `kubectl` command in Nexus Kubernetes Cluster:
40+
41+
```bash
42+
kubectl get nodes -o json | jq -r '.items[]
43+
| select(.status.conditions[] | select(.type=="Ready") | .status != "True")
44+
| .metadata.name' | xargs -I {} sh -c 'kubectl get pods --all-namespaces --field-selector spec.nodeName={} --no-headers | awk -v node={} "{print \$1, \$2, \$4, node}"' | sort -k1,1 | column -t -N "NAMESPACE,NAME,STATUS,NODE"
45+
```
46+
This command returns pods (stateful and daemon-set) that are stuck on the not-ready nodes. Pods status might show `Pending`, `Terminating`, or `Running`.
47+
48+
![stuck workload on not-ready nodes](media/naks-workload-stuck-on-not-ready-nodes.png)
49+
50+
## Diagnose power failure
51+
52+
Once you have confirmed that the workloads are stuck on the not-ready nodes, the next step will help you diagnose if Nexus Kubernetes Cluster nodes are not-ready due to power failure of one or more bare-metal machines.
53+
54+
To diagnose power failure on the bare-metal machine, look for the following symptoms:
55+
56+
* To list bare-metal machines where not-ready Nexus Kubernetes Cluster nodes are running, run the following `kubectl` command in the Nexus Kubernetes Cluster:
57+
58+
```bash
59+
kubectl get nodes -o json | jq -r '.items[]
60+
| select(.status.conditions[] | select(.type=="Ready") | .status != "True")
61+
| [.metadata.name, .metadata.labels["topology.kubernetes.io/baremetalmachine"]]
62+
| @tsv' | column -t -s $'\t' -N "Node Name,BMM Name"
63+
```
64+
This command returns the list of nodes and the bare-metal machine name where the nodes are running.
65+
66+
![bmm names where node-ready nodes reside](media/bmm-where-not-ready-naks-reside.png)
67+
68+
* To list bare-metal machines in the cluster that are powered off, run the following command at cluster managed resource group level:
69+
70+
```bash
71+
az networkcloud baremetalmachine list \
72+
--subscription <subscription-id> \
73+
--resource-group <managed-resource-group> \
74+
--query "[? powerState == 'Off' && detailedStatus != 'Available'].{BMMName: name, ResourceGroup: resourceGroup, PowerState: powerState, DetailedStatus: detailedStatus}" \
75+
-o table
76+
```
77+
Replace `<subscription-id>` with the subscription ID and `<managed-resource-group>` with the managed resource group of the cluster where bare-metal resources reside.
78+
79+
for example:
80+
```bash
81+
az networkcloud baremetalmachine list \
82+
--subscription 00000000-0000-0000-0000-000000000000 \
83+
--resource-group poc02-19cf0b39e1e5-HostedResources-540F8D4E \
84+
--query "[? powerState == 'Off' && detailedStatus != 'Available'].{BMMName: name, ResourceGroup: resourceGroup, PowerState: powerState, DetailedStatus: detailedStatus}" \
85+
-o table
86+
```
87+
88+
![powered off bare-metal machines](media/list-of-powered-off-bmms.png)
89+
**Note:** This screenshot doesn't show subscription ID since it was already set in the Azure CLI session using `az account set --subscription <subscription-id>` command.
90+
91+
If the command returns no results, rerun the command after a few minutes. If the command still returns no results after a reasonable time (5-10 mins) and your workloads are still stuck, then the issue might be different, contact support for further assistance.
92+
93+
Cross check the powered off bare-metal machines names with the list of bare-metal machines where not-ready nodes are running. If the bare-metal machines where not-ready nodes are running are powered off, then the issue is due to power failure. Now, you can proceed to the next section to resolve the issue.
94+
95+
## Warning
96+
97+
### Nexus Kubernetes Cluster virtual machine (VM) placement
98+
99+
This guide requires deleting nodes from the Nexus Kubernetes Cluster. This action can cause Nexus Kubernetes Cluster VMs rack placement to be impacted. For more information, see [how the Nexus platform schedules a Nexus Kubernetes Cluster VM](./concepts-nexus-kubernetes-placement.md#how-the-nexus-platform-schedules-a-nexus-kubernetes-cluster-vm).
100+
101+
### Host-Path storage
102+
103+
If the pods are configured to use host-path storage on the node, deleting the node deletes the data too.
104+
105+
## Solution
106+
107+
To move the stuck workloads to other nodes in the Nexus Kubernetes Cluster, you need to delete the Nexus Kubernetes Cluster nodes that are not-ready due to power failure. The workloads that are stuck on the not-ready nodes will be rescheduled to other nodes in the Nexus Kubernetes Cluster. Additionally, new nodes are created to replace the deleted nodes automatically if there's enough capacity available in the cluster.
108+
109+
* From prior steps, note down the names of the powered-off bare-metal machines where the not-ready nodes are running.
110+
* To delete the nodes that are not-ready, run the following `kubectl` command in the Nexus Kubernetes Cluster **for each powered-off bare-metal machine**:
111+
112+
```bash
113+
kubectl delete node -l topology.kubernetes.io/baremetalmachine=<powered-off-baremetal-machine-name>
114+
```
115+
Replace `<powered-off-baremetal-machine-name>` with the name of the powered-off bare-metal machine noted earlier.
116+
117+
for example:
118+
```bash
119+
kubectl delete node -l topology.kubernetes.io/baremetalmachine=b37100gipc1co01
120+
```
121+
* After you deleted the nodes, the workloads that were stuck on the not-ready nodes should be rescheduled to other nodes in the Nexus Kubernetes Cluster. Run prior `kubectl` command for all remaining powered-off bare-metal machines noted earlier.

0 commit comments

Comments
 (0)