Skip to content

Commit 73cf5b6

Browse files
author
Nyeem Akhtar
committed
Adding TSG for remediating NAKS stuck workloads
In the event of power-failure, Undercloud nodes will be powered-off causing NAKS nodes running on powered-off UC node become not-ready, which will result in statuful workloads being stuck on those not-ready nodes. This TSG helps to unstuck those workloads
1 parent 180acb7 commit 73cf5b6

File tree

4 files changed

+109
-2
lines changed

4 files changed

+109
-2
lines changed

articles/operator-nexus/TOC.yml

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -213,7 +213,7 @@
213213
- name: How to upgrade os of terminal server
214214
href: howto-upgrade-os-of-terminal-server.md
215215
- name: How to restrict serial port access and set timeout on terminal-server
216-
href: howto-restrict-serial-port-access-and-set-timeout-on-terminal-server.md
216+
href: howto-restrict-serial-port-access-and-set-timeout-on-terminal-server.md
217217
- name: Cluster
218218
expanded: false
219219
items:
@@ -369,6 +369,10 @@
369369
href: troubleshoot-kubernetes-cluster-dual-stack-configuration.md
370370
- name: Troubleshoot Neighbor Group Creation Error
371371
href: troubleshoot-neighbor-group-creation-error.md
372+
- name: Troubleshoot NAKS Cluster Node Packet Loss
373+
href: troubleshoot-packet-loss.md
374+
- name: Troubleshooting Nexus Kubernetes Cluster stuck (unable to reschedule) workloads due to power failure
375+
href: troubleshoot-kubernetes-cluster-stuck-workloads-due-to-power-failure.md
372376
- name: FAQ
373377
href: azure-operator-nexus-faq.md
374378
- name: Reference
@@ -429,4 +433,4 @@
429433
expanded: false
430434
items:
431435
- name: 2404.2
432-
href: release-notes-2404.2.md
436+
href: release-notes-2404.2.md
270 KB
Loading
87.1 KB
Loading
Lines changed: 103 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,103 @@
1+
---
2+
title: Troubleshooting Nexus Kubernetes Cluster stuck (unable to reschedule) workloads due to power failure
3+
description: Troubleshooting Nexus Kubernetes Cluster workloads stuck (unable to reschedule) on not-ready nodes due to bare-metal node power failure.
4+
ms.service: azure-operator-nexus
5+
ms.custom: troubleshooting
6+
ms.topic: troubleshooting
7+
ms.date: 01/30/2025
8+
ms.author: nyeemakhtar
9+
author: mdnyeemakhtar
10+
---
11+
# Troubleshooting stuck (unable to reschedule) workloads in a Nexus Kubernetes Cluster
12+
13+
This guide provides detailed steps for troubleshooting issues related to stuck workloads on Nexus Kubernetes Cluster not-ready nodes. If you're experiencing these issues due to bare-metal node power failure, this guide helps you identify and resolve the problem.
14+
15+
## Prerequisites
16+
17+
* Permissions to view Azure resources in the subscription where the Nexus Kubernetes Cluster is deployed
18+
* Access to Azure monitoring
19+
* Necessary permissions to make changes using `kubectl` commands in Nexus Kubernetes Cluster (for example, deleting nodes)
20+
21+
## Symptoms
22+
23+
* Nexus Kubernetes Cluster nodes are not-ready
24+
* Pods that aren't daemon-set pods stuck on the not-ready nodes
25+
26+
## Cause
27+
28+
Kubernetes, by design, doesn't move workloads that are stateful in nature if the node they're running on becomes not-ready. For more information, see [Kubernetes documentation](https://kubernetes.io/docs/concepts/cluster-administration/node-shutdown/#non-graceful-node-shutdown).
29+
30+
The Nexus Kubernetes Cluster not-ready node could be caused by a power failure on the bare-metal machine in the cluster. You can verify by checking the "Idrac Power On" metric in Azure monitoring. If you have alerts set up for this metric, you might have received an alert indicating that the bare-metal machines are powered off.
31+
32+
## Warning
33+
34+
### Nexus Kubernetes Cluster nodes Rack Spread
35+
36+
This guide requires deleting nodes from the Nexus Kubernetes Cluster. This action can cause Nexus Kubernetes Cluster nodes Rack Spread to be impacted.
37+
38+
### Host-Path storage
39+
40+
If the pods are configured to use host-path storage on the node, deleting the node deletes the data too.
41+
42+
## Solution
43+
44+
To resolve the issue, follow these steps:
45+
46+
1. Check the "Idrac Power On" metric in Azure monitoring to verify that the bare-metal machine is powered off. Following is an example screenshot of the metric in Azure monitoring:
47+
48+
![Idrac Power On metric in Azure monitoring](media/idrac-power-on-metric.png)
49+
50+
2. If the "Idrac Power On" metric indicates that the bare-metal machine is powered on (showing a value of 1), **don't proceed with the following steps**.
51+
52+
3. If the "Idrac Power On" metric indicates that the bare-metal machine is powered off (showing a value of 0), take the following actions on all impacted Nexus Kubernetes Clusters within that environment:
53+
54+
* Note down the name of the bare-metal machine that is powered off and managed-resource-group name. In the example screenshot in step 1, the machine name is `a12345bcde1co1` and managed-resource-group name is `poc02-19cf0b39e1e5-HostedResources-540F8D4E`. If this bare-metal machine is a compute node, then proceed with the following steps.
55+
* To find out impacted Nexus Kubernetes Clusters due to the powered-off bare-metal machine, run the following `az networkcloud` command:
56+
57+
```bash
58+
az networkcloud kubernetescluster list \
59+
--subscription <subscription-id> \
60+
--query "[?extendedLocation.name && contains(extendedLocation.name, '<managed-resource-group>') && nodes[? bareMetalMachineId == \`null\` || contains(bareMetalMachineId, '<powered-off-baremetal-machine-name>') && detailedStatus != 'Running']].{ClusterName: name}" \
61+
-o table
62+
```
63+
Replace `<subscription-id>` with the subscription ID and `<managed-resource-group>` where the bare-metal machine is located and `<powered-off-baremetal-machine-name>` with the name of the powered-off bare-metal machine noted earlier.
64+
65+
for example:
66+
```bash
67+
az networkcloud kubernetescluster list \
68+
--subscription 00000000-0000-0000-0000-000000000000 \
69+
--query "[?extendedLocation.name && contains(extendedLocation.name, 'poc02-19cf0b39e1e5-HostedResources-540F8D4E') && nodes[? bareMetalMachineId == \`null\` || contains(bareMetalMachineId, 'a12345bcde1co1') && detailedStatus != 'Running']].{ClusterName: name}" \
70+
-o table
71+
```
72+
**Note:** The prior command might return Nexus Kubernetes Clusters that aren't impacted by the powered-off bare-metal machine. Further steps help you identify the impacted Nexus Kubernetes Clusters.
73+
* If there are no Nexus Kubernetes Clusters in the list from prior command, then don't proceed with the next steps. However, if there are Nexus Kubernetes Clusters, run the following steps on each Nexus Kubernetes Cluster.
74+
* Run following `kubectl` command in the Nexus Kubernetes Cluster to get the list of all the nodes that are running on powered off bare-metal machines:
75+
76+
```bash
77+
kubectl get nodes -l topology.kubernetes.io/baremetalmachine=<powered-off-baremetal-machine-name>
78+
```
79+
Replace `<powered-off-baremetal-machine-name>` with the name of the powered-off bare-metal machine noted earlier.
80+
81+
for example:
82+
```bash
83+
kubectl get nodes -l topology.kubernetes.io/baremetalmachine=a12345bcde1co1
84+
```
85+
![kubectl get nodes output](media/naks-nodes-not-ready.png)
86+
87+
If prior command doesn't list any nodes, no further action is needed for the current Nexus Kubernetes Cluster.
88+
89+
Don't proceed with the next steps until all of the nodes statuses are not-ready from prior command. If the nodes are ready, then wait and run the command again to check the status. When all the nodes statuses are not-ready, proceed to the next step.
90+
91+
* Double check by refreshing the "Idrac Power On" metric in Azure monitoring to verify that the bare-metal machine is still powered off. If the machine is showing as **powered on, don't proceed with the next steps**. If the machine is still showing as powered off, proceed to the next step.
92+
* To delete the nodes that are not-ready, run following `kubectl` command in the Nexus Kubernetes Cluster:
93+
94+
```bash
95+
kubectl delete node -l topology.kubernetes.io/baremetalmachine=<powered-off-baremetal-machine-name>
96+
```
97+
Replace `<powered-off-baremetal-machine-name>` with the name of the powered-off bare-metal machine noted earlier.
98+
99+
for example:
100+
```bash
101+
kubectl delete node -l topology.kubernetes.io/baremetalmachine=a12345bcde1co1
102+
```
103+
* After you deleted the nodes, the workloads that were stuck on the not-ready nodes will be rescheduled to other nodes in the Nexus Kubernetes Cluster. **Note, this process might take upwards of 30 minutes to complete**. Additionally, new nodes are created to replace the deleted nodes automatically if there's enough capacity available in the cluster.

0 commit comments

Comments
 (0)