Skip to content

Commit c44a3c2

Browse files
committed
add centralized resource health article as jumping off point
1 parent dc26f21 commit c44a3c2

5 files changed

+90
-27
lines changed

articles/operator-nexus/TOC.yml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -366,6 +366,11 @@
366366
- name: Troubleshooting
367367
expanded: true
368368
items:
369+
- name: Resource Health
370+
expanded: false
371+
items:
372+
- name:
373+
href: troubleshoot-resource-health-alerts.md
369374
- name: Network Fabric
370375
expanded: false
371376
items:

articles/operator-nexus/troubleshoot-bare-metal-machine-not-ready-state.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,15 @@
11
---
2-
title: Troubleshoot Azure Operator Nexus BareMetal Machines in a Not Ready state
3-
description: Examine common and known issues with BareMetal Machine resources.
2+
title: Troubleshoot Azure Operator Nexus Bare Metal Machine in not ready state
3+
description: Examine common and known issues with Bare Metal Machine resources.
44
ms.service: azure-operator-nexus
55
ms.custom: troubleshooting
66
ms.topic: troubleshooting
7-
ms.date: 10/09/2024
7+
ms.date: 04/29/2025
88
ms.author: omarrivera
99
author: omarrivera
1010
---
1111

12-
# Troubleshoot Azure Operator Nexus BareMetal Machines in a Not Ready state
12+
# Troubleshoot Bare Metal Machine in not ready state
1313

1414
This guide attempts to provide steps to troubleshoot when a BareMetal Machine is declared to be `Not Ready` state.
1515

articles/operator-nexus/troubleshoot-cluster-heartbeat-connection-status-disconnected.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ For a Cluster, the `ClusterConnectionStatus` represents the stability in the con
1818
> The `ClusterConnectionStatus` **doesn't** represent or is related to the health or connectivity of the Arc Connected Kubernetes Cluster.
1919
> The `ClusterConnectionStatus` indicates that the Cluster is successful in sending heartbeats and receiving acknowledgment from the Cluster Manager.
2020
21-
[!include[prereq-az-cli](./includes/baremetal-machines/prerequisites-azure-cli-bare-metal-machine-actions.md)]
21+
[!include[prereqAzCLI](./includes/baremetal-machines/prerequisites-azure-cli-bare-metal-machine-actions.md)]
2222

2323
## Understanding the Cluster connection status signal
2424

articles/operator-nexus/troubleshoot-etcd-cluster-possible-quorum-lost.md

Lines changed: 40 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -18,56 +18,74 @@ This guide attempts to provide steps to follow when an `etcd` quorum is lost for
1818
> Feature enhancements are ongoing for a future release to help address this scenario without contacting support.
1919
> Open a support ticket via [contact support].
2020
21+
[!include[prereqAzCLI](./includes/prereq-az-cli.md)]
22+
23+
> [!NOTE]
24+
> The commands can be executed from the Azure portal or with the Azure CLI.
25+
26+
**TODO**: Add any possible steps that can be taken to help with this scenario. Perhaps there's a list of things to check. The steps above are just a starting point and should be expanded upon.
27+
2128
## Ensure that all control plane nodes are online
2229

2330
[Troubleshoot control plane quorum loss when multiple nodes are offline](./troubleshoot-control-plane-quorum.md) provides steps to follow when multiple control plane nodes are offline or unavailable.
2431

2532
## Check the status of the etcd pods
2633

2734
It's possible that the etcd pods aren't running or are in a crash loop.
28-
This can happen if the control plane nodes aren't able to communicate with each other or if there are network issues.
29-
35+
Record the `etcd` pod names; other commands require `<etcd-pod-name>` to be populated with the name of one of the etcd pods.
3036
If you have access to the control plane nodes, you can check the status of the etcd pods by running the following command:
3137

32-
```bash
33-
kubectl get pods -n kube-system -l app=etcd
38+
```azurecli
39+
az networkcloud baremetalmachine run-read-command \
40+
--resource-group "$CLUSTER_MRG" \
41+
--name "$BMM_NAME" \
42+
--subscription "$SUBSCRIPTION" \
43+
--limit-time-seconds 60 \
44+
--commands "[{command:'kubectl get',arguments:[pods,-n,kube-system,-l,component=etcd,-o,wide]}]"
45+
46+
====Action Command Output====
47+
+ kubectl get pods -n kube-system -l component=etcd -o wide
48+
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
49+
etcd-<bmmMachineNe> 1/1 Running 0 4d6h 10.1.6.101 rack1control01 <none> <none>
3450
```
3551

3652
## Check network connectivity between control plane nodes
37-
If the etcd pods are running, but the KCP is still not stable, it's possible that there are network connectivity issues between the control plane nodes.
38-
You can check the network connectivity between the control plane nodes by running the following command:
3953

40-
```bash
41-
kubectl exec -it <etcd-pod-name> -n kube-system -- ping <other-control-plane-node-ip>
42-
```
54+
If the etcd pods are running, but the KCP is still not stable, it's possible that there are network connectivity issues between the control plane nodes.
4355

44-
Replace `<etcd-pod-name>` with the name of one of the etcd pods and `<other-control-plane-node-ip>` with the IP address of one of the other control plane nodes.
56+
Replace `<other-control-plane-node-ip>` with the IP address of one of the other control plane nodes.
4557
If the ping command fails, it indicates that there are network connectivity issues between the control plane nodes.
4658

59+
```azurecli
60+
az networkcloud baremetalmachine run-read-command \
61+
--resource-group "$CLUSTER_MRG" \
62+
--name "$BMM_NAME" \
63+
--subscription "$SUBSCRIPTION" \
64+
--limit-time-seconds 60 \
65+
--commands "[{command:'ping',arguments:[<other-control-plane-node-ip>]}]"
66+
```
67+
4768
## Check for storage issues
69+
4870
If the etcd pods are running and there are no network connectivity issues, it's possible that there are storage issues with the etcd pods.
49-
You can check the storage issues by running the following command:
5071

51-
```bash
52-
kubectl describe pod <etcd-pod-name> -n kube-system
53-
```
5472

5573
Replace `<etcd-pod-name>` with the name of one of the etcd pods.
5674
This command provides detailed information about the etcd pod, including any storage issues that might be present.
75+
You can check the storage issues by running the following command:
76+
77+
```azurecli
78+
```
79+
5780

5881
## Check for resource issues or saturation
82+
5983
If the etcd pods are running and there are no network connectivity or storage issues, it's possible that there are resource issues or saturation on the control plane nodes.
60-
You can check the resource usage on the control plane nodes by running the following command:
84+
If the CPU or memory usage is high, it might indicate that the control plane nodes are overscheduled and unable to process requests.
6185

62-
```bash
63-
kubectl top nodes
86+
```azurecli
6487
```
6588

66-
This command provides information about the CPU and memory usage on the control plane nodes.
67-
If the CPU or memory usage is high, it might indicate that the control plane nodes are saturated and unable to process requests.
68-
69-
70-
**TODO**: Add any possible steps that can be taken to help with this scenario. Perhaps there's a list of things to check. The steps above are just a starting point and should be expanded upon.
7189

7290
[!include[stillHavingIssues](./includes/contact-support.md)]
7391

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
---
2+
title: Troubleshoot resource health alerts
3+
description: Find troubleshooting guides for platform-emitted resource health alerts.
4+
ms.service: azure-operator-nexus
5+
ms.custom: troubleshooting
6+
ms.topic: troubleshooting
7+
ms.date: 04/29/2025
8+
ms.author: omarrivera
9+
author: omarrivera
10+
---
11+
12+
# Troubleshoot resource health alerts
13+
14+
This guide provides a breakdown of the resource health alerts emitted by the Azure Operator Nexus platform.
15+
It includes a description of each alert and links to troubleshooting guides for each alert.
16+
17+
Resource health alerts emitted by the platform to indicate the health of a particular resource.
18+
These alerts are generated based on the status of the resource and its dependencies.
19+
20+
## Cluster
21+
22+
| Resource Health Event Name | Troubleshooting Guide |
23+
|---------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------|
24+
| `ClusterHeartbeatConnectionStatusDisconnectedClusterManagerOperationsAreAffectedPossibleNetworkIssues` | [Troubleshoot Cluster heartbeat connection status shows disconnected](./troubleshoot-cluster-heartbeat-connection-status-disconnected.md) |
25+
| `ClusterHeartbeatConnectionStatusTimedoutPossiblePerformanceIssues` | [Troubleshoot Cluster heartbeat connection status shows disconnected](./troubleshoot-cluster-heartbeat-connection-status-disconnected.md) |
26+
| `ETCDPossibleQuorumLossClusterOperationsAreAffected`<br>`ETCDPossibleQuorumLossDegradedProposalsProcessing`<br>`ETCDPossibleQuorumLossIncreasedProposalsProcessingFailures`<br>`ETCDPossibleQuorumLossNoClusterLeader` | [Troubleshoot Cluster Manager Not Reachable](./troubleshoot-cluster-manager-not-reachable.md) |
27+
28+
## Bare Metal Machine
29+
30+
| Resource Health Event Name | Troubleshooting Guide |
31+
|------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
32+
| `BMMHasHardwareValidationFailures` | [Troubleshoot 'Warning' detailed status messages on an Azure Operator Nexus Cluster Bare Metal Machine](troubleshoot-bare-metal-machine-warning.md#warning-this-machine-has-failed-hardware-validation) |
33+
| `BMMHasLACPDownStatusCondition` | [Troubleshoot Degraded status errors on an Azure Operator Nexus Cluster Bare Metal Machine](troubleshoot-bare-metal-machine-degraded.md#degraded-lacp-status-is-down) |
34+
| `BMMHasNodeReadinessProblem` | [Troubleshoot Bare Metal Machine in not ready state](troubleshoot-bare-metal-machine-not-ready-state.md) |
35+
| `BMMHasPortDownStatusCondition` | [Troubleshoot Degraded status errors on an Azure Operator Nexus Cluster Bare Metal Machine](troubleshoot-bare-metal-machine-degraded.md#degraded-port-down) |
36+
| `BMMHasPortFlappingStatusCondition`| [Troubleshoot Degraded status errors on an Azure Operator Nexus Cluster Bare Metal Machine](troubleshoot-bare-metal-machine-degraded.md#degraded-port-flapping) |
37+
| `BMMPowerStateDoesNotMatchExpected`| [Troubleshoot 'Warning' detailed status messages on an Azure Operator Nexus Cluster Bare Metal Machine](troubleshoot-bare-metal-machine-warning.md#warning-bmm-power-state-doesnt-match-expected-state) |
38+
| `BMMPxePortIsUnhealthy` | [Troubleshoot 'Warning' detailed status messages on an Azure Operator Nexus Cluster Bare Metal Machine](troubleshoot-bare-metal-machine-warning.md#warning-pxe-port-is-unhealthy) |
39+
40+
[!include[stillHavingIssues](./includes/contact-support.md)]

0 commit comments

Comments
 (0)