add centralized resource health article as jumping off point

g0r1v3r4 · g0r1v3r4 · commit c44a3c21b935 · 2025-07-02T13:24:08.000-05:00
diff --git a/articles/operator-nexus/TOC.yml b/articles/operator-nexus/TOC.yml
@@ -366,6 +366,11 @@
 - name: Troubleshooting
   expanded: true
   items:
+    - name: Resource Health
+      expanded: false
+      items:
+        - name:
+          href: troubleshoot-resource-health-alerts.md
     - name: Network Fabric
       expanded: false
       items:
diff --git a/articles/operator-nexus/troubleshoot-bare-metal-machine-not-ready-state.md b/articles/operator-nexus/troubleshoot-bare-metal-machine-not-ready-state.md
@@ -1,15 +1,15 @@
 ---
-title: Troubleshoot Azure Operator Nexus BareMetal Machines in a Not Ready state
-description: Examine common and known issues with BareMetal Machine resources.
+title: Troubleshoot Azure Operator Nexus Bare Metal Machine in not ready state
+description: Examine common and known issues with Bare Metal Machine resources.
 ms.service: azure-operator-nexus
 ms.custom: troubleshooting
 ms.topic: troubleshooting
-ms.date: 10/09/2024
+ms.date: 04/29/2025
 ms.author: omarrivera
 author: omarrivera
 ---
 
-# Troubleshoot Azure Operator Nexus BareMetal Machines in a Not Ready state
+# Troubleshoot Bare Metal Machine in not ready state
 
 This guide attempts to provide steps to troubleshoot when a BareMetal Machine is declared to be `Not Ready` state.
 
diff --git a/articles/operator-nexus/troubleshoot-cluster-heartbeat-connection-status-disconnected.md b/articles/operator-nexus/troubleshoot-cluster-heartbeat-connection-status-disconnected.md
@@ -18,7 +18,7 @@ For a Cluster, the `ClusterConnectionStatus` represents the stability in the con
 > The `ClusterConnectionStatus` **doesn't** represent or is related to the health or connectivity of the Arc Connected Kubernetes Cluster.
 > The `ClusterConnectionStatus` indicates that the Cluster is successful in sending heartbeats and receiving acknowledgment from the Cluster Manager.
 
-[!include[prereq-az-cli](./includes/baremetal-machines/prerequisites-azure-cli-bare-metal-machine-actions.md)]
+[!include[prereqAzCLI](./includes/baremetal-machines/prerequisites-azure-cli-bare-metal-machine-actions.md)]
 
 ## Understanding the Cluster connection status signal
 
diff --git a/articles/operator-nexus/troubleshoot-etcd-cluster-possible-quorum-lost.md b/articles/operator-nexus/troubleshoot-etcd-cluster-possible-quorum-lost.md
@@ -18,56 +18,74 @@ This guide attempts to provide steps to follow when an `etcd` quorum is lost for
 > Feature enhancements are ongoing for a future release to help address this scenario without contacting support.
 > Open a support ticket via [contact support].
 
+[!include[prereqAzCLI](./includes/prereq-az-cli.md)]
+
+> [!NOTE]
+> The commands can be executed from the Azure portal or with the Azure CLI.
+
+**TODO**: Add any possible steps that can be taken to help with this scenario. Perhaps there's a list of things to check. The steps above are just a starting point and should be expanded upon.
+
 ## Ensure that all control plane nodes are online
 
 [Troubleshoot control plane quorum loss when multiple nodes are offline](./troubleshoot-control-plane-quorum.md) provides steps to follow when multiple control plane nodes are offline or unavailable.
 
 ## Check the status of the etcd pods
 
 It's possible that the etcd pods aren't running or are in a crash loop.
-This can happen if the control plane nodes aren't able to communicate with each other or if there are network issues.
-
+Record the `etcd` pod names; other commands require `<etcd-pod-name>` to be populated with the name of one of the etcd pods.
 If you have access to the control plane nodes, you can check the status of the etcd pods by running the following command:
 
-```bash
-kubectl get pods -n kube-system -l app=etcd
+```azurecli
+az networkcloud baremetalmachine run-read-command \
+  --resource-group "$CLUSTER_MRG" \
+  --name "$BMM_NAME" \
+  --subscription "$SUBSCRIPTION" \
+  --limit-time-seconds 60 \
+  --commands "[{command:'kubectl get',arguments:[pods,-n,kube-system,-l,component=etcd,-o,wide]}]"
+
+====Action Command Output====
++ kubectl get pods -n kube-system -l component=etcd -o wide
+NAME                  READY   STATUS    RESTARTS   AGE    IP           NODE             NOMINATED NODE   READINESS GATES
+etcd-<bmmMachineNe>   1/1     Running   0          4d6h   10.1.6.101   rack1control01   <none>           <none>
 ```
 
 ## Check network connectivity between control plane nodes
-If the etcd pods are running, but the KCP is still not stable, it's possible that there are network connectivity issues between the control plane nodes.
-You can check the network connectivity between the control plane nodes by running the following command:
 
-```bash
-kubectl exec -it <etcd-pod-name> -n kube-system -- ping <other-control-plane-node-ip>
-```
+If the etcd pods are running, but the KCP is still not stable, it's possible that there are network connectivity issues between the control plane nodes.
 
-Replace `<etcd-pod-name>` with the name of one of the etcd pods and `<other-control-plane-node-ip>` with the IP address of one of the other control plane nodes.
+Replace `<other-control-plane-node-ip>` with the IP address of one of the other control plane nodes.
 If the ping command fails, it indicates that there are network connectivity issues between the control plane nodes.
 
+```azurecli
+az networkcloud baremetalmachine run-read-command \
+  --resource-group "$CLUSTER_MRG" \
+  --name "$BMM_NAME" \
+  --subscription "$SUBSCRIPTION" \
+  --limit-time-seconds 60 \
+  --commands "[{command:'ping',arguments:[<other-control-plane-node-ip>]}]"
+```
+
 ## Check for storage issues
+
 If the etcd pods are running and there are no network connectivity issues, it's possible that there are storage issues with the etcd pods.
-You can check the storage issues by running the following command:
 
-```bash
-kubectl describe pod <etcd-pod-name> -n kube-system
-```
 
 Replace `<etcd-pod-name>` with the name of one of the etcd pods.
 This command provides detailed information about the etcd pod, including any storage issues that might be present.
+You can check the storage issues by running the following command:
+
+```azurecli
+```
+
 
 ## Check for resource issues or saturation
+
 If the etcd pods are running and there are no network connectivity or storage issues, it's possible that there are resource issues or saturation on the control plane nodes.
-You can check the resource usage on the control plane nodes by running the following command:
+If the CPU or memory usage is high, it might indicate that the control plane nodes are overscheduled and unable to process requests.
 
-```bash
-kubectl top nodes
+```azurecli
 ```
 
-This command provides information about the CPU and memory usage on the control plane nodes.
-If the CPU or memory usage is high, it might indicate that the control plane nodes are saturated and unable to process requests.
-
-
-**TODO**: Add any possible steps that can be taken to help with this scenario. Perhaps there's a list of things to check. The steps above are just a starting point and should be expanded upon.
 
 [!include[stillHavingIssues](./includes/contact-support.md)]
 
diff --git a/articles/operator-nexus/troubleshoot-resource-health-alerts.md b/articles/operator-nexus/troubleshoot-resource-health-alerts.md
@@ -0,0 +1,40 @@
+---
+title: Troubleshoot resource health alerts
+description: Find troubleshooting guides for platform-emitted resource health alerts.
+ms.service: azure-operator-nexus
+ms.custom: troubleshooting
+ms.topic: troubleshooting
+ms.date: 04/29/2025
+ms.author: omarrivera
+author: omarrivera
+---
+
+# Troubleshoot resource health alerts
+
+This guide provides a breakdown of the resource health alerts emitted by the Azure Operator Nexus platform.
+It includes a description of each alert and links to troubleshooting guides for each alert.
+
+Resource health alerts emitted by the platform to indicate the health of a particular resource.
+These alerts are generated based on the status of the resource and its dependencies.
+
+## Cluster
+
+| Resource Health Event Name                                                                                     | Troubleshooting Guide                                                                                     |
+|---------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------|
+| `ClusterHeartbeatConnectionStatusDisconnectedClusterManagerOperationsAreAffectedPossibleNetworkIssues`        | [Troubleshoot Cluster heartbeat connection status shows disconnected](./troubleshoot-cluster-heartbeat-connection-status-disconnected.md) |
+| `ClusterHeartbeatConnectionStatusTimedoutPossiblePerformanceIssues`                                           | [Troubleshoot Cluster heartbeat connection status shows disconnected](./troubleshoot-cluster-heartbeat-connection-status-disconnected.md) |
+| `ETCDPossibleQuorumLossClusterOperationsAreAffected`<br>`ETCDPossibleQuorumLossDegradedProposalsProcessing`<br>`ETCDPossibleQuorumLossIncreasedProposalsProcessingFailures`<br>`ETCDPossibleQuorumLossNoClusterLeader` | [Troubleshoot Cluster Manager Not Reachable](./troubleshoot-cluster-manager-not-reachable.md)             |
+
+## Bare Metal Machine
+
+| Resource Health Event Name         | Troubleshooting Guide                                                                                                                                                                                   |
+|------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| `BMMHasHardwareValidationFailures` | [Troubleshoot 'Warning' detailed status messages on an Azure Operator Nexus Cluster Bare Metal Machine](troubleshoot-bare-metal-machine-warning.md#warning-this-machine-has-failed-hardware-validation) |
+| `BMMHasLACPDownStatusCondition`    | [Troubleshoot Degraded status errors on an Azure Operator Nexus Cluster Bare Metal Machine](troubleshoot-bare-metal-machine-degraded.md#degraded-lacp-status-is-down)                                   |
+| `BMMHasNodeReadinessProblem`       | [Troubleshoot Bare Metal Machine in not ready state](troubleshoot-bare-metal-machine-not-ready-state.md)                                                                                                |
+| `BMMHasPortDownStatusCondition`    | [Troubleshoot Degraded status errors on an Azure Operator Nexus Cluster Bare Metal Machine](troubleshoot-bare-metal-machine-degraded.md#degraded-port-down)                                             |
+| `BMMHasPortFlappingStatusCondition`| [Troubleshoot Degraded status errors on an Azure Operator Nexus Cluster Bare Metal Machine](troubleshoot-bare-metal-machine-degraded.md#degraded-port-flapping)                                         |
+| `BMMPowerStateDoesNotMatchExpected`| [Troubleshoot 'Warning' detailed status messages on an Azure Operator Nexus Cluster Bare Metal Machine](troubleshoot-bare-metal-machine-warning.md#warning-bmm-power-state-doesnt-match-expected-state) |
+| `BMMPxePortIsUnhealthy`            | [Troubleshoot 'Warning' detailed status messages on an Azure Operator Nexus Cluster Bare Metal Machine](troubleshoot-bare-metal-machine-warning.md#warning-pxe-port-is-unhealthy)                       |
+
+[!include[stillHavingIssues](./includes/contact-support.md)]