Skip to content

Commit 19469f9

Browse files
authored
Merge pull request #302166 from g0r1v3r4/add-cluster-resourcehealth-tsg
Add Cluster Connection Status Resource Health TSG
2 parents 660afb2 + 3c6532c commit 19469f9

File tree

7 files changed

+243
-29
lines changed

7 files changed

+243
-29
lines changed

articles/operator-nexus/TOC.yml

Lines changed: 16 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -249,7 +249,7 @@
249249
href: howto-restrict-serial-port-access-and-set-timeout-on-terminal-server.md
250250
- name: How to configure BGP prefix limit on Customer Edge (CE) devices for Azure Operator Nexus
251251
href: howto-configure-bgp-prefix-limit-on-customer-edge-devices.md
252-
- name: BMP log streaming in Azure Operator Nexus Network Fabric
252+
- name: BMP log streaming in Azure Operator Nexus Network Fabric
253253
href: concepts-bmp-log-streaming.md
254254
- name: How to enable / disable BMP log streaming Azure Operator Nexus
255255
href: howto-enable-log-streaming.md
@@ -310,7 +310,6 @@
310310
href: howto-kubernetes-cluster-install-microsoft-defender.md
311311
- name: Kubernetes cluster features
312312
href: howto-kubernetes-cluster-features.md
313-
314313
- name: Nexus Virtual Machine
315314
expanded: false
316315
items:
@@ -367,6 +366,11 @@
367366
- name: Troubleshooting
368367
expanded: true
369368
items:
369+
- name: Resource Health
370+
expanded: false
371+
items:
372+
- name: Troubleshoot Resource Health alerts
373+
href: troubleshoot-resource-health-alerts.md
370374
- name: Network Fabric
371375
expanded: false
372376
items:
@@ -378,7 +382,16 @@
378382
href: troubleshoot-dns-issues.md
379383
- name: Troubleshoot TWAMP (UDP) not working
380384
href: troubleshoot-twamp-udp-not-working.md
381-
- name: Cluster or BMM
385+
- name: Cluster
386+
expanded: false
387+
items:
388+
- name: Troubleshoot Accepted Cluster Resource
389+
href: troubleshoot-accepted-cluster-hydration.md
390+
- name: Troubleshoot Control Plane Quorum
391+
href: troubleshoot-control-plane-quorum.md
392+
- name: Troubleshoot Cluster heartbeat connection status disconnected
393+
href: troubleshoot-cluster-heartbeat-connection-status-disconnected.md
394+
- name: Bare Metal Machine
382395
expanded: false
383396
items:
384397
- name: Troubleshoot Bare Metal Server Problems
@@ -391,10 +404,6 @@
391404
href: troubleshoot-bare-metal-machine-degraded.md
392405
- name: Troubleshoot Warning status
393406
href: troubleshoot-bare-metal-machine-warning.md
394-
- name: Troubleshoot Control Plane Quorum
395-
href: troubleshoot-control-plane-quorum.md
396-
- name: Troubleshoot Accepted Cluster Resource
397-
href: troubleshoot-accepted-cluster-hydration.md
398407
- name: Troubleshoot Out of Memory Pods
399408
href: troubleshoot-memory-limits.md
400409
- name: Tenant Workload
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
---
2+
author: omarrivera
3+
ms.author: omarrivera
4+
ms.date: 07/02/2025
5+
ms.topic: include
6+
ms.service: azure-operator-nexus
7+
---
8+
9+
## Still having issues?
10+
11+
If the steps outlined didn't provide a path to resolve the issue or if you still have questions [contact support].
12+
Please, provide as much detail as possible about the issue you're experiencing, including any error messages or logs that may be relevant.
13+
This will help the support team to assist you more effectively.
14+
15+
You can open a support request through the [Azure portal][contact support].
16+
17+
For more information about support plans, see [Azure Support plans].
18+
19+
[contact support]: https://portal.azure.com/?#blade/Microsoft_Azure_Support/HelpAndSupportBlade
20+
[Azure Support plans]: https://azure.microsoft.com/support/plans/response/
Loading
Loading
Lines changed: 147 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,147 @@
1+
---
2+
title: Troubleshoot Azure Operator Nexus Cluster Heartbeat Connection Status shows Disconnected
3+
description: Provide steps to investigate and possibly resolve circumstances that are preventing the Cluster from sending heartbeats to the Cluster Manager.
4+
ms.service: azure-operator-nexus
5+
ms.custom: troubleshooting
6+
ms.topic: troubleshooting
7+
ms.date: 07/02/2025
8+
ms.author: omarrivera
9+
author: omarrivera
10+
---
11+
12+
# Troubleshoot Cluster heartbeat connection status shows disconnected
13+
14+
This guide describes steps to troubleshoot a Cluster with a `ClusterConnectionStatus` in `Disconnected` state.
15+
For a Cluster, the `ClusterConnectionStatus` represents the stability in the connection between the on-premises Cluster and its ability to reach the Cluster Manager.
16+
17+
> [!IMPORTANT]
18+
> The `ClusterConnectionStatus` **doesn't** represent nor is it related to the health or connectivity of the Arc Connected Kubernetes Cluster.
19+
> The `ClusterConnectionStatus` indicates that the Cluster is successful in sending heartbeats and receiving acknowledgment from the Cluster Manager.
20+
21+
[!include[prereqAzCLI](./includes/baremetal-machines/prerequisites-azure-cli-bare-metal-machine-actions.md)]
22+
23+
## Understanding the Cluster connection status signal
24+
25+
The `ClusterConnectionStatus` represents the ability of the on-premises Cluster to send heartbeats and receive acknowledgments from the Cluster Manager, indicating the health of the network connection between them.
26+
`ClusterConnectionStatus` is distinct from the connectivity of the Arc Connected Kubernetes Cluster, though network issues affect both.
27+
28+
A Cluster resource has the property `ClusterConnectionStatus` set to the value `Connected` if the heartbeats are continuously received and acknowledged.
29+
The `ClusterConnectionStatus` becomes `Connected` once the Cluster is in a healthy state and network connectivity issues are resolved.
30+
The Cluster shows `Timeout` only as a transitional state between `Connected` and `Disconnected`.
31+
The Cluster `ClusterConnectionStatus` value becomes `Disconnected` if the Cluster Manager detects continuously missed heartbeats.
32+
Heartbeats are considered missed if they aren't received within or beyond the specified time thresholds.
33+
Once the Cluster is a healthy state and there no network connectivity issues, the `ClusterConnectionStatus` automatically moves to `Connected`
34+
35+
During the Cluster deployment process, the Cluster is in an `Undefined` state until the Cluster is fully deployed and operational.
36+
37+
The following table shows the possible values of `ClusterConnectionStatus` and their definitions:
38+
39+
| Status | Definition |
40+
|----------------|-----------------------------------------------------------------------------------------------------------------------|
41+
| `Connected` | Heartbeats received, indicates healthy Cluster and Cluster Manager connectivity |
42+
| `Disconnected` | Heartbeats missed for **over 5 minutes**, indicates likely connectivity issue between Cluster Manager and Cluster |
43+
| `Timeout` | Heartbeats missed for **over 2 minutes but less than 5 minutes**, Cluster connectivity is uncertain possibly degraded |
44+
| `Undefined` | Cluster not yet deployed or running a version without the heartbeats feature |
45+
46+
## Check the value of the Cluster's ClusterConnectionStatus property
47+
48+
The value of `ClusterConnectionStatus` is visible in the Azure portal in the Cluster resource view.
49+
50+
:::image type="content" source="media/troubleshoot-cluster-heartbeat-connection-status/azure-portal-cluster-connection-status.png" alt-text="Screenshot of ClusterConnectionStatus property as shown in the Azure portal." lightbox="media/troubleshoot-cluster-heartbeat-connection-status/azure-portal-cluster-connection-status.png":::
51+
52+
Or, you can use the Azure CLI to see the value of `ClusterConnectionStatus`:
53+
54+
```azurecli
55+
az networkcloud cluster show \
56+
-g "$CLUSTER_RG" \
57+
-n "$CLUSTER_NAME" \
58+
--subscription "$SUBSCRIPTION_ID" \
59+
--query "{ClusterConnectionStatus:clusterConnectionStatus}" \
60+
--output table
61+
62+
ClusterConnectionStatus
63+
-------------------------
64+
Connected
65+
```
66+
67+
## Understanding the NexusClusterConnectionStatus metric
68+
69+
Use Azure Resource Health to build alerts for Cluster health, as it provides a comprehensive and supported view of resource status.
70+
The `NexusClusterConnectionStatus` metric integrates into the Cluster's Azure Resource Health.
71+
If you use the `NexusClusterConnectionStatus` metric directly, understand how it functions and what it represents.
72+
73+
The Cluster Manager, not the on-premises Cluster, emits the metric based on the `ClusterConnectionStatus` property.
74+
A pod running on the on-premises Cluster sends heartbeat message to the Cluster Manager through the infrastructure proxy.
75+
The metric emits a value of "1" for all time series. Starting from when the Cluster resource's connectionStatus is set for the first time.
76+
The metric emitting process never sends "0" values. Any "0" values seen in graphs are due to graphing tools filling gaps.
77+
The detection of state changes requires the Cluster Manager's reconciliation process to update the Cluster resource's `ClusterConnectionStatus` property accordingly.
78+
79+
There might be a delay between the actual loss of heartbeats and the metric reflecting the `Disconnected` state, due to the reconciliation loop and other operational factors.
80+
The `NexusClusterConnectionStatus` metric is used as a health indicator for the Cluster, but delays in status changes can occur due to reconciliation timing and operational constraints.
81+
Timeout events can occur if heartbeats aren't received within a 2-minute threshold, but a single successful heartbeat resets the timer.
82+
The status can transition between Connected, Timeout, and `Disconnected` based on heartbeat activity.
83+
84+
The image shows a general representation of the components responsible for emitting the `NexusClusterConnectionStatus` metric.
85+
86+
:::image type="content" source="media/troubleshoot-Cluster-heartbeat-connection-status/cluster-connection-status-components-for-metric.png" alt-text="Diagram that shows the components responsible for emitting the NexusClusterConnectionStatus metric." lightbox="media/troubleshoot-Cluster-heartbeat-connection-status/cluster-connection-status-components-for-metric.png":::
87+
88+
### ClusterConnectionStatus isn't the same as Arc Connected Cluster status
89+
90+
The Cluster's `ClusterConnectionStatus` and Arc Connected Cluster status are separate signals and shouldn't be treated interchangeably.
91+
Although the two signals aren't related, both rely on network connectivity for the Cluster.
92+
It's possible for a Cluster to be Arc `Disconnected` but still have a Heartbeat Status of `Connected`.
93+
Both signals depend on network connectivity, but they serve different purposes and managed by different systems.
94+
95+
## Common investigation steps
96+
97+
Infrastructure networking issues, permission changes in the Managed Identity, or other issues that might not be obvious at first, affect the Cluster resource connection status.
98+
The following sections provide some common investigation steps and references to help troubleshoot.
99+
100+
> [!IMPORTANT]
101+
> The `ClusterConnectionStatus` indicates general instability, not the root cause.
102+
> This guide provides general resource health checks that might help locate the problem or at least help collect information useful for customer support.
103+
104+
### Cluster Network Fabric health and connectivity
105+
106+
It's useful to start with the Network Fabric [controller][Network Fabric Controller] and [services][Network Fabric Services] resources.
107+
Verify the [network configuration][How to Configure Network Fabric] or any other network-related settings that might be affecting the connectivity.
108+
Verify the physical network setup including rack cabling, IP addresses, DNS settings, routing rules, firewall rules, etc.
109+
110+
[How to Configure Network Fabric]: ./howto-configure-network-fabric.md
111+
[Network Fabric Controller]: ./concepts-network-fabric-controller.md
112+
[Network Fabric Services]: ./concepts-network-fabric-services.md
113+
114+
Evaluate any configured monitoring or metrics for the Network Fabric resources.
115+
For more information, see the following links:
116+
117+
- [Nexus Network Fabric configuration monitoring overview](./concepts-network-fabric-configuration-monitoring.md)
118+
- [How to configure diagnostic settings and monitor configuration differences in Nexus Network Fabric](./howto-configure-diagnostic-settings-monitor-configuration-differences.md)
119+
- [Azure Operator Nexus Network Fabric internal network BGP metrics](./concepts-internal-network-bgp-metrics.md)
120+
- [How to monitor interface In and Out packet rate for network fabric devices](./howto-monitor-interface-packet-rate.md)
121+
122+
### Recent changes to the Managed Identity permissions
123+
124+
Changes to the Managed Identity permissions for the Cluster Manager or Cluster can affect the Cluster's ability to authenticate against the Cluster Manager.
125+
The Managed Identities (MI) and their permissions are used for service-to-service authentication.
126+
A change in the permissions results in authentication failures for the heartbeat messages.
127+
Even when network connectivity is healthy the Cluster's `ClusterConnectionStatus` shows `Disconnected` when heartbeats aren't successfully received and acknowledged.
128+
129+
### Check control-plane BareMetal Machines health
130+
131+
The control-plane BareMetal Machines host the component that emits the heartbeats to the Cluster Manager.
132+
In most cases, the pods running on the control-plane reschedule automatically to a different BareMetal Machine within the control-plane node pool.
133+
However, if the BareMetal Machines aren't healthy, the pods can't reschedule and the Cluster is unable to send heartbeats.
134+
135+
To check the BareMetal Machines, use the following command:
136+
137+
```azurecli
138+
az networkcloud baremetalmachine list \
139+
--resource-group "$CLUSTER_RG" \
140+
--cluster-name "$CLUSTER_NAME" \
141+
--subscription "$SUBSCRIPTION_ID" \
142+
--output table
143+
```
144+
145+
Review the status of the control-plane BareMetal Machines. If any are unhealthy or unavailable, investigate further or contact support.
146+
147+
[!include[stillHavingIssues](./includes/contact-support.md)]

articles/operator-nexus/troubleshoot-control-plane-quorum.md

Lines changed: 22 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
11
---
2-
title: Troubleshoot control plane quorum loss
3-
description: Learn how to restore control plane quorum loss.
2+
title: Troubleshoot control plane quorum loss when multiple nodes are offline
3+
description: Learn how to restore control plane quorum loss when multiple nodes are offline.
44
ms.topic: article
5-
ms.date: 01/18/2024
6-
author: matthewernst
5+
ms.date: 07/02/2025
76
ms.author: matthewernst
7+
author: matternst7258
88
ms.service: azure-operator-nexus
99
---
1010

@@ -34,29 +34,29 @@ Follow the steps in this troubleshooting article when multiple control plane nod
3434
- Sign in to the identified server.
3535
- Ensure that the ironic-conductor service is present on this node by using `crictl ps -a |grep -i ironic-conductor`. Here's example output:
3636

37-
~~~
38-
testuser@<servername> [ ~ ]$ sudo crictl ps -a |grep -i ironic-conductor
39-
<id> <id> 6 hours ago Running ironic-conductor 0 <id>
40-
~~~
37+
```shell
38+
testuser@<servername> [ ~ ]$ sudo crictl ps -a |grep -i ironic-conductor
39+
<id> <id> 6 hours ago Running ironic-conductor 0 <id>
40+
```
4141

4242
1. Determine the integrated Dell remote access controller (iDRAC) IP of the server:
4343
- Run the command `az networkcloud cluster list -g <RG_Name>`.
4444
- The output of the command is JSON with the iDRAC IP.
4545

46-
~~~
47-
{
48-
"bmcConnectionString": "redfish+https://xx.xx.xx.xx/redfish/v1/Systems/System.Embedded.1",
49-
"bmcCredentials": {
50-
"username": "<username>"
51-
},
52-
"bmcMacAddress": "<bmcMacAddress>",
53-
"bootMacAddress": "<bootMacAddress",
54-
"machineDetails": "extraDetails",
55-
"machineName": "<machineName>",
56-
"rackSlot": <rackSlot>,
57-
"serialNumber": "<serialNumber>"
58-
},
59-
~~~
46+
```json
47+
{
48+
"bmcConnectionString": "redfish+https://xx.xx.xx.xx/redfish/v1/Systems/System.Embedded.1",
49+
"bmcCredentials": {
50+
"username": "<username>"
51+
},
52+
"bmcMacAddress": "<bmcMacAddress>",
53+
"bootMacAddress": "<bootMacAddress",
54+
"machineDetails": "extraDetails",
55+
"machineName": "<machineName>",
56+
"rackSlot": "<rackSlot>",
57+
"serialNumber": "<serialNumber>"
58+
},
59+
```
6060

6161
1. Access the integrated iDRAC graphical user interface (GUI) by using the IP in your browser to shut down affected management servers.
6262

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
---
2+
title: Troubleshoot Azure Operator Nexus resource health alerts
3+
titleSuffix: Azure Operator Nexus
4+
description: Find troubleshooting guides for platform-emitted resource health alerts.
5+
ms.service: azure-operator-nexus
6+
ms.custom: troubleshooting
7+
ms.topic: troubleshooting
8+
ms.date: 07/02/2025
9+
ms.author: omarrivera
10+
author: omarrivera
11+
---
12+
13+
# Troubleshoot resource health alerts
14+
15+
This guide provides a breakdown of the resource health alerts emitted by the Azure Operator Nexus platform.
16+
It includes a description of each alert and links to troubleshooting guides for each alert.
17+
18+
Resource health alerts emitted by the platform to indicate the health of a particular resource.
19+
These alerts are generated based on the status of the resource and its dependencies.
20+
21+
## Cluster
22+
23+
| Resource Health Event Name | Troubleshooting Guide |
24+
|--------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------|
25+
| `1PExtensionsFailedInstall` | [Requires to contact support](#please-contact-support) |
26+
| `ClusterHeartbeatConnectionStatusDisconnectedClusterManagerOperationsAreAffectedPossibleNetworkIssues` | [Troubleshoot Cluster heartbeat connection status shows disconnected] |
27+
| `ClusterHeartbeatConnectionStatusTimedoutPossiblePerformanceIssues` | [Troubleshoot Cluster heartbeat connection status shows disconnected] |
28+
29+
[Troubleshoot Cluster heartbeat connection status shows disconnected]: ./troubleshoot-cluster-heartbeat-connection-status-disconnected.md
30+
31+
## Please contact support
32+
33+
For some resource health alerts, troubleshooting guides are not available.
34+
If you encounter these alerts, it is recommended to [contact Azure support] for further assistance.
35+
36+
[contact Azure support]: https://portal.azure.com/?#blade/Microsoft_Azure_Support/HelpAndSupportBlade
37+
38+
[!include[stillHavingIssues](./includes/contact-support.md)]

0 commit comments

Comments
 (0)