You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/operator-nexus/troubleshoot-cluster-heartbeat-connection-status-disconnected.md
+46-9Lines changed: 46 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,14 +4,14 @@ description: Provide steps to investigate and possibly resolve circumstances tha
4
4
ms.service: azure-operator-nexus
5
5
ms.custom: troubleshooting
6
6
ms.topic: troubleshooting
7
-
ms.date: 04/28/2025
7
+
ms.date: 07/02/2025
8
8
ms.author: omarrivera
9
9
author: omarrivera
10
10
---
11
11
12
12
# Troubleshoot Cluster heartbeat connection status shows disconnected
13
13
14
-
This guide attempts to provide steps to troubleshoot a Cluster with a `clusterConnectionStatus` in `Disconnected` state.
14
+
This guide attempts to provide steps to troubleshoot a Cluster with a `ClusterConnectionStatus` in `Disconnected` state.
15
15
For a Cluster, the `ClusterConnectionStatus` represents the stability in the connection between the on-premises Cluster and its ability to reach the Cluster Manager.
16
16
17
17
> [!IMPORTANT]
@@ -25,7 +25,7 @@ For a Cluster, the `ClusterConnectionStatus` represents the stability in the con
25
25
The `ClusterConnectionStatus` represents the ability of the on-premises Cluster to send heartbeats and receive acknowledgments from the Cluster Manager, indicating the health of the network connection between them.
26
26
`ClusterConnectionStatus` distinct from the connectivity of the Arc Connected Kubernetes Cluster, though network issues affect both.
27
27
28
-
A Cluster resource has the property `ClusterConnectionStatus`which is set to the value `Connected` as the heartbeats are continuously received and acknowledged.
28
+
A Cluster resource has the property `ClusterConnectionStatus` set to the value `Connected` as the heartbeats are continuously received and acknowledged.
29
29
The `ClusterConnectionStatus` becomes `Connected` once the Cluster is in a healthy state and network connectivity issues are resolved.
30
30
The Cluster shows `Timeout` only as a transitional state between `Connected` and `Disconnected`.
31
31
The Cluster `ClusterConnectionStatus` value becomes `Disconnected` as Cluster Manager detects continuously missed heartbeats.
@@ -38,15 +38,15 @@ The following table shows the possible values of `ClusterConnectionStatus` and t
Or, you can use the Azure CLI to see the value of `ClusterConnectionStatus`:
52
52
@@ -63,6 +63,34 @@ ClusterConnectionStatus
63
63
Connected
64
64
```
65
65
66
+
## Understanding the NexusClusterConnectionStatus metric
67
+
68
+
Use Azure Resource Health to build alerts for cluster health, as it provides a comprehensive and supported view of resource status.
69
+
The `NexusClusterConnectionStatus` metric integrates into the Cluster's Azure Resource Health.
70
+
If you use the `NexusClusterConnectionStatus` metric directly, understand how it functions and what it represents.
71
+
72
+
The Cluster Manager, not the on-premises Cluster, emits the metric based on the `ClusterConnectionStatus` property.
73
+
A pod running on the on-premises Cluster sends heartbeat message to the Cluster Manager through the infrastructure proxy.
74
+
The metric emits a value of "1" for all time series. Starting from when the Cluster resource's connectionStatus is set for the first time.
75
+
The metric emitting process never sends "0" values. Any "0" values seen in graphs are due to graphing tools filling gaps.
76
+
The detection of state changes requires the Cluster Manager's reconciliation process to update the Cluster resource's `ClusterConnectionStatus` property accordingly.
77
+
78
+
There might be a delay between the actual loss of heartbeats and the metric reflecting the `Disconnected` state, due to the reconciliation loop and other operational factors.
79
+
The `NexusClusterConnectionStatus` metric is used as a health indicator for the cluster, but delays in status changes can occur due to reconciliation timing and operational constraints.
80
+
Timeout events can occur if heartbeats aren't received within a 2-minute threshold, but a single successful heartbeat resets the timer.
81
+
The status can transition between Connected, Timeout, and `Disconnected` based on heartbeat activity.
82
+
83
+
The image shows a general representation of the components responsible for emitting the `NexusClusterConnectionStatus` metric.
### ClusterConnectionStatus isn't the same as Arc Connected Cluster status
88
+
89
+
The Cluster's `ClusterConnectionStatus` and Arc Connected Cluster status are separate signals and shouldn't be treated interchangeably.
90
+
Although the two signals aren't related, both rely on network connectivity for the Cluster.
91
+
It's possible for a Cluster to be Arc `Disconnected` but still have a Heartbeat Status of `Connected`.
92
+
Both signals depend on network connectivity, but they serve different purposes and managed by different systems.
93
+
66
94
## Common investigation steps
67
95
68
96
Infrastructure networking issues, permission changes in the Managed Identity, or other issues that might not be obvious at first, affect the Cluster resource connection status.
@@ -75,7 +103,8 @@ The following sections provide some common investigation steps and references to
75
103
### Cluster Network Fabric health and connectivity
76
104
77
105
It's useful to start with the Network Fabric [controller][Network Fabric Controller] and [services][Network Fabric Services] resources.
78
-
Verify the [network configuration][How to Configure Network Fabric], including rack cabling, IP addresses, DNS settings, routing rules, firewall rules, and any other network-related settings that might be affecting the connectivity.
106
+
Verify the [network configuration][How to Configure Network Fabric] or any other network-related settings that might be affecting the connectivity.
107
+
Verify the physical network setup including rack cabling, IP addresses, DNS settings, routing rules, firewall rules, etc.
79
108
80
109
[How to Configure Network Fabric]: ./howto-configure-network-fabric.md
0 commit comments