You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# Troubleshoot Cluster heartbeat connection status shows disconnected
12
13
13
14
This guide attempts to provide steps to troubleshoot a Cluster with a `clusterConnectionStatus` in `Disconnected` state.
14
15
For a Cluster, the `ClusterConnectionStatus` represents the stability in the connection between the on-premises Cluster and its ability to reach the Cluster Manager.
@@ -17,27 +18,22 @@ For a Cluster, the `ClusterConnectionStatus` represents the stability in the con
17
18
> The `ClusterConnectionStatus`**doesn't** represent or is related to the health or connectivity of the Arc Connected Kubernetes Cluster.
18
19
> The `ClusterConnectionStatus` indicates that the Cluster is successful in sending heartbeats and receiving acknowledgment from the Cluster Manager.
19
20
20
-
> [!CAUTION]
21
-
> The information the `ClusterConnectionStatus` provides is an indication of a symptom of instability, not the root cause.
22
-
> This guide focuses on identifying basic signals and components that might help locate the problem but might not cover all scenarios.
## Understanding the ClusterConnectionStatus signal
23
+
## Understanding the Cluster connection status signal
27
24
28
-
The `ClusterConnectionStatus` represents the ability for the on-premises Cluster to successfully send heartbeats and receive acknowledgments from the Cluster Manager.
29
-
The continuous heartbeat messages are meant to detect the network connection health between the on-premises Cluster and the corresponding Cluster Manager.
30
-
The `ClusterConnectionStatus`**isn't** the same as the connectivity of the Arc Connected Kubernetes Cluster.
31
-
If there's network related issues, it's possible that the Arc Connected Kubernetes Cluster might also be affected.
25
+
The `ClusterConnectionStatus` represents the ability of the on-premises Cluster to send heartbeats and receive acknowledgments from the Cluster Manager, indicating the health of the network connection between them.
26
+
`ClusterConnectionStatus` distinct from the connectivity of the Arc Connected Kubernetes Cluster, though network issues may affect both.
32
27
33
28
A Cluster resource has the property `ClusterConnectionStatus` which is set to the value `Connected` as the heartbeats are continuously received and acknowledged.
34
29
The `ClusterConnectionStatus` becomes `Connected` once the Cluster is in a healthy state and network connectivity issues are resolved.
35
30
The Cluster shows `Timeout` only as a transitional state between `Connected` and `Disconnected`.
36
31
The Cluster `ClusterConnectionStatus` value becomes `Disconnected` as Cluster Manager detects continuously missed heartbeats.
32
+
Once the cluster is a healthy state and there no network connectivity issues, the `ClusterConnectionStatus` will automatically move to `Connected`
37
33
38
34
During the Cluster deployment process, the Cluster is in `Undefined` state until the Cluster is fully deployed and operational.
39
35
40
-
The following table shows which status is displayed depending on the state of the undercloud cluster:
36
+
The following table shows the possible values of `ClusterConnectionStatus` and their definitions:
The Cluster resource might be affected by infrastructure networking issues (such as DNS, BGP, InfraProxy, etct.), permission changes in the Managed Identity, or other issues that might not be obvious at first.
69
+
The following sections provide some common investigation steps and references to help troubleshoot.
70
+
71
+
> [!IMPORTANT]
72
+
> The `ClusterConnectionStatus` indicates general instability, not the root cause.
73
+
> This guide provides general resource health checks that might help locate the problem or at least help collect information useful for customer support.
74
+
75
+
### Cluster Network Fabric health and connectivity
71
76
72
-
### 1. Ensure Network Connectivity for the Cluster
77
+
It is useful to start with the Network Fabric [controller][Network Fabric Controller] and [services][Network Fabric Services] resources.
78
+
Verify the [network configuration][How to Configure Network Fabric], firewall rules, and any other network-related settings that might be affecting the connectivity.
79
+
Ensure there have not been any recent cabling or network configuration changes that could affect the network connectivity.
73
80
74
-
TODO - what steps could be done here?
81
+
[How to Configure Network Fabric]: https://learn.microsoft.com/en-us/azure/operator-nexus/howto-configure-network-fabric
-[How to configure diagnostic settings and monitor configuration differences in Nexus Network Fabric](https://learn.microsoft.com/en-us/azure/operator-nexus/howto-configure-diagnostic-settings-monitor-configuration-differences)
-[How to monitor interface In and Out packet rate for network fabric devices](https://learn.microsoft.com/en-us/azure/operator-nexus/howto-monitor-interface-packet-rate)
91
+
92
+
### Recent changes to the Managed Identity permissions
77
93
78
94
- Are there recent changes to the Managed Identity permissions for the Cluster Manager or Cluster?
79
95
- The Managed Identities (MI) and their permissions are used for service-to-service authentication. A change in the permissions results in authentication failures for the heartbeat messages. Cluster Managers must both receive and acknowledge heartbeats failure to do so will also result in a `ClusterConnectionStatus` of `Disconnected`.
80
96
81
-
If the Cluster is expected to be healthy but the `ClusterConnectionStatus` remains in `Disconnected` state [contact support] after following the steps in this guide.
The control-plane BareMetal Machines host the component that emits the heartbeats to the Cluster Manager.
100
+
In most cases, the pods running on the control-plane will reschedule automatically to a differnent BareMetal Machine within the control-plane node pool.
101
+
However, if the BareMetal Machines are not healthy, the pods will not be able to reschedule and the Cluster will be unable to send heartbeats.
102
+
103
+
To check the BareMetal Machines, use the following command:
0 commit comments