Skip to content

Commit 352d828

Browse files
committed
updates to the heartbeat connection status
1 parent 03118e0 commit 352d828

File tree

2 files changed

+46
-19
lines changed

2 files changed

+46
-19
lines changed

articles/operator-nexus/includes/contact-support.md

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,12 @@ ms.service: azure-operator-nexus
99
## Still Having Issues?
1010

1111
If the steps outlined didn't provide a path to resolve the issue or if you still have questions [contact support].
12+
Please, provide as much detail as possible about the issue you're experiencing, including any error messages or logs that may be relevant.
13+
This will help the support team to assist you more effectively.
14+
15+
You can open a support request through the [Azure portal][contact support].
16+
1217
For more information about support plans, see [Azure Support plans].
1318

1419
[contact support]: https://portal.azure.com/?#blade/Microsoft_Azure_Support/HelpAndSupportBlade
15-
[Azure Support plans]: https://azure.microsoft.com/support/plans/response/
20+
[Azure Support plans]: https://azure.microsoft.com/support/plans/response/

articles/operator-nexus/troubleshoot-cluster-heartbeat-connection-status-disconnected.md

Lines changed: 40 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,8 @@ ms.date: 04/28/2025
88
ms.author: omarrivera
99
author: omarrivera
1010
---
11-
# Troubleshoot Azure Operator Nexus Cluster Heartbeat Connection Status shows Disconnected
11+
12+
# Troubleshoot Cluster heartbeat connection status shows disconnected
1213

1314
This guide attempts to provide steps to troubleshoot a Cluster with a `clusterConnectionStatus` in `Disconnected` state.
1415
For a Cluster, the `ClusterConnectionStatus` represents the stability in the connection between the on-premises Cluster and its ability to reach the Cluster Manager.
@@ -17,27 +18,22 @@ For a Cluster, the `ClusterConnectionStatus` represents the stability in the con
1718
> The `ClusterConnectionStatus` **doesn't** represent or is related to the health or connectivity of the Arc Connected Kubernetes Cluster.
1819
> The `ClusterConnectionStatus` indicates that the Cluster is successful in sending heartbeats and receiving acknowledgment from the Cluster Manager.
1920
20-
> [!CAUTION]
21-
> The information the `ClusterConnectionStatus` provides is an indication of a symptom of instability, not the root cause.
22-
> This guide focuses on identifying basic signals and components that might help locate the problem but might not cover all scenarios.
23-
2421
[!include[prereq-az-cli](./includes/baremetal-machines/prerequisites-azure-cli-bare-metal-machine-actions.md)]
2522

26-
## Understanding the ClusterConnectionStatus signal
23+
## Understanding the Cluster connection status signal
2724

28-
The `ClusterConnectionStatus` represents the ability for the on-premises Cluster to successfully send heartbeats and receive acknowledgments from the Cluster Manager.
29-
The continuous heartbeat messages are meant to detect the network connection health between the on-premises Cluster and the corresponding Cluster Manager.
30-
The `ClusterConnectionStatus` **isn't** the same as the connectivity of the Arc Connected Kubernetes Cluster.
31-
If there's network related issues, it's possible that the Arc Connected Kubernetes Cluster might also be affected.
25+
The `ClusterConnectionStatus` represents the ability of the on-premises Cluster to send heartbeats and receive acknowledgments from the Cluster Manager, indicating the health of the network connection between them.
26+
`ClusterConnectionStatus` distinct from the connectivity of the Arc Connected Kubernetes Cluster, though network issues may affect both.
3227

3328
A Cluster resource has the property `ClusterConnectionStatus` which is set to the value `Connected` as the heartbeats are continuously received and acknowledged.
3429
The `ClusterConnectionStatus` becomes `Connected` once the Cluster is in a healthy state and network connectivity issues are resolved.
3530
The Cluster shows `Timeout` only as a transitional state between `Connected` and `Disconnected`.
3631
The Cluster `ClusterConnectionStatus` value becomes `Disconnected` as Cluster Manager detects continuously missed heartbeats.
32+
Once the cluster is a healthy state and there no network connectivity issues, the `ClusterConnectionStatus` will automatically move to `Connected`
3733

3834
During the Cluster deployment process, the Cluster is in `Undefined` state until the Cluster is fully deployed and operational.
3935

40-
The following table shows which status is displayed depending on the state of the undercloud cluster:
36+
The following table shows the possible values of `ClusterConnectionStatus` and their definitions:
4137

4238
| Status | Definition |
4339
|----------------|-----------------------------------------------------------------------------------------------------------------------|
@@ -67,19 +63,45 @@ ClusterConnectionStatus
6763
Connected
6864
```
6965

70-
## Basic Investigation Steps
66+
## Common investigation steps
67+
68+
The Cluster resource might be affected by infrastructure networking issues (such as DNS, BGP, InfraProxy, etct.), permission changes in the Managed Identity, or other issues that might not be obvious at first.
69+
The following sections provide some common investigation steps and references to help troubleshoot.
70+
71+
> [!IMPORTANT]
72+
> The `ClusterConnectionStatus` indicates general instability, not the root cause.
73+
> This guide provides general resource health checks that might help locate the problem or at least help collect information useful for customer support.
74+
75+
### Cluster Network Fabric health and connectivity
7176

72-
### 1. Ensure Network Connectivity for the Cluster
77+
It is useful to start with the Network Fabric [controller][Network Fabric Controller] and [services][Network Fabric Services] resources.
78+
Verify the [network configuration][How to Configure Network Fabric], firewall rules, and any other network-related settings that might be affecting the connectivity.
79+
Ensure there have not been any recent cabling or network configuration changes that could affect the network connectivity.
7380

74-
TODO - what steps could be done here?
81+
[How to Configure Network Fabric]: https://learn.microsoft.com/en-us/azure/operator-nexus/howto-configure-network-fabric
82+
[Network Fabric Controller]: https://learn.microsoft.com/en-us/azure/operator-nexus/concepts-network-fabric-controller
83+
[Network Fabric Services]: https://learn.microsoft.com/en-us/azure/operator-nexus/concepts-network-fabric-services
7584

76-
### Other possible causes to evaluate
85+
Evaluate any configured monitoring or metrics for the Network Fabric resources.
86+
See the following links for more information:
87+
- [Nexus Network Fabric configuration monitoring overview](https://learn.microsoft.com/en-us/azure/operator-nexus/concepts-network-fabric-configuration-monitoring)
88+
- [How to configure diagnostic settings and monitor configuration differences in Nexus Network Fabric](https://learn.microsoft.com/en-us/azure/operator-nexus/howto-configure-diagnostic-settings-monitor-configuration-differences)
89+
- [Azure Operator Nexus Network Fabric internal network BGP metrics](https://learn.microsoft.com/en-us/azure/operator-nexus/concepts-internal-network-bgp-metrics)
90+
- [How to monitor interface In and Out packet rate for network fabric devices](https://learn.microsoft.com/en-us/azure/operator-nexus/howto-monitor-interface-packet-rate)
91+
92+
### Recent changes to the Managed Identity permissions
7793

7894
- Are there recent changes to the Managed Identity permissions for the Cluster Manager or Cluster?
7995
- The Managed Identities (MI) and their permissions are used for service-to-service authentication. A change in the permissions results in authentication failures for the heartbeat messages. Cluster Managers must both receive and acknowledge heartbeats failure to do so will also result in a `ClusterConnectionStatus` of `Disconnected`.
8096

81-
If the Cluster is expected to be healthy but the `ClusterConnectionStatus` remains in `Disconnected` state [contact support] after following the steps in this guide.
97+
### Check control-plane BareMetal Machines health
8298

83-
[!include[stillHavingIssues](./includes/contact-support.md)]
99+
The control-plane BareMetal Machines host the component that emits the heartbeats to the Cluster Manager.
100+
In most cases, the pods running on the control-plane will reschedule automatically to a differnent BareMetal Machine within the control-plane node pool.
101+
However, if the BareMetal Machines are not healthy, the pods will not be able to reschedule and the Cluster will be unable to send heartbeats.
102+
103+
To check the BareMetal Machines, use the following command:
84104

85-
[contact support]: https://portal.azure.com/?#blade/Microsoft_Azure_Support/HelpAndSupportBlade
105+
**TBD**: Need to add the command to check BareMetal Machines
106+
107+
[!include[stillHavingIssues](./includes/contact-support.md)]

0 commit comments

Comments
 (0)