|
| 1 | +--- |
| 2 | +title: Troubleshoot Azure Operator Nexus Cluster Heartbeat Connection Status shows Disconnected |
| 3 | +description: Provide steps to investigate and possibly resolve circumstances that are preventing the Cluster from sending heartbeats to the Cluster Manager. |
| 4 | +ms.service: azure-operator-nexus |
| 5 | +ms.custom: troubleshooting |
| 6 | +ms.topic: troubleshooting |
| 7 | +ms.date: 07/02/2025 |
| 8 | +ms.author: omarrivera |
| 9 | +author: omarrivera |
| 10 | +--- |
| 11 | + |
| 12 | +# Troubleshoot Cluster heartbeat connection status shows disconnected |
| 13 | + |
| 14 | +This guide describes steps to troubleshoot a Cluster with a `ClusterConnectionStatus` in `Disconnected` state. |
| 15 | +For a Cluster, the `ClusterConnectionStatus` represents the stability in the connection between the on-premises Cluster and its ability to reach the Cluster Manager. |
| 16 | + |
| 17 | +> [!IMPORTANT] |
| 18 | +> The `ClusterConnectionStatus` **doesn't** represent nor is it related to the health or connectivity of the Arc Connected Kubernetes Cluster. |
| 19 | +> The `ClusterConnectionStatus` indicates that the Cluster is successful in sending heartbeats and receiving acknowledgment from the Cluster Manager. |
| 20 | +
|
| 21 | +[!include[prereqAzCLI](./includes/baremetal-machines/prerequisites-azure-cli-bare-metal-machine-actions.md)] |
| 22 | + |
| 23 | +## Understanding the Cluster connection status signal |
| 24 | + |
| 25 | +The `ClusterConnectionStatus` represents the ability of the on-premises Cluster to send heartbeats and receive acknowledgments from the Cluster Manager, indicating the health of the network connection between them. |
| 26 | +`ClusterConnectionStatus` is distinct from the connectivity of the Arc Connected Kubernetes Cluster, though network issues affect both. |
| 27 | + |
| 28 | +A Cluster resource has the property `ClusterConnectionStatus` set to the value `Connected` if the heartbeats are continuously received and acknowledged. |
| 29 | +The `ClusterConnectionStatus` becomes `Connected` once the Cluster is in a healthy state and network connectivity issues are resolved. |
| 30 | +The Cluster shows `Timeout` only as a transitional state between `Connected` and `Disconnected`. |
| 31 | +The Cluster `ClusterConnectionStatus` value becomes `Disconnected` if the Cluster Manager detects continuously missed heartbeats. |
| 32 | +Heartbeats are considered missed if they aren't received within or beyond the specified time thresholds. |
| 33 | +Once the Cluster is a healthy state and there no network connectivity issues, the `ClusterConnectionStatus` automatically moves to `Connected` |
| 34 | + |
| 35 | +During the Cluster deployment process, the Cluster is in an `Undefined` state until the Cluster is fully deployed and operational. |
| 36 | + |
| 37 | +The following table shows the possible values of `ClusterConnectionStatus` and their definitions: |
| 38 | + |
| 39 | +| Status | Definition | |
| 40 | +|----------------|-----------------------------------------------------------------------------------------------------------------------| |
| 41 | +| `Connected` | Heartbeats received, indicates healthy Cluster and Cluster Manager connectivity | |
| 42 | +| `Disconnected` | Heartbeats missed for **over 5 minutes**, indicates likely connectivity issue between Cluster Manager and Cluster | |
| 43 | +| `Timeout` | Heartbeats missed for **over 2 minutes but less than 5 minutes**, Cluster connectivity is uncertain possibly degraded | |
| 44 | +| `Undefined` | Cluster not yet deployed or running a version without the heartbeats feature | |
| 45 | + |
| 46 | +## Check the value of the Cluster's ClusterConnectionStatus property |
| 47 | + |
| 48 | +The value of `ClusterConnectionStatus` is visible in the Azure portal in the Cluster resource view. |
| 49 | + |
| 50 | +:::image type="content" source="media/troubleshoot-cluster-heartbeat-connection-status/azure-portal-cluster-connection-status.png" alt-text="Screenshot of ClusterConnectionStatus property as shown in the Azure portal." lightbox="media/troubleshoot-cluster-heartbeat-connection-status/azure-portal-cluster-connection-status.png"::: |
| 51 | + |
| 52 | +Or, you can use the Azure CLI to see the value of `ClusterConnectionStatus`: |
| 53 | + |
| 54 | +```azurecli |
| 55 | +az networkcloud cluster show \ |
| 56 | + -g "$CLUSTER_RG" \ |
| 57 | + -n "$CLUSTER_NAME" \ |
| 58 | + --subscription "$SUBSCRIPTION_ID" \ |
| 59 | + --query "{ClusterConnectionStatus:clusterConnectionStatus}" \ |
| 60 | + --output table |
| 61 | +
|
| 62 | +ClusterConnectionStatus |
| 63 | +------------------------- |
| 64 | +Connected |
| 65 | +``` |
| 66 | + |
| 67 | +## Understanding the NexusClusterConnectionStatus metric |
| 68 | + |
| 69 | +Use Azure Resource Health to build alerts for Cluster health, as it provides a comprehensive and supported view of resource status. |
| 70 | +The `NexusClusterConnectionStatus` metric integrates into the Cluster's Azure Resource Health. |
| 71 | +If you use the `NexusClusterConnectionStatus` metric directly, understand how it functions and what it represents. |
| 72 | + |
| 73 | +The Cluster Manager, not the on-premises Cluster, emits the metric based on the `ClusterConnectionStatus` property. |
| 74 | +A pod running on the on-premises Cluster sends heartbeat message to the Cluster Manager through the infrastructure proxy. |
| 75 | +The metric emits a value of "1" for all time series. Starting from when the Cluster resource's connectionStatus is set for the first time. |
| 76 | +The metric emitting process never sends "0" values. Any "0" values seen in graphs are due to graphing tools filling gaps. |
| 77 | +The detection of state changes requires the Cluster Manager's reconciliation process to update the Cluster resource's `ClusterConnectionStatus` property accordingly. |
| 78 | + |
| 79 | +There might be a delay between the actual loss of heartbeats and the metric reflecting the `Disconnected` state, due to the reconciliation loop and other operational factors. |
| 80 | +The `NexusClusterConnectionStatus` metric is used as a health indicator for the Cluster, but delays in status changes can occur due to reconciliation timing and operational constraints. |
| 81 | +Timeout events can occur if heartbeats aren't received within a 2-minute threshold, but a single successful heartbeat resets the timer. |
| 82 | +The status can transition between Connected, Timeout, and `Disconnected` based on heartbeat activity. |
| 83 | + |
| 84 | +The image shows a general representation of the components responsible for emitting the `NexusClusterConnectionStatus` metric. |
| 85 | + |
| 86 | +:::image type="content" source="media/troubleshoot-Cluster-heartbeat-connection-status/cluster-connection-status-components-for-metric.png" alt-text="Diagram that shows the components responsible for emitting the NexusClusterConnectionStatus metric." lightbox="media/troubleshoot-Cluster-heartbeat-connection-status/cluster-connection-status-components-for-metric.png"::: |
| 87 | + |
| 88 | +### ClusterConnectionStatus isn't the same as Arc Connected Cluster status |
| 89 | + |
| 90 | +The Cluster's `ClusterConnectionStatus` and Arc Connected Cluster status are separate signals and shouldn't be treated interchangeably. |
| 91 | +Although the two signals aren't related, both rely on network connectivity for the Cluster. |
| 92 | +It's possible for a Cluster to be Arc `Disconnected` but still have a Heartbeat Status of `Connected`. |
| 93 | +Both signals depend on network connectivity, but they serve different purposes and managed by different systems. |
| 94 | + |
| 95 | +## Common investigation steps |
| 96 | + |
| 97 | +Infrastructure networking issues, permission changes in the Managed Identity, or other issues that might not be obvious at first, affect the Cluster resource connection status. |
| 98 | +The following sections provide some common investigation steps and references to help troubleshoot. |
| 99 | + |
| 100 | +> [!IMPORTANT] |
| 101 | +> The `ClusterConnectionStatus` indicates general instability, not the root cause. |
| 102 | +> This guide provides general resource health checks that might help locate the problem or at least help collect information useful for customer support. |
| 103 | +
|
| 104 | +### Cluster Network Fabric health and connectivity |
| 105 | + |
| 106 | +It's useful to start with the Network Fabric [controller][Network Fabric Controller] and [services][Network Fabric Services] resources. |
| 107 | +Verify the [network configuration][How to Configure Network Fabric] or any other network-related settings that might be affecting the connectivity. |
| 108 | +Verify the physical network setup including rack cabling, IP addresses, DNS settings, routing rules, firewall rules, etc. |
| 109 | + |
| 110 | +[How to Configure Network Fabric]: ./howto-configure-network-fabric.md |
| 111 | +[Network Fabric Controller]: ./concepts-network-fabric-controller.md |
| 112 | +[Network Fabric Services]: ./concepts-network-fabric-services.md |
| 113 | + |
| 114 | +Evaluate any configured monitoring or metrics for the Network Fabric resources. |
| 115 | +For more information, see the following links: |
| 116 | + |
| 117 | +- [Nexus Network Fabric configuration monitoring overview](./concepts-network-fabric-configuration-monitoring.md) |
| 118 | +- [How to configure diagnostic settings and monitor configuration differences in Nexus Network Fabric](./howto-configure-diagnostic-settings-monitor-configuration-differences.md) |
| 119 | +- [Azure Operator Nexus Network Fabric internal network BGP metrics](./concepts-internal-network-bgp-metrics.md) |
| 120 | +- [How to monitor interface In and Out packet rate for network fabric devices](./howto-monitor-interface-packet-rate.md) |
| 121 | + |
| 122 | +### Recent changes to the Managed Identity permissions |
| 123 | + |
| 124 | +Changes to the Managed Identity permissions for the Cluster Manager or Cluster can affect the Cluster's ability to authenticate against the Cluster Manager. |
| 125 | +The Managed Identities (MI) and their permissions are used for service-to-service authentication. |
| 126 | +A change in the permissions results in authentication failures for the heartbeat messages. |
| 127 | +Even when network connectivity is healthy the Cluster's `ClusterConnectionStatus` shows `Disconnected` when heartbeats aren't successfully received and acknowledged. |
| 128 | + |
| 129 | +### Check control-plane BareMetal Machines health |
| 130 | + |
| 131 | +The control-plane BareMetal Machines host the component that emits the heartbeats to the Cluster Manager. |
| 132 | +In most cases, the pods running on the control-plane reschedule automatically to a different BareMetal Machine within the control-plane node pool. |
| 133 | +However, if the BareMetal Machines aren't healthy, the pods can't reschedule and the Cluster is unable to send heartbeats. |
| 134 | + |
| 135 | +To check the BareMetal Machines, use the following command: |
| 136 | + |
| 137 | +```azurecli |
| 138 | +az networkcloud baremetalmachine list \ |
| 139 | + --resource-group "$CLUSTER_RG" \ |
| 140 | + --cluster-name "$CLUSTER_NAME" \ |
| 141 | + --subscription "$SUBSCRIPTION_ID" \ |
| 142 | + --output table |
| 143 | +``` |
| 144 | + |
| 145 | +Review the status of the control-plane BareMetal Machines. If any are unhealthy or unavailable, investigate further or contact support. |
| 146 | + |
| 147 | +[!include[stillHavingIssues](./includes/contact-support.md)] |
0 commit comments