Skip to content

Commit db63a7f

Browse files
committed
reduced the PR changes to only focus on the cluster heartbeat article
1 parent 5e9ad4c commit db63a7f

7 files changed

+69
-175
lines changed

articles/operator-nexus/TOC.yml

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -389,8 +389,6 @@
389389
href: troubleshoot-accepted-cluster-hydration.md
390390
- name: Troubleshoot Control Plane Quorum
391391
href: troubleshoot-control-plane-quorum.md
392-
- name: Troubleshoot ETCD Cluster quorum loss and recovery
393-
href: troubleshoot-etcd-cluster-possible-quorum-lost.md
394392
- name: Troubleshoot Cluster heartbeat connection status disconnected
395393
href: troubleshoot-cluster-heartbeat-connection-status-disconnected.md
396394
- name: Bare Metal Machine
@@ -408,8 +406,6 @@
408406
href: troubleshoot-bare-metal-machine-warning.md
409407
- name: Troubleshoot Out of Memory Pods
410408
href: troubleshoot-memory-limits.md
411-
- name: Troubleshoot Bare Metal Machine in not ready state
412-
href: troubleshoot-bare-metal-machine-not-ready-state.md
413409
- name: Tenant Workload
414410
expanded: false
415411
items:
Loading

articles/operator-nexus/troubleshoot-bare-metal-machine-not-ready-state.md

Lines changed: 0 additions & 30 deletions
This file was deleted.

articles/operator-nexus/troubleshoot-cluster-heartbeat-connection-status-disconnected.md

Lines changed: 46 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -4,14 +4,14 @@ description: Provide steps to investigate and possibly resolve circumstances tha
44
ms.service: azure-operator-nexus
55
ms.custom: troubleshooting
66
ms.topic: troubleshooting
7-
ms.date: 04/28/2025
7+
ms.date: 07/02/2025
88
ms.author: omarrivera
99
author: omarrivera
1010
---
1111

1212
# Troubleshoot Cluster heartbeat connection status shows disconnected
1313

14-
This guide attempts to provide steps to troubleshoot a Cluster with a `clusterConnectionStatus` in `Disconnected` state.
14+
This guide attempts to provide steps to troubleshoot a Cluster with a `ClusterConnectionStatus` in `Disconnected` state.
1515
For a Cluster, the `ClusterConnectionStatus` represents the stability in the connection between the on-premises Cluster and its ability to reach the Cluster Manager.
1616

1717
> [!IMPORTANT]
@@ -25,7 +25,7 @@ For a Cluster, the `ClusterConnectionStatus` represents the stability in the con
2525
The `ClusterConnectionStatus` represents the ability of the on-premises Cluster to send heartbeats and receive acknowledgments from the Cluster Manager, indicating the health of the network connection between them.
2626
`ClusterConnectionStatus` distinct from the connectivity of the Arc Connected Kubernetes Cluster, though network issues affect both.
2727

28-
A Cluster resource has the property `ClusterConnectionStatus` which is set to the value `Connected` as the heartbeats are continuously received and acknowledged.
28+
A Cluster resource has the property `ClusterConnectionStatus` set to the value `Connected` as the heartbeats are continuously received and acknowledged.
2929
The `ClusterConnectionStatus` becomes `Connected` once the Cluster is in a healthy state and network connectivity issues are resolved.
3030
The Cluster shows `Timeout` only as a transitional state between `Connected` and `Disconnected`.
3131
The Cluster `ClusterConnectionStatus` value becomes `Disconnected` as Cluster Manager detects continuously missed heartbeats.
@@ -38,15 +38,15 @@ The following table shows the possible values of `ClusterConnectionStatus` and t
3838
| Status | Definition |
3939
|----------------|-----------------------------------------------------------------------------------------------------------------------|
4040
| `Connected` | Heartbeats received, indicates healthy cluster and cluster manager connectivity |
41-
| `Disconnected` | Heartbeats missed for __over 5 minutes__, indicates likely connectivity issue between Cluster Manager and Cluster |
42-
| `Timeout` | Heartbeats missed for __over 2 minutes but less than 5 minutes__, cluster connectivity is uncertain possibly degraded |
41+
| `Disconnected` | Heartbeats missed for **over 5 minutes**, indicates likely connectivity issue between Cluster Manager and Cluster |
42+
| `Timeout` | Heartbeats missed for **over 2 minutes but less than 5 minutes**, cluster connectivity is uncertain possibly degraded |
4343
| `Undefined` | Cluster not yet deployed or running a version without the heartbeats feature |
4444

45-
## Check the ClusterConnectionStatus
45+
## Check the value of the Cluster's ClusterConnectionStatus property
4646

4747
The value of `ClusterConnectionStatus` is visible in the Azure portal in the Cluster resource view.
4848

49-
![!include[clusterConnectionStatus](./media/troubleshoot-cluster-heartbeat-connection-status/az-portal-cluster-connection-status.png)]
49+
![!include[ClusterConnectionStatus](./media/troubleshoot-cluster-heartbeat-connection-status/az-portal-cluster-connection-status.png)]
5050

5151
Or, you can use the Azure CLI to see the value of `ClusterConnectionStatus`:
5252

@@ -63,6 +63,34 @@ ClusterConnectionStatus
6363
Connected
6464
```
6565

66+
## Understanding the NexusClusterConnectionStatus metric
67+
68+
Use Azure Resource Health to build alerts for cluster health, as it provides a comprehensive and supported view of resource status.
69+
The `NexusClusterConnectionStatus` metric integrates into the Cluster's Azure Resource Health.
70+
If you use the `NexusClusterConnectionStatus` metric directly, understand how it functions and what it represents.
71+
72+
The Cluster Manager, not the on-premises Cluster, emits the metric based on the `ClusterConnectionStatus` property.
73+
A pod running on the on-premises Cluster sends heartbeat message to the Cluster Manager through the infrastructure proxy.
74+
The metric emits a value of "1" for all time series. Starting from when the Cluster resource's connectionStatus is set for the first time.
75+
The metric emitting process never sends "0" values. Any "0" values seen in graphs are due to graphing tools filling gaps.
76+
The detection of state changes requires the Cluster Manager's reconciliation process to update the Cluster resource's `ClusterConnectionStatus` property accordingly.
77+
78+
There might be a delay between the actual loss of heartbeats and the metric reflecting the `Disconnected` state, due to the reconciliation loop and other operational factors.
79+
The `NexusClusterConnectionStatus` metric is used as a health indicator for the cluster, but delays in status changes can occur due to reconciliation timing and operational constraints.
80+
Timeout events can occur if heartbeats aren't received within a 2-minute threshold, but a single successful heartbeat resets the timer.
81+
The status can transition between Connected, Timeout, and `Disconnected` based on heartbeat activity.
82+
83+
The image shows a general representation of the components responsible for emitting the `NexusClusterConnectionStatus` metric.
84+
85+
![!include[ClusterHeartbeatComponents](./media/troubleshoot-cluster-heartbeat-connection-status/cluster-connection-status-components-for-metric.png)]
86+
87+
### ClusterConnectionStatus isn't the same as Arc Connected Cluster status
88+
89+
The Cluster's `ClusterConnectionStatus` and Arc Connected Cluster status are separate signals and shouldn't be treated interchangeably.
90+
Although the two signals aren't related, both rely on network connectivity for the Cluster.
91+
It's possible for a Cluster to be Arc `Disconnected` but still have a Heartbeat Status of `Connected`.
92+
Both signals depend on network connectivity, but they serve different purposes and managed by different systems.
93+
6694
## Common investigation steps
6795

6896
Infrastructure networking issues, permission changes in the Managed Identity, or other issues that might not be obvious at first, affect the Cluster resource connection status.
@@ -75,7 +103,8 @@ The following sections provide some common investigation steps and references to
75103
### Cluster Network Fabric health and connectivity
76104

77105
It's useful to start with the Network Fabric [controller][Network Fabric Controller] and [services][Network Fabric Services] resources.
78-
Verify the [network configuration][How to Configure Network Fabric], including rack cabling, IP addresses, DNS settings, routing rules, firewall rules, and any other network-related settings that might be affecting the connectivity.
106+
Verify the [network configuration][How to Configure Network Fabric] or any other network-related settings that might be affecting the connectivity.
107+
Verify the physical network setup including rack cabling, IP addresses, DNS settings, routing rules, firewall rules, etc.
79108

80109
[How to Configure Network Fabric]: ./howto-configure-network-fabric.md
81110
[Network Fabric Controller]: ./concepts-network-fabric-controller.md
@@ -104,6 +133,14 @@ However, if the BareMetal Machines aren't healthy, the pods can't reschedule and
104133

105134
To check the BareMetal Machines, use the following command:
106135

107-
**TBD**: Need to add the command to check BareMetal Machines
136+
```azurecli
137+
az networkcloud baremetalmachine list \
138+
--resource-group "$CLUSTER_RG" \
139+
--cluster-name "$CLUSTER_NAME" \
140+
--subscription "$SUBSCRIPTION_ID" \
141+
--output table
142+
```
143+
144+
Review the status of the control-plane BareMetal Machines. If any are unhealthy or unavailable, investigate further or contact support.
108145

109146
[!include[stillHavingIssues](./includes/contact-support.md)]

articles/operator-nexus/troubleshoot-control-plane-quorum.md

Lines changed: 19 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
title: Troubleshoot control plane quorum loss when multiple nodes are offline
33
description: Learn how to restore control plane quorum loss when multiple nodes are offline.
44
ms.topic: article
5-
ms.date: 04/29/2025
5+
ms.date: 01/18/2024
66
author: matthewernst
77
ms.author: matthewernst
88
ms.service: azure-operator-nexus
@@ -34,29 +34,29 @@ Follow the steps in this troubleshooting article when multiple control plane nod
3434
- Sign in to the identified server.
3535
- Ensure that the ironic-conductor service is present on this node by using `crictl ps -a |grep -i ironic-conductor`. Here's example output:
3636

37-
~~~
38-
testuser@<servername> [ ~ ]$ sudo crictl ps -a |grep -i ironic-conductor
39-
<id> <id> 6 hours ago Running ironic-conductor 0 <id>
40-
~~~
37+
```shell
38+
testuser@<servername> [ ~ ]$ sudo crictl ps -a |grep -i ironic-conductor
39+
<id> <id> 6 hours ago Running ironic-conductor 0 <id>
40+
```
4141

4242
1. Determine the integrated Dell remote access controller (iDRAC) IP of the server:
4343
- Run the command `az networkcloud cluster list -g <RG_Name>`.
4444
- The output of the command is JSON with the iDRAC IP.
4545

46-
~~~
47-
{
48-
"bmcConnectionString": "redfish+https://xx.xx.xx.xx/redfish/v1/Systems/System.Embedded.1",
49-
"bmcCredentials": {
50-
"username": "<username>"
51-
},
52-
"bmcMacAddress": "<bmcMacAddress>",
53-
"bootMacAddress": "<bootMacAddress",
54-
"machineDetails": "extraDetails",
55-
"machineName": "<machineName>",
56-
"rackSlot": <rackSlot>,
57-
"serialNumber": "<serialNumber>"
58-
},
59-
~~~
46+
```json
47+
{
48+
"bmcConnectionString": "redfish+https://xx.xx.xx.xx/redfish/v1/Systems/System.Embedded.1",
49+
"bmcCredentials": {
50+
"username": "<username>"
51+
},
52+
"bmcMacAddress": "<bmcMacAddress>",
53+
"bootMacAddress": "<bootMacAddress",
54+
"machineDetails": "extraDetails",
55+
"machineName": "<machineName>",
56+
"rackSlot": <rackSlot>,
57+
"serialNumber": "<serialNumber>"
58+
},
59+
```
6060

6161
1. Access the integrated iDRAC graphical user interface (GUI) by using the IP in your browser to shut down affected management servers.
6262

articles/operator-nexus/troubleshoot-etcd-cluster-possible-quorum-lost.md

Lines changed: 0 additions & 90 deletions
This file was deleted.

0 commit comments

Comments
 (0)