Skip to content

Commit ebca591

Browse files
committed
add troubleshooting arcticles for resource health monitors
1 parent 20f8045 commit ebca591

5 files changed

+127
-1
lines changed

articles/operator-nexus/TOC.yml

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -249,7 +249,7 @@
249249
href: howto-restrict-serial-port-access-and-set-timeout-on-terminal-server.md
250250
- name: How to configure BGP prefix limit on Customer Edge (CE) devices for Azure Operator Nexus
251251
href: howto-configure-bgp-prefix-limit-on-customer-edge-devices.md
252-
- name: BMP log streaming in Azure Operator Nexus Network Fabric
252+
- name: BMP log streaming in Azure Operator Nexus Network Fabric
253253
href: concepts-bmp-log-streaming.md
254254
- name: How to enable / disable BMP log streaming Azure Operator Nexus
255255
href: howto-enable-log-streaming.md
@@ -397,6 +397,10 @@
397397
href: troubleshoot-accepted-cluster-hydration.md
398398
- name: Troubleshoot Out of Memory Pods
399399
href: troubleshoot-memory-limits.md
400+
- name: Troubleshoot Cluster heartbeat connection status disconnected
401+
href: troubleshoot-cluster-heartbeat-connection-status-disconnected.md
402+
- name: Troubleshoot Bare Metal Machine in not ready state
403+
href: troubleshoot-bare-metal-machine-not-ready-state.md
400404
- name: Tenant Workload
401405
expanded: false
402406
items:
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
---
2+
author: omarrivera
3+
ms.author: omarrivera
4+
ms.date: 10/09/2024
5+
ms.topic: include
6+
ms.service: azure-operator-nexus
7+
---
8+
## Still Having Issues?
9+
10+
If the steps outlined didn't provide a path to resolve the issue or if you still have questions [contact support].
11+
For more information about support plans, see [Azure Support plans].
12+
13+
[contact support]: https://portal.azure.com/?#blade/Microsoft_Azure_Support/HelpAndSupportBlade
14+
[Azure Support plans]: https://azure.microsoft.com/support/plans/response/
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
---
2+
title: Troubleshoot Azure Operator Nexus BareMetal Machines in a Not Ready state
3+
description: Examine common and known issues with BareMetal Machine resources.
4+
ms.service: azure-operator-nexus
5+
ms.custom: troubleshooting
6+
ms.topic: troubleshooting
7+
ms.date: 10/09/2024
8+
ms.author: omarrivera
9+
author: omarrivera
10+
---
11+
# Troubleshoot Azure Operator Nexus BareMetal Machines in a Not Ready state
12+
13+
This guide attempts to provide steps to troubleshoot when a BareMetal Machine is declared to be `Not Ready` state.
14+
15+
> [!NOTE]
16+
> There can be multiple reasons that a BareMetal Machine is in NotReady state.
17+
> The best approach is to determine if some of the common reasons apply.
18+
> Although we are providing guides to historically known issues, it cannot cover all possible error scenarios.
19+
[!include[prereqAzCLI](./includes/prereq-az-cli.md)]
20+
21+
22+
TODO - use the article that exists as reference and add only the preconditions and we'll have
23+
articles/operator-nexus/troubleshoot-bare-metal-machine-provisioning.md
24+
25+
>[!NOTE]
26+
> NC 3.14 has OnpremLogs in the LAW - it would need to use that for reference https://teams.microsoft.com/l/message/19:99bdf627-579c-46bb-a2e1-20215be79888_e5ef5aef-6faf-4e93-ae99-d353f173d715@unq.gbl.spaces/1729106818629?context=%7B%22contextType%22%3A%22chat%22%7D
27+
[!include[stillHavingIssues](./includes/contact-support.md)]
Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
---
2+
title: Troubleshoot Azure Operator Nexus Cluster Heartbeat Connection Status shows Disconnected
3+
description: Provide steps to investigate and possibly resolve circumstances that are preventing the Cluster from sending heartbeats to the Cluster Manager.
4+
ms.service: azure-operator-nexus
5+
ms.custom: troubleshooting
6+
ms.topic: troubleshooting
7+
ms.date: 10/09/2024
8+
ms.author: omarrivera
9+
author: omarrivera
10+
---
11+
# Troubleshoot Azure Operator Nexus Cluster Heartbeat Connection Status shows Disconnected
12+
13+
This guide attempts to provide steps to troubleshoot a Cluster is shown to have `clusterConnectionStatus` with a value of `Disconnected`.
14+
15+
> [!CAUTION]
16+
> The `ClusterConnectionStatus` is likely a symptom or signal and not the root cause and this guide will not be able to provide answers for all scenarios.
17+
> The focus and purpose of this guide is to provide common issues and signals that can be inspected to determine where the issue might be.
18+
## Understanding the Issue
19+
20+
Cluster Managers ensure continuous Cluster network connectivity through a heartbeat agent running within the target Cluster.
21+
The cluster-heartbeat agent sends periodic HTTP messages to the Cluster Manager and expects an acknowledgment response as well.
22+
A Cluster has the property `ClusterConnectionStatus` which is set to the value `Connected` as the heartbeats are continuously received and acknowledged.
23+
24+
The `ClusterConnectionStatus` becomes `Connected` once the cluster is in a healthy state and network connectivity issues are resolved.
25+
If the Cluster is expected to be healthy but the `ClusterConnectionStatus` remains in `Disconnected` state [contact support] after following the steps in this guide.
26+
27+
> [!IMPORTANT]
28+
> `ClusterConnectionStatus` is **not** the same as Arc Connected Kubernetes Clusters.
29+
The command can be used to see the value of `ClsuterConnectionStatus` and it is visible in Azure Portal in the Cluster resource's JSON view.
30+
31+
```azurecli
32+
az networkcloud cluster show --subscription "$SUBSCRIPTION_ID" -g "$CLUSTER_RG" -n "$CLUSTER_NAME" --output table --query "{ClusterConnectionStatus:clusterConnectionStatus}"
33+
ClusterConnectionStatus
34+
-------------------------
35+
Connected
36+
```
37+
38+
The following table shows which status is displayed depending on the state of the undercloud cluster:
39+
40+
| Status | Definition |
41+
|----------------|-----------------------------------------------------------------------------------------------------------------------|
42+
| `Connected` | Heartbeats received, indicates healthy cluster and cluster manager connectivity |
43+
| `Disconnected` | Heartbeats missed for __over 5 minutes__, indicates likely connectivity issue between Cluster Manager and Cluster |
44+
| `Timeout` | Heartbeats missed for __over 2 minutes but less than 5 minutes__, cluster connectivity is uncertain possibly degraded |
45+
| `Undefined` | Cluster not yet deployed or running a version without the heartbeats feature |
46+
47+
## Basic Investigation Steps
48+
49+
### 1. Ensure Network Connectivity for the Cluster
50+
51+
TODO - what steps could be done here?
52+
53+
### Other possible causes to evaluate
54+
55+
- Are there recent changes to the Managed Identity permissions for the Cluster Manager or Cluster?
56+
- The Managed Identities (MI) and their permissions are used for service-to-service authentication. A change in the permissions results in authentication failures for the heartbeat messages. Cluster Managers must both receive and acknowledge heartbeats failure to do so will also result in a `ClusterConnectionStatus` of `Disconnected`.
57+
58+
[!include[stillHavingIssues](./includes/contact-support.md)]
59+
60+
[contact support]: https://portal.azure.com/?#blade/Microsoft_Azure_Support/HelpAndSupportBlade
Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
---
2+
title: Troubleshoot Azure Operator Nexus Cluster has ETCD Quorum Lost
3+
description: Provides steps to follow in the event that an `etcd` quorum is lost for an extended period of time and the KCP did not successfully return to a stable state.
4+
ms.service: azure-operator-nexus
5+
ms.custom: troubleshooting
6+
ms.topic: troubleshooting
7+
ms.date: 10/09/2024
8+
ms.author: omarrivera
9+
author: omarrivera
10+
---
11+
# Troubleshoot Azure Operator Nexus Cluster has ETCD Quorum Lost
12+
13+
This guide attempts to provide steps to follow in the event that an `etcd` quorum is lost for an extended period of time and the Kubernetes Control Plane (KCP) did not successfully return to stable state.
14+
15+
> [!IMPORTANT]
16+
> At this time there is no supported approach that can be executed through customer tools.
17+
> There will be a feature enhancement for a future release to help address this scenario.
18+
> Please, open a support ticket via [contact support].
19+
[!include[stillHavingIssues](./includes/contact-support.md)]
20+
21+
[contact support]: https://portal.azure.com/?#blade/Microsoft_Azure_Support/HelpAndSupportBlade

0 commit comments

Comments
 (0)