Skip to content

Commit dc26f21

Browse files
committed
adding improvements to the etcd quorum lost article
1 parent 19ad681 commit dc26f21

File tree

3 files changed

+65
-11
lines changed

3 files changed

+65
-11
lines changed

articles/operator-nexus/troubleshoot-bare-metal-machine-not-ready-state.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ ms.date: 10/09/2024
88
ms.author: omarrivera
99
author: omarrivera
1010
---
11+
1112
# Troubleshoot Azure Operator Nexus BareMetal Machines in a Not Ready state
1213

1314
This guide attempts to provide steps to troubleshoot when a BareMetal Machine is declared to be `Not Ready` state.

articles/operator-nexus/troubleshoot-control-plane-quorum.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
---
2-
title: Troubleshoot control plane quorum loss
3-
description: Learn how to restore control plane quorum loss.
2+
title: Troubleshoot control plane quorum loss when multiple nodes are offline
3+
description: Learn how to restore control plane quorum loss when multiple nodes are offline.
44
ms.topic: article
5-
ms.date: 01/18/2024
5+
ms.date: 04/29/2025
66
author: matthewernst
77
ms.author: matthewernst
88
ms.service: azure-operator-nexus
Lines changed: 61 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,74 @@
11
---
2-
title: Troubleshoot Azure Operator Nexus Cluster has ETCD Quorum Lost
3-
description: Provides steps to follow in the event that an `etcd` quorum is lost for an extended period of time and the KCP did not successfully return to a stable state.
2+
title: Troubleshoot Azure Operator Nexus Cluster lost etcd quorum
3+
description: Steps to follow when `etcd` quorum is lost for an extended period of time and the KCP didn't successfully return to a stable state.
44
ms.service: azure-operator-nexus
55
ms.custom: troubleshooting
66
ms.topic: troubleshooting
7-
ms.date: 10/09/2024
7+
ms.date: 04/29/2024
88
ms.author: omarrivera
99
author: omarrivera
1010
---
11-
# Troubleshoot Azure Operator Nexus Cluster has ETCD Quorum Lost
1211

13-
This guide attempts to provide steps to follow in the event that an `etcd` quorum is lost for an extended period of time and the Kubernetes Control Plane (KCP) did not successfully return to stable state.
12+
# Troubleshoot Azure Operator Nexus Cluster lost etcd quorum
13+
14+
This guide attempts to provide steps to follow when an `etcd` quorum is lost for an extended period of time and the Kubernetes Control Plane (KCP) didn't successfully return to stable state.
1415

1516
> [!IMPORTANT]
16-
> At this time there is no supported approach that can be executed through customer tools.
17-
> There will be a feature enhancement for a future release to help address this scenario.
18-
> Please, open a support ticket via [contact support].
17+
> At this time, there's no supported approach that can be executed through customer tools.
18+
> Feature enhancements are ongoing for a future release to help address this scenario without contacting support.
19+
> Open a support ticket via [contact support].
20+
21+
## Ensure that all control plane nodes are online
22+
23+
[Troubleshoot control plane quorum loss when multiple nodes are offline](./troubleshoot-control-plane-quorum.md) provides steps to follow when multiple control plane nodes are offline or unavailable.
24+
25+
## Check the status of the etcd pods
26+
27+
It's possible that the etcd pods aren't running or are in a crash loop.
28+
This can happen if the control plane nodes aren't able to communicate with each other or if there are network issues.
29+
30+
If you have access to the control plane nodes, you can check the status of the etcd pods by running the following command:
31+
32+
```bash
33+
kubectl get pods -n kube-system -l app=etcd
34+
```
35+
36+
## Check network connectivity between control plane nodes
37+
If the etcd pods are running, but the KCP is still not stable, it's possible that there are network connectivity issues between the control plane nodes.
38+
You can check the network connectivity between the control plane nodes by running the following command:
39+
40+
```bash
41+
kubectl exec -it <etcd-pod-name> -n kube-system -- ping <other-control-plane-node-ip>
42+
```
43+
44+
Replace `<etcd-pod-name>` with the name of one of the etcd pods and `<other-control-plane-node-ip>` with the IP address of one of the other control plane nodes.
45+
If the ping command fails, it indicates that there are network connectivity issues between the control plane nodes.
46+
47+
## Check for storage issues
48+
If the etcd pods are running and there are no network connectivity issues, it's possible that there are storage issues with the etcd pods.
49+
You can check the storage issues by running the following command:
50+
51+
```bash
52+
kubectl describe pod <etcd-pod-name> -n kube-system
53+
```
54+
55+
Replace `<etcd-pod-name>` with the name of one of the etcd pods.
56+
This command provides detailed information about the etcd pod, including any storage issues that might be present.
57+
58+
## Check for resource issues or saturation
59+
If the etcd pods are running and there are no network connectivity or storage issues, it's possible that there are resource issues or saturation on the control plane nodes.
60+
You can check the resource usage on the control plane nodes by running the following command:
61+
62+
```bash
63+
kubectl top nodes
64+
```
65+
66+
This command provides information about the CPU and memory usage on the control plane nodes.
67+
If the CPU or memory usage is high, it might indicate that the control plane nodes are saturated and unable to process requests.
68+
69+
70+
**TODO**: Add any possible steps that can be taken to help with this scenario. Perhaps there's a list of things to check. The steps above are just a starting point and should be expanded upon.
71+
1972
[!include[stillHavingIssues](./includes/contact-support.md)]
2073

2174
[contact support]: https://portal.azure.com/?#blade/Microsoft_Azure_Support/HelpAndSupportBlade

0 commit comments

Comments
 (0)