|
1 | 1 | ---
|
2 |
| -title: Troubleshoot Azure Operator Nexus Cluster has ETCD Quorum Lost |
3 |
| -description: Provides steps to follow in the event that an `etcd` quorum is lost for an extended period of time and the KCP did not successfully return to a stable state. |
| 2 | +title: Troubleshoot Azure Operator Nexus Cluster lost etcd quorum |
| 3 | +description: Steps to follow when `etcd` quorum is lost for an extended period of time and the KCP didn't successfully return to a stable state. |
4 | 4 | ms.service: azure-operator-nexus
|
5 | 5 | ms.custom: troubleshooting
|
6 | 6 | ms.topic: troubleshooting
|
7 |
| -ms.date: 10/09/2024 |
| 7 | +ms.date: 04/29/2024 |
8 | 8 | ms.author: omarrivera
|
9 | 9 | author: omarrivera
|
10 | 10 | ---
|
11 |
| -# Troubleshoot Azure Operator Nexus Cluster has ETCD Quorum Lost |
12 | 11 |
|
13 |
| -This guide attempts to provide steps to follow in the event that an `etcd` quorum is lost for an extended period of time and the Kubernetes Control Plane (KCP) did not successfully return to stable state. |
| 12 | +# Troubleshoot Azure Operator Nexus Cluster lost etcd quorum |
| 13 | + |
| 14 | +This guide attempts to provide steps to follow when an `etcd` quorum is lost for an extended period of time and the Kubernetes Control Plane (KCP) didn't successfully return to stable state. |
14 | 15 |
|
15 | 16 | > [!IMPORTANT]
|
16 |
| -> At this time there is no supported approach that can be executed through customer tools. |
17 |
| -> There will be a feature enhancement for a future release to help address this scenario. |
18 |
| -> Please, open a support ticket via [contact support]. |
| 17 | +> At this time, there's no supported approach that can be executed through customer tools. |
| 18 | +> Feature enhancements are ongoing for a future release to help address this scenario without contacting support. |
| 19 | +> Open a support ticket via [contact support]. |
| 20 | +
|
| 21 | +## Ensure that all control plane nodes are online |
| 22 | + |
| 23 | +[Troubleshoot control plane quorum loss when multiple nodes are offline](./troubleshoot-control-plane-quorum.md) provides steps to follow when multiple control plane nodes are offline or unavailable. |
| 24 | + |
| 25 | +## Check the status of the etcd pods |
| 26 | + |
| 27 | +It's possible that the etcd pods aren't running or are in a crash loop. |
| 28 | +This can happen if the control plane nodes aren't able to communicate with each other or if there are network issues. |
| 29 | + |
| 30 | +If you have access to the control plane nodes, you can check the status of the etcd pods by running the following command: |
| 31 | + |
| 32 | +```bash |
| 33 | +kubectl get pods -n kube-system -l app=etcd |
| 34 | +``` |
| 35 | + |
| 36 | +## Check network connectivity between control plane nodes |
| 37 | +If the etcd pods are running, but the KCP is still not stable, it's possible that there are network connectivity issues between the control plane nodes. |
| 38 | +You can check the network connectivity between the control plane nodes by running the following command: |
| 39 | + |
| 40 | +```bash |
| 41 | +kubectl exec -it <etcd-pod-name> -n kube-system -- ping <other-control-plane-node-ip> |
| 42 | +``` |
| 43 | + |
| 44 | +Replace `<etcd-pod-name>` with the name of one of the etcd pods and `<other-control-plane-node-ip>` with the IP address of one of the other control plane nodes. |
| 45 | +If the ping command fails, it indicates that there are network connectivity issues between the control plane nodes. |
| 46 | + |
| 47 | +## Check for storage issues |
| 48 | +If the etcd pods are running and there are no network connectivity issues, it's possible that there are storage issues with the etcd pods. |
| 49 | +You can check the storage issues by running the following command: |
| 50 | + |
| 51 | +```bash |
| 52 | +kubectl describe pod <etcd-pod-name> -n kube-system |
| 53 | +``` |
| 54 | + |
| 55 | +Replace `<etcd-pod-name>` with the name of one of the etcd pods. |
| 56 | +This command provides detailed information about the etcd pod, including any storage issues that might be present. |
| 57 | + |
| 58 | +## Check for resource issues or saturation |
| 59 | +If the etcd pods are running and there are no network connectivity or storage issues, it's possible that there are resource issues or saturation on the control plane nodes. |
| 60 | +You can check the resource usage on the control plane nodes by running the following command: |
| 61 | + |
| 62 | +```bash |
| 63 | +kubectl top nodes |
| 64 | +``` |
| 65 | + |
| 66 | +This command provides information about the CPU and memory usage on the control plane nodes. |
| 67 | +If the CPU or memory usage is high, it might indicate that the control plane nodes are saturated and unable to process requests. |
| 68 | + |
| 69 | + |
| 70 | +**TODO**: Add any possible steps that can be taken to help with this scenario. Perhaps there's a list of things to check. The steps above are just a starting point and should be expanded upon. |
| 71 | + |
19 | 72 | [!include[stillHavingIssues](./includes/contact-support.md)]
|
20 | 73 |
|
21 | 74 | [contact support]: https://portal.azure.com/?#blade/Microsoft_Azure_Support/HelpAndSupportBlade
|
0 commit comments