You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
> The commands can be executed from the Azure portal or with the Azure CLI.
25
+
26
+
**TODO**: Add any possible steps that can be taken to help with this scenario. Perhaps there's a list of things to check. The steps above are just a starting point and should be expanded upon.
27
+
21
28
## Ensure that all control plane nodes are online
22
29
23
30
[Troubleshoot control plane quorum loss when multiple nodes are offline](./troubleshoot-control-plane-quorum.md) provides steps to follow when multiple control plane nodes are offline or unavailable.
24
31
25
32
## Check the status of the etcd pods
26
33
27
34
It's possible that the etcd pods aren't running or are in a crash loop.
28
-
This can happen if the control plane nodes aren't able to communicate with each other or if there are network issues.
29
-
35
+
Record the `etcd` pod names; other commands require `<etcd-pod-name>` to be populated with the name of one of the etcd pods.
30
36
If you have access to the control plane nodes, you can check the status of the etcd pods by running the following command:
31
37
32
-
```bash
33
-
kubectl get pods -n kube-system -l app=etcd
38
+
```azurecli
39
+
az networkcloud baremetalmachine run-read-command \
## Check network connectivity between control plane nodes
37
-
If the etcd pods are running, but the KCP is still not stable, it's possible that there are network connectivity issues between the control plane nodes.
38
-
You can check the network connectivity between the control plane nodes by running the following command:
If the etcd pods are running, but the KCP is still not stable, it's possible that there are network connectivity issues between the control plane nodes.
43
55
44
-
Replace `<etcd-pod-name>` with the name of one of the etcd pods and `<other-control-plane-node-ip>` with the IP address of one of the other control plane nodes.
56
+
Replace `<other-control-plane-node-ip>` with the IP address of one of the other control plane nodes.
45
57
If the ping command fails, it indicates that there are network connectivity issues between the control plane nodes.
46
58
59
+
```azurecli
60
+
az networkcloud baremetalmachine run-read-command \
If the etcd pods are running and there are no network connectivity issues, it's possible that there are storage issues with the etcd pods.
49
-
You can check the storage issues by running the following command:
50
71
51
-
```bash
52
-
kubectl describe pod <etcd-pod-name> -n kube-system
53
-
```
54
72
55
73
Replace `<etcd-pod-name>` with the name of one of the etcd pods.
56
74
This command provides detailed information about the etcd pod, including any storage issues that might be present.
75
+
You can check the storage issues by running the following command:
76
+
77
+
```azurecli
78
+
```
79
+
57
80
58
81
## Check for resource issues or saturation
82
+
59
83
If the etcd pods are running and there are no network connectivity or storage issues, it's possible that there are resource issues or saturation on the control plane nodes.
60
-
You can check the resource usage on the control plane nodes by running the following command:
84
+
If the CPU or memory usage is high, it might indicate that the control plane nodes are overscheduled and unable to process requests.
61
85
62
-
```bash
63
-
kubectl top nodes
86
+
```azurecli
64
87
```
65
88
66
-
This command provides information about the CPU and memory usage on the control plane nodes.
67
-
If the CPU or memory usage is high, it might indicate that the control plane nodes are saturated and unable to process requests.
68
-
69
-
70
-
**TODO**: Add any possible steps that can be taken to help with this scenario. Perhaps there's a list of things to check. The steps above are just a starting point and should be expanded upon.
|`ClusterHeartbeatConnectionStatusDisconnectedClusterManagerOperationsAreAffectedPossibleNetworkIssues`|[Troubleshoot Cluster heartbeat connection status shows disconnected](./troubleshoot-cluster-heartbeat-connection-status-disconnected.md)|
25
+
|`ClusterHeartbeatConnectionStatusTimedoutPossiblePerformanceIssues`|[Troubleshoot Cluster heartbeat connection status shows disconnected](./troubleshoot-cluster-heartbeat-connection-status-disconnected.md)|
26
+
|`ETCDPossibleQuorumLossClusterOperationsAreAffected`<br>`ETCDPossibleQuorumLossDegradedProposalsProcessing`<br>`ETCDPossibleQuorumLossIncreasedProposalsProcessingFailures`<br>`ETCDPossibleQuorumLossNoClusterLeader`|[Troubleshoot Cluster Manager Not Reachable](./troubleshoot-cluster-manager-not-reachable.md)|
27
+
28
+
## Bare Metal Machine
29
+
30
+
| Resource Health Event Name | Troubleshooting Guide |
|`BMMHasHardwareValidationFailures`|[Troubleshoot 'Warning' detailed status messages on an Azure Operator Nexus Cluster Bare Metal Machine](troubleshoot-bare-metal-machine-warning.md#warning-this-machine-has-failed-hardware-validation)|
33
+
|`BMMHasLACPDownStatusCondition`|[Troubleshoot Degraded status errors on an Azure Operator Nexus Cluster Bare Metal Machine](troubleshoot-bare-metal-machine-degraded.md#degraded-lacp-status-is-down)|
34
+
|`BMMHasNodeReadinessProblem`|[Troubleshoot Bare Metal Machine in not ready state](troubleshoot-bare-metal-machine-not-ready-state.md)|
35
+
|`BMMHasPortDownStatusCondition`|[Troubleshoot Degraded status errors on an Azure Operator Nexus Cluster Bare Metal Machine](troubleshoot-bare-metal-machine-degraded.md#degraded-port-down)|
36
+
|`BMMHasPortFlappingStatusCondition`|[Troubleshoot Degraded status errors on an Azure Operator Nexus Cluster Bare Metal Machine](troubleshoot-bare-metal-machine-degraded.md#degraded-port-flapping)|
37
+
|`BMMPowerStateDoesNotMatchExpected`|[Troubleshoot 'Warning' detailed status messages on an Azure Operator Nexus Cluster Bare Metal Machine](troubleshoot-bare-metal-machine-warning.md#warning-bmm-power-state-doesnt-match-expected-state)|
38
+
|`BMMPxePortIsUnhealthy`|[Troubleshoot 'Warning' detailed status messages on an Azure Operator Nexus Cluster Bare Metal Machine](troubleshoot-bare-metal-machine-warning.md#warning-pxe-port-is-unhealthy)|
0 commit comments