|
| 1 | +--- |
| 2 | +title: Troubleshoot control plane quorum loss |
| 3 | +description: Document how to restore control plane quorum loss |
| 4 | +ms.topic: article |
| 5 | +ms.date: 01/18/2024 |
| 6 | +author: matthewernst |
| 7 | +ms.author: matthewernst |
| 8 | +ms.service: azure-operator-nexus |
| 9 | +--- |
| 10 | + |
| 11 | +# Troubleshoot control plane quorum loss |
| 12 | + |
| 13 | +Follow this troubleshooting guide when multiple control plane nodes are offline or unavailable: |
| 14 | + |
| 15 | +## Prerequisites |
| 16 | + |
| 17 | +- Install the latest version of the |
| 18 | + [appropriate Azure CLI extensions](./howto-install-cli-extensions.md). |
| 19 | +- Gather the following information: |
| 20 | + - Subscription ID |
| 21 | + - Cluster name and resource group |
| 22 | + - Bare metal machine name |
| 23 | +- Ensure you're logged using `az login` |
| 24 | + |
| 25 | + |
| 26 | +## Symptoms |
| 27 | + |
| 28 | +- Kubernetes API isn't available |
| 29 | +- Multiple control plane nodes are offline or unavailable |
| 30 | + |
| 31 | +## Procedure |
| 32 | + |
| 33 | +1. Identify the Nexus Management Node |
| 34 | +- To identify the management nodes, run `az networkcloud baremetalmachine list -g <ResourceGroup_Name>` |
| 35 | +- Log in to the identified server |
| 36 | +- Ensure the ironic-conductor service is present on this node using `crictl ps -a |grep -i ironic-conductor` |
| 37 | + Example output: |
| 38 | + |
| 39 | +~~~ |
| 40 | +testuser@<servername> [ ~ ]$ sudo crictl ps -a |grep -i ironic-conductor |
| 41 | +<id> <id> 6 hours ago Running ironic-conductor 0 <id> |
| 42 | +~~~ |
| 43 | + |
| 44 | +2. Determine the iDRAC IP of the server |
| 45 | +- Run the command `az networkcloud cluster list -g <RG_Name>` |
| 46 | +- The output of the command is a JSON with the iDRAC IP |
| 47 | + |
| 48 | + ~~~ |
| 49 | + { |
| 50 | + "bmcConnectionString": "redfish+https://xx.xx.xx.xx/redfish/v1/Systems/System.Embedded.1", |
| 51 | + "bmcCredentials": { |
| 52 | + "username": "<username>" |
| 53 | + }, |
| 54 | + "bmcMacAddress": "<bmcMacAddress>", |
| 55 | + "bootMacAddress": "<bootMacAddress", |
| 56 | + "machineDetails": "extraDetails", |
| 57 | + "machineName": "<machineName>", |
| 58 | + "rackSlot": <rackSlot>, |
| 59 | + "serialNumber": "<serialNumber>" |
| 60 | + }, |
| 61 | + ~~~ |
| 62 | +
|
| 63 | +3. Access the iDRAC GUI using the IP in your browser to shut down impacted management servers |
| 64 | +
|
| 65 | + :::image type="content" source="media\troubleshoot-control-plane-quorum\graceful-shutdown.png" alt-text="Screenshot of an iDRAC GUI and the button to perform a graceful shutdown." lightbox="media\troubleshoot-control-plane-quorum\graceful-shutdown.png"::: |
| 66 | +
|
| 67 | +4. When all impacted management servers are down, turn on the servers using the iDRAC GUI |
| 68 | +
|
| 69 | + :::image type="content" source="media\troubleshoot-control-plane-quorum\graceful-power-on.png" alt-text="Screenshot of an iDRAC GUI and the button to perform power on command." lightbox="media\troubleshoot-control-plane-quorum\graceful-power-on.png"::: |
| 70 | +
|
| 71 | +5. The servers should now be restored. If not, engage Microsoft support. |
0 commit comments