Skip to content

Commit 163dfff

Browse files
Merge pull request #280203 from bartpinto/main
Create troubleshoot-baremetalmachine-provisioning.md
2 parents 31a1d7e + 789a853 commit 163dfff

File tree

2 files changed

+229
-0
lines changed

2 files changed

+229
-0
lines changed

articles/operator-nexus/TOC.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -286,6 +286,8 @@
286286
href: howto-baremetal-run-data-extract.md
287287
- name: Troubleshoot Control Plane Quorum
288288
href: troubleshoot-control-plane-quorum.md
289+
- name: Troubleshoot Bare Metal Machine Provisioning
290+
href: troubleshoot-bare-metal-machine-provisioning.md
289291
- name: FAQ
290292
href: azure-operator-nexus-faq.md
291293
- name: Reference
Lines changed: 227 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,227 @@
1+
---
2+
title: Azure Operator Nexus troubleshoot bare metal machine provisioning
3+
description: Troubleshoot bare metal machine provisioning for Azure Operator Nexus.
4+
ms.service: azure-operator-nexus
5+
ms.custom: troubleshooting
6+
ms.topic: troubleshooting
7+
ms.date: 07/19/2024
8+
author: bpinto
9+
ms.author: bpinto
10+
---
11+
12+
# Troubleshoot BMM provisioning in Azure Operator Nexus cluster
13+
14+
As part of cluster deploy action, bare metal machines (BMM) are provisioned with required roles to participate in the cluster. This document supports troubleshooting for common provisioning issues using Azure CLI, Azure portal, and the server baseboard management controller (BMC). For the Azure Operator Nexus platform, the underlying server hardware uses integrated Dell remote access controller (iDRAC) as the BMC. Provisioning uses the Preboot eXecution Environment (PXE) interface to load the Operating System (OS) on the BMM.
15+
16+
## Prerequisites
17+
1. Install the latest version of the [appropriate CLI extensions](howto-install-cli-extensions.md)
18+
2. Collect the following information:
19+
- Subscription ID (SUBSCRIPTION)
20+
- Cluster name (CLUSTER)
21+
- Resource group (CLUSTER_RG)
22+
- Managed resource group (CLUSTER_MRG)
23+
3. Request subscription access to run Azure Operator Nexus network fabric (NF) and network cloud (NC) CLI extension commands.
24+
4. Log in to Azure CLI and select the subscription where the cluster is deployed.
25+
26+
## BMM roles
27+
For a given SKU, there are required roles to manage and operate the underlying kubernetes cluster.
28+
29+
The following roles are assigned to BMM resources (see [BMM roles reference](reference-near-edge-baremetal-machine-roles.md)):
30+
31+
- `Control plane`: BMM responsible for running the kubernetes control plane agents for cluster.
32+
- `Management plane`: BMM responsible for running the platform agents including controllers and extensions.
33+
- `Compute plane`: BMM responsible for running actual tenant workloads including kubernetes clusters and virtual machines.
34+
35+
## Listing BMM status
36+
This command will `list` all `bareMetalMachineName` resources in the managed resource group with simple status:
37+
38+
```azurecli
39+
az networkcloud baremetalmachine list -g $CLUSTER_MRG -o table
40+
41+
Name ResourceGroup DetailedStatus DetailedStatusMessage
42+
------------ ----------------------------- ---------------- ---------------------------------------
43+
BMM_NAME CLUSTER_MRG STATUS STATUS_MSG
44+
```
45+
46+
Where `STATUS` goes through the following phases through the BMM provisioning process (see [BMM Status in Azure Operator Nexus Compute Concepts](concepts-compute.md)):
47+
48+
`Registering` -> `Preparing` -> `Inspecting` -> `Available` -> `Provisioning` -> `Provisioned`
49+
50+
These phases are defined as follows:
51+
52+
| Phase | Actions |
53+
| --- | --- |
54+
| `Registering` | Verifying BMC connectivity/BMC credentials and adding BMM to provisioning service. |
55+
| `Preparing` | Rebooting BMM, resetting BMC, and verifying power state. |
56+
| `Inspecting` | Updating firmware, applying BIOS settings, and configuring storage. |
57+
| `Available` | BMM is ready to install OS. |
58+
| `Provisioning` | OS image installing on the BMM. After OS is installed, BMM attempts to join cluster. |
59+
| `Provisioned` | BMM successfully provisioned and joined to cluster. |
60+
| `Deprovisioning` | BMM provisioning failed. Provisioning service is cleaning up resource for retry. |
61+
| `Failed` | BMM provisioning failed and manual recovery is required. All retries exhausted. |
62+
63+
During any phase, the BMM detailed status is set to failed and the phase is blocked if any of the following occurs:
64+
- BMC is unavailable
65+
- Network port is down
66+
- Hardware component fails
67+
68+
To get a more detailed status of the BMM:
69+
```azurecli
70+
az networkcloud baremetalmachine list -g $CLUSTER_MRG --query "sort_by([].{name:name,readyState:readyState,provisioningState:provisioningState,detailedStatus:detailedStatus,detailedStatusMessage:detailedStatusMessage,powerState:powerState,machineRoles:machineRoles| join(', ', @),createdAt:systemData.createdAt}, &name)" --output table
71+
72+
Name ReadyState ProvisioningState DetailedStatus DetailedStatusMessage PowerState MachineRoles CreatedAt
73+
------------ ---------- ----------------- -------------- ----------------------------------------- ---------- ------------------------------------------------ -----------
74+
BMM_NAME RSTATE PROV_STATE STATUS STATUS_MSG POWER_STATE BMM_ROLE CREATE_DATE
75+
```
76+
77+
Where the output is defined as follows:
78+
79+
| Output | Definition |
80+
| --- | --- |
81+
| BMM_NAME | BMM name |
82+
| RSTATE | Cluster participation status (`True`,`False`). |
83+
| PROV_STATE | Provisioning state (`Succeeded`,`Failed`). |
84+
| STATUS | Provisioning detailed status (`Registering`,`Preparing`,`Inspecting`,`Available`,`Provisioning`,`Provisioned`,`Deprovisioning`,`Failed`). |
85+
| STATUS_MSG | Detailed provisioning status message. |
86+
| POWER_STATE | Power state of BMM (`On`,`Off`). |
87+
| BMM_ROLE | BMM cluster role (`control-plane`,`management-plane`,`compute-plane`). |
88+
| CREATE_DATE | BMM creation date. |
89+
90+
For example:
91+
```azurecli
92+
x01dev01c01w01 True Succeeded Provisioned The OS is provisioned to the machine On platform.afo-nc.microsoft.com/compute-plane=true 2024-05-03T15:12:48.0934793Z
93+
x01dev01c01w01 False Failed Preparing Preparing for provisioning of the machine Off platform.afo-nc.microsoft.com/compute-plane=true 2024-05-03T15:12:48.0934793Z
94+
```
95+
96+
## BMM details
97+
To show details and status of a single BMM:
98+
```azurecli
99+
az networkcloud baremetalmachine show -g $CLUSTER_MRG -n $BMM_NAME
100+
```
101+
For BMM details specific to troubleshooting:
102+
```azurecli
103+
az networkcloud baremetalmachine show -g $CLUSTER_MRG -n $BMM_NAME --query "{name:name,BootMAC:bootMacAddress,BMCMAC:bmcMacAddress,Connect:bmcConnectionString,SN:serialNumber,rackId:rackId,RackSlot:rackSlot}" -o table
104+
```
105+
106+
## Troubleshooting failed provisioning state
107+
108+
The following conditions can cause provisioning failures:
109+
110+
| Error Type | Resolution |
111+
| ---------- | ---------- |
112+
| BMC shows `Backplane Comm` critical error. | 1) Execute BMM remote flea drain. 2) Perform BMM physical flea drain. 3) Execute BMM `replace` action. |
113+
| Boot (PXE) network data response empty from BMC. | 1) Reset port on fabric device. 2) Execute BMM remote flea drain. 3) Perform BMM physical flea drain. 4) Execute BMM `replace` action. |
114+
| Boot (PXE) MAC address mismatch. | 1) Validate BMM MAC address data against BMC data. 2) Execute BMM remote flea drain. 3) Perform BMM physical flea drain. 4) Execute BMM `replace` action. |
115+
| BMC MAC address mismatch | 1) Validate BMM MAC address data against BMC data. 2) Execute BMM remote flea drain. 3) Perform BMM physical flea drain. 4) Execute BMM `replace` action. |
116+
| Disk data response empty from BMC. | 1) Remove/replace disk. 2) Remove/replace storage controller. 3) Execute BMM remote flea drain. 4) Perform BMM physical flea drain. 5) Execute BMM `replace` action. |
117+
| BMC unreachable. | 1) Reset port on fabric device. 2) Remove/replace cable. 3) Execute BMM remote flea drain. 4) Perform BMM physical flea drain. 5) Execute BMM `replace` action. |
118+
| BMC fails log in. | 1) Update credentials on BMC. 2) Execute BMM `replace` action. |
119+
| Memory, CPU, OEM critical errors on BMC. | 1) Resolve hardware issue with remove/replace. 2) Execute BMM remote flea drain. 3) Perform BMM physical flea drain. 4) Execute BMM `replace` action. |
120+
| Console stuck at boot loader (GRUB) menu. | 1) Execute NVRAM reset. 2) Execute BMM `replace` action. |
121+
122+
### Azure BMM activity log
123+
124+
1. Log in to [Azure portal](https://portal.azure.com/).
125+
2. Search on the BMM name in the top `Search` box.
126+
3. Select the `Bare Metal Machine (Operator Nexus)` from the search results.
127+
4. Select `Activity log` on the left side menu.
128+
5. Make sure the `Timespan` encompasses the provisioning period.
129+
6. Expand the `BareMetalMachines_Update` operation and select any that show `Failed` status.
130+
7. Select `JSON` tab to get the detailed status message.
131+
132+
Look for failures related to invalid credentials or BMC unavailable.
133+
134+
### Determine BMC IPv4 address
135+
The IPv4 address of the BMC (BMC_IP) is in the `Connect` value returned from the previous `BMM Details` section.
136+
137+
### Validate MAC address of BMM against BMC data
138+
139+
To get the MAC address information from the BMM:
140+
```azurecli
141+
az networkcloud baremetalmachine show -g $CLUSTER_MRG -n $BMM_NAME --query "{name:name,BootMAC:bootMacAddress,BMCMAC:bmcMacAddress,SN:serialNumber,rackId:rackId,RackSlot:rackSlot}" -o table
142+
```
143+
144+
Verify the MAC address data against the BMC through the WEB UI:
145+
`BMC` -> `Dashboard` - Shows BMC MAC address
146+
`BMC` -> `System Info` -> `Network` -> `Embedded.1-1-1` - Shows Boot MAC address
147+
148+
Verify the MAC address using `racadm` from a jumpbox that has access to the BMC network:
149+
```bash
150+
racadm --nocertwarn -r $IP -u $BMC_USR -p $BMC_PWD getsysinfo | grep "MAC Address " #BMC MAC
151+
racadm --nocertwarn -r $IP -u $BMC_USR -p $BMC_PWD getsysinfo | grep "NIC.Embedded.1-1-1" #Boot MAC
152+
```
153+
154+
If the MAC address supplied to the cluster is incorrect, use the BMM `replace` action at [BMM actions](howto-baremetal-functions.md) to correct the addresses.
155+
156+
### Ping test BMC connectivity
157+
158+
Attempt to run ping against the BMC IPv4 address:
159+
1. Obtain the IPv4 address (BMC_IP) from the previous `Determine BMC IPv4 address`.
160+
2. Test ping to the BMC:
161+
162+
To test from a jumpbox that has access to the BMC network:
163+
```bash
164+
ping $BMC_IP -c 3
165+
```
166+
167+
To test from a BMM control-plane host using Azure CLI:
168+
```azurecli
169+
az networkcloud baremetalmachine run-read-command -g $CLUSTER_MRG -n $BMM_NAME --limit-time-seconds 60 --commands "[{command:'ping',arguments:['$BMC_IP',-c,3]}]"
170+
```
171+
172+
### Reset port on fabric device
173+
If the BMC_IP isn't responsive, a reset of the fabric device port retriggers autonegotiation on the port and may bring it back online.
174+
175+
To find the `Network Fabric` port from Azure:
176+
1. Obtain the `RackID` and `RackSlot` from the previous `BMM Details` section.
177+
2. In Azure portal, drill down to the `Network Rack` RackID for the BMM.
178+
3. Select `Network Devices` tab and the management (Mgmt) switch for the rack.
179+
4. Under `Resources`, select `Network Interfaces` and then the BMC (iDRAC) or boot (PXE) interface for the port that requires reset.
180+
181+
Collect the following information:
182+
- Network fabric resource group (NF_RG)
183+
- Device name (NF_DEVICE_NAME)
184+
- Interface name (NF_DEVICE_INTERFACE_NAME)
185+
186+
5. Reset the port:
187+
188+
To reset the port using Azure CLI:
189+
```azurecli
190+
az networkfabric interface update-admin-state -g $NF_RG --network-device-name $NF_DEVICE_NAME --resource-name $NF_DEVICE_INTERFACE_NAME --state Disable
191+
az networkfabric interface update-admin-state -g $NF_RG --network-device-name $NF_DEVICE_NAME --resource-name $NF_DEVICE_INERFACE_NAME --state Enable
192+
```
193+
194+
### BMM remote power drain (flea drain)
195+
Perform a remote flea drain against the BMM through the BMC UI:
196+
`BMC` -> `Configuration` -> `BIOS Settings` -> `Miscellaneous Settings` -> `Select "Full Power Cycle" under Power Cycle Request` -> `Apply and reboot`
197+
198+
Perform a remote flea drain using `racadm` from a jumpbox that has access to the BMC network:
199+
```bash
200+
racadm set bios.miscsettings.powercyclerequest FullPowerCycle
201+
racadm jobqueue create BIOS.Setup.1-1
202+
racadm serveraction powercycle
203+
```
204+
205+
### BMM physical power drain (flea drain)
206+
For a physical flea drain, the local site hands physically disconnect the power cables from both power adapters for 5 minutes and then restore power. This process ensures the server, capacitors, and all components have complete power removal and all cached data is cleared.
207+
208+
### Reset NVRAM
209+
If provisioning failed due to an OEM or hardware error, the boot sequence may be locked in NVRAM to `PXE boot` instead of showing `hdd` or `hard drive` listed first in the boot order.
210+
211+
This condition typically shows the BMM at the bootloader stage on the console and is blocked without manual keystroke intervention.
212+
213+
To reset the NVRAM, use the following sequence in the BMC UI:
214+
`Maintenance` -> `Diagnostics` -> `Reset iDrac to Factory Defaults` -> `Discard All Settings, but preserve user and network settings` -> `Apply and reboot`
215+
216+
### Reset BMC password
217+
If the activity log indicates invalid credentials on the BMC, run the following command from a jumpbox that has access to the BMC network:
218+
```bash
219+
racadm -r $BMC_IP -u $BMC_USER -p $CURRENT_PASSWORD set iDRAC.Users.2.Password $BMC_PWD
220+
```
221+
222+
## Adding servers back into the cluster after a repair
223+
224+
After hardware is fixed, run BMM `replace` action following instructions from the following page [BMM actions](howto-baremetal-functions.md).
225+
226+
If you still have questions, [contact support](https://portal.azure.com/?#blade/Microsoft_Azure_Support/HelpAndSupportBlade).
227+
For more information about support plans, see [Azure Support plans](https://azure.microsoft.com/support/plans/response/).

0 commit comments

Comments
 (0)