Skip to content

Commit 7a564c2

Browse files
authored
Update and rename troubleshoot-baremetalmachine-provisioning.md to troubleshoot-bmm-provisioning.md
Update document name, corrections from PR review and acrolinx corrections.
1 parent 03024a5 commit 7a564c2

File tree

1 file changed

+71
-68
lines changed

1 file changed

+71
-68
lines changed
Lines changed: 71 additions & 68 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
2-
title: Azure Operator Nexus Troubleshooting BareMetal Machine Provisioning
3-
description: Troubleshoot BareMetal Machine Provisioning for Azure Operator Nexus.
2+
title: Azure Operator Nexus troubleshooting BMM provisioning
3+
description: Troubleshoot BMM provisioning for Azure Operator Nexus.
44
ms.service: azure-operator-nexus
55
ms.custom: troubleshooting
66
ms.topic: troubleshooting
@@ -9,29 +9,31 @@ author: bpinto
99
ms.author: bpinto
1010
---
1111

12-
# Troubleshoot BareMetal Machine Provisioning in Nexus Cluster
12+
# Troubleshoot BMM provisioning in Azure Operator Nexus cluster
1313

14-
As part of Cluster deploy action, BareMetal machines (BMM) are provisioned with required roles to participate in the Nexus Cluster. This document supports troubleshooting for common provisioning issues using Azure CLI, Azure Portal, and the server Baseboard Management Controller (BMC). For the Operator Nexus Platform, the underlying Dell server hardware uses Integrated Dell Remote Access Controller (iDRAC) as the BMC.
14+
As part of cluster deploy action, bare metal machines (BMM) are provisioned with required roles to participate in the cluster. This document supports troubleshooting for common provisioning issues using Azure CLI, Azure portal, and the server baseboard management controller (BMC). For the Azure Operator Nexus platform, the underlying server hardware uses integrated Dell remote access controller (iDRAC) as the BMC.
1515

1616
## Prerequisites
1717
1. Install the latest version of the [appropriate CLI extensions](howto-install-cli-extensions.md)
1818
2. Gather the following information:
19-
- Subscription ID (SUBSCRIPTION)
20-
- Cluster name (CLUSTER), Resource Group (CLUSTER_RG), and Managed Resource Group (CLUSTER_MRG)
21-
3. The user needs access to the subscription to run Nexus Network Fabric (NF) and Network Cloud (NC) CLI extension commands
19+
- Subscription ID (SUBSCRIPTION)
20+
- Cluster name (CLUSTER)
21+
- Resource group (CLUSTER_RG)
22+
- Managed resource group (CLUSTER_MRG)
23+
3. The user needs access to the subscription to run Azure Operator Nexus network fabric (NF) and network cloud (NC) CLI extension commands
2224
4. Log in to Azure CLI and select the subscription where the cluster is deployed
2325

2426
## BMM roles
25-
For a given SKU, there are required roles to manage and operate the underlying Kubernetes cluster.
27+
For a given SKU, there are required roles to manage and operate the underlying kubernetes cluster.
2628

27-
The following roles are assigned to BMM resources (see [BMM Roles](reference-near-edge-baremetal-machine-roles.md)):
29+
The following roles are assigned to BMM resources (see [BMM roles reference](reference-near-edge-baremetal-machine-roles.md)):
2830

29-
- `Control plane`: BMM responsible for running the Kubernetes control plane agents for Nexus platform cluster.
30-
- `Management plane`: BMM responsible for running the Nexus platform agents including controllers and extensions.
31-
- `Compute plane`: BMM responsible for running actual tenant workloads including Nexus Kubernetes Clusters and Virtual Machines.
31+
- `Control plane`: BMM responsible for running the kubernetes control plane agents for cluster.
32+
- `Management plane`: BMM responsible for running the platform agents including controllers and extensions.
33+
- `Compute plane`: BMM responsible for running actual tenant workloads including kubernetes clusters and virtual machines.
3234

33-
## Listing BareMetal Machine status
34-
This command will `list` all `bareMetalMachineName` resources in the Managed Resource Group with simple status:
35+
## Listing BMM status
36+
This command will `list` all `bareMetalMachineName` resources in the managed resource group with simple status:
3537

3638
```azurecli
3739
az networkcloud baremetalmachine list -g $CLUSTER_MRG -o table
@@ -41,19 +43,20 @@ Name ResourceGroup DetailedStatus DetailedStatusMes
4143
BMM_NAME CLUSTER_MRG STATUS STATUS_MSG
4244
```
4345

44-
Where `STATUS` goes through the following phases through the BareMetal Machine provisioning process (see [BMM Status in Azure Operator Nexus Compute Concepts](concepts-compute.md)):
46+
Where `STATUS` goes through the following phases through the BMM provisioning process (see [BMM Status in Azure Operator Nexus Compute Concepts](concepts-compute.md)):
4547

4648
`Registering` -> `Preparing` -> `Inspecting` -> `Available` -> `Provisioning` -> `Provisioned`
4749

4850
These phases are defined as follows:
51+
4952
| Phase | Definition |
5053
| --- | --- |
51-
| `Registering` | Verify BMC Connectivity and BMC Credentials, Add BMM to Provisioning Service |
54+
| `Registering` | Verify BMC connectivity and BMC credentials, add BMM to provisioning service |
5255
| `Preparing` | Reboot BMM, reset BMC, verify power state |
53-
| `Inspecting` | Update firmware, apply BIOS settings, and configure RAID |
56+
| `Inspecting` | Update firmware, apply BIOS settings, and configure storage |
5457
| `Available` | BMM ready to install OS |
55-
| `Provisioning` | OS image installing on the BMM, BMM attempts to join Cluster |
56-
| `Provisioned` | BMM successfully provisioned and joined Cluster |
58+
| `Provisioning` | OS image installing on the BMM and attempts to join cluster |
59+
| `Provisioned` | BMM successfully provisioned and joined to cluster |
5760
| `Deprovisioning` | BMM provisioning failed and retrying |
5861
| `Failed` | BMM provisioning failed and requires recovery action, all retries exhausted |
5962

@@ -72,29 +75,30 @@ BMM_NAME RSTATE PROV_STATE STATUS STATUS_MSG
7275
```
7376

7477
Where the output is defined as follows:
78+
7579
| Output | Definition |
7680
| --- | --- |
77-
| BMM_NAME | BareMetal Machine Name |
78-
| RSTATE | Cluster Participation Status (`True`,`False`) |
79-
| PROV_STATE | Provisioning State (`Succeeded`,`Failed`) |
80-
| STATUS | Provisioning Detailed Status (`Registering`,`Preparing`,`Inspecting`,`Available`,`Provisioning`,`Provisioned`,`Deprovisioning`,`Failed`) |
81-
| STATUS_MSG | Detailed Provisioning Status Message |
82-
| POWER_STATE | Power State of BMM (`On`,`Off`) |
83-
| BMM_ROLE | BMM Cluster Role contains (`control-plane`,`management-plane`,`compute-plane`) |
84-
| CREATE_DATE | BMM Creation Date |
81+
| BMM_NAME | BMM name |
82+
| RSTATE | Cluster participation status (`True`,`False`) |
83+
| PROV_STATE | Provisioning state (`Succeeded`,`Failed`) |
84+
| STATUS | Provisioning detailed status (`Registering`,`Preparing`,`Inspecting`,`Available`,`Provisioning`,`Provisioned`,`Deprovisioning`,`Failed`) |
85+
| STATUS_MSG | Detailed provisioning status message |
86+
| POWER_STATE | Power state of BMM (`On`,`Off`) |
87+
| BMM_ROLE | BMM cluster role contains (`control-plane`,`management-plane`,`compute-plane`) |
88+
| CREATE_DATE | BMM creation date |
8589

8690
For example:
8791
```azurecli
8892
x01dev01c01w01 True Succeeded Provisioned The OS is provisioned to the machine On platform.afo-nc.microsoft.com/compute-plane=true 2024-05-03T15:12:48.0934793Z
8993
x01dev01c01w01 False Failed Preparing Preparing for provisioning of the machine Off platform.afo-nc.microsoft.com/compute-plane=true 2024-05-03T15:12:48.0934793Z
9094
```
9195

92-
## BareMetal Machine details
96+
## BMM details
9397
To show details and status of a single BMM:
9498
```azurecli
9599
az networkcloud baremetalmachine show -g $CLUSTER_MRG -n $BMM_NAME
96100
```
97-
For important BareMetal Machine details:
101+
For additional BMM details used in troubleshooting:
98102
```azurecli
99103
az networkcloud baremetalmachine show -g $CLUSTER_MRG -n $BMM_NAME --query "{name:name,BootMAC:bootMacAddress,BMCMAC:bmcMacAddress,Connect:bmcConnectionString,SN:serialNumber,rackId:rackId,RackSlot:rackSlot}" -o table
100104
```
@@ -105,20 +109,18 @@ The following conditions can cause provisioning failures:
105109

106110
| Error Type | Resolution |
107111
| ---------- | ---------- |
108-
| BMC shows Backplane Comm | Remote Flea drain, Physical Flea Drain, BareMetal Machine Replace |
109-
| Preboot eXecution Environment (PXE) MAC Address mismatch | BareMetal Machine Replace |
110-
| BMC MAC Address mismatch | BareMetal Machine Replace |
111-
| Boot Network Data not Retrieved from Redfish | Bounce Port, Remote Flea drain, Physical Flea Drain, BareMetal Machine Replace |
112-
| Disk Data not retrieved from Redfish | Re-seat Disk, Re-seat RAID Controller (PERC), Remote Flea drain, Physical Flea Drain, BareMetal Machine Replace |
113-
| BMC Unreachable | Bounce Port, Reseat Cable, Remote Flea drain, Physical Flea Drain, BareMetal Machine Replace |
114-
| BMC fails log in | Update Credentials on BMC, BareMetal Machine Replace |
115-
| DIMM, CPU, OEM Critical Errors | Resolve Hardware Issue, BareMetal Machine Replace |
116-
| Stuck at Grub Loader | Reset NVRAM, BareMetal Machine Replace |
117-
118-
### Azure Bare Metal Machine activity log
112+
| BMC shows `Backplane Comm` critical error | Remote flea drain, physical flea drain, BMM replace action |
113+
| Boot network data response empty from BMC | Bounce port on fabric device, remote flea drain, physical flea drain, BMM replace action |
114+
| Disk data response empty from BMC | Reseat disk, re-seat storage controller, remote flea drain, physical flea drain, BMM replace action |
115+
| BMC unreachable | Bounce port on fabric device, reseat cable, remote flea drain, physical flea drain, BMM replace action |
116+
| BMC fails log in | Update credentials on BMC, BMM replace action |
117+
| DIMM, CPU, OEM critical errors | Resolve hardware issue, BMM replace action |
118+
| Console stuck at grub menu | Reset NVRAM, BMM replace action |
119+
120+
### Azure BMM activity log
119121

120122
1. Log in to [Azure Portal](https://portal.azure.com/).
121-
2. Search on the BMM Name in the top `Search` box.
123+
2. Search on the BMM name in the top `Search` box.
122124
3. Select the `Bare Metal Machine (Operator Nexus)` from the search results.
123125
4. Select `Activity log` on the left side menu.
124126
5. Make sure the `Timespan` encompasses the provisioning period.
@@ -128,56 +130,57 @@ The following conditions can cause provisioning failures:
128130
Look for failures related to invalid credentials or BMC unavailable.
129131

130132
### Determine BMC IPv4 address
131-
The IPv4 address of the BMC (BMC_IP) is the `Connect` value returned from the `BareMetal Machine Details` section.
133+
The IPv4 address of the BMC (BMC_IP) is in the `Connect` value returned from the previous `BMM Details` section.
132134

133135
### Validate MAC address of BMM against BMC data
134136

135-
To get the MAC address information from the BareMetal Machine:
137+
To get the MAC address information from the BMM:
136138
```azurecli
137139
az networkcloud baremetalmachine show -g $CLUSTER_MRG -n $BMM_NAME --query "{name:name,BootMAC:bootMacAddress,BMCMAC:bmcMacAddress,SN:serialNumber,rackId:rackId,RackSlot:rackSlot}" -o table
138140
```
139141

140142
Verify the MAC address data against the BMC through the WEB UI:
141-
`BMC` -> `Dashboard` - Shows BMC MAC Address
142-
`BMC` -> `System Info` -> `Network` -> `Embedded.1-1-1` - Shows Boot MAC Address
143+
`BMC` -> `Dashboard` - Shows BMC MAC address
144+
`BMC` -> `System Info` -> `Network` -> `Embedded.1-1-1` - Shows Boot MAC address
143145

144-
Verify the MAC address using `racadm` from a Jumpbox that has access to the BMC network:
146+
Verify the MAC address using `racadm` from a jumpbox that has access to the BMC network:
145147
```bash
146148
racadm --nocertwarn -r $IP -u $BMC_USR -p $BMC_PWD getsysinfo | grep "MAC Address " #BMC MAC
147149
racadm --nocertwarn -r $IP -u $BMC_USR -p $BMC_PWD getsysinfo | grep "NIC.Embedded.1-1-1" #Boot MAC
148150
```
149151

150-
If the MAC address supplied to the cluster is incorrect, use the BareMetal Machine replace action at [BMM actions](howto-baremetal-functions.md) to correct the addresses.
152+
If the MAC address supplied to the Cluster is incorrect, use the BMM replace action at [BMM actions](howto-baremetal-functions.md) to correct the addresses.
151153

152154
### Ping test BMC connectivity
153155

154156
Attempt to run ping against the BMC IPv4 address:
155-
1. Obtain the IPv4 address (BMC_IP) from `Determine BMC IPv4 Address` above.
157+
1. Obtain the IPv4 address (BMC_IP) from the previous `Determine BMC IPv4 address`.
156158
2. Test ping to the BMC:
157159

158-
To test from a Jumpbox that has access to the BMC network:
160+
To test from a jumpbox that has access to the BMC network:
159161
```bash
160162
ping $BMC_IP -c 3
161163
```
162164

163-
To test from a BareMetal Machine control-plane host:
165+
To test from a BMM control-plane host using Azure CLI:
164166
```azurecli
165167
az networkcloud baremetalmachine run-read-command -g $CLUSTER_MRG -n $BMM_NAME --limit-time-seconds 60 --commands "[{command:'ping',arguments:['$BMC_IP',-c,3]}]"
166168
```
167169

168-
### Reset Port on Fabric Device
169-
If the BMC_IP is not responsive, a reset of the fabric port retriggers autonegotiation on the port and may bring it back online.
170+
### Reset port on fabric device
171+
If the BMC_IP is not responsive, a reset of the fabric device port retriggers autonegotiation on the port and may bring it back online.
170172

171173
To find the `Network Fabric` port from Azure:
172-
1. Obtain the RackID and RackSlot from the previous `BareMetal Machine Details` section.
173-
2. In `Azure Portal`, drill down to the `Network Rack` RackID for the BareMetal Machine Rack.
174-
3. Select `Network Devices` tab and the Management (Mgmt) switch for the rack.
175-
4. Under `Resources`, select `Network Interfaces` and then the interface for the BMC (iDRAC) or Boot (PXE) for the port that requires reset.
174+
1. Obtain the `RackID` and `RackSlot` from the previous `BMM Details` section.
175+
2. In `Azure Portal`, drill down to the `Network Rack` RackID for the BMM.
176+
3. Select `Network Devices` tab and the management (Mgmt) switch for the rack.
177+
4. Under `Resources`, select `Network Interfaces` and then the BMC (iDRAC) or boot (PXE) interface for the port that requires reset.
176178

177179
Collect the following information:
178-
- Network Fabric Resource Group (NF_RG)
179-
- Device Name (NF_DEVICE_NAME)
180-
- Interface Name (NF_DEVICE_INTERFACE_NAME).
180+
- Network fabric resource group (NF_RG)
181+
- Device name (NF_DEVICE_NAME)
182+
- Interface name (NF_DEVICE_INTERFACE_NAME)
183+
181184
5. Reset the port:
182185

183186
To reset the port using Azure CLI:
@@ -187,36 +190,36 @@ To find the `Network Fabric` port from Azure:
187190
```
188191

189192
### BMM remote power drain (flea drain)
190-
Perform a remote Flea Drain against the BareMetal Machine through the WEB UI:
193+
Perform a remote flea drain against the BMM through the BMC UI:
191194
`BMC` -> `Configuration` -> `BIOS Settings` -> `Miscellaneous Settings` -> `Select "Full Power Cycle" under Power Cycle Request` -> `Apply and reboot`
192195

193-
Perform a remote flea drain using `racadm` from a Jumpbox that has access to the BMC network:
196+
Perform a remote flea drain using `racadm` from a jumpbox that has access to the BMC network:
194197
```bash
195198
racadm set bios.miscsettings.powercyclerequest FullPowerCycle
196199
racadm jobqueue create BIOS.Setup.1-1
197200
racadm serveraction powercycle
198201
```
199202

200203
### BMM physical power drain (flea drain)
201-
For a physical flea drain, the local site hands physically disconnect the power cables from both power adapters for 5 minutes and then restore power. This process ensures the server, capacitors, and all components have complete power removal and all cached data are cleared.
204+
For a physical flea drain, the local site hands physically disconnect the power cables from both power adapters for 5 minutes and then restore power. This process ensures the server, capacitors, and all components have complete power removal and all cached data is cleared.
202205

203206
### Reset NVRAM
204-
If provisioning failed due to an OEM or hardware error, the boot sequence may be locked in NVRAM to `PXE boot` instead of `hdd` or `hard drive` listed first in the boot order.
207+
If provisioning failed due to an OEM or hardware error, the boot sequence may be locked in NVRAM to `PXE boot` instead of showing `hdd` or `hard drive` listed first in the boot order.
205208

206-
This condition typically shows the BareMetal Machine at the GRUB Bootloader on the console and is blocked without intervention.
209+
This condition typically shows the BMM at the bootloader stage on the console and is blocked without manual keystroke intervention.
207210

208-
To reset the NVRAM, use the following BMC Sequence:
211+
To reset the NVRAM, use the following sequence in the BMC UI:
209212
`Maintenance` -> `Diagnostics` -> `Reset iDrac to Factory Defaults` -> `Discard All Settings, but preserve user and network settings` -> `Apply and reboot`
210213

211214
### Reset BMC password
212-
If the Activity Log indicates invalid credentials on the BMC, run the following command from a Jumpbox that has access to the BMC network:
215+
If the activity log indicates invalid credentials on the BMC, run the following command from a jumpbox that has access to the BMC network:
213216
```bash
214217
racadm -r $BMC_IP -u $BMC_USER -p $CURRENT_PASSWORD set iDRAC.Users.2.Password $BMC_PWD
215218
```
216219

217220
## Adding servers back into the Cluster after a repair
218221

219-
After Hardware is fixed, run BMM Replace following instructions from the following page [BMM actions](howto-baremetal-functions.md).
222+
After hardware is fixed, run BMM replace action following instructions from the following page [BMM actions](howto-baremetal-functions.md).
220223

221224
If you still have questions, [contact support](https://portal.azure.com/?#blade/Microsoft_Azure_Support/HelpAndSupportBlade).
222-
For more information about Support plans, see [Azure Support plans](https://azure.microsoft.com/support/plans/response/).
225+
For more information about support plans, see [Azure Support plans](https://azure.microsoft.com/support/plans/response/).

0 commit comments

Comments
 (0)