You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
description: Troubleshoot BMM provisioning for Azure Operator Nexus.
4
4
ms.service: azure-operator-nexus
5
5
ms.custom: troubleshooting
6
6
ms.topic: troubleshooting
@@ -9,29 +9,31 @@ author: bpinto
9
9
ms.author: bpinto
10
10
---
11
11
12
-
# Troubleshoot BareMetal Machine Provisioning in Nexus Cluster
12
+
# Troubleshoot BMM provisioning in Azure Operator Nexus cluster
13
13
14
-
As part of Cluster deploy action, BareMetal machines (BMM) are provisioned with required roles to participate in the Nexus Cluster. This document supports troubleshooting for common provisioning issues using Azure CLI, Azure Portal, and the server Baseboard Management Controller (BMC). For the Operator Nexus Platform, the underlying Dell server hardware uses Integrated Dell Remote Access Controller (iDRAC) as the BMC.
14
+
As part of cluster deploy action, bare metal machines (BMM) are provisioned with required roles to participate in the cluster. This document supports troubleshooting for common provisioning issues using Azure CLI, Azure portal, and the server baseboard management controller (BMC). For the Azure Operator Nexus platform, the underlying server hardware uses integrated Dell remote access controller (iDRAC) as the BMC.
15
15
16
16
## Prerequisites
17
17
1. Install the latest version of the [appropriate CLI extensions](howto-install-cli-extensions.md)
18
18
2. Gather the following information:
19
-
- Subscription ID (SUBSCRIPTION)
20
-
- Cluster name (CLUSTER), Resource Group (CLUSTER_RG), and Managed Resource Group (CLUSTER_MRG)
21
-
3. The user needs access to the subscription to run Nexus Network Fabric (NF) and Network Cloud (NC) CLI extension commands
19
+
- Subscription ID (SUBSCRIPTION)
20
+
- Cluster name (CLUSTER)
21
+
- Resource group (CLUSTER_RG)
22
+
- Managed resource group (CLUSTER_MRG)
23
+
3. The user needs access to the subscription to run Azure Operator Nexus network fabric (NF) and network cloud (NC) CLI extension commands
22
24
4. Log in to Azure CLI and select the subscription where the cluster is deployed
23
25
24
26
## BMM roles
25
-
For a given SKU, there are required roles to manage and operate the underlying Kubernetes cluster.
27
+
For a given SKU, there are required roles to manage and operate the underlying kubernetes cluster.
26
28
27
-
The following roles are assigned to BMM resources (see [BMM Roles](reference-near-edge-baremetal-machine-roles.md)):
29
+
The following roles are assigned to BMM resources (see [BMM roles reference](reference-near-edge-baremetal-machine-roles.md)):
28
30
29
-
-`Control plane`: BMM responsible for running the Kubernetes control plane agents for Nexus platform cluster.
30
-
-`Management plane`: BMM responsible for running the Nexus platform agents including controllers and extensions.
31
-
-`Compute plane`: BMM responsible for running actual tenant workloads including Nexus Kubernetes Clusters and Virtual Machines.
31
+
-`Control plane`: BMM responsible for running the kubernetes control plane agents for cluster.
32
+
-`Management plane`: BMM responsible for running the platform agents including controllers and extensions.
33
+
-`Compute plane`: BMM responsible for running actual tenant workloads including kubernetes clusters and virtual machines.
32
34
33
-
## Listing BareMetal Machine status
34
-
This command will `list` all `bareMetalMachineName` resources in the Managed Resource Group with simple status:
35
+
## Listing BMM status
36
+
This command will `list` all `bareMetalMachineName` resources in the managed resource group with simple status:
35
37
36
38
```azurecli
37
39
az networkcloud baremetalmachine list -g $CLUSTER_MRG -o table
@@ -41,19 +43,20 @@ Name ResourceGroup DetailedStatus DetailedStatusMes
41
43
BMM_NAME CLUSTER_MRG STATUS STATUS_MSG
42
44
```
43
45
44
-
Where `STATUS` goes through the following phases through the BareMetal Machine provisioning process (see [BMM Status in Azure Operator Nexus Compute Concepts](concepts-compute.md)):
46
+
Where `STATUS` goes through the following phases through the BMM provisioning process (see [BMM Status in Azure Operator Nexus Compute Concepts](concepts-compute.md)):
|`Registering`| Verify BMC Connectivity and BMC Credentials, Add BMM to Provisioning Service|
54
+
|`Registering`| Verify BMC connectivity and BMC credentials, add BMM to provisioning service|
52
55
|`Preparing`| Reboot BMM, reset BMC, verify power state |
53
-
|`Inspecting`| Update firmware, apply BIOS settings, and configure RAID|
56
+
|`Inspecting`| Update firmware, apply BIOS settings, and configure storage|
54
57
|`Available`| BMM ready to install OS |
55
-
|`Provisioning`| OS image installing on the BMM, BMM attempts to join Cluster|
56
-
|`Provisioned`| BMM successfully provisioned and joined Cluster|
58
+
|`Provisioning`| OS image installing on the BMM and attempts to join cluster|
59
+
|`Provisioned`| BMM successfully provisioned and joined to cluster|
57
60
|`Deprovisioning`| BMM provisioning failed and retrying |
58
61
|`Failed`| BMM provisioning failed and requires recovery action, all retries exhausted |
59
62
@@ -72,29 +75,30 @@ BMM_NAME RSTATE PROV_STATE STATUS STATUS_MSG
72
75
```
73
76
74
77
Where the output is defined as follows:
78
+
75
79
| Output | Definition |
76
80
| --- | --- |
77
-
| BMM_NAME |BareMetal Machine Name|
78
-
| RSTATE | Cluster Participation Status (`True`,`False`) |
79
-
| PROV_STATE | Provisioning State (`Succeeded`,`Failed`) |
80
-
| STATUS | Provisioning Detailed Status (`Registering`,`Preparing`,`Inspecting`,`Available`,`Provisioning`,`Provisioned`,`Deprovisioning`,`Failed`) |
81
-
| STATUS_MSG | Detailed Provisioning Status Message|
82
-
| POWER_STATE | Power State of BMM (`On`,`Off`) |
83
-
| BMM_ROLE | BMM Cluster Role contains (`control-plane`,`management-plane`,`compute-plane`) |
84
-
| CREATE_DATE | BMM Creation Date|
81
+
| BMM_NAME |BMM name|
82
+
| RSTATE | Cluster participation status (`True`,`False`) |
83
+
| PROV_STATE | Provisioning state (`Succeeded`,`Failed`) |
84
+
| STATUS | Provisioning detailed status (`Registering`,`Preparing`,`Inspecting`,`Available`,`Provisioning`,`Provisioned`,`Deprovisioning`,`Failed`) |
85
+
| STATUS_MSG | Detailed provisioning status message|
86
+
| POWER_STATE | Power state of BMM (`On`,`Off`) |
87
+
| BMM_ROLE | BMM cluster role contains (`control-plane`,`management-plane`,`compute-plane`) |
88
+
| CREATE_DATE | BMM creation date|
85
89
86
90
For example:
87
91
```azurecli
88
92
x01dev01c01w01 True Succeeded Provisioned The OS is provisioned to the machine On platform.afo-nc.microsoft.com/compute-plane=true 2024-05-03T15:12:48.0934793Z
89
93
x01dev01c01w01 False Failed Preparing Preparing for provisioning of the machine Off platform.afo-nc.microsoft.com/compute-plane=true 2024-05-03T15:12:48.0934793Z
90
94
```
91
95
92
-
## BareMetal Machine details
96
+
## BMM details
93
97
To show details and status of a single BMM:
94
98
```azurecli
95
99
az networkcloud baremetalmachine show -g $CLUSTER_MRG -n $BMM_NAME
96
100
```
97
-
For important BareMetal Machine details:
101
+
For additional BMM details used in troubleshooting:
98
102
```azurecli
99
103
az networkcloud baremetalmachine show -g $CLUSTER_MRG -n $BMM_NAME --query "{name:name,BootMAC:bootMacAddress,BMCMAC:bmcMacAddress,Connect:bmcConnectionString,SN:serialNumber,rackId:rackId,RackSlot:rackSlot}" -o table
100
104
```
@@ -105,20 +109,18 @@ The following conditions can cause provisioning failures:
If the MAC address supplied to the cluster is incorrect, use the BareMetal Machine replace action at [BMM actions](howto-baremetal-functions.md) to correct the addresses.
152
+
If the MAC address supplied to the Cluster is incorrect, use the BMM replace action at [BMM actions](howto-baremetal-functions.md) to correct the addresses.
151
153
152
154
### Ping test BMC connectivity
153
155
154
156
Attempt to run ping against the BMC IPv4 address:
155
-
1. Obtain the IPv4 address (BMC_IP) from `Determine BMC IPv4 Address` above.
157
+
1. Obtain the IPv4 address (BMC_IP) from the previous `Determine BMC IPv4 address`.
156
158
2. Test ping to the BMC:
157
159
158
-
To test from a Jumpbox that has access to the BMC network:
160
+
To test from a jumpbox that has access to the BMC network:
159
161
```bash
160
162
ping $BMC_IP -c 3
161
163
```
162
164
163
-
To test from a BareMetal Machine control-plane host:
165
+
To test from a BMM control-plane host using Azure CLI:
If the BMC_IP is not responsive, a reset of the fabric port retriggers autonegotiation on the port and may bring it back online.
170
+
### Reset port on fabric device
171
+
If the BMC_IP is not responsive, a reset of the fabric device port retriggers autonegotiation on the port and may bring it back online.
170
172
171
173
To find the `Network Fabric` port from Azure:
172
-
1. Obtain the RackID and RackSlot from the previous `BareMetal Machine Details` section.
173
-
2. In `Azure Portal`, drill down to the `Network Rack` RackID for the BareMetal Machine Rack.
174
-
3. Select `Network Devices` tab and the Management (Mgmt) switch for the rack.
175
-
4. Under `Resources`, select `Network Interfaces` and then the interface for the BMC (iDRAC) or Boot (PXE) for the port that requires reset.
174
+
1. Obtain the `RackID` and `RackSlot` from the previous `BMM Details` section.
175
+
2. In `Azure Portal`, drill down to the `Network Rack` RackID for the BMM.
176
+
3. Select `Network Devices` tab and the management (Mgmt) switch for the rack.
177
+
4. Under `Resources`, select `Network Interfaces` and then the BMC (iDRAC) or boot (PXE) interface for the port that requires reset.
176
178
177
179
Collect the following information:
178
-
- Network Fabric Resource Group (NF_RG)
179
-
- Device Name (NF_DEVICE_NAME)
180
-
- Interface Name (NF_DEVICE_INTERFACE_NAME).
180
+
- Network fabric resource group (NF_RG)
181
+
- Device name (NF_DEVICE_NAME)
182
+
- Interface name (NF_DEVICE_INTERFACE_NAME)
183
+
181
184
5. Reset the port:
182
185
183
186
To reset the port using Azure CLI:
@@ -187,36 +190,36 @@ To find the `Network Fabric` port from Azure:
187
190
```
188
191
189
192
### BMM remote power drain (flea drain)
190
-
Perform a remote Flea Drain against the BareMetal Machine through the WEB UI:
193
+
Perform a remote flea drain against the BMM through the BMC UI:
191
194
`BMC` -> `Configuration` -> `BIOS Settings` -> `Miscellaneous Settings` -> `Select "Full Power Cycle" under Power Cycle Request` -> `Apply and reboot`
192
195
193
-
Perform a remote flea drain using `racadm` from a Jumpbox that has access to the BMC network:
196
+
Perform a remote flea drain using `racadm` from a jumpbox that has access to the BMC network:
194
197
```bash
195
198
racadm set bios.miscsettings.powercyclerequest FullPowerCycle
196
199
racadm jobqueue create BIOS.Setup.1-1
197
200
racadm serveraction powercycle
198
201
```
199
202
200
203
### BMM physical power drain (flea drain)
201
-
For a physical flea drain, the local site hands physically disconnect the power cables from both power adapters for 5 minutes and then restore power. This process ensures the server, capacitors, and all components have complete power removal and all cached data are cleared.
204
+
For a physical flea drain, the local site hands physically disconnect the power cables from both power adapters for 5 minutes and then restore power. This process ensures the server, capacitors, and all components have complete power removal and all cached data is cleared.
202
205
203
206
### Reset NVRAM
204
-
If provisioning failed due to an OEM or hardware error, the boot sequence may be locked in NVRAM to `PXE boot` instead of `hdd` or `hard drive` listed first in the boot order.
207
+
If provisioning failed due to an OEM or hardware error, the boot sequence may be locked in NVRAM to `PXE boot` instead of showing `hdd` or `hard drive` listed first in the boot order.
205
208
206
-
This condition typically shows the BareMetal Machine at the GRUB Bootloader on the console and is blocked without intervention.
209
+
This condition typically shows the BMM at the bootloader stage on the console and is blocked without manual keystroke intervention.
207
210
208
-
To reset the NVRAM, use the following BMC Sequence:
211
+
To reset the NVRAM, use the following sequence in the BMC UI:
209
212
`Maintenance` -> `Diagnostics` -> `Reset iDrac to Factory Defaults` -> `Discard All Settings, but preserve user and network settings` -> `Apply and reboot`
210
213
211
214
### Reset BMC password
212
-
If the Activity Log indicates invalid credentials on the BMC, run the following command from a Jumpbox that has access to the BMC network:
215
+
If the activity log indicates invalid credentials on the BMC, run the following command from a jumpbox that has access to the BMC network:
0 commit comments