Skip to content

Commit 81be99d

Browse files
committed
Updates for Clarity
1 parent 9e37b4a commit 81be99d

File tree

1 file changed

+31
-29
lines changed

1 file changed

+31
-29
lines changed

articles/operator-nexus/troubleshoot-hardware-validation-failure.md

Lines changed: 31 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -11,14 +11,16 @@ ms.author: vanjanikolin
1111

1212
# Troubleshoot hardware validation failure in Nexus Cluster
1313

14-
This article describes how to troubleshoot a failed server hardware validation. Hardware validation (HWV) is run as part of cluster deploy action as well as a bare metal replace action. HWV validates a bare metal machine (BMM) by executing test cases against the baseboard management controller (BMC). The Azure Operator Nexus platform is deployed on Dell servers. Dell servers use the integrated Dell remote access controller (iDRAC) which is the equivalent of a BMC.
14+
This article describes how to troubleshoot a failed server hardware validation. Hardware validation (HWV) is run as part of cluster deploy action and a bare metal replace action. HWV validates a bare metal machine (BMM) by executing test cases against the baseboard management controller (BMC). The Azure Operator Nexus platform is deployed on Dell servers. Dell servers use the integrated Dell remote access controller (iDRAC) which is the equivalent of a BMC.
1515

1616
## Prerequisites
1717

18-
- Gather the following information:
18+
1. Collect the following information:
1919
- Subscription ID
20-
- Cluster name and resource group
21-
- The user needs access to the Cluster's Log Analytics Workspace (LAW)
20+
- Cluster name
21+
- Resource group
22+
2. Request access to the Cluster's Log Analytics Workspace (LAW).
23+
3. Access to BMC webui and/or jumpbox that allows running of racadm utility.
2224

2325
## Locating hardware validation results
2426

@@ -46,7 +48,7 @@ Expanding `result_detail` for a given category shows detailed results.
4648
### System info category
4749

4850
* Memory/RAM Related Failure (memory_capacity_GB)
49-
* Memory specs are defined in the SKU. Memory below threshold value indicates missing or failed DIMM(s). Failed DIMM(s) would also be reflected in the `health_info` category. The following example shows a failed memory check.
51+
* Memory specs are defined in the SKU. Memory below threshold value indicates missing or failed Dual In-Line Memory Module (DIMM). A failed DIMM would also be reflected in the `health_info` category. The following example shows a failed memory check.
5052

5153
```json
5254
{
@@ -115,7 +117,7 @@ Expanding `result_detail` for a given category shows detailed results.
115117
racadm --nocertwarn -r $IP -u $BMC_USR -p $BMC_PWD getsysinfo | grep Model
116118
```
117119

118-
* To troubleshoot this problem ensure that server is racked in the correct location, cabled accordingly, and that the correct IP is assigned.
120+
* To troubleshoot this problem, ensure that server is racked in the correct location, cabled accordingly, and that the correct IP is assigned.
119121

120122
* Serial Number Check Failure (Serial_Number)
121123
* The server's serial number, also referred as the service tag, is defined in the cluster. Failed `Serial_Number` check indicates a mismatch between the serial number in the cluster and the actual serial number of the machine. The following example shows a failed serial number check.
@@ -139,10 +141,10 @@ Expanding `result_detail` for a given category shows detailed results.
139141
racadm --nocertwarn -r $IP -u $BMC_USR -p $BMC_PWD getsysinfo | grep "Service Tag"
140142
```
141143

142-
* To troubleshoot this problem ensure that server is racked in the correct location, cabled accordingly, and that the correct IP is assigned.
144+
* To troubleshoot this problem, ensure that server is racked in the correct location, cabled accordingly, and that the correct IP is assigned.
143145

144146
* iDRAC License Check Failure
145-
* To enable necessary functionality all iDRACs require a perpetual/production iDRAC9 datacenter or enterprise license. Trial licenses are valid for only 30 days. A failed `iDRAC License Check` indicates that the required iDRAC license is missing. The following examples show a failed iDRAC license check for a trial license and missing license respectively.
147+
* All iDRACs require a perpetual/production iDRAC datacenter or enterprise license. Trial licenses are valid for only 30 days. A failed `iDRAC License Check` indicates that the required iDRAC license is missing. The following examples show a failed iDRAC license check for a trial license and missing license respectively.
146148

147149
```json
148150
{
@@ -170,7 +172,7 @@ Expanding `result_detail` for a given category shows detailed results.
170172
### Drive info category
171173

172174
* Disk Check Failure
173-
* Drive specs are defined in the SKU. Mismatched capacity values indicate incorrect drives or drives inserted in to incorrect slots. Missing capacity and type fetched values indicate drives that are failed, missing or inserted in to incorrect slots.
175+
* Drive specs are defined in the SKU. Mismatched capacity values indicate incorrect drives or drives inserted in to incorrect slots. Missing capacity and type fetched values indicate drives that are failed, missing, or inserted in to incorrect slots.
174176

175177
```json
176178
{
@@ -209,12 +211,12 @@ Expanding `result_detail` for a given category shows detailed results.
209211
racadm --nocertwarn -r $IP -u $BMC_USR -p $BMC_PWD raid get pdisks -o -p State,Size
210212
```
211213

212-
* To troubleshoot ensure that disks are inserted int he correct slots. If the problem persists engage vendor.
214+
* To troubleshoot, ensure that disks are inserted in the correct slots. If the problem persists engage vendor.
213215

214216
### Network info category
215217

216-
* NIC Check Failure
217-
* Dell server NIC specs are defined in the SKU. A mismatched link status indicates loose or faulty cabling or crossed cables. A mismatched model indicates incorrect NIC card is inserted in to slot. Missing link/model fetched values indicate NICs that are failed, missing or inserted in to incorrect slots.
218+
* Network Interface Card (NIC) Check Failure
219+
* Dell server NIC specs are defined in the SKU. A mismatched link status indicates loose or faulty cabling or crossed cables. A mismatched model indicates incorrect NIC card is inserted in to slot. Missing link/model fetched values indicate NICs that are failed, missing, or inserted in to incorrect slots.
218220

219221
```json
220222
{
@@ -271,16 +273,16 @@ Expanding `result_detail` for a given category shows detailed results.
271273
racadm --nocertwarn -r $IP -u $BMC_USR -p $BMC_PWD hwinventory NIC
272274
```
273275

274-
* To check a specific NIC with racadm provide the FQDD:
276+
* To check a specific NIC with racadm provide the Fully Qualified Device Descriptor (FQDD):
275277

276278
```bash
277279
racadm --nocertwarn -r $IP -u $BMC_USR -p $BMC_PWD hwinventory NIC.Embedded.1-1-1
278280
```
279281

280-
* To troubleshoot ensure that servers are cabled correctly and that ports are linked up. Bounce port on the fabric. Perform flea drain. If the problem persists engage vendor.
282+
* To troubleshoot, ensure that servers are cabled correctly and that ports are linked up. Bounce port on the fabric. Perform flea drain. If the problem persists engage vendor.
281283

282284
* NIC Check L2 Switch Information
283-
* HW Validation reports L2 switch information for each of the server interfaces. The switch connection ID (switch interface MAC) and switch port connection ID (switch interface label) are informational.
285+
* HWV reports L2 switch information for each of the server interfaces. The switch connection ID (switch interface MAC) and switch port connection ID (switch interface label) are informational.
284286

285287
```json
286288
{
@@ -301,7 +303,7 @@ Expanding `result_detail` for a given category shows detailed results.
301303
```
302304

303305
* Cabling Checks for Bonded Interfaces
304-
* Mismatched cabling is reported in the result_log. Cable check validates that that bonded NICs connect to switch ports with same Port ID. In the following example PCI 3/1 and 3/2 connect to "Ethernet1/1" and "Ethernet1/3" respectively on TOR, triggering a failure for HWV.
306+
* Mismatched cabling is reported in the result_log. Cable check validates that that bonded NICs connect to switch ports with same Port ID. In the following example Peripheral Component Interconnect (PCI) 3/1 and 3/2 connect to "Ethernet1/1" and "Ethernet1/3" respectively on TOR, triggering a failure for HWV.
305307

306308
```json
307309
{
@@ -338,9 +340,9 @@ Expanding `result_detail` for a given category shows detailed results.
338340
}
339341
```
340342

341-
* To troubleshoot this problem ensure that correct MAC address is defined in the cluster. If MAC is correct attempt a flea drain. If problem persists ensure that server is racked in the correct location, cabled accordingly, and that the correct IP is assigned.
343+
* To troubleshoot this problem, ensure that correct MAC address is defined in the cluster. If MAC is correct attempt a flea drain. If problem persists ensure that server is racked in the correct location, cabled accordingly, and that the correct IP is assigned.
342344

343-
* PXE MAC Address Check Failure
345+
* Preboot execution environment (PXE) MAC Address Check Failure
344346
* The PXE MAC address is defined in the cluster for each BMM. A failed `PXE_MAC` check indicates a mismatch between the PXE MAC in the cluster and the actual MAC address retrieved from the machine.
345347

346348
```json
@@ -352,12 +354,12 @@ Expanding `result_detail` for a given category shows detailed results.
352354
}
353355
```
354356

355-
* To troubleshoot this problem ensure that correct MAC address is defined in the cluster. If MAC is correct attempt a flea drain. If problem persists ensure that server is racked in the correct location, cabled accordingly, and that the correct IP is assigned.
357+
* To troubleshoot this problem, ensure that correct MAC address is defined in the cluster. If MAC is correct attempt a flea drain. If problem persists ensure that server is racked in the correct location, cabled accordingly, and that the correct IP is assigned.
356358

357359
### Health info category
358360

359361
* Health Check Sensor Failure
360-
* Server health checks cover various hardware component sensors. A failed health sensor indicates a problem with the corresponding hardware component. The following examples indicate fan, drive and CPU failures respectively.
362+
* Server health checks cover various hardware component sensors. A failed health sensor indicates a problem with the corresponding hardware component. The following examples indicate fan, drive, and CPU failures respectively.
361363

362364
```json
363365
{
@@ -398,7 +400,7 @@ Expanding `result_detail` for a given category shows detailed results.
398400

399401
* To troubleshoot a server health failure engage vendor.
400402

401-
* Health Check Lifecycle Log (LC Log) Failures
403+
* Health Check Lifecycle (LC) Log Failures
402404
* Dell server health checks fail for recent Critical LC Log Alarms. The hardware validation plugin logs the alarm ID, name, and timestamp. Recent LC Log critical alarms indicate need for further investigation. The following example shows a failure for a critical backplane voltage alarm.
403405

404406
```json
@@ -420,7 +422,7 @@ Expanding `result_detail` for a given category shows detailed results.
420422
racadm --nocertwarn -r $IP -u $BMC_USR -p $BMC_PWD lclog view -s critical
421423
```
422424

423-
* If `Backplane Comm` critical errors are encountered attempt a flea drain. To troubleshoot any other LC log critical failures engage vendor.
425+
* If `Backplane Comm` critical errors are logged, perform flea drain. Engage vendor to troubleshoot any other LC log critical failures.
424426

425427
* Health Check Server Power Action Failures
426428
* Dell server health checks fail for failed server power-up or failed iDRAC reset. A failed server control action indicates an underlying hardware issue. The following example shows failed power on attempt.
@@ -450,10 +452,10 @@ Expanding `result_detail` for a given category shows detailed results.
450452
racadm --nocertwarn -r $IP -u $BMC_USR -p $BMC_PWD serveraction powerup
451453
```
452454

453-
* To troubleshoot server power on failure attempt a flea drain. If problem persists engage vendor.
455+
* To troubleshoot server power-on failure attempt a flea drain. If problem persists engage vendor.
454456

455457
* Health Check Power Supply Failure and Redundancy Considerations
456-
* Dell server health checks warn when one power supply is missing or failed. Power supply "field_name" might be displayed as 0/PS0/Power Supply 0 and 1/PS1/Power Supply 1 for the first and second power supplies respectively. A failure of one power supply doesn't trigger an HW validation device failure.
458+
* Dell server health checks warn when one power supply is missing or failed. Power supply "field_name" might be displayed as 0/PS0/Power Supply 0 and 1/PS1/Power Supply 1 for the first and second power supplies respectively. A failure of one power supply doesn't trigger an HWV device failure.
457459

458460
```json
459461
{
@@ -483,7 +485,7 @@ Expanding `result_detail` for a given category shows detailed results.
483485
racadm --nocertwarn -r $IP -u $BMC_USR -p $BMC_PWD getsensorinfo | grep PS
484486
```
485487

486-
* To troubleshoot attempt a power supply reseat. If problem persists engage vendor.
488+
* Reseating the power supply might fix the problem. If alarms persist engage vendor.
487489

488490
### Boot info category
489491

@@ -523,12 +525,12 @@ Expanding `result_detail` for a given category shows detailed results.
523525
}
524526
```
525527

526-
* To update the PXE device state and name in BMC webui set the value in the following location below then select `Apply` followed by `Apply And Reboot`:
528+
* To update the PXE device state and name in BMC webui, set the value then select `Apply` followed by `Apply And Reboot`:
527529

528530
`BMC` -> `Configuration` -> `BIOS Settings` -> `Network Settings` -> `PXE Device1` -> `Enabled`
529531
`BMC` -> `Configuration` -> `BIOS Settings` -> `Network Settings` -> `PXE Device1 Settings` -> `Interface` -> `Embedded NIC 1 Port 1 Partition 1`
530532

531-
* To update the PXE device state and name with racadm perform the following:
533+
* To update the PXE device state and name with racadm run the following commands:
532534

533535
```bash
534536
racadm --nocertwarn -r $IP -u $BMC_USR -p $BMC_PWD set bios.NetworkSettings.PxeDev1EnDis Enabled
@@ -540,7 +542,7 @@ Expanding `result_detail` for a given category shows detailed results.
540542
### Device login check
541543

542544
* Device Login Check Considerations
543-
* The `device_login` check will fail if the iDRAC is not accessible or if the hardware validaton plugin is not able to log in.
545+
* The `device_login` check fails if the iDRAC isn't accessible or if the hardware validation plugin isn't able to sign-in.
544546

545547
```json
546548
{
@@ -558,7 +560,7 @@ Expanding `result_detail` for a given category shows detailed results.
558560
racadm -r $BMC_IP -u $BMC_USER -p $CURRENT_PASSWORD set iDRAC.Users.2.Password $BMC_PWD
559561
```
560562

561-
* To troubleshoot ping the iDRAC from a jumpserver with access to the BMC network. If iDRAC pings check that passwords match.
563+
* To troubleshoot, ping the iDRAC from a jumpbox with access to the BMC network. If iDRAC pings check that passwords match.
562564

563565
## Adding servers back into the Cluster after a repair
564566

0 commit comments

Comments
 (0)