You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/operator-nexus/troubleshoot-hardware-validation-failure.md
+31-29Lines changed: 31 additions & 29 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -11,14 +11,16 @@ ms.author: vanjanikolin
11
11
12
12
# Troubleshoot hardware validation failure in Nexus Cluster
13
13
14
-
This article describes how to troubleshoot a failed server hardware validation. Hardware validation (HWV) is run as part of cluster deploy action as well as a bare metal replace action. HWV validates a bare metal machine (BMM) by executing test cases against the baseboard management controller (BMC). The Azure Operator Nexus platform is deployed on Dell servers. Dell servers use the integrated Dell remote access controller (iDRAC) which is the equivalent of a BMC.
14
+
This article describes how to troubleshoot a failed server hardware validation. Hardware validation (HWV) is run as part of cluster deploy action and a bare metal replace action. HWV validates a bare metal machine (BMM) by executing test cases against the baseboard management controller (BMC). The Azure Operator Nexus platform is deployed on Dell servers. Dell servers use the integrated Dell remote access controller (iDRAC) which is the equivalent of a BMC.
15
15
16
16
## Prerequisites
17
17
18
-
- Gather the following information:
18
+
1. Collect the following information:
19
19
- Subscription ID
20
-
- Cluster name and resource group
21
-
- The user needs access to the Cluster's Log Analytics Workspace (LAW)
20
+
- Cluster name
21
+
- Resource group
22
+
2. Request access to the Cluster's Log Analytics Workspace (LAW).
23
+
3. Access to BMC webui and/or jumpbox that allows running of racadm utility.
22
24
23
25
## Locating hardware validation results
24
26
@@ -46,7 +48,7 @@ Expanding `result_detail` for a given category shows detailed results.
46
48
### System info category
47
49
48
50
* Memory/RAM Related Failure (memory_capacity_GB)
49
-
* Memory specs are defined in the SKU. Memory below threshold value indicates missing or failed DIMM(s). Failed DIMM(s) would also be reflected in the `health_info` category. The following example shows a failed memory check.
51
+
* Memory specs are defined in the SKU. Memory below threshold value indicates missing or failed Dual In-Line Memory Module (DIMM). A failed DIMM would also be reflected in the `health_info` category. The following example shows a failed memory check.
50
52
51
53
```json
52
54
{
@@ -115,7 +117,7 @@ Expanding `result_detail` for a given category shows detailed results.
* To troubleshoot this problem ensure that server is racked in the correct location, cabled accordingly, and that the correct IP is assigned.
120
+
* To troubleshoot this problem, ensure that server is racked in the correct location, cabled accordingly, and that the correct IP is assigned.
119
121
120
122
* Serial Number Check Failure (Serial_Number)
121
123
* The server's serial number, also referred as the service tag, is defined in the cluster. Failed `Serial_Number` check indicates a mismatch between the serial number in the cluster and the actual serial number of the machine. The following example shows a failed serial number check.
@@ -139,10 +141,10 @@ Expanding `result_detail` for a given category shows detailed results.
* To troubleshoot this problem ensure that server is racked in the correct location, cabled accordingly, and that the correct IP is assigned.
144
+
* To troubleshoot this problem, ensure that server is racked in the correct location, cabled accordingly, and that the correct IP is assigned.
143
145
144
146
* iDRAC License Check Failure
145
-
* To enable necessary functionality all iDRACs require a perpetual/production iDRAC9 datacenter or enterprise license. Trial licenses are valid for only 30 days. A failed `iDRAC License Check` indicates that the required iDRAC license is missing. The following examples show a failed iDRAC license check for a trial license and missing license respectively.
147
+
* All iDRACs require a perpetual/production iDRAC datacenter or enterprise license. Trial licenses are valid for only 30 days. A failed `iDRAC License Check` indicates that the required iDRAC license is missing. The following examples show a failed iDRAC license check for a trial license and missing license respectively.
146
148
147
149
```json
148
150
{
@@ -170,7 +172,7 @@ Expanding `result_detail` for a given category shows detailed results.
170
172
### Drive info category
171
173
172
174
* Disk Check Failure
173
-
* Drive specs are defined in the SKU. Mismatched capacity values indicate incorrect drives or drives inserted in to incorrect slots. Missing capacity and type fetched values indicate drives that are failed, missing or inserted in to incorrect slots.
175
+
* Drive specs are defined in the SKU. Mismatched capacity values indicate incorrect drives or drives inserted in to incorrect slots. Missing capacity and type fetched values indicate drives that are failed, missing, or inserted in to incorrect slots.
174
176
175
177
```json
176
178
{
@@ -209,12 +211,12 @@ Expanding `result_detail` for a given category shows detailed results.
* To troubleshoot ensure that disks are inserted int he correct slots. If the problem persists engage vendor.
214
+
* To troubleshoot, ensure that disks are inserted in the correct slots. If the problem persists engage vendor.
213
215
214
216
### Network info category
215
217
216
-
* NIC Check Failure
217
-
* Dell server NIC specs are defined in the SKU. A mismatched link status indicates loose or faulty cabling or crossed cables. A mismatched model indicates incorrect NIC card is inserted in to slot. Missing link/model fetched values indicate NICs that are failed, missing or inserted in to incorrect slots.
218
+
* Network Interface Card (NIC) Check Failure
219
+
* Dell server NIC specs are defined in the SKU. A mismatched link status indicates loose or faulty cabling or crossed cables. A mismatched model indicates incorrect NIC card is inserted in to slot. Missing link/model fetched values indicate NICs that are failed, missing, or inserted in to incorrect slots.
218
220
219
221
```json
220
222
{
@@ -271,16 +273,16 @@ Expanding `result_detail` for a given category shows detailed results.
271
273
racadm --nocertwarn -r $IP -u $BMC_USR -p $BMC_PWD hwinventory NIC
272
274
```
273
275
274
-
* To check a specific NIC with racadm provide the FQDD:
276
+
* To check a specific NIC with racadm provide the Fully Qualified Device Descriptor (FQDD):
* To troubleshoot ensure that servers are cabled correctly and that ports are linked up. Bounce port on the fabric. Perform flea drain. If the problem persists engage vendor.
282
+
* To troubleshoot, ensure that servers are cabled correctly and that ports are linked up. Bounce port on the fabric. Perform flea drain. If the problem persists engage vendor.
281
283
282
284
* NIC Check L2 Switch Information
283
-
* HW Validation reports L2 switch information for each of the server interfaces. The switch connection ID (switch interface MAC) and switch port connection ID (switch interface label) are informational.
285
+
* HWV reports L2 switch information for each of the server interfaces. The switch connection ID (switch interface MAC) and switch port connection ID (switch interface label) are informational.
284
286
285
287
```json
286
288
{
@@ -301,7 +303,7 @@ Expanding `result_detail` for a given category shows detailed results.
301
303
```
302
304
303
305
* Cabling Checks for Bonded Interfaces
304
-
* Mismatched cabling is reported in the result_log. Cable check validates that that bonded NICs connect to switch ports with same Port ID. In the following example PCI 3/1 and 3/2 connect to "Ethernet1/1" and "Ethernet1/3" respectively on TOR, triggering a failure for HWV.
306
+
* Mismatched cabling is reported in the result_log. Cable check validates that that bonded NICs connect to switch ports with same Port ID. In the following example Peripheral Component Interconnect (PCI) 3/1 and 3/2 connect to "Ethernet1/1" and "Ethernet1/3" respectively on TOR, triggering a failure for HWV.
305
307
306
308
```json
307
309
{
@@ -338,9 +340,9 @@ Expanding `result_detail` for a given category shows detailed results.
338
340
}
339
341
```
340
342
341
-
* To troubleshoot this problem ensure that correct MAC address is defined in the cluster. If MAC is correct attempt a flea drain. If problem persists ensure that server is racked in the correct location, cabled accordingly, and that the correct IP is assigned.
343
+
* To troubleshoot this problem, ensure that correct MAC address is defined in the cluster. If MAC is correct attempt a flea drain. If problem persists ensure that server is racked in the correct location, cabled accordingly, and that the correct IP is assigned.
342
344
343
-
* PXE MAC Address Check Failure
345
+
* Preboot execution environment (PXE) MAC Address Check Failure
344
346
* The PXE MAC address is defined in the cluster for each BMM. A failed `PXE_MAC` check indicates a mismatch between the PXE MAC in the cluster and the actual MAC address retrieved from the machine.
345
347
346
348
```json
@@ -352,12 +354,12 @@ Expanding `result_detail` for a given category shows detailed results.
352
354
}
353
355
```
354
356
355
-
* To troubleshoot this problem ensure that correct MAC address is defined in the cluster. If MAC is correct attempt a flea drain. If problem persists ensure that server is racked in the correct location, cabled accordingly, and that the correct IP is assigned.
357
+
* To troubleshoot this problem, ensure that correct MAC address is defined in the cluster. If MAC is correct attempt a flea drain. If problem persists ensure that server is racked in the correct location, cabled accordingly, and that the correct IP is assigned.
356
358
357
359
### Health info category
358
360
359
361
* Health Check Sensor Failure
360
-
* Server health checks cover various hardware component sensors. A failed health sensor indicates a problem with the corresponding hardware component. The following examples indicate fan, drive and CPU failures respectively.
362
+
* Server health checks cover various hardware component sensors. A failed health sensor indicates a problem with the corresponding hardware component. The following examples indicate fan, drive, and CPU failures respectively.
361
363
362
364
```json
363
365
{
@@ -398,7 +400,7 @@ Expanding `result_detail` for a given category shows detailed results.
398
400
399
401
* To troubleshoot a server health failure engage vendor.
400
402
401
-
* Health Check Lifecycle Log (LC Log) Failures
403
+
* Health Check Lifecycle (LC) Log Failures
402
404
* Dell server health checks fail for recent Critical LC Log Alarms. The hardware validation plugin logs the alarm ID, name, and timestamp. Recent LC Log critical alarms indicate need for further investigation. The following example shows a failure for a critical backplane voltage alarm.
403
405
404
406
```json
@@ -420,7 +422,7 @@ Expanding `result_detail` for a given category shows detailed results.
* If `Backplane Comm` critical errors are encountered attempt a flea drain. To troubleshoot any other LC log critical failures engage vendor.
425
+
* If `Backplane Comm` critical errors are logged, perform flea drain. Engage vendor to troubleshoot any other LC log critical failures.
424
426
425
427
* Health Check Server Power Action Failures
426
428
* Dell server health checks fail for failed server power-up or failed iDRAC reset. A failed server control action indicates an underlying hardware issue. The following example shows failed power on attempt.
@@ -450,10 +452,10 @@ Expanding `result_detail` for a given category shows detailed results.
* To troubleshoot server poweron failure attempt a flea drain. If problem persists engage vendor.
455
+
* To troubleshoot server power-on failure attempt a flea drain. If problem persists engage vendor.
454
456
455
457
* Health Check Power Supply Failure and Redundancy Considerations
456
-
* Dell server health checks warn when one power supply is missing or failed. Power supply "field_name" might be displayed as 0/PS0/Power Supply 0 and 1/PS1/Power Supply 1 for the first and second power supplies respectively. A failure of one power supply doesn't trigger an HW validation device failure.
458
+
* Dell server health checks warn when one power supply is missing or failed. Power supply "field_name" might be displayed as 0/PS0/Power Supply 0 and 1/PS1/Power Supply 1 for the first and second power supplies respectively. A failure of one power supply doesn't trigger an HWV device failure.
457
459
458
460
```json
459
461
{
@@ -483,7 +485,7 @@ Expanding `result_detail` for a given category shows detailed results.
* To troubleshoot attempt a power supply reseat. If problem persists engage vendor.
488
+
* Reseating the power supply might fix the problem. If alarms persist engage vendor.
487
489
488
490
### Boot info category
489
491
@@ -523,12 +525,12 @@ Expanding `result_detail` for a given category shows detailed results.
523
525
}
524
526
```
525
527
526
-
* To update the PXE device state and name in BMC webui set the value in the following location below then select `Apply` followed by `Apply And Reboot`:
528
+
* To update the PXE device state and name in BMC webui, set the value then select `Apply` followed by `Apply And Reboot`:
0 commit comments