Skip to content

Commit 4d78028

Browse files
committed
Update Hardware Validation Troubleshooting Document
1 parent a1dbbaa commit 4d78028

File tree

1 file changed

+111
-21
lines changed

1 file changed

+111
-21
lines changed

articles/operator-nexus/troubleshoot-hardware-validation-failure.md

Lines changed: 111 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -83,6 +83,44 @@ Expanding `result_detail` for a given category shows detailed results.
8383
}
8484
```
8585

86+
* Serial Number Check Failure (Serial_Number)
87+
* The server's serial number is defined in the cluster.
88+
* Failed `Serial_Number` check indicates a mismatch between the serial number in the cluster and the actual serial number of the machine.
89+
90+
```json
91+
{
92+
"field_name": "Serial_Number",
93+
"comparison_result": "Fail",
94+
"expected": "1234567",
95+
"fetched": "7654321"
96+
}
97+
```
98+
99+
* iDRAC License Check
100+
* To enable necessary functionality all iDRACs require a perpetual/production iDRAC9 datacenter or enterprise license.
101+
* Trial licenses are valid for only 30 days.
102+
* Failed `iDRAC License Check` indicates that the required iDRAC license is missing.
103+
* The following examples show a failed iDRAC license check for a trial license and missing license respectively.
104+
105+
```json
106+
{
107+
"field_name": "iDRAC License Check",
108+
"comparison_result": "Fail",
109+
"expected": "idrac9 x5 datacenter license or idrac9 x5 enterprise license - perpetual or production",
110+
"fetched": "iDRAC9 x5 Datacenter Trial License - Trial"
111+
}
112+
```
113+
114+
```json
115+
{
116+
"field_name": "iDRAC License Check",
117+
"comparison_result": "Fail",
118+
"expected": "idrac9 x5 datacenter license or idrac9 x5 enterprise license - perpetual or production",
119+
"fetched": ""
120+
}
121+
```
122+
123+
86124
### Drive info category
87125

88126
* Disk Check Failure
@@ -177,22 +215,22 @@ Expanding `result_detail` for a given category shows detailed results.
177215
```json
178216
{
179217
"field_name": "NIC.Slot.3-1-1_SwitchConnectionID",
218+
"comparison_result": "Info",
180219
"expected": "unknown",
181-
"fetched": "c0:d6:82:23:0c:7d",
182-
"comparison_result": "Info"
220+
"fetched": "c0:d6:82:23:0c:7d"
183221
}
184222
```
185223

186224
```json
187225
{
188226
"field_name": "NIC.Slot.3-1-1_SwitchPortConnectionID",
227+
"comparison_result": "Info",
189228
"expected": "unknown",
190-
"fetched": "Ethernet10/1",
191-
"comparison_result": "Info"
229+
"fetched": "Ethernet10/1"
192230
}
193231
```
194232

195-
* Release 3.6 introduced cable checks for bonded interfaces.
233+
* Cabling Checks for Bonded Interfaces
196234
* Mismatched cabling is reported in the result_log.
197235
* Cable check validates that that bonded NICs connect to switch ports with same Port ID. In the following example PCI 3/1 and 3/2 connect to "Ethernet1/1" and "Ethernet1/3" respectively on TOR, triggering a failure for HWV.
198236

@@ -211,12 +249,38 @@ Expanding `result_detail` for a given category shows detailed results.
211249
}
212250
],
213251
"result_log": [
214-
"Cabling problem detected on PCI Slot 3"
252+
"Cabling problem detected on PCI Slot 3 - server NIC.Slot.3-1-1 connected to switch Ethernet1/1 - server NIC.Slot.3-2-1 connected to switch Ethernet1/3"
215253
]
216254
},
217255
}
218256
```
219257

258+
* iDRAC (BMC) MAC Address Check Failure
259+
* The iDRAC MAC address is defined in the cluster for each BMM.
260+
* A failed `iDRAC_MAC` check indicates a mismatch between the iDRAC/BMC MAC in the cluster and the actual MAC address retrieved from the machine.
261+
262+
```json
263+
{
264+
"field_name": "iDRAC_MAC",
265+
"comparison_result": "Fail",
266+
"expected": "aa:bb:cc:dd:ee:ff",
267+
"fetched": "aa:bb:cc:dd:ee:gg"
268+
}
269+
```
270+
271+
* PXE MAC Address Check Failure
272+
* The PXE MAC address is defined in the cluster for each BMM.
273+
* A failed `PXE_MAC` check indicates a mismatch between the PXE MAC in the cluster and the actual MAC address retrieved from the machine.
274+
275+
```json
276+
{
277+
"field_name": "NIC.Embedded.1-1_PXE_MAC",
278+
"comparison_result": "Fail",
279+
"expected": "aa:bb:cc:dd:ee:ff",
280+
"fetched": "aa:bb:cc:dd:ee:gg"
281+
}
282+
```
283+
220284
### Health info category
221285

222286
* Health Check Sensor Failure
@@ -260,9 +324,9 @@ Expanding `result_detail` for a given category shows detailed results.
260324
```json
261325
{
262326
"field_name": "LCLog_Critical_Alarms",
327+
"comparison_result": "Fail",
263328
"expected": "No Critical Errors",
264-
"fetched": "53539 2023-07-22T23:44:06-05:00 The system board BP1 PG voltage is outside of range.",
265-
"comparison_result": "Fail"
329+
"fetched": "53539 2023-07-22T23:44:06-05:00 The system board BP1 PG voltage is outside of range."
266330
}
267331
```
268332

@@ -274,9 +338,9 @@ Expanding `result_detail` for a given category shows detailed results.
274338
```json
275339
{
276340
"field_name": "Server Control Actions",
341+
"comparison_result": "Fail",
277342
"expected": "Success",
278-
"fetched": "Failed",
279-
"comparison_result": "Fail"
343+
"fetched": "Failed"
280344
}
281345
```
282346

@@ -294,33 +358,45 @@ Expanding `result_detail` for a given category shows detailed results.
294358
```json
295359
{
296360
"field_name": "Power Supply 1",
361+
"comparison_result": "Warning",
297362
"expected": "Enabled-OK",
298-
"fetched": "UnavailableOffline-Critical",
299-
"comparison_result": "Warning"
363+
"fetched": "UnavailableOffline-Critical"
300364
}
301365
```
302366

303367
```json
304368
{
305369
"field_name": "System Board PS Redundancy",
370+
"comparison_result": "Warning",
306371
"expected": "Enabled-OK",
307-
"fetched": "Enabled-Critical",
308-
"comparison_result": "Warning"
372+
"fetched": "Enabled-Critical"
309373
}
310374
```
311375

312376
### Boot info category
313377

314-
* Boot Device Check Considerations
378+
* Boot Device Name Check Considerations
315379
* The `boot_device_name` check is currently informational.
316380
* Mismatched boot device name shouldn't trigger a device failure.
317381

318382
```json
319383
{
384+
"field_name": "boot_device_name",
320385
"comparison_result": "Info",
321386
"expected": "NIC.PxeDevice.1-1",
322-
"fetched": "NIC.PxeDevice.1-1",
323-
"field_name": "boot_device_name"
387+
"fetched": "NIC.PxeDevice.1-1"
388+
}
389+
```
390+
391+
* Boot Device State Check Considerations
392+
* A failed `boot_device_state` check indicates that the boot device is in a disabled state.
393+
394+
```json
395+
{
396+
"field_name": "boot_device_state",
397+
"comparison_result": "Fail",
398+
"expected": "enabled",
399+
"fetched": "disabled"
324400
}
325401
```
326402

@@ -332,21 +408,35 @@ Expanding `result_detail` for a given category shows detailed results.
332408
```json
333409
{
334410
"field_name": "pxe_device_1_name",
411+
"comparison_result": "Fail",
335412
"expected": "NIC.Embedded.1-1-1",
336-
"fetched": "NIC.Embedded.1-2-1",
337-
"comparison_result": "Fail"
413+
"fetched": "NIC.Embedded.1-2-1"
338414
}
339415
```
340416

341417
```json
342418
{
343419
"field_name": "pxe_device_1_state",
420+
"comparison_result": "Fail",
344421
"expected": "Enabled",
345-
"fetched": "Disabled",
346-
"comparison_result": "Fail"
422+
"fetched": "Disabled"
347423
}
348424
```
349425

426+
* To update the PXE device state ane name in BMC webui set the value in the following location below then select `Apply` followed by `Apply And Reboot`:
427+
428+
`BMC` -> `Configuration` -> `BIOS Settings` -> `Network Settings` -> `PXE Device1` -> `Enabled`
429+
`BMC` -> `Configuration` -> `BIOS Settings` -> `Network Settings` -> `PXE Device1 Settings` -> `Interface` -> `Embedded NIC 1 Port 1 Partition 1`
430+
431+
* To update the PXE device name and state with racadm perform the following:
432+
433+
```bash
434+
racadm --nocertwarn -r $IP -u $BMC_USR -p $BMC_PWD set bios.NetworkSettings.PxeDev1EnDis Enabled
435+
racadm --nocertwarn -r $IP -u $BMC_USR -p $BMC_PWD set bios.PxeDev1Settings.PxeDev1Interface NIC.Embedded.1-1-1
436+
racadm --nocertwarn -r $IP -u $BMC_USR -p $BMC_PWD jobqueue create BIOS.Setup.1-1
437+
racadm --nocertwarn -r $IP -u $BMC_USR -p $BMC_PWD serveraction powercycle
438+
```
439+
350440
## Adding servers back into the Cluster after a repair
351441

352442
After Hardware is fixed, run BMM Replace following instructions from the following page [BMM actions](howto-baremetal-functions.md).

0 commit comments

Comments
 (0)