Skip to content

Commit de76526

Browse files
committed
Add Additional Troubleshooting Steps
1 parent 8d6108b commit de76526

File tree

1 file changed

+55
-0
lines changed

1 file changed

+55
-0
lines changed

articles/operator-nexus/troubleshoot-hardware-validation-failure.md

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -168,6 +168,32 @@ Expanding `result_detail` for a given category shows detailed results.
168168

169169
`BMC` -> `Configuration` -> `Licenses`
170170

171+
* Firmware Version Checks
172+
* Firmware version checks were introduced in release 3.9. The following example shows the expected log for release versions before 3.9.
173+
174+
```json
175+
{
176+
"system_info": {
177+
"system_info_result": "Pass",
178+
"result_log": [
179+
"Firmware validation not supported in release 3.8"
180+
]
181+
},
182+
}
183+
```
184+
185+
* Firmware versions are determined based on the `cluster version` value in the cluster object. The following example shows a failed check due to indeterminate cluster version. If this problem is encountered, verify the version in the cluster object.
186+
187+
```json
188+
{
189+
"system_info": {
190+
"system_info_result": "Fail",
191+
"result_log": [
192+
"Unable to determine firmware release"
193+
]
194+
},
195+
}
196+
```
171197

172198
### Drive info category
173199

@@ -412,6 +438,17 @@ Expanding `result_detail` for a given category shows detailed results.
412438
}
413439
```
414440

441+
* Virtual disk errors typically indicate a RAID cleanup false positive condition and are logged due to the timing of raid cleanup and system power off pre HWV. The following example shows an LC log critical error on virtual disk 238. If multiple errors are encountered blocking deployment, delete cluster, wait two hours, then reattempt cluster deployment. If the failures aren't deployment blocking, wait two hours then run BMM replace.
442+
443+
```json
444+
{
445+
"field_name": "LCLog_Critical_Alarms",
446+
"comparison_result": "Fail",
447+
"expected": "No Critical Errors",
448+
"fetched": "104473 2024-07-26T16:05:19-05:00 Virtual Disk 238 on RAID Controller in SL 3 has failed."
449+
}
450+
```
451+
415452
* To check LC logs in BMC webui:
416453

417454
`BMC` -> `Maintenance` -> `Lifecycle Log`
@@ -562,6 +599,24 @@ Expanding `result_detail` for a given category shows detailed results.
562599

563600
* To troubleshoot, ping the iDRAC from a jumpbox with access to the BMC network. If iDRAC pings check that passwords match.
564601

602+
### Special considerations
603+
604+
* Servers Failing Multiple Health and Network Checks
605+
* Raid deletion is performed during cluster deploy and cluster delete actions for all releases inclusive of 3.12.
606+
* If we observe servers getting powered off during hardware validation with multiple failed health and network checks, we need to reattempt cluster deployment.
607+
* If issues persist, raid deletion needs to be performed manually on `control` nodes in the cluster.
608+
609+
* To clear raid in BMC webui:
610+
611+
`BMC` -> `Storage` -> `Virtual Disks` -> `Action` -> `Delete` -> `Apply Now`
612+
613+
* To clear raid with racadm:
614+
615+
```bash
616+
racadm --nocertwarn -r $IP -u $BMC_USR -p $BMC_PWD raid deletevd:Disk.Virtual.239:RAID.SL.3-1
617+
racadm --nocertwarn -r $IP -u $BMC_USR -p $BMC_PWD jobqueue create RAID.SL.3-1 --realtime
618+
```
619+
565620
## Adding servers back into the Cluster after a repair
566621

567622
After Hardware is fixed, run BMM Replace following instructions from the following page [BMM actions](howto-baremetal-functions.md).

0 commit comments

Comments
 (0)