Skip to content

Commit f9a61da

Browse files
authored
Merge pull request #264380 from vnikolin/vnikolin-hwv
Troubleshooting Hardware Validation Failures for Azure Operator Nexus
2 parents ca10ad9 + e868af1 commit f9a61da

File tree

3 files changed

+357
-0
lines changed

3 files changed

+357
-0
lines changed

articles/operator-nexus/TOC.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -163,6 +163,8 @@
163163
href: troubleshoot-bmm-node-reboot.md
164164
- name: Troubleshoot Resolve CSN storage pod stuck in ContainerCreating
165165
href: troubleshoot-csn-storage-pod-container-stuck-in-creating.md
166+
- name: Troubleshoot Hardware Validation Failure
167+
href: troubleshoot-hardware-validation-failure.md
166168
- name: Enable node down cleaner
167169
href: troubleshoot-enable-node-down-cleaner.md
168170
- name: BareMetal Actions
179 KB
Loading
Lines changed: 355 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,355 @@
1+
---
2+
title: Azure Operator Nexus troubleshooting hardware validation failure
3+
description: Troubleshoot Hardware Validation Failure for Azure Operator Nexus.
4+
ms.service: azure-operator-nexus
5+
ms.custom: troubleshooting
6+
ms.topic: troubleshooting
7+
ms.date: 01/26/2024
8+
author: vnikolin
9+
ms.author: vanjanikolin
10+
---
11+
12+
# Troubleshoot hardware validation failure in Nexus Cluster
13+
14+
This article describes how to troubleshoot a failed server hardware validation. Hardware validation is run as part of cluster deploy action.
15+
16+
## Prerequisites
17+
18+
- Gather the following information:
19+
- Subscription ID
20+
- Cluster name and resource group
21+
- The user needs access to the Cluster's Log Analytics Workspace (LAW)
22+
23+
## Locating hardware validation results
24+
25+
1. Navigate to cluster resource group in the subscription
26+
2. Expand the cluster Log Analytics Workspace (LAW) resource for the cluster
27+
3. Navigate to the Logs tab
28+
4. Hardware validation results can be fetched with a query against the HWVal_CL table as per the following example
29+
30+
:::image type="content" source="media\hardware-validation-cluster-law.png" alt-text="Screenshot of cluster LAW custom table query." lightbox="media\hardware-validation-cluster-law.png":::
31+
32+
## Examining hardware validation results
33+
34+
The Hardware Validation result for a given server includes the following categories.
35+
36+
- system_info
37+
- drive_info
38+
- network_info
39+
- health_info
40+
- boot_info
41+
42+
Expanding `result_detail` for a given category shows detailed results.
43+
44+
## Troubleshooting specific failures
45+
46+
### System info category
47+
48+
* Memory/RAM related failure (memory_capacity_GB)
49+
* Memory specs are defined in the SKU.
50+
* Memory below threshold value indicates missing or failed DIMM(s). Failed DIMM(s) would also be reflected in the `health_info` category.
51+
52+
```json
53+
{
54+
"field_name": "memory_capacity_GB",
55+
"comparison_result": "Fail",
56+
"expected": "512",
57+
"fetched": "480"
58+
}
59+
```
60+
61+
* CPU Related Failure (cpu_sockets)
62+
* CPU specs are defined in the SKU.
63+
* Failed `cpu_sockets` check indicates a failed CPU or CPU count mismatch.
64+
65+
```json
66+
{
67+
"field_name": "cpu_sockets",
68+
"comparison_result": "Fail",
69+
"expected": "2",
70+
"fetched": "1"
71+
}
72+
```
73+
74+
* Model Check Failure (Model)
75+
* Failed `Model` check indicates that wrong server is racked in the slot or there's a cabling mismatch.
76+
77+
```json
78+
{
79+
"field_name": "Model",
80+
"comparison_result": "Fail",
81+
"expected": "R750",
82+
"fetched": "R650"
83+
}
84+
```
85+
86+
### Drive info category
87+
88+
* Disk Check Failure
89+
* Drive specs are defined in the SKU
90+
* Mismatched capacity values indicate incorrect drives or drives inserted in to incorrect slots.
91+
* Missing capacity and type fetched values indicate drives that are failed, missing or inserted in to incorrect slots.
92+
93+
```json
94+
{
95+
"field_name": "Disk_0_Capacity_GB",
96+
"comparison_result": "Fail",
97+
"expected": "893",
98+
"fetched": "3576"
99+
}
100+
```
101+
102+
```json
103+
{
104+
"field_name": "Disk_0_Capacity_GB",
105+
"comparison_result": "Fail",
106+
"expected": "893",
107+
"fetched": ""
108+
}
109+
```
110+
111+
```json
112+
{
113+
"field_name": "Disk_0_Type",
114+
"comparison_result": "Fail",
115+
"expected": "SSD",
116+
"fetched": ""
117+
}
118+
```
119+
120+
### Network info category
121+
122+
* NIC Check Failure
123+
* Dell server NIC specs are defined in the SKU.
124+
* Mismatched link status indicates loose or faulty cabling or crossed cables.
125+
* Mismatched model indicates incorrect NIC card is inserted in to slot.
126+
* Missing link/model fetched values indicate NICs that are failed, missing or inserted in to incorrect slots.
127+
128+
```json
129+
{
130+
"field_name": "NIC.Slot.3-1-1_LinkStatus",
131+
"comparison_result": "Fail",
132+
"expected": "Up",
133+
"fetched": "Down"
134+
}
135+
```
136+
137+
```json
138+
{
139+
"field_name": "NIC.Embedded.2-1-1_LinkStatus",
140+
"comparison_result": "Fail",
141+
"expected": "Down",
142+
"fetched": "Up"
143+
}
144+
```
145+
146+
```json
147+
{
148+
"field_name": "NIC.Slot.3-1-1_Model",
149+
"comparison_result": "Fail",
150+
"expected": "ConnectX-6",
151+
"fetched": "BCM5720"
152+
}
153+
```
154+
155+
```json
156+
{
157+
"field_name": "NIC.Slot.3-1-1_LinkStatus",
158+
"comparison_result": "Fail",
159+
"expected": "Up",
160+
"fetched": ""
161+
}
162+
```
163+
164+
```json
165+
{
166+
"field_name": "NIC.Slot.3-1-1_Model",
167+
"comparison_result": "Fail",
168+
"expected": "ConnectX-6",
169+
"fetched": ""
170+
}
171+
```
172+
173+
* NIC Check L2 Switch Information
174+
* HW Validation reports L2 switch information for each of the server interfaces.
175+
* The switch connection ID (switch interface MAC) and switch port connection ID (switch interface label) are informational.
176+
177+
```json
178+
{
179+
"field_name": "NIC.Slot.3-1-1_SwitchConnectionID",
180+
"expected": "unknown",
181+
"fetched": "c0:d6:82:23:0c:7d",
182+
"comparison_result": "Info"
183+
}
184+
```
185+
186+
```json
187+
{
188+
"field_name": "NIC.Slot.3-1-1_SwitchPortConnectionID",
189+
"expected": "unknown",
190+
"fetched": "Ethernet10/1",
191+
"comparison_result": "Info"
192+
}
193+
```
194+
195+
* Release 3.6 introduced cable checks for bonded interfaces.
196+
* Mismatched cabling is reported in the result_log.
197+
* Cable check validates that that bonded NICs connect to switch ports with same Port ID. In the following example PCI 3/1 and 3/2 connect to "Ethernet1/1" and "Ethernet1/3" respectively on TOR, triggering a failure for HWV.
198+
199+
```json
200+
{
201+
"network_info": {
202+
"network_info_result": "Fail",
203+
"result_detail": [
204+
{
205+
"field_name": "NIC.Slot.3-1-1_SwitchPortConnectionID",
206+
"fetched": "Ethernet1/1",
207+
},
208+
{
209+
"field_name": "NIC.Slot.3-2-1_SwitchPortConnectionID",
210+
"fetched": "Ethernet1/3",
211+
}
212+
],
213+
"result_log": [
214+
"Cabling problem detected on PCI Slot 3"
215+
]
216+
},
217+
}
218+
```
219+
220+
### Health info category
221+
222+
* Health Check Sensor Failure
223+
* Server health checks cover various hardware component sensors.
224+
* A failed health sensor indicates a problem with the corresponding hardware component.
225+
* The following examples indicate fan, drive and CPU failures respectively.
226+
227+
```json
228+
{
229+
"field_name": "System Board Fan1A",
230+
"comparison_result": "Fail",
231+
"expected": "Enabled-OK",
232+
"fetched": "Enabled-Critical"
233+
}
234+
```
235+
236+
```json
237+
{
238+
"field_name": "Solid State Disk 0:1:1",
239+
"comparison_result": "Fail",
240+
"expected": "Enabled-OK",
241+
"fetched": "Enabled-Critical"
242+
}
243+
```
244+
245+
```json
246+
{
247+
"field_name": "CPU.Socket.1",
248+
"comparison_result": "Fail",
249+
"expected": "Enabled-OK",
250+
"fetched": "Enabled-Critical"
251+
}
252+
```
253+
254+
* Health Check Lifecycle Log (LC Log) Failures
255+
* Dell server health checks fail for recent Critical LC Log Alarms.
256+
* The hardware validation plugin logs the alarm ID, name, and timestamp.
257+
* Recent LC Log critical alarms indicate need for further investigation.
258+
* The following example shows a failure for a critical Backplane voltage alarm.
259+
260+
```json
261+
{
262+
"field_name": "LCLog_Critical_Alarms",
263+
"expected": "No Critical Errors",
264+
"fetched": "53539 2023-07-22T23:44:06-05:00 The system board BP1 PG voltage is outside of range.",
265+
"comparison_result": "Fail"
266+
}
267+
```
268+
269+
* Health Check Server Power Action Failures
270+
* Dell server health check fail for failed server power-up or failed iDRAC reset.
271+
* A failed server control action indicates an underlying hardware issue.
272+
* The following example shows failed power on attempt.
273+
274+
```json
275+
{
276+
"field_name": "Server Control Actions",
277+
"expected": "Success",
278+
"fetched": "Failed",
279+
"comparison_result": "Fail"
280+
}
281+
```
282+
283+
```json
284+
"result_log": [
285+
"Server power up failed with: server OS is powered off after successful power on attempt",
286+
]
287+
```
288+
289+
* Health Check Power Supply Failure and Redundancy Considerations
290+
* Dell server health checks warn when one power supply is missing or failed.
291+
* Power supply "field_name" might be displayed as 0/PS0/Power Supply 0 and 1/PS1/Power Supply 1 for the first and second power supplies respectively.
292+
* A failure of one power supply doesn't trigger an HW validation device failure.
293+
294+
```json
295+
{
296+
"field_name": "Power Supply 1",
297+
"expected": "Enabled-OK",
298+
"fetched": "UnavailableOffline-Critical",
299+
"comparison_result": "Warning"
300+
}
301+
```
302+
303+
```json
304+
{
305+
"field_name": "System Board PS Redundancy",
306+
"expected": "Enabled-OK",
307+
"fetched": "Enabled-Critical",
308+
"comparison_result": "Warning"
309+
}
310+
```
311+
312+
### Boot info category
313+
314+
* Boot Device Check Considerations
315+
* The `boot_device_name` check is currently informational.
316+
* Mismatched boot device name shouldn't trigger a device failure.
317+
318+
```json
319+
{
320+
"comparison_result": "Info",
321+
"expected": "NIC.PxeDevice.1-1",
322+
"fetched": "NIC.PxeDevice.1-1",
323+
"field_name": "boot_device_name"
324+
}
325+
```
326+
327+
* PXE Device Check Considerations
328+
* This check validates the PXE device settings.
329+
* Failed `pxe_device_1_name` or `pxe_device_1_state` checks indicate a problem with the PXE configuration.
330+
* Failed settings need to be fixed to enable system boot during deployment.
331+
332+
```json
333+
{
334+
"field_name": "pxe_device_1_name",
335+
"expected": "NIC.Embedded.1-1-1",
336+
"fetched": "NIC.Embedded.1-2-1",
337+
"comparison_result": "Fail"
338+
}
339+
```
340+
341+
```json
342+
{
343+
"field_name": "pxe_device_1_state",
344+
"expected": "Enabled",
345+
"fetched": "Disabled",
346+
"comparison_result": "Fail"
347+
}
348+
```
349+
350+
## Adding servers back into the Cluster after a repair
351+
352+
After Hardware is fixed, run BMM Replace following instructions from the following page [BMM actions](howto-baremetal-functions.md).
353+
354+
355+

0 commit comments

Comments
 (0)