|
| 1 | +--- |
| 2 | +title: Azure Operator Nexus troubleshooting hardware validation failure |
| 3 | +description: Troubleshoot Hardware Validation Failure for Azure Operator Nexus. |
| 4 | +ms.service: azure-operator-nexus |
| 5 | +ms.custom: troubleshooting |
| 6 | +ms.topic: troubleshooting |
| 7 | +ms.date: 01/26/2024 |
| 8 | +author: vnikolin |
| 9 | +ms.author: vanjanikolin |
| 10 | +--- |
| 11 | + |
| 12 | +# Troubleshoot hardware validation failure in Nexus Cluster |
| 13 | + |
| 14 | +This article describes how to troubleshoot a failed server hardware validation. Hardware validation is run as part of cluster deploy action. |
| 15 | + |
| 16 | +## Prerequisites |
| 17 | + |
| 18 | +- Gather the following information: |
| 19 | + - Subscription ID |
| 20 | + - Cluster name and resource group |
| 21 | +- The user needs access to the Cluster's Log Analytics Workspace (LAW) |
| 22 | + |
| 23 | +## Locating hardware validation results |
| 24 | + |
| 25 | +1. Navigate to cluster resource group in the subscription |
| 26 | +2. Expand the cluster Log Analytics Workspace (LAW) resource for the cluster |
| 27 | +3. Navigate to the Logs tab |
| 28 | +4. Hardware validation results can be fetched with a query against the HWVal_CL table as per the following example |
| 29 | + |
| 30 | +:::image type="content" source="media\hardware-validation-cluster-law.png" alt-text="Screenshot of cluster LAW custom table query." lightbox="media\hardware-validation-cluster-law.png"::: |
| 31 | + |
| 32 | +## Examining hardware validation results |
| 33 | + |
| 34 | +The Hardware Validation result for a given server includes the following categories. |
| 35 | + |
| 36 | +- system_info |
| 37 | +- drive_info |
| 38 | +- network_info |
| 39 | +- health_info |
| 40 | +- boot_info |
| 41 | + |
| 42 | +Expanding `result_detail` for a given category shows detailed results. |
| 43 | + |
| 44 | +## Troubleshooting specific failures |
| 45 | + |
| 46 | +### System info category |
| 47 | + |
| 48 | +* Memory/RAM related failure (memory_capacity_GB) |
| 49 | + * Memory specs are defined in the SKU. |
| 50 | + * Memory below threshold value indicates missing or failed DIMM(s). Failed DIMM(s) would also be reflected in the `health_info` category. |
| 51 | + |
| 52 | + ```json |
| 53 | + { |
| 54 | + "field_name": "memory_capacity_GB", |
| 55 | + "comparison_result": "Fail", |
| 56 | + "expected": "512", |
| 57 | + "fetched": "480" |
| 58 | + } |
| 59 | + ``` |
| 60 | + |
| 61 | +* CPU Related Failure (cpu_sockets) |
| 62 | + * CPU specs are defined in the SKU. |
| 63 | + * Failed `cpu_sockets` check indicates a failed CPU or CPU count mismatch. |
| 64 | + |
| 65 | + ```json |
| 66 | + { |
| 67 | + "field_name": "cpu_sockets", |
| 68 | + "comparison_result": "Fail", |
| 69 | + "expected": "2", |
| 70 | + "fetched": "1" |
| 71 | + } |
| 72 | + ``` |
| 73 | + |
| 74 | +* Model Check Failure (Model) |
| 75 | + * Failed `Model` check indicates that wrong server is racked in the slot or there's a cabling mismatch. |
| 76 | + |
| 77 | + ```json |
| 78 | + { |
| 79 | + "field_name": "Model", |
| 80 | + "comparison_result": "Fail", |
| 81 | + "expected": "R750", |
| 82 | + "fetched": "R650" |
| 83 | + } |
| 84 | + ``` |
| 85 | + |
| 86 | +### Drive info category |
| 87 | + |
| 88 | +* Disk Check Failure |
| 89 | + * Drive specs are defined in the SKU |
| 90 | + * Mismatched capacity values indicate incorrect drives or drives inserted in to incorrect slots. |
| 91 | + * Missing capacity and type fetched values indicate drives that are failed, missing or inserted in to incorrect slots. |
| 92 | + |
| 93 | + ```json |
| 94 | + { |
| 95 | + "field_name": "Disk_0_Capacity_GB", |
| 96 | + "comparison_result": "Fail", |
| 97 | + "expected": "893", |
| 98 | + "fetched": "3576" |
| 99 | + } |
| 100 | + ``` |
| 101 | + |
| 102 | + ```json |
| 103 | + { |
| 104 | + "field_name": "Disk_0_Capacity_GB", |
| 105 | + "comparison_result": "Fail", |
| 106 | + "expected": "893", |
| 107 | + "fetched": "" |
| 108 | + } |
| 109 | + ``` |
| 110 | + |
| 111 | + ```json |
| 112 | + { |
| 113 | + "field_name": "Disk_0_Type", |
| 114 | + "comparison_result": "Fail", |
| 115 | + "expected": "SSD", |
| 116 | + "fetched": "" |
| 117 | + } |
| 118 | + ``` |
| 119 | + |
| 120 | +### Network info category |
| 121 | + |
| 122 | +* NIC Check Failure |
| 123 | + * Dell server NIC specs are defined in the SKU. |
| 124 | + * Mismatched link status indicates loose or faulty cabling or crossed cables. |
| 125 | + * Mismatched model indicates incorrect NIC card is inserted in to slot. |
| 126 | + * Missing link/model fetched values indicate NICs that are failed, missing or inserted in to incorrect slots. |
| 127 | + |
| 128 | + ```json |
| 129 | + { |
| 130 | + "field_name": "NIC.Slot.3-1-1_LinkStatus", |
| 131 | + "comparison_result": "Fail", |
| 132 | + "expected": "Up", |
| 133 | + "fetched": "Down" |
| 134 | + } |
| 135 | + ``` |
| 136 | + |
| 137 | + ```json |
| 138 | + { |
| 139 | + "field_name": "NIC.Embedded.2-1-1_LinkStatus", |
| 140 | + "comparison_result": "Fail", |
| 141 | + "expected": "Down", |
| 142 | + "fetched": "Up" |
| 143 | + } |
| 144 | + ``` |
| 145 | + |
| 146 | + ```json |
| 147 | + { |
| 148 | + "field_name": "NIC.Slot.3-1-1_Model", |
| 149 | + "comparison_result": "Fail", |
| 150 | + "expected": "ConnectX-6", |
| 151 | + "fetched": "BCM5720" |
| 152 | + } |
| 153 | + ``` |
| 154 | + |
| 155 | + ```json |
| 156 | + { |
| 157 | + "field_name": "NIC.Slot.3-1-1_LinkStatus", |
| 158 | + "comparison_result": "Fail", |
| 159 | + "expected": "Up", |
| 160 | + "fetched": "" |
| 161 | + } |
| 162 | + ``` |
| 163 | + |
| 164 | + ```json |
| 165 | + { |
| 166 | + "field_name": "NIC.Slot.3-1-1_Model", |
| 167 | + "comparison_result": "Fail", |
| 168 | + "expected": "ConnectX-6", |
| 169 | + "fetched": "" |
| 170 | + } |
| 171 | + ``` |
| 172 | + |
| 173 | +* NIC Check L2 Switch Information |
| 174 | + * HW Validation reports L2 switch information for each of the server interfaces. |
| 175 | + * The switch connection ID (switch interface MAC) and switch port connection ID (switch interface label) are informational. |
| 176 | + |
| 177 | + ```json |
| 178 | + { |
| 179 | + "field_name": "NIC.Slot.3-1-1_SwitchConnectionID", |
| 180 | + "expected": "unknown", |
| 181 | + "fetched": "c0:d6:82:23:0c:7d", |
| 182 | + "comparison_result": "Info" |
| 183 | + } |
| 184 | + ``` |
| 185 | + |
| 186 | + ```json |
| 187 | + { |
| 188 | + "field_name": "NIC.Slot.3-1-1_SwitchPortConnectionID", |
| 189 | + "expected": "unknown", |
| 190 | + "fetched": "Ethernet10/1", |
| 191 | + "comparison_result": "Info" |
| 192 | + } |
| 193 | + ``` |
| 194 | + |
| 195 | +* Release 3.6 introduced cable checks for bonded interfaces. |
| 196 | + * Mismatched cabling is reported in the result_log. |
| 197 | + * Cable check validates that that bonded NICs connect to switch ports with same Port ID. In the following example PCI 3/1 and 3/2 connect to "Ethernet1/1" and "Ethernet1/3" respectively on TOR, triggering a failure for HWV. |
| 198 | + |
| 199 | + ```json |
| 200 | + { |
| 201 | + "network_info": { |
| 202 | + "network_info_result": "Fail", |
| 203 | + "result_detail": [ |
| 204 | + { |
| 205 | + "field_name": "NIC.Slot.3-1-1_SwitchPortConnectionID", |
| 206 | + "fetched": "Ethernet1/1", |
| 207 | + }, |
| 208 | + { |
| 209 | + "field_name": "NIC.Slot.3-2-1_SwitchPortConnectionID", |
| 210 | + "fetched": "Ethernet1/3", |
| 211 | + } |
| 212 | + ], |
| 213 | + "result_log": [ |
| 214 | + "Cabling problem detected on PCI Slot 3" |
| 215 | + ] |
| 216 | + }, |
| 217 | + } |
| 218 | + ``` |
| 219 | + |
| 220 | +### Health info category |
| 221 | + |
| 222 | +* Health Check Sensor Failure |
| 223 | + * Server health checks cover various hardware component sensors. |
| 224 | + * A failed health sensor indicates a problem with the corresponding hardware component. |
| 225 | + * The following examples indicate fan, drive and CPU failures respectively. |
| 226 | + |
| 227 | + ```json |
| 228 | + { |
| 229 | + "field_name": "System Board Fan1A", |
| 230 | + "comparison_result": "Fail", |
| 231 | + "expected": "Enabled-OK", |
| 232 | + "fetched": "Enabled-Critical" |
| 233 | + } |
| 234 | + ``` |
| 235 | + |
| 236 | + ```json |
| 237 | + { |
| 238 | + "field_name": "Solid State Disk 0:1:1", |
| 239 | + "comparison_result": "Fail", |
| 240 | + "expected": "Enabled-OK", |
| 241 | + "fetched": "Enabled-Critical" |
| 242 | + } |
| 243 | + ``` |
| 244 | + |
| 245 | + ```json |
| 246 | + { |
| 247 | + "field_name": "CPU.Socket.1", |
| 248 | + "comparison_result": "Fail", |
| 249 | + "expected": "Enabled-OK", |
| 250 | + "fetched": "Enabled-Critical" |
| 251 | + } |
| 252 | + ``` |
| 253 | + |
| 254 | +* Health Check Lifecycle Log (LC Log) Failures |
| 255 | + * Dell server health checks fail for recent Critical LC Log Alarms. |
| 256 | + * The hardware validation plugin logs the alarm ID, name, and timestamp. |
| 257 | + * Recent LC Log critical alarms indicate need for further investigation. |
| 258 | + * The following example shows a failure for a critical Backplane voltage alarm. |
| 259 | + |
| 260 | + ```json |
| 261 | + { |
| 262 | + "field_name": "LCLog_Critical_Alarms", |
| 263 | + "expected": "No Critical Errors", |
| 264 | + "fetched": "53539 2023-07-22T23:44:06-05:00 The system board BP1 PG voltage is outside of range.", |
| 265 | + "comparison_result": "Fail" |
| 266 | + } |
| 267 | + ``` |
| 268 | + |
| 269 | +* Health Check Server Power Action Failures |
| 270 | + * Dell server health check fail for failed server power-up or failed iDRAC reset. |
| 271 | + * A failed server control action indicates an underlying hardware issue. |
| 272 | + * The following example shows failed power on attempt. |
| 273 | + |
| 274 | + ```json |
| 275 | + { |
| 276 | + "field_name": "Server Control Actions", |
| 277 | + "expected": "Success", |
| 278 | + "fetched": "Failed", |
| 279 | + "comparison_result": "Fail" |
| 280 | + } |
| 281 | + ``` |
| 282 | + |
| 283 | + ```json |
| 284 | + "result_log": [ |
| 285 | + "Server power up failed with: server OS is powered off after successful power on attempt", |
| 286 | + ] |
| 287 | + ``` |
| 288 | + |
| 289 | +* Health Check Power Supply Failure and Redundancy Considerations |
| 290 | + * Dell server health checks warn when one power supply is missing or failed. |
| 291 | + * Power supply "field_name" might be displayed as 0/PS0/Power Supply 0 and 1/PS1/Power Supply 1 for the first and second power supplies respectively. |
| 292 | + * A failure of one power supply doesn't trigger an HW validation device failure. |
| 293 | + |
| 294 | + ```json |
| 295 | + { |
| 296 | + "field_name": "Power Supply 1", |
| 297 | + "expected": "Enabled-OK", |
| 298 | + "fetched": "UnavailableOffline-Critical", |
| 299 | + "comparison_result": "Warning" |
| 300 | + } |
| 301 | + ``` |
| 302 | + |
| 303 | + ```json |
| 304 | + { |
| 305 | + "field_name": "System Board PS Redundancy", |
| 306 | + "expected": "Enabled-OK", |
| 307 | + "fetched": "Enabled-Critical", |
| 308 | + "comparison_result": "Warning" |
| 309 | + } |
| 310 | + ``` |
| 311 | + |
| 312 | +### Boot info category |
| 313 | + |
| 314 | +* Boot Device Check Considerations |
| 315 | + * The `boot_device_name` check is currently informational. |
| 316 | + * Mismatched boot device name shouldn't trigger a device failure. |
| 317 | + |
| 318 | + ```json |
| 319 | + { |
| 320 | + "comparison_result": "Info", |
| 321 | + "expected": "NIC.PxeDevice.1-1", |
| 322 | + "fetched": "NIC.PxeDevice.1-1", |
| 323 | + "field_name": "boot_device_name" |
| 324 | + } |
| 325 | + ``` |
| 326 | + |
| 327 | +* PXE Device Check Considerations |
| 328 | + * This check validates the PXE device settings. |
| 329 | + * Failed `pxe_device_1_name` or `pxe_device_1_state` checks indicate a problem with the PXE configuration. |
| 330 | + * Failed settings need to be fixed to enable system boot during deployment. |
| 331 | + |
| 332 | + ```json |
| 333 | + { |
| 334 | + "field_name": "pxe_device_1_name", |
| 335 | + "expected": "NIC.Embedded.1-1-1", |
| 336 | + "fetched": "NIC.Embedded.1-2-1", |
| 337 | + "comparison_result": "Fail" |
| 338 | + } |
| 339 | + ``` |
| 340 | + |
| 341 | + ```json |
| 342 | + { |
| 343 | + "field_name": "pxe_device_1_state", |
| 344 | + "expected": "Enabled", |
| 345 | + "fetched": "Disabled", |
| 346 | + "comparison_result": "Fail" |
| 347 | + } |
| 348 | + ``` |
| 349 | + |
| 350 | +## Adding servers back into the Cluster after a repair |
| 351 | + |
| 352 | +After Hardware is fixed, run BMM Replace following instructions from the following page [BMM actions](howto-baremetal-functions.md). |
| 353 | + |
| 354 | + |
| 355 | + |
0 commit comments