You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
| COMPX_RACK_SKU | Rack SKU for CompX Rack; repeat for each rack in compute-rack-definitions |
87
87
| COMPX_RACK_SN | Rack Serial Number for CompX Rack; repeat for each rack in compute-rack-definitions |
88
88
| COMPX_RACK_LOCATION | Rack physical location for CompX Rack; repeat for each rack in compute-rack-definitions |
89
-
| COMPX_SVRY_BMC_PASS | CompX Rack ServerY BMC password; repeat for each rack in compute-rack-definitions and for each server in rack |
89
+
| COMPX_SVRY_BMC_PASS | CompX Rack ServerY Baseboard Management Controller (BMC) password; repeat for each rack in compute-rack-definitions and for each server in rack |
90
90
| COMPX_SVRY_BMC_USER | CompX Rack ServerY BMC user; repeat for each rack in compute-rack-definitions and for each server in rack |
91
91
| COMPX_SVRY_BMC_MAC | CompX Rack ServerY BMC MAC address; repeat for each rack in compute-rack-definitions and for each server in rack |
92
-
| COMPX_SVRY_BOOT_MAC | CompX Rack ServerY boot NIC MAC address; repeat for each rack in compute-rack-definitions and for each server in rack |
92
+
| COMPX_SVRY_BOOT_MAC | CompX Rack ServerY boot Network Interface Card (NIC) MAC address; repeat for each rack in compute-rack-definitions and for each server in rack |
93
93
| COMPX_SVRY_SERVER_DETAILS | CompX Rack ServerY details; repeat for each rack in compute-rack-definitions and for each server in rack |
94
-
| COMPX_SVRY_SERVER_NAME | CompX Rack ServerY name, repeat for each rack in compute-rack-definitions and for each server in rack |
94
+
| COMPX_SVRY_SERVER_NAME | CompX Rack ServerY name; repeat for each rack in compute-rack-definitions and for each server in rack |
95
95
| MRG_NAME | Cluster managed resource group name |
Starting with the 2024-06-01-preview API version, a customer can assign managed identity to a Cluster. Both System-assigned and User-Assigned managed identities are supported.
114
+
The customer can assign managed identity to a Cluster starting with the 2024-06-01-preview API version. Both System-assigned and User-Assigned managed identities are supported.
115
115
116
116
Managed Identity can be assigned to the Cluster during creation or update operations by providing the following parameters:
117
117
@@ -131,23 +131,23 @@ You can find examples for an 8-Rack 2M16C SKU cluster using these two files:
131
131
>[!NOTE]
132
132
>To get the correct formatting, copy the raw code file. The values within the cluster.parameters.jsonc file are customer specific and may not be a complete list. Update the value fields for your specific environment.
133
133
134
-
1.In a web browser, go to the [Azure portal](https://portal.azure.com/) and sign in.
135
-
1.From the Azure portal search bar, search for 'Deploy a custom template' and then select it from the available services.
134
+
1.Navigate to [Azure portal](https://portal.azure.com/) in a web browser and sign in.
135
+
1.Search for 'Deploy a custom template' in the Azure portal search bar, and then select it from the available services.
136
136
1. Click on Build your own template in the editor.
137
137
1. Click on Load file. Locate your cluster.jsonc template file and upload it.
138
138
1. Click Save.
139
139
1. Click Edit parameters.
140
140
1. Click Load file. Locate your cluster.parameters.jsonc parameters file and upload it.
141
141
1. Click Save.
142
142
1. Select the correct Subscription.
143
-
1. Search for the Resource group to see if it already exists. If not, create a new Resource group.
143
+
1. Search for the Resource group to see if it already exists. If not, create a new Resource group.
144
144
1. Make sure all Instance Details are correct.
145
145
1. Click Review + create.
146
146
147
147
148
148
### Cluster validation
149
149
150
-
A successful Operator Nexus Cluster creation results in the creation of an AKS cluster
150
+
A successful Operator Nexus Cluster creation results in the creation of an Azure Kubernetes Service (AKS) cluster
151
151
inside your subscription. The cluster ID, cluster provisioning state, and
152
152
deployment state are returned as a result of a successful `cluster create`.
153
153
@@ -170,16 +170,16 @@ Cluster create Logs can be viewed in the following locations:
170
170
171
171
## Deploy Cluster
172
172
173
-
After creating the cluster, the deploy cluster action can be triggered.
173
+
The deploy Cluster action can be triggered after creating the Cluster.
174
174
The deploy Cluster action creates the bootstrap image and deploys the Cluster.
175
175
176
176
Deploy Cluster initiates a sequence of events that occur in the Cluster Manager.
177
177
178
-
1. Validation of the cluster/rack properties
178
+
1. Validation of the cluster/rack properties.
179
179
2. Generation of a bootable image for the ephemeral bootstrap cluster
180
180
(Validation of Infrastructure).
181
-
3. Interaction with the IPMI interface of the targeted bootstrap machine.
182
-
4.Perform hardware validation checks
181
+
3. Interaction with the Intelligent Platform Management Interface (IPMI) interface of the targeted bootstrap machine.
182
+
4.Performing hardware validation checks.
183
183
5. Monitoring of the Cluster deployment process.
184
184
185
185
Deploy the on-premises Cluster:
@@ -198,7 +198,7 @@ az networkcloud cluster deploy \
198
198
> See the section [Cluster Deploy Failed](#cluster-deploy-failed) for more detailed steps.
199
199
> Optionally, the command can run asynchronously using the `--no-wait` flag.
200
200
201
-
### Cluster Deploy with hardware validation
201
+
### Cluster Deployment with hardware validation
202
202
203
203
During a Cluster deploy process, one of the steps executed is hardware validation.
204
204
The hardware validation procedure runs various test and checks against the machines
@@ -211,6 +211,9 @@ passed and/or are available to meet the thresholds necessary for deployment to c
211
211
> Additionally, the provided Service Principal in the Cluster object is used for authentication against the Log Analytics Workspace Data Collection API.
212
212
> This capability is only visible during a new deployment (Green Field); existing cluster will not have the logs available retroactively.
213
213
214
+
> [!NOTE]
215
+
> The RAID controller is reset during cluster deployment wiping all data from the server's virtual disks. Any Baseboard Management Controller (BMC) virtual disk alerts can typically be ignored unless there are additonal physical disk and/or RAID controllers alerts.
216
+
214
217
By default, the hardware validation process writes the results to the configured Cluster `analyticsWorkspaceId`.
215
218
However, due to the nature of Log Analytics Workspace data collection and schema evaluation, there can be ingestion delay that can take several minutes or more.
216
219
For this reason, the Cluster deployment proceeds even if there was a failure to write the results to the Log Analytics Workspace.
@@ -219,9 +222,9 @@ To help address this possible event, the results, for redundancy, are also logge
219
222
In the provided Cluster object's Log Analytics Workspace, a new custom table with the Cluster's name as prefix and the suffix `*_CL` should appear.
220
223
In the _Logs_ section of the LAW resource, a query can be executed against the new `*_CL` Custom Log table.
221
224
222
-
#### Cluster Deploy Action with skipping specific bare-metal-machine
225
+
#### Cluster Deployment with skipping specific bare-metal-machine
223
226
224
-
A parameter can be passed in to the deploy command that represents the names of
227
+
The `--skip-validation-for-machines` parameter represents the names of
225
228
bare metal machines in the cluster that should be skipped during hardware validation.
226
229
Nodes skipped aren't validated and aren't added to the node pool.
227
230
Additionally, nodes skipped don't count against the total used by threshold calculations.
@@ -279,7 +282,7 @@ az networkcloud cluster show --resource-group "$CLUSTER_RG" \
279
282
```
280
283
281
284
The Cluster deployment is in-progress when detailedStatus is set to `Deploying` and detailedStatusMessage shows the progress of deployment.
282
-
Some examples of deployment progress shown in detailedStatusMessage are `Hardware validation is in progress.` (if cluster is deployed with hardware validation) ,`Cluster is bootstrapping.`, `KCP initialization in progress.`, `Management plane deployment in progress.`, `Cluster extension deployment in progress.`, `waiting for "<rack-ids>" to be ready`, etc.
285
+
Some examples of deployment progress shown in detailedStatusMessage are `Hardware validation is in progress.` (if cluster is deployed with hardware validation), `Cluster is bootstrapping.`, `KCP initialization in progress.`, `Management plane deployment in progress.`, `Cluster extension deployment in progress.`, `waiting for "<rack-ids>" to be ready`, etc.
Copy file name to clipboardExpand all lines: articles/operator-nexus/troubleshoot-hardware-validation-failure.md
+38-22Lines changed: 38 additions & 22 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -197,7 +197,7 @@ Expanding `result_detail` for a given category shows detailed results.
197
197
198
198
### Drive info category
199
199
200
-
* Disk Check Failure
200
+
* Disk Checks Failure
201
201
* Drive specs are defined in the SKU. Mismatched capacity values indicate incorrect drives or drives inserted in to incorrect slots. Missing capacity and type fetched values indicate drives that are failed, missing, or inserted in to incorrect slots.
202
202
203
203
```json
@@ -427,7 +427,7 @@ Expanding `result_detail` for a given category shows detailed results.
427
427
* To troubleshoot a server health failure engage vendor.
428
428
429
429
* Health Check LifeCycle (LC) Log Failures
430
-
* Dell server health checks fail for recent Critical LC Log Alarms. The hardware validation plugin logs the alarm ID, name, and timestamp. Recent LC Log critical alarms indicate need for further investigation. The following example shows a failure for a critical backplane voltage alarm.
430
+
* Dell server health checks fail for recent Critical LC Log Alarms. The hardware validation plugin logs the alarm ID, name, and timestamp. Recent LC Log's critical alarms indicate need for further investigation. The following example shows a failure for a critical backplane voltage alarm.
431
431
432
432
```json
433
433
{
@@ -439,6 +439,7 @@ Expanding `result_detail` for a given category shows detailed results.
439
439
```
440
440
441
441
* Virtual disk errors typically indicate a RAID cleanup false positive condition and are logged due to the timing of raid cleanup and system power off pre HWV. The following example shows an LC log critical error on virtual disk 238. If multiple errors are encountered blocking deployment, delete cluster, wait two hours, then reattempt cluster deployment. If the failures aren't deployment blocking, wait two hours then run BMM replace.
442
+
* Virtual disk errors are allowlisted starting with release 3.13 and don't trigger a health check failure.
442
443
443
444
```json
444
445
{
@@ -461,7 +462,7 @@ Expanding `result_detail` for a given category shows detailed results.
461
462
462
463
* If `Backplane Comm` critical errors are logged, perform flea drain. Engage vendor to troubleshoot any other LC log critical failures.
463
464
464
-
* Health Check Server Power Action Failures
465
+
* Health Check Server Power Control Action Failures
465
466
* Dell server health checks fail for failed server power-up or failed iDRAC reset. A failed server control action indicates an underlying hardware issue. The following example shows failed power on attempt.
466
467
467
468
```json
@@ -491,6 +492,38 @@ Expanding `result_detail` for a given category shows detailed results.
491
492
492
493
* To troubleshoot server power-on failure attempt a flea drain. If problem persists engage vendor.
493
494
495
+
* RAID Cleanup Failures
496
+
* RAID cleanup was added to HWV in release 3.13. As part of RAID cleanup the RAID controller configuration is reset. Dell server health check fails for RAID controller reset failure. A failed RAID cleanup action indicates an underlying hardware issue. The following example shows a failed RAID controller reset.
497
+
498
+
```json
499
+
{
500
+
"field_name": "Server Control Actions",
501
+
"comparison_result": "Fail",
502
+
"expected": "Success",
503
+
"fetched": "Failed"
504
+
}
505
+
```
506
+
507
+
```json
508
+
"result_log": [
509
+
"RAID cleanup failed with: raid deletion failed after 2 attempts",
racadm --nocertwarn -r $IP -u $BMC_USR -p $BC_PWD storage resetconfig:RAID.SL.3-1 #substitute with RAID controller from get command
522
+
racadm --nocertwarn -r $IP -u $BMC_USR -p $BC_PWD jobqueue create RAID.SL.3-1 --realtime #substitute with RAID controller from get command
523
+
```
524
+
525
+
* To troubleshoot RAID cleanup failure check for any errors logged. For Dell R650/660, ensure that only slots 0 and 1 contain physical drives. For Dell R750/760, ensure that only slots 0 through 3 contain physical drives. For any other models, confirm there are no extra drives inserted based on SKU definition. All extra drives should be removed to align with the SKU. If the problem persists engage vendor.
526
+
494
527
* Health Check Power Supply Failure and Redundancy Considerations
495
528
* Dell server health checks warn when one power supply is missing or failed. Power supply "field_name" might be displayed as 0/PS0/Power Supply 0 and 1/PS1/Power Supply 1 for the first and second power supplies respectively. A failure of one power supply doesn't trigger an HWV device failure.
496
529
@@ -539,8 +572,9 @@ Expanding `result_detail` for a given category shows detailed results.
539
572
}
540
573
```
541
574
542
-
* PXE Device Check Considerations
575
+
* PXE Device Checks Considerations
543
576
* This check validates the PXE device settings.
577
+
* Starting with release 3.13 HWV attempts to auto fix the BIOS boot configuration.
544
578
* Failed `pxe_device_1_name` or `pxe_device_1_state` checks indicate a problem with the PXE configuration.
545
579
* Failed settings need to be fixed to enable system boot during deployment.
546
580
@@ -599,24 +633,6 @@ Expanding `result_detail` for a given category shows detailed results.
599
633
600
634
* To troubleshoot, ping the iDRAC from a jumpbox with access to the BMC network. If iDRAC pings check that passwords match.
601
635
602
-
### Special considerations
603
-
604
-
* Servers Failing Multiple Health and Network Checks
605
-
* Raid deletion is performed during cluster deploy and cluster delete actions for all releases inclusive of 3.12.
606
-
* If we observe servers getting powered off during hardware validation with multiple failed health and network checks, we need to reattempt cluster deployment.
607
-
* If issues persist, raid deletion needs to be performed manually on `control` nodes in the cluster.
Copy file name to clipboardExpand all lines: articles/operator-nexus/troubleshoot-reboot-reimage-replace.md
+3Lines changed: 3 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -64,6 +64,9 @@ Servers contain many physical components that can fail over time. It's important
64
64
65
65
A hardware validation process is invoked to ensure the integrity of the physical host in advance of deploying the OS image. Like the reimage action, the tenant data isn't modified during replacement.
66
66
67
+
> [!IMPORTANT]
68
+
> Starting with release 3.13 the RAID controller is reset during BMM replace, wiping all data from the server's virtual disks. Baseboard Management Controller (BMC) virtual disk alerts triggered during BMM replace can be ignored unless there are additonal physical disk and/or RAID controllers alerts.
69
+
67
70
As a best practice, first issue a `cordon` command to remove the bare metal machine from workload scheduling and then shut down the BMM in advance of physical repairs.
68
71
69
72
When you're performing a physical hot swappable power supply repair, a replace action isn't required because the BMM host will continue to function normally after the repair.
0 commit comments