Skip to content

Commit 2aea2e9

Browse files
committed
HWV Updates for Release 3.13
1 parent eb6485a commit 2aea2e9

File tree

3 files changed

+63
-41
lines changed

3 files changed

+63
-41
lines changed

articles/operator-nexus/howto-configure-cluster.md

Lines changed: 22 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -86,12 +86,12 @@ az networkcloud cluster create --name "$CLUSTER_NAME" --location "$LOCATION" \
8686
| COMPX_RACK_SKU | Rack SKU for CompX Rack; repeat for each rack in compute-rack-definitions |
8787
| COMPX_RACK_SN | Rack Serial Number for CompX Rack; repeat for each rack in compute-rack-definitions |
8888
| COMPX_RACK_LOCATION | Rack physical location for CompX Rack; repeat for each rack in compute-rack-definitions |
89-
| COMPX_SVRY_BMC_PASS | CompX Rack ServerY BMC password; repeat for each rack in compute-rack-definitions and for each server in rack |
89+
| COMPX_SVRY_BMC_PASS | CompX Rack ServerY Baseboard Management Controller (BMC) password; repeat for each rack in compute-rack-definitions and for each server in rack |
9090
| COMPX_SVRY_BMC_USER | CompX Rack ServerY BMC user; repeat for each rack in compute-rack-definitions and for each server in rack |
9191
| COMPX_SVRY_BMC_MAC | CompX Rack ServerY BMC MAC address; repeat for each rack in compute-rack-definitions and for each server in rack |
92-
| COMPX_SVRY_BOOT_MAC | CompX Rack ServerY boot NIC MAC address; repeat for each rack in compute-rack-definitions and for each server in rack |
92+
| COMPX_SVRY_BOOT_MAC | CompX Rack ServerY boot Network Interface Card (NIC) MAC address; repeat for each rack in compute-rack-definitions and for each server in rack |
9393
| COMPX_SVRY_SERVER_DETAILS | CompX Rack ServerY details; repeat for each rack in compute-rack-definitions and for each server in rack |
94-
| COMPX_SVRY_SERVER_NAME | CompX Rack ServerY name, repeat for each rack in compute-rack-definitions and for each server in rack |
94+
| COMPX_SVRY_SERVER_NAME | CompX Rack ServerY name; repeat for each rack in compute-rack-definitions and for each server in rack |
9595
| MRG_NAME | Cluster managed resource group name |
9696
| MRG_LOCATION | Cluster Azure region |
9797
| NF_ID | Reference to Network Fabric |
@@ -101,8 +101,8 @@ az networkcloud cluster create --name "$CLUSTER_NAME" --location "$LOCATION" \
101101
| TENANT_ID | Subscription tenant ID |
102102
| SUBSCRIPTION_ID | Subscription ID |
103103
| KV_RESOURCE_ID | Key Vault ID |
104-
| CLUSTER_TYPE | Type of cluster, Single, or MultiRack |
105-
| CLUSTER_VERSION | NC Version of cluster |
104+
| CLUSTER_TYPE | Type of cluster, Single, or MultiRack |
105+
| CLUSTER_VERSION | Network Cloud (NC) Version of cluster |
106106
| TAG_KEY1 | Optional tag1 to pass to Cluster Create |
107107
| TAG_VALUE1 | Optional tag1 value to pass to Cluster Create |
108108
| TAG_KEY2 | Optional tag2 to pass to Cluster Create |
@@ -111,7 +111,7 @@ az networkcloud cluster create --name "$CLUSTER_NAME" --location "$LOCATION" \
111111

112112
## Cluster Identity
113113

114-
Starting with the 2024-06-01-preview API version, a customer can assign managed identity to a Cluster. Both System-assigned and User-Assigned managed identities are supported.
114+
The customer can assign managed identity to a Cluster starting with the 2024-06-01-preview API version. Both System-assigned and User-Assigned managed identities are supported.
115115

116116
Managed Identity can be assigned to the Cluster during creation or update operations by providing the following parameters:
117117

@@ -131,23 +131,23 @@ You can find examples for an 8-Rack 2M16C SKU cluster using these two files:
131131
>[!NOTE]
132132
>To get the correct formatting, copy the raw code file. The values within the cluster.parameters.jsonc file are customer specific and may not be a complete list. Update the value fields for your specific environment.
133133
134-
1. In a web browser, go to the [Azure portal](https://portal.azure.com/) and sign in.
135-
1. From the Azure portal search bar, search for 'Deploy a custom template' and then select it from the available services.
134+
1. Navigate to [Azure portal](https://portal.azure.com/) in a web browser and sign in.
135+
1. Search for 'Deploy a custom template' in the Azure portal search bar, and then select it from the available services.
136136
1. Click on Build your own template in the editor.
137137
1. Click on Load file. Locate your cluster.jsonc template file and upload it.
138138
1. Click Save.
139139
1. Click Edit parameters.
140140
1. Click Load file. Locate your cluster.parameters.jsonc parameters file and upload it.
141141
1. Click Save.
142142
1. Select the correct Subscription.
143-
1. Search for the Resource group to see if it already exists. If not, create a new Resource group.
143+
1. Search for the Resource group to see if it already exists. If not, create a new Resource group.
144144
1. Make sure all Instance Details are correct.
145145
1. Click Review + create.
146146

147147

148148
### Cluster validation
149149

150-
A successful Operator Nexus Cluster creation results in the creation of an AKS cluster
150+
A successful Operator Nexus Cluster creation results in the creation of an Azure Kubernetes Service (AKS) cluster
151151
inside your subscription. The cluster ID, cluster provisioning state, and
152152
deployment state are returned as a result of a successful `cluster create`.
153153

@@ -170,16 +170,16 @@ Cluster create Logs can be viewed in the following locations:
170170

171171
## Deploy Cluster
172172

173-
After creating the cluster, the deploy cluster action can be triggered.
173+
The deploy Cluster action can be triggered after creating the Cluster.
174174
The deploy Cluster action creates the bootstrap image and deploys the Cluster.
175175

176176
Deploy Cluster initiates a sequence of events that occur in the Cluster Manager.
177177

178-
1. Validation of the cluster/rack properties
178+
1. Validation of the cluster/rack properties.
179179
2. Generation of a bootable image for the ephemeral bootstrap cluster
180180
(Validation of Infrastructure).
181-
3. Interaction with the IPMI interface of the targeted bootstrap machine.
182-
4. Perform hardware validation checks
181+
3. Interaction with the Intelligent Platform Management Interface (IPMI) interface of the targeted bootstrap machine.
182+
4. Performing hardware validation checks.
183183
5. Monitoring of the Cluster deployment process.
184184

185185
Deploy the on-premises Cluster:
@@ -198,7 +198,7 @@ az networkcloud cluster deploy \
198198
> See the section [Cluster Deploy Failed](#cluster-deploy-failed) for more detailed steps.
199199
> Optionally, the command can run asynchronously using the `--no-wait` flag.
200200
201-
### Cluster Deploy with hardware validation
201+
### Cluster Deployment with hardware validation
202202

203203
During a Cluster deploy process, one of the steps executed is hardware validation.
204204
The hardware validation procedure runs various test and checks against the machines
@@ -211,6 +211,9 @@ passed and/or are available to meet the thresholds necessary for deployment to c
211211
> Additionally, the provided Service Principal in the Cluster object is used for authentication against the Log Analytics Workspace Data Collection API.
212212
> This capability is only visible during a new deployment (Green Field); existing cluster will not have the logs available retroactively.
213213
214+
> [!NOTE]
215+
> The RAID controller is reset during cluster deployment wiping all data from the server's virtual disks. Any Baseboard Management Controller (BMC) virtual disk alerts can typically be ignored unless there are additonal physical disk and/or RAID controllers alerts.
216+
214217
By default, the hardware validation process writes the results to the configured Cluster `analyticsWorkspaceId`.
215218
However, due to the nature of Log Analytics Workspace data collection and schema evaluation, there can be ingestion delay that can take several minutes or more.
216219
For this reason, the Cluster deployment proceeds even if there was a failure to write the results to the Log Analytics Workspace.
@@ -219,9 +222,9 @@ To help address this possible event, the results, for redundancy, are also logge
219222
In the provided Cluster object's Log Analytics Workspace, a new custom table with the Cluster's name as prefix and the suffix `*_CL` should appear.
220223
In the _Logs_ section of the LAW resource, a query can be executed against the new `*_CL` Custom Log table.
221224

222-
#### Cluster Deploy Action with skipping specific bare-metal-machine
225+
#### Cluster Deployment with skipping specific bare-metal-machine
223226

224-
A parameter can be passed in to the deploy command that represents the names of
227+
The `--skip-validation-for-machines` parameter represents the names of
225228
bare metal machines in the cluster that should be skipped during hardware validation.
226229
Nodes skipped aren't validated and aren't added to the node pool.
227230
Additionally, nodes skipped don't count against the total used by threshold calculations.
@@ -279,7 +282,7 @@ az networkcloud cluster show --resource-group "$CLUSTER_RG" \
279282
```
280283

281284
The Cluster deployment is in-progress when detailedStatus is set to `Deploying` and detailedStatusMessage shows the progress of deployment.
282-
Some examples of deployment progress shown in detailedStatusMessage are `Hardware validation is in progress.` (if cluster is deployed with hardware validation) ,`Cluster is bootstrapping.`, `KCP initialization in progress.`, `Management plane deployment in progress.`, `Cluster extension deployment in progress.`, `waiting for "<rack-ids>" to be ready`, etc.
285+
Some examples of deployment progress shown in detailedStatusMessage are `Hardware validation is in progress.` (if cluster is deployed with hardware validation), `Cluster is bootstrapping.`, `KCP initialization in progress.`, `Management plane deployment in progress.`, `Cluster extension deployment in progress.`, `waiting for "<rack-ids>" to be ready`, etc.
283286

284287
:::image type="content" source="./media/nexus-deploy-kcp-status.png" lightbox="./media/nexus-deploy-kcp-status.png" alt-text="Screenshot of Azure portal showing cluster deploy progress kcp init.":::
285288

@@ -372,7 +375,7 @@ Note, `<APIVersion>` is the API version 2024-06-01-preview or newer.
372375

373376
## Delete a cluster
374377

375-
When deleting a cluster, it deletes the resources in Azure and the cluster that resides in the on-premises environment.
378+
Deleting a cluster deletes the resources in Azure and the cluster that resides in the on-premises environment.
376379

377380
>[!NOTE]
378381
>If there are any tenant resources that exist in the cluster, it will not be deleted until those resources are deleted.

articles/operator-nexus/troubleshoot-hardware-validation-failure.md

Lines changed: 38 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -197,7 +197,7 @@ Expanding `result_detail` for a given category shows detailed results.
197197

198198
### Drive info category
199199

200-
* Disk Check Failure
200+
* Disk Checks Failure
201201
* Drive specs are defined in the SKU. Mismatched capacity values indicate incorrect drives or drives inserted in to incorrect slots. Missing capacity and type fetched values indicate drives that are failed, missing, or inserted in to incorrect slots.
202202

203203
```json
@@ -427,7 +427,7 @@ Expanding `result_detail` for a given category shows detailed results.
427427
* To troubleshoot a server health failure engage vendor.
428428

429429
* Health Check LifeCycle (LC) Log Failures
430-
* Dell server health checks fail for recent Critical LC Log Alarms. The hardware validation plugin logs the alarm ID, name, and timestamp. Recent LC Log critical alarms indicate need for further investigation. The following example shows a failure for a critical backplane voltage alarm.
430+
* Dell server health checks fail for recent Critical LC Log Alarms. The hardware validation plugin logs the alarm ID, name, and timestamp. Recent LC Log's critical alarms indicate need for further investigation. The following example shows a failure for a critical backplane voltage alarm.
431431

432432
```json
433433
{
@@ -439,6 +439,7 @@ Expanding `result_detail` for a given category shows detailed results.
439439
```
440440

441441
* Virtual disk errors typically indicate a RAID cleanup false positive condition and are logged due to the timing of raid cleanup and system power off pre HWV. The following example shows an LC log critical error on virtual disk 238. If multiple errors are encountered blocking deployment, delete cluster, wait two hours, then reattempt cluster deployment. If the failures aren't deployment blocking, wait two hours then run BMM replace.
442+
* Virtual disk errors are allowlisted starting with release 3.13 and don't trigger a health check failure.
442443

443444
```json
444445
{
@@ -461,7 +462,7 @@ Expanding `result_detail` for a given category shows detailed results.
461462

462463
* If `Backplane Comm` critical errors are logged, perform flea drain. Engage vendor to troubleshoot any other LC log critical failures.
463464

464-
* Health Check Server Power Action Failures
465+
* Health Check Server Power Control Action Failures
465466
* Dell server health checks fail for failed server power-up or failed iDRAC reset. A failed server control action indicates an underlying hardware issue. The following example shows failed power on attempt.
466467

467468
```json
@@ -491,6 +492,38 @@ Expanding `result_detail` for a given category shows detailed results.
491492

492493
* To troubleshoot server power-on failure attempt a flea drain. If problem persists engage vendor.
493494

495+
* RAID Cleanup Failures
496+
* RAID cleanup was added to HWV in release 3.13. As part of RAID cleanup the RAID controller configuration is reset. Dell server health check fails for RAID controller reset failure. A failed RAID cleanup action indicates an underlying hardware issue. The following example shows a failed RAID controller reset.
497+
498+
```json
499+
{
500+
"field_name": "Server Control Actions",
501+
"comparison_result": "Fail",
502+
"expected": "Success",
503+
"fetched": "Failed"
504+
}
505+
```
506+
507+
```json
508+
"result_log": [
509+
"RAID cleanup failed with: raid deletion failed after 2 attempts",
510+
]
511+
```
512+
513+
* To clear RAID in BMC webui:
514+
515+
`BMC` -> `Dashboard` -> `Storage` -> `Controllers` -> `Actions` -> `Reset Configuration`
516+
517+
* To clear RAID with racadm check for RAID controllers then clear config:
518+
519+
```bash
520+
racadm --nocertwarn -r $IP -u $BMC_USR -p $BMC_PWD storage get controllers | grep "RAID"
521+
racadm --nocertwarn -r $IP -u $BMC_USR -p $BC_PWD storage resetconfig:RAID.SL.3-1 #substitute with RAID controller from get command
522+
racadm --nocertwarn -r $IP -u $BMC_USR -p $BC_PWD jobqueue create RAID.SL.3-1 --realtime #substitute with RAID controller from get command
523+
```
524+
525+
* To troubleshoot RAID cleanup failure check for any errors logged. For Dell R650/660, ensure that only slots 0 and 1 contain physical drives. For Dell R750/760, ensure that only slots 0 through 3 contain physical drives. For any other models, confirm there are no extra drives inserted based on SKU definition. All extra drives should be removed to align with the SKU. If the problem persists engage vendor.
526+
494527
* Health Check Power Supply Failure and Redundancy Considerations
495528
* Dell server health checks warn when one power supply is missing or failed. Power supply "field_name" might be displayed as 0/PS0/Power Supply 0 and 1/PS1/Power Supply 1 for the first and second power supplies respectively. A failure of one power supply doesn't trigger an HWV device failure.
496529

@@ -539,8 +572,9 @@ Expanding `result_detail` for a given category shows detailed results.
539572
}
540573
```
541574

542-
* PXE Device Check Considerations
575+
* PXE Device Checks Considerations
543576
* This check validates the PXE device settings.
577+
* Starting with release 3.13 HWV attempts to auto fix the BIOS boot configuration.
544578
* Failed `pxe_device_1_name` or `pxe_device_1_state` checks indicate a problem with the PXE configuration.
545579
* Failed settings need to be fixed to enable system boot during deployment.
546580

@@ -599,24 +633,6 @@ Expanding `result_detail` for a given category shows detailed results.
599633

600634
* To troubleshoot, ping the iDRAC from a jumpbox with access to the BMC network. If iDRAC pings check that passwords match.
601635

602-
### Special considerations
603-
604-
* Servers Failing Multiple Health and Network Checks
605-
* Raid deletion is performed during cluster deploy and cluster delete actions for all releases inclusive of 3.12.
606-
* If we observe servers getting powered off during hardware validation with multiple failed health and network checks, we need to reattempt cluster deployment.
607-
* If issues persist, raid deletion needs to be performed manually on `control` nodes in the cluster.
608-
609-
* To clear raid in BMC webui:
610-
611-
`BMC` -> `Storage` -> `Virtual Disks` -> `Action` -> `Delete` -> `Apply Now`
612-
613-
* To clear raid with racadm:
614-
615-
```bash
616-
racadm --nocertwarn -r $IP -u $BMC_USR -p $BMC_PWD raid deletevd:Disk.Virtual.239:RAID.SL.3-1
617-
racadm --nocertwarn -r $IP -u $BMC_USR -p $BMC_PWD jobqueue create RAID.SL.3-1 --realtime
618-
```
619-
620636
## Adding servers back into the Cluster after a repair
621637

622638
After Hardware is fixed, run BMM Replace following instructions from the following page [BMM actions](howto-baremetal-functions.md).

articles/operator-nexus/troubleshoot-reboot-reimage-replace.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -64,6 +64,9 @@ Servers contain many physical components that can fail over time. It's important
6464

6565
A hardware validation process is invoked to ensure the integrity of the physical host in advance of deploying the OS image. Like the reimage action, the tenant data isn't modified during replacement.
6666

67+
> [!IMPORTANT]
68+
> Starting with release 3.13 the RAID controller is reset during BMM replace, wiping all data from the server's virtual disks. Baseboard Management Controller (BMC) virtual disk alerts triggered during BMM replace can be ignored unless there are additonal physical disk and/or RAID controllers alerts.
69+
6770
As a best practice, first issue a `cordon` command to remove the bare metal machine from workload scheduling and then shut down the BMM in advance of physical repairs.
6871

6972
When you're performing a physical hot swappable power supply repair, a replace action isn't required because the BMM host will continue to function normally after the repair.

0 commit comments

Comments
 (0)