HWV Updates for Release 3.13

vnikolin · vnikolin · commit 2aea2e945282 · 2024-09-17T17:05:50.000-04:00
diff --git a/articles/operator-nexus/howto-configure-cluster.md b/articles/operator-nexus/howto-configure-cluster.md
@@ -86,12 +86,12 @@ az networkcloud cluster create --name "$CLUSTER_NAME" --location "$LOCATION" \
 | COMPX_RACK_SKU            | Rack SKU for CompX Rack; repeat for each rack in compute-rack-definitions                                             |
 | COMPX_RACK_SN             | Rack Serial Number for CompX Rack; repeat for each rack in compute-rack-definitions                                   |
 | COMPX_RACK_LOCATION       | Rack physical location for CompX Rack; repeat for each rack in compute-rack-definitions                               |
-| COMPX_SVRY_BMC_PASS       | CompX Rack ServerY BMC password; repeat for each rack in compute-rack-definitions and for each server in rack         |
+| COMPX_SVRY_BMC_PASS       | CompX Rack ServerY Baseboard Management Controller (BMC) password; repeat for each rack in compute-rack-definitions and for each server in rack         |
 | COMPX_SVRY_BMC_USER       | CompX Rack ServerY BMC user; repeat for each rack in compute-rack-definitions and for each server in rack             |
 | COMPX_SVRY_BMC_MAC        | CompX Rack ServerY BMC MAC address; repeat for each rack in compute-rack-definitions and for each server in rack      |
-| COMPX_SVRY_BOOT_MAC       | CompX Rack ServerY boot NIC MAC address; repeat for each rack in compute-rack-definitions and for each server in rack |
+| COMPX_SVRY_BOOT_MAC       | CompX Rack ServerY boot Network Interface Card (NIC) MAC address; repeat for each rack in compute-rack-definitions and for each server in rack |
 | COMPX_SVRY_SERVER_DETAILS | CompX Rack ServerY details; repeat for each rack in compute-rack-definitions and for each server in rack              |
-| COMPX_SVRY_SERVER_NAME    | CompX Rack ServerY name, repeat for each rack in compute-rack-definitions and for each server in rack                 |
+| COMPX_SVRY_SERVER_NAME    | CompX Rack ServerY name; repeat for each rack in compute-rack-definitions and for each server in rack                 |
 | MRG_NAME                  | Cluster managed resource group name                                                                                   |
 | MRG_LOCATION              | Cluster Azure region                                                                                                  |
 | NF_ID                    | Reference to Network Fabric                                                                               |
@@ -101,8 +101,8 @@ az networkcloud cluster create --name "$CLUSTER_NAME" --location "$LOCATION" \
 | TENANT_ID                 | Subscription tenant ID                                                                                                |
 | SUBSCRIPTION_ID           | Subscription ID                                                                                                       |
 | KV_RESOURCE_ID            | Key Vault ID                                                                                                          |
-| CLUSTER_TYPE              | Type of cluster, Single, or MultiRack                                                                                  |
-| CLUSTER_VERSION           | NC Version of cluster                                                                                                 |
+| CLUSTER_TYPE              | Type of cluster, Single, or MultiRack                                                                                 |
+| CLUSTER_VERSION           | Network Cloud (NC) Version of cluster                                                                                 |
 | TAG_KEY1                  | Optional tag1 to pass to Cluster Create                                                                               |
 | TAG_VALUE1                | Optional tag1 value to pass to Cluster Create                                                                         |
 | TAG_KEY2                  | Optional tag2 to pass to Cluster Create                                                                               |
@@ -111,7 +111,7 @@ az networkcloud cluster create --name "$CLUSTER_NAME" --location "$LOCATION" \
 
 ## Cluster Identity
 
-Starting with the 2024-06-01-preview API version, a customer can assign managed identity to a Cluster. Both System-assigned and User-Assigned managed identities are supported.
+The customer can assign managed identity to a Cluster starting with the 2024-06-01-preview API version. Both System-assigned and User-Assigned managed identities are supported.
 
 Managed Identity can be assigned to the Cluster during creation or update operations by providing the following parameters:
 
@@ -131,23 +131,23 @@ You can find examples for an 8-Rack 2M16C SKU cluster using these two files:
 >[!NOTE]
 >To get the correct formatting, copy the raw code file. The values within the cluster.parameters.jsonc file are customer specific and may not be a complete list. Update the value fields for your specific environment.
 
-1. In a web browser, go to the [Azure portal](https://portal.azure.com/) and sign in.
-1. From the Azure portal search bar, search for 'Deploy a custom template' and then select it from the available services.
+1. Navigate to [Azure portal](https://portal.azure.com/) in a web browser and sign in.
+1. Search for 'Deploy a custom template' in the Azure portal search bar, and then select it from the available services.
 1. Click on Build your own template in the editor.
 1. Click on Load file. Locate your cluster.jsonc template file and upload it.
 1. Click Save.
 1. Click Edit parameters.
 1. Click Load file. Locate your cluster.parameters.jsonc parameters file and upload it.
 1. Click Save.
 1. Select the correct Subscription.
-1. Search for the Resource group to see if it already exists.  If not, create a new Resource group.
+1. Search for the Resource group to see if it already exists. If not, create a new Resource group.
 1. Make sure all Instance Details are correct.
 1. Click Review + create.
 
 
 ### Cluster validation
 
-A successful Operator Nexus Cluster creation results in the creation of an AKS cluster
+A successful Operator Nexus Cluster creation results in the creation of an Azure Kubernetes Service (AKS) cluster
 inside your subscription. The cluster ID, cluster provisioning state, and
 deployment state are returned as a result of a successful `cluster create`.
 
@@ -170,16 +170,16 @@ Cluster create Logs can be viewed in the following locations:
 
 ## Deploy Cluster
 
-After creating the cluster, the deploy cluster action can be triggered.
+The deploy Cluster action can be triggered after creating the Cluster.
 The deploy Cluster action creates the bootstrap image and deploys the Cluster.
 
 Deploy Cluster initiates a sequence of events that occur in the Cluster Manager.
 
-1.  Validation of the cluster/rack properties
+1.  Validation of the cluster/rack properties.
 2.  Generation of a bootable image for the ephemeral bootstrap cluster
     (Validation of Infrastructure).
-3.  Interaction with the IPMI interface of the targeted bootstrap machine.
-4.  Perform hardware validation checks
+3.  Interaction with the Intelligent Platform Management Interface (IPMI) interface of the targeted bootstrap machine.
+4.  Performing hardware validation checks.
 5.  Monitoring of the Cluster deployment process.
 
 Deploy the on-premises Cluster:
@@ -198,7 +198,7 @@ az networkcloud cluster deploy \
 > See the section [Cluster Deploy Failed](#cluster-deploy-failed) for more detailed steps.
 > Optionally, the command can run asynchronously using the `--no-wait` flag.
 
-### Cluster Deploy with hardware validation
+### Cluster Deployment with hardware validation
 
 During a Cluster deploy process, one of the steps executed is hardware validation.
 The hardware validation procedure runs various test and checks against the machines
@@ -211,6 +211,9 @@ passed and/or are available to meet the thresholds necessary for deployment to c
 > Additionally, the provided Service Principal in the Cluster object is used for authentication against the Log Analytics Workspace Data Collection API.
 > This capability is only visible during a new deployment (Green Field); existing cluster will not have the logs available retroactively.
 
+> [!NOTE]
+> The RAID controller is reset during cluster deployment wiping all data from the server's virtual disks. Any Baseboard Management Controller (BMC) virtual disk alerts can typically be ignored unless there are additonal physical disk and/or RAID controllers alerts.
+
 By default, the hardware validation process writes the results to the configured Cluster `analyticsWorkspaceId`.
 However, due to the nature of Log Analytics Workspace data collection and schema evaluation, there can be ingestion delay that can take several minutes or more.
 For this reason, the Cluster deployment proceeds even if there was a failure to write the results to the Log Analytics Workspace.
@@ -219,9 +222,9 @@ To help address this possible event, the results, for redundancy, are also logge
 In the provided Cluster object's Log Analytics Workspace, a new custom table with the Cluster's name as prefix and the suffix `*_CL` should appear.
 In the _Logs_ section of the LAW resource, a query can be executed against the new `*_CL` Custom Log table.
 
-#### Cluster Deploy Action with skipping specific bare-metal-machine
+#### Cluster Deployment with skipping specific bare-metal-machine
 
-A parameter can be passed in to the deploy command that represents the names of
+The `--skip-validation-for-machines` parameter represents the names of
 bare metal machines in the cluster that should be skipped during hardware validation.
 Nodes skipped aren't validated and aren't added to the node pool.
 Additionally, nodes skipped don't count against the total used by threshold calculations.
@@ -279,7 +282,7 @@ az networkcloud cluster show --resource-group "$CLUSTER_RG" \
 ```
 
 The Cluster deployment is in-progress when detailedStatus is set to `Deploying` and detailedStatusMessage shows the progress of deployment.
-Some examples of deployment progress shown in detailedStatusMessage are `Hardware validation is in progress.` (if cluster is deployed with hardware validation) ,`Cluster is bootstrapping.`, `KCP initialization in progress.`, `Management plane deployment in progress.`, `Cluster extension deployment in progress.`, `waiting for "<rack-ids>" to be ready`, etc.
+Some examples of deployment progress shown in detailedStatusMessage are `Hardware validation is in progress.` (if cluster is deployed with hardware validation), `Cluster is bootstrapping.`, `KCP initialization in progress.`, `Management plane deployment in progress.`, `Cluster extension deployment in progress.`, `waiting for "<rack-ids>" to be ready`, etc.
 
 :::image type="content" source="./media/nexus-deploy-kcp-status.png" lightbox="./media/nexus-deploy-kcp-status.png" alt-text="Screenshot of Azure portal showing cluster deploy progress kcp init.":::
 
@@ -372,7 +375,7 @@ Note, `<APIVersion>` is the API version 2024-06-01-preview or newer.
 
 ## Delete a cluster
 
-When deleting a cluster, it deletes the resources in Azure and the cluster that resides in the on-premises environment.
+Deleting a cluster deletes the resources in Azure and the cluster that resides in the on-premises environment.
 
 >[!NOTE]
 >If there are any tenant resources that exist in the cluster, it will not be deleted until those resources are deleted.
diff --git a/articles/operator-nexus/troubleshoot-hardware-validation-failure.md b/articles/operator-nexus/troubleshoot-hardware-validation-failure.md
@@ -197,7 +197,7 @@ Expanding `result_detail` for a given category shows detailed results.
 
 ### Drive info category
 
-* Disk Check Failure
+* Disk Checks Failure
     * Drive specs are defined in the SKU. Mismatched capacity values indicate incorrect drives or drives inserted in to incorrect slots. Missing capacity and type fetched values indicate drives that are failed, missing, or inserted in to incorrect slots.
 
     ```json
@@ -427,7 +427,7 @@ Expanding `result_detail` for a given category shows detailed results.
     * To troubleshoot a server health failure engage vendor.
 
 * Health Check LifeCycle (LC) Log Failures
-    * Dell server health checks fail for recent Critical LC Log Alarms. The hardware validation plugin logs the alarm ID, name, and timestamp. Recent LC Log critical alarms indicate need for further investigation. The following example shows a failure for a critical backplane voltage alarm.
+    * Dell server health checks fail for recent Critical LC Log Alarms. The hardware validation plugin logs the alarm ID, name, and timestamp. Recent LC Log's critical alarms indicate need for further investigation. The following example shows a failure for a critical backplane voltage alarm.
 
     ```json
         {
@@ -439,6 +439,7 @@ Expanding `result_detail` for a given category shows detailed results.
     ```
 
     * Virtual disk errors typically indicate a RAID cleanup false positive condition and are logged due to the timing of raid cleanup and system power off pre HWV. The following example shows an LC log critical error on virtual disk 238. If multiple errors are encountered blocking deployment, delete cluster, wait two hours, then reattempt cluster deployment. If the failures aren't deployment blocking, wait two hours then run BMM replace.
+    * Virtual disk errors are allowlisted starting with release 3.13 and don't trigger a health check failure.
 
     ```json
         {
@@ -461,7 +462,7 @@ Expanding `result_detail` for a given category shows detailed results.
 
     * If `Backplane Comm` critical errors are logged, perform flea drain. Engage vendor to troubleshoot any other LC log critical failures.
 
-* Health Check Server Power Action Failures
+* Health Check Server Power Control Action Failures
     * Dell server health checks fail for failed server power-up or failed iDRAC reset. A failed server control action indicates an underlying hardware issue. The following example shows failed power on attempt.
 
     ```json
@@ -491,6 +492,38 @@ Expanding `result_detail` for a given category shows detailed results.
 
     * To troubleshoot server power-on failure attempt a flea drain. If problem persists engage vendor.
 
+* RAID Cleanup Failures
+    * RAID cleanup was added to HWV in release 3.13. As part of RAID cleanup the RAID controller configuration is reset. Dell server health check fails for RAID controller reset failure. A failed RAID cleanup action indicates an underlying hardware issue. The following example shows a failed RAID controller reset.
+
+    ```json
+        {
+            "field_name": "Server Control Actions",
+            "comparison_result": "Fail",
+            "expected": "Success",
+            "fetched": "Failed"
+        }
+    ```
+
+    ```json
+        "result_log": [
+          "RAID cleanup failed with: raid deletion failed after 2 attempts",
+        ]
+    ```
+
+    * To clear RAID in BMC webui:
+
+        `BMC` -> `Dashboard` -> `Storage` -> `Controllers` -> `Actions` -> `Reset Configuration`
+
+    * To clear RAID with racadm check for RAID controllers then clear config:
+
+    ```bash
+        racadm --nocertwarn -r $IP -u $BMC_USR -p $BMC_PWD storage get controllers | grep "RAID"
+        racadm --nocertwarn -r $IP -u $BMC_USR -p $BC_PWD storage resetconfig:RAID.SL.3-1         #substitute with RAID controller from get command
+        racadm --nocertwarn -r $IP -u $BMC_USR -p $BC_PWD jobqueue create RAID.SL.3-1 --realtime  #substitute with RAID controller from get command
+    ```
+
+    * To troubleshoot RAID cleanup failure check for any errors logged. For Dell R650/660, ensure that only slots 0 and 1 contain physical drives. For Dell R750/760, ensure that only slots 0 through 3 contain physical drives. For any other models, confirm there are no extra drives inserted based on SKU definition. All extra drives should be removed to align with the SKU. If the problem persists engage vendor.
+
 * Health Check Power Supply Failure and Redundancy Considerations
     * Dell server health checks warn when one power supply is missing or failed. Power supply "field_name" might be displayed as 0/PS0/Power Supply 0 and 1/PS1/Power Supply 1 for the first and second power supplies respectively. A failure of one power supply doesn't trigger an HWV device failure.
 
@@ -539,8 +572,9 @@ Expanding `result_detail` for a given category shows detailed results.
         }
     ```
 
-* PXE Device Check Considerations
+* PXE Device Checks Considerations
     * This check validates the PXE device settings.
+    * Starting with release 3.13 HWV attempts to auto fix the BIOS boot configuration.
     * Failed `pxe_device_1_name` or `pxe_device_1_state` checks indicate a problem with the PXE configuration.
     * Failed settings need to be fixed to enable system boot during deployment.
 
@@ -599,24 +633,6 @@ Expanding `result_detail` for a given category shows detailed results.
 
     * To troubleshoot, ping the iDRAC from a jumpbox with access to the BMC network. If iDRAC pings check that passwords match.
 
-### Special considerations
-
-* Servers Failing Multiple Health and Network Checks
-    * Raid deletion is performed during cluster deploy and cluster delete actions for all releases inclusive of 3.12.
-    * If we observe servers getting powered off during hardware validation with multiple failed health and network checks, we need to reattempt cluster deployment.
-    * If issues persist, raid deletion needs to be performed manually on `control` nodes in the cluster.
-
-    * To clear raid in BMC webui:
-
-        `BMC` -> `Storage` -> `Virtual Disks` -> `Action` -> `Delete` -> `Apply Now`
-
-    * To clear raid with racadm:
-
-        ```bash
-        racadm --nocertwarn -r $IP -u $BMC_USR -p $BMC_PWD raid deletevd:Disk.Virtual.239:RAID.SL.3-1
-        racadm --nocertwarn -r $IP -u $BMC_USR -p $BMC_PWD jobqueue create RAID.SL.3-1 --realtime
-        ```
-
 ## Adding servers back into the Cluster after a repair
 
 After Hardware is fixed, run BMM Replace following instructions from the following page [BMM actions](howto-baremetal-functions.md).
diff --git a/articles/operator-nexus/troubleshoot-reboot-reimage-replace.md b/articles/operator-nexus/troubleshoot-reboot-reimage-replace.md
@@ -64,6 +64,9 @@ Servers contain many physical components that can fail over time. It's important
 
 A hardware validation process is invoked to ensure the integrity of the physical host in advance of deploying the OS image. Like the reimage action, the tenant data isn't modified during replacement.
 
+> [!IMPORTANT]
+> Starting with release 3.13 the RAID controller is reset during BMM replace, wiping all data from the server's virtual disks. Baseboard Management Controller (BMC) virtual disk alerts triggered during BMM replace can be ignored unless there are additonal physical disk and/or RAID controllers alerts.
+
 As a best practice, first issue a `cordon` command to remove the bare metal machine from workload scheduling and then shut down the BMM in advance of physical repairs.
 
 When you're performing a physical hot swappable power supply repair, a replace action isn't required because the BMM host will continue to function normally after the repair.