|
| 1 | +--- |
| 2 | +title: Best practices for Bare Metal Machine operations |
| 3 | +description: Steps that should be taken before executing any Bare Metal Machine replace, or reimage actions. Highlight essential prerequisites and common pitfalls to avoid. |
| 4 | +ms.date: 03/25/2025 |
| 5 | +ms.topic: how-to |
| 6 | +ms.service: azure-operator-nexus |
| 7 | +ms.custom: template-how-to, best-practices |
| 8 | +author: omarrivera |
| 9 | +ms.author: omarrivera |
| 10 | +ms.reviewer: bartpinto |
| 11 | +--- |
| 12 | + |
| 13 | +# Best practices for Bare Metal Machine operations |
| 14 | + |
| 15 | +This article provides best practices for BareMetal Machine (BMM) lifecycle management operations. |
| 16 | +The aim is to highlight common pitfalls and essential prerequisites. |
| 17 | + |
| 18 | +## Read important disclaimers |
| 19 | + |
| 20 | +[!INCLUDE [caution-affect-cluster-integrity](./includes/baremetal-machines/caution-affect-cluster-integrity.md)] |
| 21 | + |
| 22 | +[!INCLUDE [important-donot-disrupt-kcpnodes](./includes/baremetal-machines/important-donot-disrupt-kcpnodes.md)] |
| 23 | + |
| 24 | +[!INCLUDE [prerequisites-azure-cli-bare-metal-machine-actions](./includes/baremetal-machines/prerequisites-azure-cli-bare-metal-machine-actions.md)] |
| 25 | + |
| 26 | +## Identify the best-fit corrective approach |
| 27 | + |
| 28 | +Troubleshooting technical problems requires a systematic approach. |
| 29 | +One effective method is to start with the least invasive solution and, if necessary, work your way up to more complex and potentially disruptive measures. |
| 30 | +Keep in mind that these troubleshooting methods might not always be effective for all scenarios and accounting for various other factors might require a different approach. |
| 31 | +For this reason, it's essential to understand the available options well when troubleshooting a Bare Metal Machine for failures to determine the most appropriate corrective action. |
| 32 | + |
| 33 | +### General advice while troubleshooting |
| 34 | + |
| 35 | +- Familiarize yourself with the relevant documentation, including troubleshooting guides and how-to articles. |
| 36 | + Always refer to the latest documentation to stay informed about best practices and updates. |
| 37 | +- Avoid repeated failed operations by first attempting to identify the root cause of the failure before retrying the operation. |
| 38 | + Perform retry attempts in incremental steps to isolate and address specific issues. |
| 39 | +- Wait for Az CLI commands to run to completion and validate the state of the Bare Metal Machine resource before executing other steps. |
| 40 | +- Verify that the firmware and software versions are up-to-date before a new greenfield deployment to prevent compatibility issues between hardware and software versions. |
| 41 | + For more information about firmware compatibility, see [Operator Nexus Platform Prerequisites](./howto-platform-prerequisites.md). |
| 42 | +- Check the iDRAC credentials are correct and that the Bare Metal Machine is powered on. |
| 43 | + |
| 44 | +#### Look at general network connectivity health |
| 45 | + |
| 46 | +Ensure stable network connectivity to avoid interruptions during the process. |
| 47 | +Ignoring network stability could make operations fail to complete successfully and leave a Bare Metal Machine in an error or degraded state. |
| 48 | + |
| 49 | +A quick look at Cluster resource's `clusterConnectionStatus` serves as one indicator of network connectivity health. |
| 50 | + |
| 51 | +```azurecli |
| 52 | +az networkcloud cluster show \ |
| 53 | + -g $CLUSTER_MRG \ |
| 54 | + -n $BMM_NAME \ |
| 55 | + --subscription $SUBSCRIPTION \ |
| 56 | + --query "clusterConnectionStatus" \ |
| 57 | + -o table |
| 58 | +
|
| 59 | +Result |
| 60 | +--------- |
| 61 | +Connected |
| 62 | +``` |
| 63 | + |
| 64 | +Take a deeper look at the NetworkFabric resources by checking the NetworkFabric resources statuses, alerts, and metrics. |
| 65 | +See related articles: |
| 66 | + - [How to monitor interface In and Out packet rate for network fabric devices] |
| 67 | + - [How to configure diagnostic settings and monitor configuration differences in Nexus Network Fabric]. |
| 68 | + |
| 69 | +Evaluate for any Bare Metal Machine warnings or degraded conditions which could indicate the need to resolve hardware, network, or server configuration problems. |
| 70 | +For more information, see [Troubleshoot Degraded Status Errors on Bare Metal Machines] and [Troubleshoot Bare Metal Machine Warning Status]. |
| 71 | + |
| 72 | +#### Determine if firmware update jobs are running |
| 73 | + |
| 74 | +Validate that there are no running firmware upgrade jobs through the BMC before initiating a `replace` or `reimage` operation. |
| 75 | +Interrupting an ongoing firmware upgrade can leave the Bare Metal Machine in an inconsistent state. |
| 76 | +You can view in the iDRAC GUI the `jobqueue` or use a `racadm jobqueque view` to determine if there are firmware upgrade jobs running. |
| 77 | + |
| 78 | +```azurecli |
| 79 | +az networkcloud baremetalmachine run-read-command \ |
| 80 | + -g $CLUSTER_MRG \ |
| 81 | + -n $BMM_NAME \ |
| 82 | + --subscription $SUBSCRIPTION \ |
| 83 | + --limit-time-seconds 60 \ |
| 84 | + --commands "[{command:'nc-toolbox nc-toolbox-runread racadm jobqueue view'}]" \ |
| 85 | + --output-directory . |
| 86 | +``` |
| 87 | + |
| 88 | +Here's an example output from the `racadm jobqueue view` command which shows `Firmware Update`. |
| 89 | +``` |
| 90 | +[Job ID=JID_833540920066] |
| 91 | +Job Name=Firmware Update: iDRAC |
| 92 | +Status=Downloading |
| 93 | +Start Time= [Not Applicable] |
| 94 | +Expiration Time= [Not Applicable] |
| 95 | +Message= [RED001: Job in progress.] |
| 96 | +Percent Complete= [50%] |
| 97 | +``` |
| 98 | + |
| 99 | +Here's an example output from the `racadm jobqueue view` command showing common happy-path statements. |
| 100 | +``` |
| 101 | +-------------------------JOB QUEUE------------------------ |
| 102 | +[Job ID=JID_429400224349] |
| 103 | +Job Name=Configure: Import Server Configuration Profile |
| 104 | +Status=Completed |
| 105 | +Scheduled Start Time=[Not Applicable] |
| 106 | +Expiration Time=[Not Applicable] |
| 107 | +Actual Start Time=[Tue, 25 Mar 2025 17:00:22] |
| 108 | +Actual Completion Time=[Tue, 25 Mar 2025 17:00:32] |
| 109 | +Message=[SYS053: Successfully imported and applied Server Configuration Profile.] |
| 110 | +Percent Complete=[100] |
| 111 | +---------------------------------------------------------- |
| 112 | +[Job ID=JID_429400338344] |
| 113 | +Job Name=Export: Server Configuration Profile |
| 114 | +Status=Completed |
| 115 | +Scheduled Start Time=[Not Applicable] |
| 116 | +Expiration Time=[Not Applicable] |
| 117 | +Actual Start Time=[Tue, 25 Mar 2025 17:00:33] |
| 118 | +Actual Completion Time=[Tue, 25 Mar 2025 17:00:58] |
| 119 | +Message=[SYS043: Successfully exported Server Configuration Profile] |
| 120 | +Percent Complete=[100] |
| 121 | +``` |
| 122 | + |
| 123 | +## Best practices for a Bare Metal Machine reimage |
| 124 | + |
| 125 | +The Bare Metal Machine (BMM) `reimage` action is explained in [Bare Metal Machine Lifecycle Management Commands] and scenario procedures described in [Troubleshoot Azure Operator Nexus Server Problems]. |
| 126 | + |
| 127 | +[!INCLUDE [warning-do-not-run-multiple-actions](./includes/baremetal-machines/warning-do-not-run-multiple-actions.md)] |
| 128 | + |
| 129 | +You can restore the operating system runtime version on a Bare Metal Machine by executing the `reimage` operation. |
| 130 | +A Bare Metal Machine `reimage` can be both time-saving and reliable for resolving issues or restoring the operating system software to a known-good state. |
| 131 | +This process **redeploys** the runtime image on the target Bare Metal Machine and executes the steps to rejoin the cluster with the same identifiers. |
| 132 | +The `reimage` action is designed to interact with the operating system partition, leaving virtual machine's local storage unchanged. |
| 133 | + |
| 134 | +> [!IMPORTANT] |
| 135 | +> Avoid manual or automated changes to the Bare Metal Machine's file system (also known as "break glass"). |
| 136 | +> The `reimage` action is required to restore Microsoft support and any changes done to the Bare Metal Machine are lost while restoring the node to its expected state. |
| 137 | +
|
| 138 | +### Preconditions and validations before a Bare Metal Machine reimage |
| 139 | + |
| 140 | +Before initiating any `reimage` operation, ensure the following preconditions are met: |
| 141 | + |
| 142 | +- Make sure the Bare Metal Machine's workloads are drained using the [`cordon`](./howto-baremetal-functions.md#make-a-bare-metal-machine-unschedulable-cordon) command with the parameter `evacuate` set to `True`. |
| 143 | +- Perform high level checks covered in the article [Troubleshoot Bare Metal Machine Provisioning]. |
| 144 | +- Evaluate any Bare Metal Machine warnings or degraded conditions which could indicate the need to resolve hardware, network, or server configuration problems before a `reimage` operation. |
| 145 | + For more information, read [Troubleshoot Degraded Status Errors on Bare Metal Machines] and [Troubleshoot Bare Metal Machine Warning Status]. |
| 146 | +- If the Bare Metal Machine reports a failed state with the reason of hardware validation (seen in the Bare Metal Machine `Detailed Status` and `Detailed Status Message` fields), then the Bare Metal Machine needs a `replace` instead. |
| 147 | + See the [Best Practices for a Bare Metal Machine Replace](#best-practices-for-a-bare-metal-machine-replace). |
| 148 | +- Validate that there are no running firmware upgrade jobs. |
| 149 | + Follow steps in section [Determine if Firmware Update Jobs are Running](#determine-if-firmware-update-jobs-are-running). |
| 150 | + |
| 151 | +## Best practices for a Bare Metal Machine replace |
| 152 | + |
| 153 | +The Bare Metal Machine `replace` action is explained in [Bare Metal Machine Lifecycle Management Commands] and scenario procedures described in [Troubleshoot Azure Operator Nexus Server Problems]. |
| 154 | + |
| 155 | +[!INCLUDE [warning-do-not-run-multiple-actions](./includes/baremetal-machines/warning-do-not-run-multiple-actions.md)] |
| 156 | + |
| 157 | +Hardware failures are a normal occurrence over the life of a server. |
| 158 | +Component replacements might be necessary to restore functionality and ensure continued operation. |
| 159 | +The `replace` operation must be executed after any hardware maintenance/repair event. |
| 160 | +When one or more hardware components fail on the server (multiple failures), make the necessary repairs for **all** components before executing a Bare Metal Machine `replace` operation. |
| 161 | + |
| 162 | +> [!IMPORTANT] |
| 163 | +> With the `2024-07-01` GA API version, the RAID controller is reset during Bare Metal Machine `replace`, wiping all data from the server's virtual disks. |
| 164 | +> Baseboard Management Controller (BMC) virtual disk alerts triggered during Bare Metal Machine `replace` can be ignored unless there are more physical disk and/or RAID controllers alerts. |
| 165 | +
|
| 166 | +### Preconditions and validations before a Bare Metal Machine replace |
| 167 | + |
| 168 | +Before initiating any `replace` operation, ensure the following preconditions are met: |
| 169 | + |
| 170 | +- Make sure the Bare Metal Machine's workloads are drained using the [`cordon`](./howto-baremetal-functions.md#make-a-bare-metal-machine-unschedulable-cordon) command with the parameter `evacuate` set to `True`. |
| 171 | +- Perform high level checks covered in the article [Troubleshoot Bare Metal Machine Provisioning]. |
| 172 | +- Evaluate any Bare Metal Machine warnings or degraded conditions which could indicate the need to resolve hardware, network, or server configuration problems before a `replace` operation. |
| 173 | + For more information, see [Troubleshoot Degraded Status Errors on Bare Metal Machines] and [Troubleshoot Bare Metal Machine Warning Status]. |
| 174 | +- Validate that there are no running firmware upgrade jobs. |
| 175 | + Follow steps in section [Determine if Firmware Update Jobs are Running](#determine-if-firmware-update-jobs-are-running). |
| 176 | + |
| 177 | +### Resolve hardware validation issues |
| 178 | + |
| 179 | +When a Bare Metal Machine is marked with failed hardware validation, it might indicate that physical repairs are needed. |
| 180 | +It's crucial to identify and address these repairs before performing a Bare Metal Machine `replace`. |
| 181 | +A hardware validation process is invoked as part of the `replace` operation to ensure the physical host's integrity before deploying the OS image. |
| 182 | +The Bare Metal Machine can't provision successfully when the Bare Metal Machine continues to have hardware validation failures. |
| 183 | +As a result, the Bare Metal Machine fails to complete the necessary setup steps to become operational and join the cluster. |
| 184 | +Ensure **all hardware validation issues** are cleared before the next `replace` action. |
| 185 | + |
| 186 | +To understand hardware validation result, read through the article [Troubleshoot Hardware Validation Failure](./troubleshoot-hardware-validation-failure.md). |
| 187 | + |
| 188 | +### Bare Metal Machine replace isn't required |
| 189 | + |
| 190 | +Some repairs don't require a Bare Metal Machine `replace` to be executed. |
| 191 | +For example, a `replace` operation isn't required when you're performing a physical hot swappable power supply repair because the Bare Metal Machine host will continue to function normally after the repair. |
| 192 | +However, if the Bare Metal Machine failed hardware validation, the Bare Metal Machine `replace` is required even if the hot swappable repairs are done. |
| 193 | +Examine the Bare Metal Machine status messages to determine if hardware validation failures or other degraded conditions are present. |
| 194 | + - [Troubleshoot Degraded Status Errors on Bare Metal Machines] |
| 195 | + - [Troubleshoot Bare Metal Machine Warning Status] |
| 196 | + - [Troubleshoot Hardware Validation Failure](./troubleshoot-hardware-validation-failure.md). |
| 197 | + |
| 198 | +Other repairs of this type might be: |
| 199 | + |
| 200 | +- CPU |
| 201 | +- Dual In-Line Memory Module (DIMM) |
| 202 | +- Fan |
| 203 | +- Expansion board riser |
| 204 | +- Transceiver |
| 205 | +- Ethernet or fiber cable replacement |
| 206 | + |
| 207 | +### Bare Metal Machine replace is required |
| 208 | + |
| 209 | +After components such as motherboard or Network Interface Card (NIC) are replaced, the Bare Metal Machine MAC address changes. |
| 210 | +However, the iDRAC IP address and hostname remain the same. |
| 211 | +Motherboard changes result in MAC address changes, requiring a Bare Metal Machine `replace`. |
| 212 | + |
| 213 | +A `replace` operation **is required** to bring the Bare Metal Machine back into service when you're performing the following physical repairs: |
| 214 | + |
| 215 | +- Backplane |
| 216 | +- System board |
| 217 | +- SSD disk |
| 218 | +- PERC/RAID adapter |
| 219 | +- Mellanox Network Interface Card (NIC) |
| 220 | +- Broadcom embedded NIC |
| 221 | + |
| 222 | +### Check statuses after a Bare Metal Machine replace operation |
| 223 | + |
| 224 | +After the Bare Metal Machine `replace` operation completes successfully, ensure that the `provisioningStatus` is `Succeeded` and the `readyState` is `True`. |
| 225 | +Only then, proceed to execute the `uncordon` operation to have the Bare Metal Machine rejoin the workload schedulable node pool. |
| 226 | + |
| 227 | +## Request support |
| 228 | + |
| 229 | +If you still have questions, [contact support](https://portal.azure.com/?#blade/Microsoft_Azure_Support/HelpAndSupportBlade). |
| 230 | +For more information about Support plans, see [Azure Support plans](https://azure.microsoft.com/support/plans/response/). |
| 231 | + |
| 232 | +## References |
| 233 | + |
| 234 | +- [Bare Metal Machine Lifecycle Management Commands] |
| 235 | +- [Run emergency bare metal actions outside of Azure using nexusctl] |
| 236 | +- [Troubleshoot Azure Operator Nexus Server Problems] |
| 237 | +- [Troubleshoot Bare Metal Machine Provisioning] |
| 238 | +- [Troubleshoot Bare Metal Machine Warning Status] |
| 239 | +- [Troubleshoot Degraded Status Errors on Bare Metal Machines] |
| 240 | +- [Troubleshoot Hardware Validation Failure] |
| 241 | + |
| 242 | +[Bare Metal Machine Lifecycle Management Commands]: ./howto-baremetal-functions.md |
| 243 | +[Run emergency bare metal actions outside of Azure using nexusctl]: ./howto-baremetal-nexusctl.md |
| 244 | +[Troubleshoot Azure Operator Nexus Server Problems]: ./troubleshoot-reboot-reimage-replace.md |
| 245 | +[Troubleshoot Bare Metal Machine Provisioning]: ./troubleshoot-bare-metal-machine-provisioning.md |
| 246 | +[Troubleshoot Bare Metal Machine Warning Status]: ./troubleshoot-bare-metal-machine-warning.md |
| 247 | +[Troubleshoot Degraded Status Errors on Bare Metal Machines]: ./troubleshoot-bare-metal-machine-degraded.md |
| 248 | +[Troubleshoot Hardware Validation Failure]: ./troubleshoot-hardware-validation-failure.md |
| 249 | +[How to monitor interface In and Out packet rate for network fabric devices]: ./howto-monitor-interface-packet-rate.md |
| 250 | +[How to configure diagnostic settings and monitor configuration differences in Nexus Network Fabric]: ./howto-configure-diagnostic-settings-monitor-configuration-differences.md |
0 commit comments