|
| 1 | +--- |
| 2 | +title: Best Practices for BareMetal Machine Operations |
| 3 | +description: Steps that should be taken before executing any BMM replace, or reimage actions. Highlight essential prerequisites and common pitfalls to avoid. |
| 4 | +ms.date: 03/25/2025 |
| 5 | +ms.topic: how-to |
| 6 | +ms.service: azure-operator-nexus |
| 7 | +ms.custom: template-how-to, best-practices |
| 8 | +author: omarrivera |
| 9 | +ms.author: omarrivera |
| 10 | +ms.reviewer: bartpinto |
| 11 | +--- |
| 12 | + |
| 13 | +# Best Practices for BareMetal Machine Operations |
| 14 | + |
| 15 | +This article provides best practices for BareMetal Machine (BMM) lifecycle management operations. |
| 16 | +The aim is to highlight common pitfalls and essential prerequisites. |
| 17 | + |
| 18 | +## Read Important Disclaimers |
| 19 | + |
| 20 | +[!INCLUDE [caution-affect-cluster-integrity](./includes/baremetal-machines/caution-affect-cluster-integrity.md)] |
| 21 | + |
| 22 | +[!INCLUDE [important-donot-disrupt-kcpnodes](./includes/baremetal-machines/important-donot-disrupt-kcpnodes.md)] |
| 23 | + |
| 24 | +[!INCLUDE [prerequisites-azcli-bmm-actions](./includes/baremetal-machines/prerequisites-azcli-bmm-actions.md)] |
| 25 | + |
| 26 | +## Identify the Best-fit Corrective Approach |
| 27 | + |
| 28 | +Troubleshooting technical problems requires a systematic approach. |
| 29 | +One effective method is to start with the least invasive solution and, if necessary, work your way up to more complex and drastic measures. |
| 30 | +Keep in mind that these troubleshooting methods might not always be effective for all scenarios and accounting for various other factors may require a different approach. |
| 31 | +For this reason, it is essential to understand the available options well when troubleshooting a BMM for failures to determine the most appropriate corrective action. |
| 32 | + |
| 33 | +### General Advice while Troubleshooting |
| 34 | + |
| 35 | +- Familiarize yourself with the relevant documentation, including troubleshooting guides and how-to articles. |
| 36 | + Always refer to the latest documentation to stay informed about best practices and updates. |
| 37 | +- Before retrying operations, attempt to identify the root cause of the failure to avoid repeating the same mistake. |
| 38 | + Perform retry attempts in incremental steps to isolate and address specific issues. |
| 39 | +- Wait for Az CLI commands to run to completion and validate the state of the BMM resource before executing other steps. |
| 40 | +- Keep an eye on system logs to detect any anomalies during the retry process. |
| 41 | +- Verify that the firmware and software versions are up-to-date to prevent compatibility issues and ensure compatibility between hardware and software versions. |
| 42 | +- Always back up critical data to prevent data loss during the recovery or replacement process. |
| 43 | +- Ensure stable network connectivity to avoid interruptions during the process. |
| 44 | + Validate that there are no active network stability issues with the network fabric. |
| 45 | + Ignoring network stability could make operations fail to complete successfully and leave a BMM in an unknown state. |
| 46 | + |
| 47 | +## Best Practices for BMM Reimage |
| 48 | + |
| 49 | +The BMM `reimage` action is explained in [BMM Lifecycle Management Commands] and scenario procedures described in [Troubleshoot Azure Operator Nexus Server Problems]. |
| 50 | + |
| 51 | +[!INCLUDE [warning-donot-run-multiple-actions](./includes/baremetal-machines/warning-donot-run-multiple-actions.md)] |
| 52 | + |
| 53 | +You can restore the operating system runtime version on a BMM by executing the `reimage` operation. |
| 54 | +A BMM `reimage` can be both time-saving and reliable for resolving issues or restoring the operating system software to a known-good state. |
| 55 | +This process **redeploys** the runtime image on the target BMM and executes the steps to rejoin the cluster with the same identifiers. |
| 56 | +The `reimage` action doesn't affect the tenant workload files on the BMM under normal circumstances. |
| 57 | + |
| 58 | +> [!IMPORTANT] |
| 59 | +> Avoid write or edit actions performed on the node via BMM access. |
| 60 | +> The `reimage` action is required to restore Microsoft support and any changes done to the BMM are lost while restoring the node to its expected state. |
| 61 | +
|
| 62 | +### Preconditions and Validations Before BMM Reimage |
| 63 | + |
| 64 | +Before initiating any `reimage` operation, ensure the following preconditions are met: |
| 65 | + |
| 66 | +- Ensure the BMM is in `poweredState` set to `On` and `readyState` set to `True`. |
| 67 | +- Make sure the BMM's workloads are drained using the [`cordon`](#make-a-bmm-unschedulable-cordon) command with the paramater `evacuate` set to `True`. |
| 68 | +- Perform high level checks covered in the article [Troubleshoot Bare Metal Machine Provisioning]. |
| 69 | +- Evaluate any BMM warnings or degraded conditions which could indicate the need to resolve hardware, network, or server configuration problems prior to a `replace` operation. |
| 70 | + See the articles [Troubleshoot Degraded Status Errors on Bare Metal Machines] and [Troubleshoot Bare Metal Machine Warning Status] for more details. |
| 71 | +- Ensure to resolve any BMM hardware validation failures. |
| 72 | + Read article [Troubleshoot Hardware Validation Failure](./troubleshoot-hardware-validation-failure.md) to understand hardware validation results. |
| 73 | +- Validate that there are no running firmware upgrade jobs through the BMC before initiating a `replace` operation. |
| 74 | + The BMM will have `provisioningStatus` in the `Preparing` state. Interrupting an ongoing firmware upgrade can leave the BMM in an inconsistent state. |
| 75 | + |
| 76 | +## Best Practices for BMM Replace |
| 77 | + |
| 78 | +The BMM `replace` action is explained in [BMM Lifecycle Management Commands] and scenario procedures described in [Troubleshoot Azure Operator Nexus Server Problems]. |
| 79 | + |
| 80 | +[!INCLUDE [warning-donot-run-multiple-actions](./includes/baremetal-machines/warning-donot-run-multiple-actions.md)] |
| 81 | + |
| 82 | +Hardware failures are an expected occurrence over the natural lifecycle of a server. |
| 83 | +Component replacements may be necessary to restore functionality and ensure continued operation. |
| 84 | +In cases where one or more hardware components fail on the server, it's necessary to perform a BMM `replace` operation. |
| 85 | +The `replace` operation should be executed after any hardware maintenance event. Multiple maintenance events should be done as multiple `replace` operations. |
| 86 | + |
| 87 | +> [!IMPORTANT] |
| 88 | +> With the `2024-07-01` GA API version, the RAID controller is reset during BMM replace, wiping all data from the server's virtual disks. |
| 89 | +> Baseboard Management Controller (BMC) virtual disk alerts triggered during BMM replace can be ignored unless there are more physical disk and/or RAID controllers alerts. |
| 90 | +
|
| 91 | +### Resolve Hardware Validation Issues |
| 92 | + |
| 93 | +When a BMM is marked with failed hardware validation, it indicates that physical repairs are needed. It is crucial to identify and address these repairs before performing a BMM `replace`. |
| 94 | +A hardware validation process is invoked, as part of the `replace` operation, to ensure the physical host's integrity prior to deploying the OS image. |
| 95 | +If the BMM continues to have hardware validation failures, the BMM will not provision successfully, meaning it will fail to complete the necessary setup steps to become operational, and will not join the cluster. |
| 96 | +Ensure **all hardware validation issues** are cleared prior to the next `replace` action. |
| 97 | + |
| 98 | +Read through the [Troubleshoot Hardware Validation Failure](./troubleshoot-hardware-validation-failure.md) article to understand hardware validation results. |
| 99 | + |
| 100 | +### Preconditions and Validations Before BMM Replace |
| 101 | + |
| 102 | +Before initiating any `replace` operation, ensure the following preconditions are met: |
| 103 | + |
| 104 | +- Ensure the BMM is in `poweredState` set to `On` and `readyState` set to `True`. |
| 105 | +- Make sure the BMM's workloads are drained using the [`cordon`](#make-a-bmm-unschedulable-cordon) command with the paramater `evacuate` set to `True`. |
| 106 | +- Perform high level checks covered in the article [Troubleshoot Bare Metal Machine Provisioning]. |
| 107 | +- Evaluate any BMM warnings or degraded conditions which could indicate the need to resolve hardware, network, or server configuration problems prior to a `replace` operation. |
| 108 | + See the articles [Troubleshoot Degraded Status Errors on Bare Metal Machines] and [Troubleshoot Bare Metal Machine Warning Status] for more details. |
| 109 | +- Validate that there are no running firmware upgrade jobs through the BMC before initiating a `replace` operation. |
| 110 | + The BMM will have `provisioningStatus` in the `Preparing` state. Interrupting an ongoing firmware upgrade can leave the BMM in an inconsistent state. |
| 111 | + |
| 112 | +### BMM Replace isn't Required |
| 113 | + |
| 114 | +A `replace` operation isn't required when you're performing a physical hot swappable power supply repair because the BMM host will continue to function normally after the repair. |
| 115 | + |
| 116 | +### BMM Replace is Optional but Recommended |
| 117 | + |
| 118 | +While not strictly necessary to bring the BMM back into service, we recommend doing a `replace` operation when you're performing the following physical repairs: |
| 119 | + |
| 120 | +- CPU |
| 121 | +- Dual In-Line Memory Module (DIMM) |
| 122 | +- Fan |
| 123 | +- Expansion board riser |
| 124 | +- Transceiver |
| 125 | +- Ethernet or fiber cable replacement |
| 126 | + |
| 127 | +### BMM Relace is Required |
| 128 | + |
| 129 | +A `replace` operation **is required** to bring the BMM back into service when you're performing the following physical repairs: |
| 130 | + |
| 131 | +- Backplane |
| 132 | +- System board |
| 133 | +- SSD disk |
| 134 | +- PERC/RAID adapter |
| 135 | +- Mellanox Network Interface Card (NIC) |
| 136 | +- Broadcom embedded NIC |
| 137 | + |
| 138 | +After replacing components such as motherboard or Network Interface Card (NIC), the MAC address of BMM will change; however, the iDRAC IP address and hostname will remain the same. |
| 139 | +Motherboard changes result in MAC address changes, requiring a BMM `replace`. |
| 140 | + |
| 141 | +### After BMM Replace |
| 142 | + |
| 143 | +After the BMM `replace` operation completes successfully, ensure that the `provisioningStatus` is `Succeeded` and the `readyState` is `True`. |
| 144 | +Only then, proceed to execute the `uncordon` operation to have the BMM rejoin the workload schedulable node pool. |
| 145 | + |
| 146 | +## Request Support |
| 147 | + |
| 148 | +If you still have questions, [contact support](https://portal.azure.com/?#blade/Microsoft_Azure_Support/HelpAndSupportBlade). |
| 149 | +For more information about Support plans, see [Azure Support plans](https://azure.microsoft.com/support/plans/response/). |
| 150 | + |
| 151 | +## References |
| 152 | + |
| 153 | +- [BMM Lifecycle Management Commands] |
| 154 | +- [Run emergency bare metal actions outside of Azure using nexusctl] |
| 155 | +- [Troubleshoot Azure Operator Nexus Server Problems] |
| 156 | +- [Troubleshoot Bare Metal Machine Provisioning] |
| 157 | +- [Troubleshoot Bare Metal Machine Warning Status] |
| 158 | +- [Troubleshoot Degraded Status Errors on Bare Metal Machines] |
| 159 | +- [Troubleshoot Hardware Validation Failure] |
| 160 | + |
| 161 | +[BMM Lifecycle Management Commands]: ./howto-baremetal-functions.md |
| 162 | +[Run emergency bare metal actions outside of Azure using nexusctl]: ./howto-baremetal-nexusctl.md |
| 163 | +[Troubleshoot Azure Operator Nexus Server Problems]: ./troubleshoot-reboot-reimage-replace.md |
| 164 | +[Troubleshoot Bare Metal Machine Provisioning]: ./troubleshoot-bare-metal-machine-provisioning.md |
| 165 | +[Troubleshoot Bare Metal Machine Warning Status]: ./troubleshoot-bare-metal-machine-warning.md |
| 166 | +[Troubleshoot Degraded Status Errors on Bare Metal Machines]: ./troubleshoot-bare-metal-machine-degraded.md |
| 167 | +[Troubleshoot Hardware Validation Failure]: ./troubleshoot-hardware-validation-failure.md |
0 commit comments