Skip to content

Commit da0f997

Browse files
committed
adds bestpractices for bmm operations article
- updates existing guides to ensure clarity and consistency - utilizes include statements to communicate common warnings,important,caution messages
1 parent b008cb6 commit da0f997

13 files changed

+1094
-773
lines changed

articles/operator-nexus/TOC.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -233,6 +233,8 @@
233233
- name: Cluster
234234
expanded: false
235235
items:
236+
- name: Best Practices for BareMetal Machine Operations
237+
href: howto-baremetal-best-practices.md
236238
- name: BareMetal Actions
237239
expanded: false
238240
items:
Lines changed: 167 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,167 @@
1+
---
2+
title: Best Practices for BareMetal Machine Operations
3+
description: Steps that should be taken before executing any BMM replace, or reimage actions. Highlight essential prerequisites and common pitfalls to avoid.
4+
ms.date: 03/25/2025
5+
ms.topic: how-to
6+
ms.service: azure-operator-nexus
7+
ms.custom: template-how-to, best-practices
8+
author: omarrivera
9+
ms.author: omarrivera
10+
ms.reviewer: bartpinto
11+
---
12+
13+
# Best Practices for BareMetal Machine Operations
14+
15+
This article provides best practices for BareMetal Machine (BMM) lifecycle management operations.
16+
The aim is to highlight common pitfalls and essential prerequisites.
17+
18+
## Read Important Disclaimers
19+
20+
[!INCLUDE [caution-affect-cluster-integrity](./includes/baremetal-machines/caution-affect-cluster-integrity.md)]
21+
22+
[!INCLUDE [important-donot-disrupt-kcpnodes](./includes/baremetal-machines/important-donot-disrupt-kcpnodes.md)]
23+
24+
[!INCLUDE [prerequisites-azcli-bmm-actions](./includes/baremetal-machines/prerequisites-azcli-bmm-actions.md)]
25+
26+
## Identify the Best-fit Corrective Approach
27+
28+
Troubleshooting technical problems requires a systematic approach.
29+
One effective method is to start with the least invasive solution and, if necessary, work your way up to more complex and drastic measures.
30+
Keep in mind that these troubleshooting methods might not always be effective for all scenarios and accounting for various other factors may require a different approach.
31+
For this reason, it is essential to understand the available options well when troubleshooting a BMM for failures to determine the most appropriate corrective action.
32+
33+
### General Advice while Troubleshooting
34+
35+
- Familiarize yourself with the relevant documentation, including troubleshooting guides and how-to articles.
36+
Always refer to the latest documentation to stay informed about best practices and updates.
37+
- Before retrying operations, attempt to identify the root cause of the failure to avoid repeating the same mistake.
38+
Perform retry attempts in incremental steps to isolate and address specific issues.
39+
- Wait for Az CLI commands to run to completion and validate the state of the BMM resource before executing other steps.
40+
- Keep an eye on system logs to detect any anomalies during the retry process.
41+
- Verify that the firmware and software versions are up-to-date to prevent compatibility issues and ensure compatibility between hardware and software versions.
42+
- Always back up critical data to prevent data loss during the recovery or replacement process.
43+
- Ensure stable network connectivity to avoid interruptions during the process.
44+
Validate that there are no active network stability issues with the network fabric.
45+
Ignoring network stability could make operations fail to complete successfully and leave a BMM in an unknown state.
46+
47+
## Best Practices for BMM Reimage
48+
49+
The BMM `reimage` action is explained in [BMM Lifecycle Management Commands] and scenario procedures described in [Troubleshoot Azure Operator Nexus Server Problems].
50+
51+
[!INCLUDE [warning-donot-run-multiple-actions](./includes/baremetal-machines/warning-donot-run-multiple-actions.md)]
52+
53+
You can restore the operating system runtime version on a BMM by executing the `reimage` operation.
54+
A BMM `reimage` can be both time-saving and reliable for resolving issues or restoring the operating system software to a known-good state.
55+
This process **redeploys** the runtime image on the target BMM and executes the steps to rejoin the cluster with the same identifiers.
56+
The `reimage` action doesn't affect the tenant workload files on the BMM under normal circumstances.
57+
58+
> [!IMPORTANT]
59+
> Avoid write or edit actions performed on the node via BMM access.
60+
> The `reimage` action is required to restore Microsoft support and any changes done to the BMM are lost while restoring the node to its expected state.
61+
62+
### Preconditions and Validations Before BMM Reimage
63+
64+
Before initiating any `reimage` operation, ensure the following preconditions are met:
65+
66+
- Ensure the BMM is in `poweredState` set to `On` and `readyState` set to `True`.
67+
- Make sure the BMM's workloads are drained using the [`cordon`](#make-a-bmm-unschedulable-cordon) command with the paramater `evacuate` set to `True`.
68+
- Perform high level checks covered in the article [Troubleshoot Bare Metal Machine Provisioning].
69+
- Evaluate any BMM warnings or degraded conditions which could indicate the need to resolve hardware, network, or server configuration problems prior to a `replace` operation.
70+
See the articles [Troubleshoot Degraded Status Errors on Bare Metal Machines] and [Troubleshoot Bare Metal Machine Warning Status] for more details.
71+
- Ensure to resolve any BMM hardware validation failures.
72+
Read article [Troubleshoot Hardware Validation Failure](./troubleshoot-hardware-validation-failure.md) to understand hardware validation results.
73+
- Validate that there are no running firmware upgrade jobs through the BMC before initiating a `replace` operation.
74+
The BMM will have `provisioningStatus` in the `Preparing` state. Interrupting an ongoing firmware upgrade can leave the BMM in an inconsistent state.
75+
76+
## Best Practices for BMM Replace
77+
78+
The BMM `replace` action is explained in [BMM Lifecycle Management Commands] and scenario procedures described in [Troubleshoot Azure Operator Nexus Server Problems].
79+
80+
[!INCLUDE [warning-donot-run-multiple-actions](./includes/baremetal-machines/warning-donot-run-multiple-actions.md)]
81+
82+
Hardware failures are an expected occurrence over the natural lifecycle of a server.
83+
Component replacements may be necessary to restore functionality and ensure continued operation.
84+
In cases where one or more hardware components fail on the server, it's necessary to perform a BMM `replace` operation.
85+
The `replace` operation should be executed after any hardware maintenance event. Multiple maintenance events should be done as multiple `replace` operations.
86+
87+
> [!IMPORTANT]
88+
> With the `2024-07-01` GA API version, the RAID controller is reset during BMM replace, wiping all data from the server's virtual disks.
89+
> Baseboard Management Controller (BMC) virtual disk alerts triggered during BMM replace can be ignored unless there are more physical disk and/or RAID controllers alerts.
90+
91+
### Resolve Hardware Validation Issues
92+
93+
When a BMM is marked with failed hardware validation, it indicates that physical repairs are needed. It is crucial to identify and address these repairs before performing a BMM `replace`.
94+
A hardware validation process is invoked, as part of the `replace` operation, to ensure the physical host's integrity prior to deploying the OS image.
95+
If the BMM continues to have hardware validation failures, the BMM will not provision successfully, meaning it will fail to complete the necessary setup steps to become operational, and will not join the cluster.
96+
Ensure **all hardware validation issues** are cleared prior to the next `replace` action.
97+
98+
Read through the [Troubleshoot Hardware Validation Failure](./troubleshoot-hardware-validation-failure.md) article to understand hardware validation results.
99+
100+
### Preconditions and Validations Before BMM Replace
101+
102+
Before initiating any `replace` operation, ensure the following preconditions are met:
103+
104+
- Ensure the BMM is in `poweredState` set to `On` and `readyState` set to `True`.
105+
- Make sure the BMM's workloads are drained using the [`cordon`](#make-a-bmm-unschedulable-cordon) command with the paramater `evacuate` set to `True`.
106+
- Perform high level checks covered in the article [Troubleshoot Bare Metal Machine Provisioning].
107+
- Evaluate any BMM warnings or degraded conditions which could indicate the need to resolve hardware, network, or server configuration problems prior to a `replace` operation.
108+
See the articles [Troubleshoot Degraded Status Errors on Bare Metal Machines] and [Troubleshoot Bare Metal Machine Warning Status] for more details.
109+
- Validate that there are no running firmware upgrade jobs through the BMC before initiating a `replace` operation.
110+
The BMM will have `provisioningStatus` in the `Preparing` state. Interrupting an ongoing firmware upgrade can leave the BMM in an inconsistent state.
111+
112+
### BMM Replace isn't Required
113+
114+
A `replace` operation isn't required when you're performing a physical hot swappable power supply repair because the BMM host will continue to function normally after the repair.
115+
116+
### BMM Replace is Optional but Recommended
117+
118+
While not strictly necessary to bring the BMM back into service, we recommend doing a `replace` operation when you're performing the following physical repairs:
119+
120+
- CPU
121+
- Dual In-Line Memory Module (DIMM)
122+
- Fan
123+
- Expansion board riser
124+
- Transceiver
125+
- Ethernet or fiber cable replacement
126+
127+
### BMM Relace is Required
128+
129+
A `replace` operation **is required** to bring the BMM back into service when you're performing the following physical repairs:
130+
131+
- Backplane
132+
- System board
133+
- SSD disk
134+
- PERC/RAID adapter
135+
- Mellanox Network Interface Card (NIC)
136+
- Broadcom embedded NIC
137+
138+
After replacing components such as motherboard or Network Interface Card (NIC), the MAC address of BMM will change; however, the iDRAC IP address and hostname will remain the same.
139+
Motherboard changes result in MAC address changes, requiring a BMM `replace`.
140+
141+
### After BMM Replace
142+
143+
After the BMM `replace` operation completes successfully, ensure that the `provisioningStatus` is `Succeeded` and the `readyState` is `True`.
144+
Only then, proceed to execute the `uncordon` operation to have the BMM rejoin the workload schedulable node pool.
145+
146+
## Request Support
147+
148+
If you still have questions, [contact support](https://portal.azure.com/?#blade/Microsoft_Azure_Support/HelpAndSupportBlade).
149+
For more information about Support plans, see [Azure Support plans](https://azure.microsoft.com/support/plans/response/).
150+
151+
## References
152+
153+
- [BMM Lifecycle Management Commands]
154+
- [Run emergency bare metal actions outside of Azure using nexusctl]
155+
- [Troubleshoot Azure Operator Nexus Server Problems]
156+
- [Troubleshoot Bare Metal Machine Provisioning]
157+
- [Troubleshoot Bare Metal Machine Warning Status]
158+
- [Troubleshoot Degraded Status Errors on Bare Metal Machines]
159+
- [Troubleshoot Hardware Validation Failure]
160+
161+
[BMM Lifecycle Management Commands]: ./howto-baremetal-functions.md
162+
[Run emergency bare metal actions outside of Azure using nexusctl]: ./howto-baremetal-nexusctl.md
163+
[Troubleshoot Azure Operator Nexus Server Problems]: ./troubleshoot-reboot-reimage-replace.md
164+
[Troubleshoot Bare Metal Machine Provisioning]: ./troubleshoot-bare-metal-machine-provisioning.md
165+
[Troubleshoot Bare Metal Machine Warning Status]: ./troubleshoot-bare-metal-machine-warning.md
166+
[Troubleshoot Degraded Status Errors on Bare Metal Machines]: ./troubleshoot-bare-metal-machine-degraded.md
167+
[Troubleshoot Hardware Validation Failure]: ./troubleshoot-hardware-validation-failure.md

0 commit comments

Comments
 (0)