Skip to content

Commit df97b67

Browse files
authored
Merge pull request #297228 from g0r1v3r4/add-bmm-replace-best-practice
Adds best practices for BMM operations article
2 parents 7b38794 + ca2bcd9 commit df97b67

13 files changed

+1225
-821
lines changed

articles/operator-nexus/TOC.yml

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -239,14 +239,16 @@
239239
- name: Cluster
240240
expanded: false
241241
items:
242+
- name: Best Practices for Bare Metal Machine Operations
243+
href: howto-bare-metal-best-practices.md
242244
- name: BareMetal Actions
243245
expanded: false
244246
items:
245247
- name: BareMetal BMM Access Setup
246248
href: howto-baremetal-bmm-ssh.md
247249
- name: BareMetal BMC Access Setup
248250
href: howto-baremetal-bmc-ssh.md
249-
- name: BareMetal Functions
251+
- name: Bare Metal Machine Platform Commands
250252
href: howto-baremetal-functions.md
251253
- name: BareMetal Run-Read Execution
252254
href: howto-baremetal-run-read.md
@@ -356,7 +358,7 @@
356358
- name: Cluster or BMM
357359
expanded: false
358360
items:
359-
- name: Troubleshoot Bare Metal Machine
361+
- name: Troubleshoot Bare Metal Server Problems
360362
href: troubleshoot-reboot-reimage-replace.md
361363
- name: Troubleshoot Bare Metal Machine Provisioning
362364
href: troubleshoot-bare-metal-machine-provisioning.md
Lines changed: 250 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,250 @@
1+
---
2+
title: Best practices for Bare Metal Machine operations
3+
description: Steps that should be taken before executing any Bare Metal Machine replace, or reimage actions. Highlight essential prerequisites and common pitfalls to avoid.
4+
ms.date: 03/25/2025
5+
ms.topic: how-to
6+
ms.service: azure-operator-nexus
7+
ms.custom: template-how-to, best-practices
8+
author: omarrivera
9+
ms.author: omarrivera
10+
ms.reviewer: bartpinto
11+
---
12+
13+
# Best practices for Bare Metal Machine operations
14+
15+
This article provides best practices for BareMetal Machine (BMM) lifecycle management operations.
16+
The aim is to highlight common pitfalls and essential prerequisites.
17+
18+
## Read important disclaimers
19+
20+
[!INCLUDE [caution-affect-cluster-integrity](./includes/baremetal-machines/caution-affect-cluster-integrity.md)]
21+
22+
[!INCLUDE [important-donot-disrupt-kcpnodes](./includes/baremetal-machines/important-donot-disrupt-kcpnodes.md)]
23+
24+
[!INCLUDE [prerequisites-azure-cli-bare-metal-machine-actions](./includes/baremetal-machines/prerequisites-azure-cli-bare-metal-machine-actions.md)]
25+
26+
## Identify the best-fit corrective approach
27+
28+
Troubleshooting technical problems requires a systematic approach.
29+
One effective method is to start with the least invasive solution and, if necessary, work your way up to more complex and potentially disruptive measures.
30+
Keep in mind that these troubleshooting methods might not always be effective for all scenarios and accounting for various other factors might require a different approach.
31+
For this reason, it's essential to understand the available options well when troubleshooting a Bare Metal Machine for failures to determine the most appropriate corrective action.
32+
33+
### General advice while troubleshooting
34+
35+
- Familiarize yourself with the relevant documentation, including troubleshooting guides and how-to articles.
36+
Always refer to the latest documentation to stay informed about best practices and updates.
37+
- Avoid repeated failed operations by first attempting to identify the root cause of the failure before retrying the operation.
38+
Perform retry attempts in incremental steps to isolate and address specific issues.
39+
- Wait for Az CLI commands to run to completion and validate the state of the Bare Metal Machine resource before executing other steps.
40+
- Verify that the firmware and software versions are up-to-date before a new greenfield deployment to prevent compatibility issues between hardware and software versions.
41+
For more information about firmware compatibility, see [Operator Nexus Platform Prerequisites](./howto-platform-prerequisites.md).
42+
- Check the iDRAC credentials are correct and that the Bare Metal Machine is powered on.
43+
44+
#### Look at general network connectivity health
45+
46+
Ensure stable network connectivity to avoid interruptions during the process.
47+
Ignoring network stability could make operations fail to complete successfully and leave a Bare Metal Machine in an error or degraded state.
48+
49+
A quick look at Cluster resource's `clusterConnectionStatus` serves as one indicator of network connectivity health.
50+
51+
```azurecli
52+
az networkcloud cluster show \
53+
-g $CLUSTER_MRG \
54+
-n $BMM_NAME \
55+
--subscription $SUBSCRIPTION \
56+
--query "clusterConnectionStatus" \
57+
-o table
58+
59+
Result
60+
---------
61+
Connected
62+
```
63+
64+
Take a deeper look at the NetworkFabric resources by checking the NetworkFabric resources statuses, alerts, and metrics.
65+
See related articles:
66+
- [How to monitor interface In and Out packet rate for network fabric devices]
67+
- [How to configure diagnostic settings and monitor configuration differences in Nexus Network Fabric].
68+
69+
Evaluate for any Bare Metal Machine warnings or degraded conditions which could indicate the need to resolve hardware, network, or server configuration problems.
70+
For more information, see [Troubleshoot Degraded Status Errors on Bare Metal Machines] and [Troubleshoot Bare Metal Machine Warning Status].
71+
72+
#### Determine if firmware update jobs are running
73+
74+
Validate that there are no running firmware upgrade jobs through the BMC before initiating a `replace` or `reimage` operation.
75+
Interrupting an ongoing firmware upgrade can leave the Bare Metal Machine in an inconsistent state.
76+
You can view in the iDRAC GUI the `jobqueue` or use a `racadm jobqueque view` to determine if there are firmware upgrade jobs running.
77+
78+
```azurecli
79+
az networkcloud baremetalmachine run-read-command \
80+
-g $CLUSTER_MRG \
81+
-n $BMM_NAME \
82+
--subscription $SUBSCRIPTION \
83+
--limit-time-seconds 60 \
84+
--commands "[{command:'nc-toolbox nc-toolbox-runread racadm jobqueue view'}]" \
85+
--output-directory .
86+
```
87+
88+
Here's an example output from the `racadm jobqueue view` command which shows `Firmware Update`.
89+
```
90+
[Job ID=JID_833540920066]
91+
Job Name=Firmware Update: iDRAC
92+
Status=Downloading
93+
Start Time= [Not Applicable]
94+
Expiration Time= [Not Applicable]
95+
Message= [RED001: Job in progress.]
96+
Percent Complete= [50%]
97+
```
98+
99+
Here's an example output from the `racadm jobqueue view` command showing common happy-path statements.
100+
```
101+
-------------------------JOB QUEUE------------------------
102+
[Job ID=JID_429400224349]
103+
Job Name=Configure: Import Server Configuration Profile
104+
Status=Completed
105+
Scheduled Start Time=[Not Applicable]
106+
Expiration Time=[Not Applicable]
107+
Actual Start Time=[Tue, 25 Mar 2025 17:00:22]
108+
Actual Completion Time=[Tue, 25 Mar 2025 17:00:32]
109+
Message=[SYS053: Successfully imported and applied Server Configuration Profile.]
110+
Percent Complete=[100]
111+
----------------------------------------------------------
112+
[Job ID=JID_429400338344]
113+
Job Name=Export: Server Configuration Profile
114+
Status=Completed
115+
Scheduled Start Time=[Not Applicable]
116+
Expiration Time=[Not Applicable]
117+
Actual Start Time=[Tue, 25 Mar 2025 17:00:33]
118+
Actual Completion Time=[Tue, 25 Mar 2025 17:00:58]
119+
Message=[SYS043: Successfully exported Server Configuration Profile]
120+
Percent Complete=[100]
121+
```
122+
123+
## Best practices for a Bare Metal Machine reimage
124+
125+
The Bare Metal Machine (BMM) `reimage` action is explained in [Bare Metal Machine Lifecycle Management Commands] and scenario procedures described in [Troubleshoot Azure Operator Nexus Server Problems].
126+
127+
[!INCLUDE [warning-do-not-run-multiple-actions](./includes/baremetal-machines/warning-do-not-run-multiple-actions.md)]
128+
129+
You can restore the operating system runtime version on a Bare Metal Machine by executing the `reimage` operation.
130+
A Bare Metal Machine `reimage` can be both time-saving and reliable for resolving issues or restoring the operating system software to a known-good state.
131+
This process **redeploys** the runtime image on the target Bare Metal Machine and executes the steps to rejoin the cluster with the same identifiers.
132+
The `reimage` action is designed to interact with the operating system partition, leaving virtual machine's local storage unchanged.
133+
134+
> [!IMPORTANT]
135+
> Avoid manual or automated changes to the Bare Metal Machine's file system (also known as "break glass").
136+
> The `reimage` action is required to restore Microsoft support and any changes done to the Bare Metal Machine are lost while restoring the node to its expected state.
137+
138+
### Preconditions and validations before a Bare Metal Machine reimage
139+
140+
Before initiating any `reimage` operation, ensure the following preconditions are met:
141+
142+
- Make sure the Bare Metal Machine's workloads are drained using the [`cordon`](./howto-baremetal-functions.md#make-a-bare-metal-machine-unschedulable-cordon) command with the parameter `evacuate` set to `True`.
143+
- Perform high level checks covered in the article [Troubleshoot Bare Metal Machine Provisioning].
144+
- Evaluate any Bare Metal Machine warnings or degraded conditions which could indicate the need to resolve hardware, network, or server configuration problems before a `reimage` operation.
145+
For more information, read [Troubleshoot Degraded Status Errors on Bare Metal Machines] and [Troubleshoot Bare Metal Machine Warning Status].
146+
- If the Bare Metal Machine reports a failed state with the reason of hardware validation (seen in the Bare Metal Machine `Detailed Status` and `Detailed Status Message` fields), then the Bare Metal Machine needs a `replace` instead.
147+
See the [Best Practices for a Bare Metal Machine Replace](#best-practices-for-a-bare-metal-machine-replace).
148+
- Validate that there are no running firmware upgrade jobs.
149+
Follow steps in section [Determine if Firmware Update Jobs are Running](#determine-if-firmware-update-jobs-are-running).
150+
151+
## Best practices for a Bare Metal Machine replace
152+
153+
The Bare Metal Machine `replace` action is explained in [Bare Metal Machine Lifecycle Management Commands] and scenario procedures described in [Troubleshoot Azure Operator Nexus Server Problems].
154+
155+
[!INCLUDE [warning-do-not-run-multiple-actions](./includes/baremetal-machines/warning-do-not-run-multiple-actions.md)]
156+
157+
Hardware failures are a normal occurrence over the life of a server.
158+
Component replacements might be necessary to restore functionality and ensure continued operation.
159+
The `replace` operation must be executed after any hardware maintenance/repair event.
160+
When one or more hardware components fail on the server (multiple failures), make the necessary repairs for **all** components before executing a Bare Metal Machine `replace` operation.
161+
162+
> [!IMPORTANT]
163+
> With the `2024-07-01` GA API version, the RAID controller is reset during Bare Metal Machine `replace`, wiping all data from the server's virtual disks.
164+
> Baseboard Management Controller (BMC) virtual disk alerts triggered during Bare Metal Machine `replace` can be ignored unless there are more physical disk and/or RAID controllers alerts.
165+
166+
### Preconditions and validations before a Bare Metal Machine replace
167+
168+
Before initiating any `replace` operation, ensure the following preconditions are met:
169+
170+
- Make sure the Bare Metal Machine's workloads are drained using the [`cordon`](./howto-baremetal-functions.md#make-a-bare-metal-machine-unschedulable-cordon) command with the parameter `evacuate` set to `True`.
171+
- Perform high level checks covered in the article [Troubleshoot Bare Metal Machine Provisioning].
172+
- Evaluate any Bare Metal Machine warnings or degraded conditions which could indicate the need to resolve hardware, network, or server configuration problems before a `replace` operation.
173+
For more information, see [Troubleshoot Degraded Status Errors on Bare Metal Machines] and [Troubleshoot Bare Metal Machine Warning Status].
174+
- Validate that there are no running firmware upgrade jobs.
175+
Follow steps in section [Determine if Firmware Update Jobs are Running](#determine-if-firmware-update-jobs-are-running).
176+
177+
### Resolve hardware validation issues
178+
179+
When a Bare Metal Machine is marked with failed hardware validation, it might indicate that physical repairs are needed.
180+
It's crucial to identify and address these repairs before performing a Bare Metal Machine `replace`.
181+
A hardware validation process is invoked as part of the `replace` operation to ensure the physical host's integrity before deploying the OS image.
182+
The Bare Metal Machine can't provision successfully when the Bare Metal Machine continues to have hardware validation failures.
183+
As a result, the Bare Metal Machine fails to complete the necessary setup steps to become operational and join the cluster.
184+
Ensure **all hardware validation issues** are cleared before the next `replace` action.
185+
186+
To understand hardware validation result, read through the article [Troubleshoot Hardware Validation Failure](./troubleshoot-hardware-validation-failure.md).
187+
188+
### Bare Metal Machine replace isn't required
189+
190+
Some repairs don't require a Bare Metal Machine `replace` to be executed.
191+
For example, a `replace` operation isn't required when you're performing a physical hot swappable power supply repair because the Bare Metal Machine host will continue to function normally after the repair.
192+
However, if the Bare Metal Machine failed hardware validation, the Bare Metal Machine `replace` is required even if the hot swappable repairs are done.
193+
Examine the Bare Metal Machine status messages to determine if hardware validation failures or other degraded conditions are present.
194+
- [Troubleshoot Degraded Status Errors on Bare Metal Machines]
195+
- [Troubleshoot Bare Metal Machine Warning Status]
196+
- [Troubleshoot Hardware Validation Failure](./troubleshoot-hardware-validation-failure.md).
197+
198+
Other repairs of this type might be:
199+
200+
- CPU
201+
- Dual In-Line Memory Module (DIMM)
202+
- Fan
203+
- Expansion board riser
204+
- Transceiver
205+
- Ethernet or fiber cable replacement
206+
207+
### Bare Metal Machine replace is required
208+
209+
After components such as motherboard or Network Interface Card (NIC) are replaced, the Bare Metal Machine MAC address changes.
210+
However, the iDRAC IP address and hostname remain the same.
211+
Motherboard changes result in MAC address changes, requiring a Bare Metal Machine `replace`.
212+
213+
A `replace` operation **is required** to bring the Bare Metal Machine back into service when you're performing the following physical repairs:
214+
215+
- Backplane
216+
- System board
217+
- SSD disk
218+
- PERC/RAID adapter
219+
- Mellanox Network Interface Card (NIC)
220+
- Broadcom embedded NIC
221+
222+
### Check statuses after a Bare Metal Machine replace operation
223+
224+
After the Bare Metal Machine `replace` operation completes successfully, ensure that the `provisioningStatus` is `Succeeded` and the `readyState` is `True`.
225+
Only then, proceed to execute the `uncordon` operation to have the Bare Metal Machine rejoin the workload schedulable node pool.
226+
227+
## Request support
228+
229+
If you still have questions, [contact support](https://portal.azure.com/?#blade/Microsoft_Azure_Support/HelpAndSupportBlade).
230+
For more information about Support plans, see [Azure Support plans](https://azure.microsoft.com/support/plans/response/).
231+
232+
## References
233+
234+
- [Bare Metal Machine Lifecycle Management Commands]
235+
- [Run emergency bare metal actions outside of Azure using nexusctl]
236+
- [Troubleshoot Azure Operator Nexus Server Problems]
237+
- [Troubleshoot Bare Metal Machine Provisioning]
238+
- [Troubleshoot Bare Metal Machine Warning Status]
239+
- [Troubleshoot Degraded Status Errors on Bare Metal Machines]
240+
- [Troubleshoot Hardware Validation Failure]
241+
242+
[Bare Metal Machine Lifecycle Management Commands]: ./howto-baremetal-functions.md
243+
[Run emergency bare metal actions outside of Azure using nexusctl]: ./howto-baremetal-nexusctl.md
244+
[Troubleshoot Azure Operator Nexus Server Problems]: ./troubleshoot-reboot-reimage-replace.md
245+
[Troubleshoot Bare Metal Machine Provisioning]: ./troubleshoot-bare-metal-machine-provisioning.md
246+
[Troubleshoot Bare Metal Machine Warning Status]: ./troubleshoot-bare-metal-machine-warning.md
247+
[Troubleshoot Degraded Status Errors on Bare Metal Machines]: ./troubleshoot-bare-metal-machine-degraded.md
248+
[Troubleshoot Hardware Validation Failure]: ./troubleshoot-hardware-validation-failure.md
249+
[How to monitor interface In and Out packet rate for network fabric devices]: ./howto-monitor-interface-packet-rate.md
250+
[How to configure diagnostic settings and monitor configuration differences in Nexus Network Fabric]: ./howto-configure-diagnostic-settings-monitor-configuration-differences.md

0 commit comments

Comments
 (0)