You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/operator-nexus/troubleshoot-reboot-reimage-replace.md
+19-19Lines changed: 19 additions & 19 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -11,7 +11,7 @@ ms.author: ekarandjeff
11
11
12
12
# Troubleshoot Azure Operator Nexus Bare Metal Machine server problems
13
13
14
-
This article describes how to troubleshoot server problems by using restart, reimage, and replace actions on Azure Operator Nexus bare metal machines (BMMs). You might need to take these actions on your server for maintenance reasons, which may cause a brief disruption to specific BMMs.
14
+
This article describes how to troubleshoot server problems by using Restart, Reimage, and Replace actions on Azure Operator Nexus Bare Metal Machines (BMMs). You might need to take these actions on your server for maintenance reasons, which may cause a brief disruption to specific BMMs.
15
15
16
16
The time required to complete each of these actions is similar. Restarting is the fastest, whereas replacing takes slightly longer. All three actions are simple and efficient methods for troubleshooting.
17
17
@@ -37,9 +37,9 @@ The time required to complete each of these actions is similar. Restarting is th
37
37
38
38
When troubleshooting a BMM for failures and determining the most appropriate corrective action, it is essential to understand the available options. This article provides a systematic approach to troubleshoot Azure Operator Nexus server problems using these three methods:
39
39
40
-
1.**Restart** - Least invasive method, best for temporary glitches or unresponsive VMs
41
-
2.**Reimage** - Intermediate solution, restores OS to known-good state without affecting data
42
-
3.**Replace** - Most significant action, required for hardware component failures
40
+
-**Restart** - Least invasive method, best for temporary glitches or unresponsive VMs
41
+
-**Reimage** - Intermediate solution, restores OS to known-good state without affecting data
42
+
-**Replace** - Most significant action, required for hardware component failures
43
43
44
44
### Troubleshooting decision tree
45
45
@@ -63,9 +63,9 @@ The restart typically is the starting point for mitigating a problem.
63
63
### Restart workflow
64
64
65
65
1.**Assess impact** - Determine if restarting the BMM will impact critical workloads.
66
-
1.**Power off** - If needed, power off the BMM (optional).
67
-
1.**Start or restart** - Either start a powered-off BMM or restart a running BMM.
68
-
1.**Verify status** - Check if the BMM is back online and functioning properly.
66
+
2.**Power off** - If needed, power off the BMM (optional).
67
+
3.**Start or restart** - Either start a powered-off BMM or restart a running BMM.
68
+
4.**Verify status** - Check if the BMM is back online and functioning properly.
69
69
70
70
> [!NOTE]
71
71
> The restart operation is the fastest recovery method but may not resolve issues related to OS corruption or hardware failures.
@@ -118,9 +118,9 @@ A reimage action is the best practice for lowest operational risk to ensure the
118
118
### Reimage workflow
119
119
120
120
1.**Verify running workloads** - Before reimaging, check what workloads are running on the BMM.
121
-
1.**Cordon and evacuate workloads** - Drain the BMM of workloads.
122
-
1.**Perform reimage** - Execute the reimage operation.
123
-
1.**Uncordon** - Make the BMM schedulable again after reimage completes.
121
+
2.**Cordon and evacuate workloads** - Drain the BMM of workloads.
122
+
3.**Perform reimage** - Execute the reimage operation.
123
+
4.**Uncordon** - Make the BMM schedulable again after reimage completes.
124
124
125
125
> [!WARNING]
126
126
> Running more than one `baremetalmachine replace` or `reimage` command at the same time, or running a `replace`
@@ -182,10 +182,10 @@ A hardware validation process is invoked to ensure the integrity of the physical
182
182
### Replace workflow
183
183
184
184
1.**Cordon and evacuate** - Remove workloads from the BMM before physical repair.
185
-
1.**Perform physical repairs** - Replace hardware components as needed.
186
-
1.**Execute replace command** - Run the replace command with required parameters.
187
-
1.**Uncordon** - Make the BMM schedulable again after replacement completes.
188
-
1.**Verify status** - Check that the BMM is properly functioning.
185
+
2.**Perform physical repairs** - Replace hardware components as needed.
186
+
3.**Execute replace command** - Run the replace command with required parameters.
187
+
4.**Uncordon** - Make the BMM schedulable again after replacement completes.
188
+
5.**Verify status** - Check that the BMM is properly functioning.
189
189
190
190
**The following Azure CLI command will `cordon` the specified bareMetalMachineName.**
191
191
@@ -256,11 +256,11 @@ Restarting, reimaging, and replacing are effective troubleshooting methods for a
256
256
257
257
### Best practices
258
258
259
-
1.**Always follow the escalation path**: Start with restart, then reimage, then replace unless the issue clearly indicates otherwise.
260
-
1.**Verify workloads before action**: Use the provided commands to identify running workloads before any disruptive action.
261
-
1.**Cordon with evacuation**: When performing reimage or replace actions, always use `cordon` with `evacuate="True"` to safely move workloads.
262
-
1.**Never run multiple operations simultaneously**: Ensure one operation completes before starting another to prevent server issues.
263
-
1.**Verify resolution**: After performing any action, verify the BMM status and that the original issue is resolved.
259
+
-**Always follow the escalation path**: Start with restart, then reimage, then replace unless the issue clearly indicates otherwise.
260
+
-**Verify workloads before action**: Use the provided commands to identify running workloads before any disruptive action.
261
+
-**Cordon with evacuation**: When performing reimage or replace actions, always use `cordon` with `evacuate="True"` to safely move workloads.
262
+
-**Never run multiple operations simultaneously**: Ensure one operation completes before starting another to prevent server issues.
263
+
-**Verify resolution**: After performing any action, verify the BMM status and that the original issue is resolved.
264
264
265
265
More details about the BMM actions can be found in the [BMM actions](howto-baremetal-functions.md) article.
0 commit comments