Skip to content

Commit 97a4cb4

Browse files
committed
Enhanced TSG for clarity
1 parent cce31d8 commit 97a4cb4

File tree

1 file changed

+79
-14
lines changed

1 file changed

+79
-14
lines changed

articles/operator-nexus/troubleshoot-reboot-reimage-replace.md

Lines changed: 79 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,14 @@ ms.author: ekarandjeff
1313

1414
This article describes how to troubleshoot server problems by using restart, reimage, and replace actions on Azure Operator Nexus bare metal machines (BMMs). You might need to take these actions on your server for maintenance reasons, which causes a brief disruption to specific BMMs.
1515

16+
## In this article
17+
- [Prerequisites](#prerequisites)
18+
- [Identify the corrective action](#identify-the-corrective-action)
19+
- [Troubleshoot with a restart action](#troubleshoot-with-a-restart-action)
20+
- [Troubleshoot with a reimage action](#troubleshoot-with-a-reimage-action)
21+
- [Troubleshoot with a replace action](#troubleshoot-with-a-replace-action)
22+
- [Summary](#summary)
23+
1624
The time required to complete each of these actions is similar. Restarting is the fastest, whereas replacing takes slightly longer. All three actions are simple and efficient methods for troubleshooting.
1725

1826
> [!CAUTION]
@@ -35,48 +43,73 @@ The time required to complete each of these actions is similar. Restarting is th
3543
3644
## Identify the corrective action
3745

38-
When troubleshooting a BMM for failures and determining the most appropriate corrective action, it is essential to understand the available options. Restarting or reimaging a BMM can be both efficient and effective for resolving issues or restoring the software to a known-good state. In cases where one or more hardware components fail on the server, it may be necessary to replace the BMM entirely. This article outlines the best practices for each of these three actions.
46+
When troubleshooting a BMM for failures and determining the most appropriate corrective action, it is essential to understand the available options. This article provides a systematic approach to troubleshoot Azure Operator Nexus server problems using these three methods:
3947

40-
Troubleshooting technical problems requires a systematic approach. One effective method is to start with the least invasive solution and work your way up to more complex and drastic measures, if necessary.
48+
1. **Restart** - Least invasive method, best for temporary glitches or unresponsive VMs
49+
2. **Reimage** - Intermediate solution, restores OS to known-good state without affecting data
50+
3. **Replace** - Most significant action, required for hardware component failures
4151

42-
The first step in troubleshooting is to try restarting the device or system. Restarting can help to clear up any temporary glitches or errors that might be causing the problem.
52+
### Troubleshooting decision tree
4353

44-
If restarting does not solve the problem, the next step is to try reimaging the device or system.
54+
Follow this escalation path when troubleshooting BMM issues:
4555

46-
If reimaging does not solve the problem, the final step is to replace the faulty hardware component. While replacement is a more significant measure, it may be required if the issue stems from a hardware defect.
56+
| Problem | First action | If problem persists | If still unresolved |
57+
|---------|-------------|-------------------|-------------------|
58+
| Unresponsive VMs or services | Restart | Reimage | Replace |
59+
| Software/OS corruption | Reimage | Replace | Contact support |
60+
| Known hardware failure | Replace | N/A | Contact support |
61+
| Security compromise | Reimage | Replace | Contact support |
4762

48-
Keep in mind that these troubleshooting methods might not always be effective, and other factors in play might require a different approach.
63+
It's recommended to start with the least invasive solution (restart) and escalate to more complex measures only if necessary. Always validate that the issue is resolved after each corrective action.
4964

5065
## Troubleshoot with a restart action
5166

5267
Restarting a BMM is a process of restarting the server through a simple API call. This action can be useful for troubleshooting problems when tenant virtual machines on the host aren't responsive or are otherwise stuck.
5368

5469
The restart typically is the starting point for mitigating a problem.
5570

56-
***The following Azure CLI command will `power-off` the specified bareMetalMachineName.***
71+
### Restart workflow
72+
73+
1. **Assess impact** - Determine if restarting the BMM will impact critical workloads
74+
2. **Power off** - If needed, power off the BMM (optional)
75+
3. **Start or restart** - Either start a powered-off BMM or restart a running BMM
76+
4. **Verify status** - Check if the BMM is back online and functioning properly
77+
78+
> [!NOTE]
79+
> The restart operation is the fastest recovery method but may not resolve issues related to OS corruption or hardware failures.
80+
81+
**The following Azure CLI command will `power-off` the specified bareMetalMachineName:**
5782
```
5883
az networkcloud baremetalmachine power-off \
5984
--name <bareMetalMachineName> \
6085
--resource-group "<resourceGroup>" \
6186
--subscription <subscriptionID>
6287
```
6388

64-
***The following Azure CLI command will `start` the specified bareMetalMachineName.***
89+
**The following Azure CLI command will `start` the specified bareMetalMachineName:**
6590
```
6691
az networkcloud baremetalmachine start \
6792
--name <bareMetalMachineName> \
6893
--resource-group "<resourceGroup>" \
6994
--subscription <subscriptionID>
7095
```
7196

72-
***The following Azure CLI command will `restart` the specified bareMetalMachineName.***
97+
**The following Azure CLI command will `restart` the specified bareMetalMachineName:**
7398
```
7499
az networkcloud baremetalmachine restart \
75100
--name <bareMetalMachineName> \
76101
--resource-group "<resourceGroup>" \
77102
--subscription <subscriptionID>
78103
```
79104

105+
**To verify the BMM status after restart:**
106+
```
107+
az networkcloud baremetalmachine show \
108+
--name <bareMetalMachineName> \
109+
--resource-group "<resourceGroup>" \
110+
--subscription <subscriptionID> \
111+
--query "provisioningState"
112+
```
80113

81114
## Troubleshoot with a reimage action
82115

@@ -86,14 +119,23 @@ The reimage action can be useful for troubleshooting problems by restoring the O
86119

87120
A reimage action is the best practice for lowest operational risk to ensure the integrity of the BMM.
88121

89-
As a best practice, make sure the BMM's workloads are drained using the cordon command, with evacuate "True", before executing the reimage command.
122+
### Reimage workflow
123+
124+
1. **Verify running workloads** - Before reimaging, check what workloads are running on the BMM
125+
2. **Cordon and evacuate workloads** - Drain the BMM of workloads
126+
3. **Perform reimage** - Execute the reimage operation
127+
4. **Uncordon** - Make the BMM schedulable again after reimage completes
128+
129+
> [!WARNING]
130+
> Running more than one `baremetalmachine replace` or `reimage` command at the same time, or running a `replace`
131+
> at the same time as a `reimage` will leave servers in a nonworking state. Make sure one operation has fully completed before starting another.
90132
91133
**To identify if any workloads are currently running on a BMM, run the following command:**
92134

93135
**For Virtual Machines:**
94136
```azurecli
95-
az networkcloud baremetalmachine show -n <nodeName> /
96-
--resource-group <resourceGroup> /
137+
az networkcloud baremetalmachine show -n <nodeName> \
138+
--resource-group <resourceGroup> \
97139
--subscription <subscriptionID> | jq '.virtualMachinesAssociatedIds'
98140
```
99141

@@ -137,7 +179,13 @@ A hardware validation process is invoked to ensure the integrity of the physical
137179
> [!IMPORTANT]
138180
> Starting with the 2024-07-01 GA API version, the RAID controller is reset during BMM replace, wiping all data from the server's virtual disks. Baseboard Management Controller (BMC) virtual disk alerts triggered during BMM replace can be ignored unless there are additional physical disk and/or RAID controllers alerts.
139181
140-
As a best practice, first issue a `cordon` command to remove the bare metal machine from workload scheduling and then shut down the BMM in advance of physical repairs.
182+
### Replace workflow
183+
184+
1. **Cordon and evacuate** - Remove workloads from the BMM before physical repair
185+
2. **Perform physical repairs** - Replace hardware components as needed
186+
3. **Execute replace command** - Run the replace command with required parameters
187+
4. **Uncordon** - Make the BMM schedulable again after replacement completes
188+
5. **Verify status** - Check that the BMM is properly functioning
141189

142190
**The following Azure CLI command will `cordon` the specified bareMetalMachineName.**
143191
```
@@ -148,6 +196,8 @@ az networkcloud baremetalmachine cordon \
148196
--subscription <subscriptionID>
149197
```
150198

199+
### Hardware component replacement guide
200+
151201
When you're performing a physical hot swappable power supply repair, a replace action is not required because the BMM host will continue to function normally after the repair.
152202

153203
When you're performing the following physical repairs, we recommend a replace action, though it is not necessary to bring the BMM back into service:
@@ -193,7 +243,22 @@ az networkcloud baremetalmachine uncordon \
193243

194244
## Summary
195245

196-
Restarting, reimaging, and replacing are effective troubleshooting methods that you can use to address technical problems. However, it's important to have a systematic approach and to consider other factors before you try any drastic measures.
246+
Restarting, reimaging, and replacing are effective troubleshooting methods for addressing Azure Operator Nexus server problems. Here's a quick reference guide:
247+
248+
| Action | When to use | Impact | Requirements |
249+
|--------|------------|--------|-------------|
250+
| **Restart** | Temporary glitches, unresponsive VMs | Brief downtime | None, fastest option |
251+
| **Reimage** | OS corruption, security concerns | Longer downtime, preserves data | Workload evacuation recommended |
252+
| **Replace** | Hardware component failures | Longest downtime, preserves data | Hardware component replacement, specific parameters needed |
253+
254+
### Best practices
255+
256+
1. **Always follow the escalation path**: Start with restart, then reimage, then replace unless the issue clearly indicates otherwise.
257+
2. **Verify workloads before action**: Use the provided commands to identify running workloads before any disruptive action.
258+
3. **Cordon with evacuation**: When performing reimage or replace actions, always use `cordon` with `evacuate="True"` to safely move workloads.
259+
4. **Never run multiple operations simultaneously**: Ensure one operation completes before starting another to prevent server issues.
260+
5. **Verify resolution**: After performing any action, verify the BMM status and that the original issue is resolved.
261+
197262
More details about the BMM actions can be found in the [BMM actions](howto-baremetal-functions.md) article.
198263

199264
If you still have questions, [contact support](https://portal.azure.com/?#blade/Microsoft_Azure_Support/HelpAndSupportBlade).

0 commit comments

Comments
 (0)