Enhanced TSG for clarity

JAC0BSMITH · JAC0BSMITH · commit 97a4cb441fa6 · 2025-04-03T13:57:17.000-05:00
diff --git a/articles/operator-nexus/troubleshoot-reboot-reimage-replace.md b/articles/operator-nexus/troubleshoot-reboot-reimage-replace.md
@@ -13,6 +13,14 @@ ms.author: ekarandjeff
 
 This article describes how to troubleshoot server problems by using restart, reimage, and replace actions on Azure Operator Nexus bare metal machines (BMMs). You might need to take these actions on your server for maintenance reasons, which causes a brief disruption to specific BMMs.
 
+## In this article
+- [Prerequisites](#prerequisites)
+- [Identify the corrective action](#identify-the-corrective-action)
+- [Troubleshoot with a restart action](#troubleshoot-with-a-restart-action)
+- [Troubleshoot with a reimage action](#troubleshoot-with-a-reimage-action)
+- [Troubleshoot with a replace action](#troubleshoot-with-a-replace-action)
+- [Summary](#summary)
+
 The time required to complete each of these actions is similar. Restarting is the fastest, whereas replacing takes slightly longer. All three actions are simple and efficient methods for troubleshooting.
 
 > [!CAUTION]
@@ -35,48 +43,73 @@ The time required to complete each of these actions is similar. Restarting is th
 
 ## Identify the corrective action
 
-When troubleshooting a BMM for failures and determining the most appropriate corrective action, it is essential to understand the available options. Restarting or reimaging a BMM can be both efficient and effective for resolving issues or restoring the software to a known-good state. In cases where one or more hardware components fail on the server, it may be necessary to replace the BMM entirely. This article outlines the best practices for each of these three actions.
+When troubleshooting a BMM for failures and determining the most appropriate corrective action, it is essential to understand the available options. This article provides a systematic approach to troubleshoot Azure Operator Nexus server problems using these three methods:
 
-Troubleshooting technical problems requires a systematic approach. One effective method is to start with the least invasive solution and work your way up to more complex and drastic measures, if necessary.
+1. **Restart** - Least invasive method, best for temporary glitches or unresponsive VMs
+2. **Reimage** - Intermediate solution, restores OS to known-good state without affecting data
+3. **Replace** - Most significant action, required for hardware component failures
 
-The first step in troubleshooting is to try restarting the device or system. Restarting can help to clear up any temporary glitches or errors that might be causing the problem.
+### Troubleshooting decision tree
 
-If restarting does not solve the problem, the next step is to try reimaging the device or system.
+Follow this escalation path when troubleshooting BMM issues:
 
-If reimaging does not solve the problem, the final step is to replace the faulty hardware component. While replacement is a more significant measure, it may be required if the issue stems from a hardware defect.
+| Problem | First action | If problem persists | If still unresolved |
+|---------|-------------|-------------------|-------------------|
+| Unresponsive VMs or services | Restart | Reimage | Replace |
+| Software/OS corruption | Reimage | Replace | Contact support |
+| Known hardware failure | Replace | N/A | Contact support |
+| Security compromise | Reimage | Replace | Contact support |
 
-Keep in mind that these troubleshooting methods might not always be effective, and other factors in play might require a different approach.
+It's recommended to start with the least invasive solution (restart) and escalate to more complex measures only if necessary. Always validate that the issue is resolved after each corrective action.
 
 ## Troubleshoot with a restart action
 
 Restarting a BMM is a process of restarting the server through a simple API call. This action can be useful for troubleshooting problems when tenant virtual machines on the host aren't responsive or are otherwise stuck.
 
 The restart typically is the starting point for mitigating a problem.
 
-***The following Azure CLI command will `power-off` the specified bareMetalMachineName.***
+### Restart workflow
+
+1. **Assess impact** - Determine if restarting the BMM will impact critical workloads
+2. **Power off** - If needed, power off the BMM (optional)
+3. **Start or restart** - Either start a powered-off BMM or restart a running BMM
+4. **Verify status** - Check if the BMM is back online and functioning properly
+
+> [!NOTE]
+> The restart operation is the fastest recovery method but may not resolve issues related to OS corruption or hardware failures.
+
+**The following Azure CLI command will `power-off` the specified bareMetalMachineName:**
 ```
 az networkcloud baremetalmachine power-off \
   --name <bareMetalMachineName>  \
   --resource-group "<resourceGroup>" \
   --subscription <subscriptionID>
 ```
 
-***The following Azure CLI command will `start` the specified bareMetalMachineName.***
+**The following Azure CLI command will `start` the specified bareMetalMachineName:**
 ```
 az networkcloud baremetalmachine start \
   --name <bareMetalMachineName>  \
   --resource-group "<resourceGroup>" \
   --subscription <subscriptionID>
 ```
 
-***The following Azure CLI command will `restart` the specified bareMetalMachineName.***
+**The following Azure CLI command will `restart` the specified bareMetalMachineName:**
 ```
 az networkcloud baremetalmachine restart \
   --name <bareMetalMachineName>  \
   --resource-group "<resourceGroup>" \
   --subscription <subscriptionID>
 ```
 
+**To verify the BMM status after restart:**
+```
+az networkcloud baremetalmachine show \
+  --name <bareMetalMachineName>  \
+  --resource-group "<resourceGroup>" \
+  --subscription <subscriptionID> \
+  --query "provisioningState"
+```
 
 ## Troubleshoot with a reimage action
 
@@ -86,14 +119,23 @@ The reimage action can be useful for troubleshooting problems by restoring the O
 
 A reimage action is the best practice for lowest operational risk to ensure the integrity of the BMM.
 
-As a best practice, make sure the BMM's workloads are drained using the cordon command, with evacuate "True", before executing the reimage command.
+### Reimage workflow
+
+1. **Verify running workloads** - Before reimaging, check what workloads are running on the BMM
+2. **Cordon and evacuate workloads** - Drain the BMM of workloads
+3. **Perform reimage** - Execute the reimage operation
+4. **Uncordon** - Make the BMM schedulable again after reimage completes
+
+> [!WARNING]
+> Running more than one `baremetalmachine replace` or `reimage` command at the same time, or running a `replace`
+> at the same time as a `reimage` will leave servers in a nonworking state. Make sure one operation has fully completed before starting another.
 
 **To identify if any workloads are currently running on a BMM, run the following command:**
 
 **For Virtual Machines:**
 ```azurecli
-az networkcloud baremetalmachine show -n <nodeName> /
---resource-group <resourceGroup> /
+az networkcloud baremetalmachine show -n <nodeName> \
+--resource-group <resourceGroup> \
 --subscription <subscriptionID> | jq '.virtualMachinesAssociatedIds'
 ```
 
@@ -137,7 +179,13 @@ A hardware validation process is invoked to ensure the integrity of the physical
 > [!IMPORTANT]
 > Starting with the 2024-07-01 GA API version, the RAID controller is reset during BMM replace, wiping all data from the server's virtual disks. Baseboard Management Controller (BMC) virtual disk alerts triggered during BMM replace can be ignored unless there are additional physical disk and/or RAID controllers alerts.
 
-As a best practice, first issue a `cordon` command to remove the bare metal machine from workload scheduling and then shut down the BMM in advance of physical repairs.
+### Replace workflow
+
+1. **Cordon and evacuate** - Remove workloads from the BMM before physical repair
+2. **Perform physical repairs** - Replace hardware components as needed
+3. **Execute replace command** - Run the replace command with required parameters 
+4. **Uncordon** - Make the BMM schedulable again after replacement completes
+5. **Verify status** - Check that the BMM is properly functioning
 
 **The following Azure CLI command will `cordon` the specified bareMetalMachineName.**
 ```
@@ -148,6 +196,8 @@ az networkcloud baremetalmachine cordon \
   --subscription <subscriptionID>
 ```
 
+### Hardware component replacement guide
+
 When you're performing a physical hot swappable power supply repair, a replace action is not required because the BMM host will continue to function normally after the repair.
 
 When you're performing the following physical repairs, we recommend a replace action, though it is not necessary to bring the BMM back into service:
@@ -193,7 +243,22 @@ az networkcloud baremetalmachine uncordon \
 
 ## Summary
 
-Restarting, reimaging, and replacing are effective troubleshooting methods that you can use to address technical problems. However, it's important to have a systematic approach and to consider other factors before you try any drastic measures.
+Restarting, reimaging, and replacing are effective troubleshooting methods for addressing Azure Operator Nexus server problems. Here's a quick reference guide:
+
+| Action | When to use | Impact | Requirements |
+|--------|------------|--------|-------------|
+| **Restart** | Temporary glitches, unresponsive VMs | Brief downtime | None, fastest option |
+| **Reimage** | OS corruption, security concerns | Longer downtime, preserves data | Workload evacuation recommended |
+| **Replace** | Hardware component failures | Longest downtime, preserves data | Hardware component replacement, specific parameters needed |
+
+### Best practices
+
+1. **Always follow the escalation path**: Start with restart, then reimage, then replace unless the issue clearly indicates otherwise.
+2. **Verify workloads before action**: Use the provided commands to identify running workloads before any disruptive action.
+3. **Cordon with evacuation**: When performing reimage or replace actions, always use `cordon` with `evacuate="True"` to safely move workloads.
+4. **Never run multiple operations simultaneously**: Ensure one operation completes before starting another to prevent server issues.
+5. **Verify resolution**: After performing any action, verify the BMM status and that the original issue is resolved.
+
 More details about the BMM actions can be found in the [BMM actions](howto-baremetal-functions.md) article.
 
 If you still have questions, [contact support](https://portal.azure.com/?#blade/Microsoft_Azure_Support/HelpAndSupportBlade).