Skip to content

Commit 61018e0

Browse files
committed
CoPilot generated rewrite of article for clarity
1 parent e64834b commit 61018e0

File tree

1 file changed

+141
-80
lines changed

1 file changed

+141
-80
lines changed

articles/operator-nexus/troubleshoot-reboot-reimage-replace.md

Lines changed: 141 additions & 80 deletions
Original file line numberDiff line numberDiff line change
@@ -4,155 +4,203 @@ description: Troubleshoot cluster bare metal machines with Restart, Reimage, Rep
44
ms.service: azure-operator-nexus
55
ms.custom: troubleshooting
66
ms.topic: troubleshooting
7-
ms.date: 04/24/2024
7+
ms.date: 04/03/2025
88
author: eak13
99
ms.author: ekarandjeff
1010
---
1111

12-
# Troubleshoot Bare Metal server problems
12+
# Troubleshoot Azure Operator Nexus server problems
1313

14-
This article describes how to troubleshoot server problems by using `restart`, `reimage`, and `replace` actions on Azure Operator Nexus BareMetal Machine (BMM).
15-
These operations are performed for maintenance on your servers and cause a disruption to the specific Bare Metal Machine.
14+
This article describes how to troubleshoot server problems by using restart, reimage, and replace actions on Azure Operator Nexus bare metal machines (BMMs). You might need to take these actions on your server for maintenance reasons, which causes a brief disruption to specific BMMs.
1615

17-
[!INCLUDE [caution-affect-cluster-integrity](./includes/baremetal-machines/caution-affect-cluster-integrity.md)]
16+
## In this article
17+
- [Prerequisites](#prerequisites)
18+
- [Identify the corrective action](#identify-the-corrective-action)
19+
- [Troubleshoot with a restart action](#troubleshoot-with-a-restart-action)
20+
- [Troubleshoot with a reimage action](#troubleshoot-with-a-reimage-action)
21+
- [Troubleshoot with a replace action](#troubleshoot-with-a-replace-action)
22+
- [Summary](#summary)
1823

19-
[!INCLUDE [important-donot-disrupt-kcpnodes](./includes/baremetal-machines/important-donot-disrupt-kcpnodes.md)]
24+
The time required to complete each of these actions is similar. Restarting is the fastest, whereas replacing takes slightly longer. All three actions are simple and efficient methods for troubleshooting.
2025

21-
[!INCLUDE [prerequisites-azure-cli-bare-metal-machine-actions](./includes/baremetal-machines/prerequisites-azure-cli-bare-metal-machine-actions.md)]
26+
> [!CAUTION]
27+
> Do not perform any action against management servers without first consulting with Microsoft support personnel. Doing so could affect the integrity of the Operator Nexus Cluster.
2228
23-
## Follow best practice for Bare Metal Machine operations
29+
## Prerequisites
2430

25-
The various operations `restart`, `reimage`, and `replace` are effective troubleshooting methods that you can use to address technical problems.
26-
However, it's important to have a systematic approach and to consider other factors before you try any drastic measures.
31+
- Familiarize yourself with the capabilities referenced in this article by reviewing the [BMM actions](howto-baremetal-functions.md).
32+
- Gather the following information:
33+
- Name of the managed resource group for the BMM
34+
- Name of the BMM that requires a lifecycle management operation
35+
- Subscription ID
2736

28-
First, familiarize yourself with the operations by reading and following the advice on the recommended articles before proceeding with operations:
37+
> [!IMPORTANT]
38+
> Disruptive command requests against a Kubernetes Control Plane (KCP) node are rejected if there is another disruptive action command already running against another KCP node or if the full KCP is not available.
39+
>
40+
> Restart, reimage and replace are all considered disruptive actions.
41+
>
42+
> This check is done to maintain the integrity of the Nexus instance and ensure multiple KCP nodes do not go down at once due to simultaneous disruptive actions. If multiple nodes go down, it will break the healthy quorum threshold of the Kubernetes Control Plane.
2943
30-
- [Best Practices for BareMetal Machine Operations](./howto-bare-metal-best-practices.md).
31-
- [Bare Metal Machine Lifecycle Management Operations](howto-baremetal-functions.md).
44+
## Identify the corrective action
3245

33-
## Troubleshoot with a restart operation
46+
When troubleshooting a BMM for failures and determining the most appropriate corrective action, it is essential to understand the available options. This article provides a systematic approach to troubleshoot Azure Operator Nexus server problems using these three methods:
3447

35-
A `restart` operation can be useful in troubleshooting problems where tenant virtual machines on the host aren't responsive or are otherwise stuck.
48+
1. **Restart** - Least invasive method, best for temporary glitches or unresponsive VMs
49+
2. **Reimage** - Intermediate solution, restores OS to known-good state without affecting data
50+
3. **Replace** - Most significant action, required for hardware component failures
3651

37-
One way approach this operation can be done by executing, in order, both a `power-off` followed by `start` operation.
38-
This approach will `restart` a Bare Metal Machine command by performing a graceful shutdown that power cycles the node.
52+
### Troubleshooting decision tree
3953

40-
The following Azure CLI command will `power-off` the specified bareMetalMachineName.
54+
Follow this escalation path when troubleshooting BMM issues:
4155

42-
```azurecli
56+
| Problem | First action | If problem persists | If still unresolved |
57+
|---------|-------------|-------------------|-------------------|
58+
| Unresponsive VMs or services | Restart | Reimage | Replace |
59+
| Software/OS corruption | Reimage | Replace | Contact support |
60+
| Known hardware failure | Replace | N/A | Contact support |
61+
| Security compromise | Reimage | Replace | Contact support |
62+
63+
It's recommended to start with the least invasive solution (restart) and escalate to more complex measures only if necessary. Always validate that the issue is resolved after each corrective action.
64+
65+
## Troubleshoot with a restart action
66+
67+
Restarting a BMM is a process of restarting the server through a simple API call. This action can be useful for troubleshooting problems when tenant virtual machines on the host aren't responsive or are otherwise stuck.
68+
69+
The restart typically is the starting point for mitigating a problem.
70+
71+
### Restart workflow
72+
73+
1. **Assess impact** - Determine if restarting the BMM will impact critical workloads
74+
2. **Power off** - If needed, power off the BMM (optional)
75+
3. **Start or restart** - Either start a powered-off BMM or restart a running BMM
76+
4. **Verify status** - Check if the BMM is back online and functioning properly
77+
78+
> [!NOTE]
79+
> The restart operation is the fastest recovery method but may not resolve issues related to OS corruption or hardware failures.
80+
81+
**The following Azure CLI command will `power-off` the specified bareMetalMachineName:**
82+
```
4383
az networkcloud baremetalmachine power-off \
4484
--name <bareMetalMachineName> \
4585
--resource-group "<resourceGroup>" \
4686
--subscription <subscriptionID>
4787
```
4888

49-
The following Azure CLI command will `start` the specified bareMetalMachineName.
50-
51-
```azurecli
89+
**The following Azure CLI command will `start` the specified bareMetalMachineName:**
90+
```
5291
az networkcloud baremetalmachine start \
5392
--name <bareMetalMachineName> \
5493
--resource-group "<resourceGroup>" \
5594
--subscription <subscriptionID>
5695
```
5796

58-
Alternatively, you can let the `restart` command perform a server reboot.
59-
60-
The following Azure CLI command will `restart` the specified bareMetalMachineName.
61-
62-
```azurecli
97+
**The following Azure CLI command will `restart` the specified bareMetalMachineName:**
98+
```
6399
az networkcloud baremetalmachine restart \
64100
--name <bareMetalMachineName> \
65101
--resource-group "<resourceGroup>" \
66102
--subscription <subscriptionID>
67103
```
68104

69-
## Troubleshoot with a reimage operation
105+
**To verify the BMM status after restart:**
106+
```
107+
az networkcloud baremetalmachine show \
108+
--name <bareMetalMachineName> \
109+
--resource-group "<resourceGroup>" \
110+
--subscription <subscriptionID> \
111+
--query "provisioningState"
112+
```
113+
114+
## Troubleshoot with a reimage action
70115

71-
The `reimage` command on a Bare Metal Machine is a process that **redeploys** the OS image on disk without affecting the tenant data.
72-
This operation executes the steps to rejoin the cluster with the same identifiers.
116+
Reimaging a BMM is a process that you use to redeploy the image on the OS disk, without affecting the tenant data. This action executes the steps to rejoin the cluster with the same identifiers.
73117

74-
The `reimage` operation can be useful for troubleshooting problems by restoring the OS to a known-good working state.
75-
Common causes that can be resolved through reimaging include recovery due to doubt of host integrity, suspected or confirmed security compromise, or "break glass" write activity.
118+
The reimage action can be useful for troubleshooting problems by restoring the OS to a known-good working state. Common causes that can be resolved through reimaging include recovery due to doubt of host integrity, suspected or confirmed security compromise, or "break glass" write activity.
76119

77-
A `reimage` operation is the best practice for lowest operational risk to ensure the Bare Metal Machine's integrity.
120+
A reimage action is the best practice for lowest operational risk to ensure the integrity of the BMM.
78121

79-
As a best practice, before executing the `reimage` command make sure the Bare Metal Machine's workloads are drained using the cordon command with `evacuate` parameter set to `True`.
122+
### Reimage workflow
80123

81-
[!INCLUDE [warning-do-not-run-multiple-actions](./includes/baremetal-machines/warning-do-not-run-multiple-actions.md)]
124+
1. **Verify running workloads** - Before reimaging, check what workloads are running on the BMM
125+
2. **Cordon and evacuate workloads** - Drain the BMM of workloads
126+
3. **Perform reimage** - Execute the reimage operation
127+
4. **Uncordon** - Make the BMM schedulable again after reimage completes
82128

83-
To identify if any workloads are currently running on a Bare Metal Machine, run the following command:
129+
> [!WARNING]
130+
> Running more than one `baremetalmachine replace` or `reimage` command at the same time, or running a `replace`
131+
> at the same time as a `reimage` will leave servers in a nonworking state. Make sure one operation has fully completed before starting another.
84132
85-
For Virtual Machines:
133+
**To identify if any workloads are currently running on a BMM, run the following command:**
86134

135+
**For Virtual Machines:**
87136
```azurecli
88-
az networkcloud baremetalmachine show -n <nodeName> /
89-
--resource-group <resourceGroup> /
90-
--subscription <subscriptionID> | jq '.virtualMachinesAssociatedIds'
137+
az networkcloud baremetalmachine show -n <nodeName> \
138+
--resource-group <resourceGroup> \
139+
--subscription <subscriptionID> | jq '.virtualMachinesAssociatedIds'
91140
```
92141

93-
For Nexus Kubernetes cluster nodes: (Requires logging into the Nexus Kubernetes cluster)
142+
**For Nexus Kubernetes cluster nodes: (requires logging into the Nexus Kubernetes cluster)**
94143

95-
```shell
96-
kubectl get nodes <resourceName> -ojson | jq '.metadata.labels."topology.kubernetes.io/baremetalmachine"'
144+
```
145+
kubectl get nodes <resourceName> -ojson |jq '.metadata.labels."topology.kubernetes.io/baremetalmachine"'
97146
```
98147

99-
The following Azure CLI command will `cordon` the specified bareMetalMachineName.
100-
101-
```azurecli
148+
**The following Azure CLI command will `cordon` the specified bareMetalMachineName.**
149+
```
102150
az networkcloud baremetalmachine cordon \
103151
--evacuate "True" \
104152
--name <bareMetalMachineName> \
105153
--resource-group "<resourceGroup>" \
106154
--subscription <subscriptionID>
107155
```
108156

109-
The following Azure CLI command will `reimage` the specified bareMetalMachineName.
110-
111-
```azurecli
157+
**The following Azure CLI command will `reimage` the specified bareMetalMachineName.**
158+
```
112159
az networkcloud baremetalmachine reimage \
113160
--name <bareMetalMachineName> \
114161
--resource-group "<resourceGroup>" \
115162
--subscription <subscriptionID>
116163
```
117164

118-
The following Azure CLI command will `uncordon` the specified bareMetalMachineName.
119-
120-
```azurecli
165+
**The following Azure CLI command will `uncordon` the specified bareMetalMachineName.**
166+
```
121167
az networkcloud baremetalmachine uncordon \
122168
--name <bareMetalMachineName> \
123169
--resource-group "<resourceGroup>" \
124170
--subscription <subscriptionID>
125171
```
126172

127-
## Troubleshoot with a replace operation
173+
## Troubleshoot with a replace action
128174

129-
Servers contain many physical components that fail over time. It's important to understand which physical repairs require to perform a Bare Metal Machine `replace`.
130-
Like the `reimage` action, the tenant data isn't modified during a `replace`.
175+
Servers contain many physical components that can fail over time. It is important to understand which physical repairs require BMM replacement and when BMM replacement is recommended.
131176

132-
> [!IMPORTANT]
133-
> With the `2024-07-01` GA API version, the RAID controller is reset during Bare Metal Machine replace, wiping all data from the server's virtual disks.
134-
> Baseboard Management Controller (BMC) virtual disk alerts triggered during Bare Metal Machine replace can be ignored unless there are more physical disk and/or RAID controllers alerts.
135-
136-
### Resolve hardware validation issues
177+
A hardware validation process is invoked to ensure the integrity of the physical host in advance of deploying the OS image. Like the reimage action, the tenant data isn't modified during replacement.
137178

138-
A hardware validation process is invoked, as part of the `replace`, to ensure the integrity of the physical host in advance of deploying the OS image.
139-
As a best practice, first issue a `cordon` command to remove the Bare Metal Machine from workload scheduling and then shutdown/`power-off` the Bare Metal Machine in advance of physical repairs.
179+
> [!IMPORTANT]
180+
> Starting with the 2024-07-01 GA API version, the RAID controller is reset during BMM replace, wiping all data from the server's virtual disks. Baseboard Management Controller (BMC) virtual disk alerts triggered during BMM replace can be ignored unless there are additional physical disk and/or RAID controllers alerts.
140181
141-
[!INCLUDE [warning-do-not-run-multiple-actions](./includes/baremetal-machines/warning-do-not-run-multiple-actions.md)]
182+
### Replace workflow
142183

143-
The following Azure CLI command will `cordon` the specified bareMetalMachineName.
184+
1. **Cordon and evacuate** - Remove workloads from the BMM before physical repair
185+
2. **Perform physical repairs** - Replace hardware components as needed
186+
3. **Execute replace command** - Run the replace command with required parameters
187+
4. **Uncordon** - Make the BMM schedulable again after replacement completes
188+
5. **Verify status** - Check that the BMM is properly functioning
144189

145-
```azurecli
190+
**The following Azure CLI command will `cordon` the specified bareMetalMachineName.**
191+
```
146192
az networkcloud baremetalmachine cordon \
147193
--evacuate "True" \
148194
--name <bareMetalMachineName> \
149195
--resource-group "<resourceGroup>" \
150196
--subscription <subscriptionID>
151197
```
152198

153-
A `replace` operation isn't required when you're performing a physical hot swappable power supply repair because the Bare Metal Machine host will continue to function normally after the repair.
199+
### Hardware component replacement guide
200+
201+
When you're performing a physical hot swappable power supply repair, a replace action is not required because the BMM host will continue to function normally after the repair.
154202

155-
Although it isn't strictly necessary to bring the Bare Metal Machine back into service, we recommend doing a `replace` operation when you're performing the following physical repairs:
203+
When you're performing the following physical repairs, we recommend a replace action, though it is not necessary to bring the BMM back into service:
156204

157205
- CPU
158206
- Dual In-Line Memory Module (DIMM)
@@ -161,7 +209,7 @@ Although it isn't strictly necessary to bring the Bare Metal Machine back into s
161209
- Transceiver
162210
- Ethernet or fiber cable replacement
163211

164-
A `replace` operation ***is required*** to bring the Bare Metal Machine back into service when you're performing the following physical repairs:
212+
When you're performing the following physical repairs, a replace action ***is required*** to bring the BMM back into service:
165213

166214
- Backplane
167215
- System board
@@ -170,11 +218,10 @@ A `replace` operation ***is required*** to bring the Bare Metal Machine back int
170218
- Mellanox Network Interface Card (NIC)
171219
- Broadcom embedded NIC
172220

173-
After physical repairs are completed, perform a `replace` operation.
174-
175-
The following Azure CLI command will `replace` the specified bareMetalMachineName.
176-
177-
```azurecli
221+
After physical repairs are completed, perform a replace action.
222+
223+
**The following Azure CLI command will `replace` the specified bareMetalMachineName.**
224+
```
178225
az networkcloud baremetalmachine replace \
179226
--name <bareMetalMachineName> \
180227
--resource-group "<resourceGroup>" \
@@ -186,19 +233,33 @@ az networkcloud baremetalmachine replace \
186233
--subscription <subscriptionID>
187234
```
188235

189-
Once the Bare Metal Machine `replace` operation completes successfully, validate that the Bare Metal Machine's `provisioningStatus` is `Succeeded` and its `readyState` is set to `True`.
190-
Then, proceed to execute the `uncordon` operation to have the Bare Metal Machine rejoin the workload schedulable node pool.
191-
192-
The following Azure CLI command will `uncordon` the specified bareMetalMachineName.
193-
194-
```azurecli
236+
**The following Azure CLI command will uncordon the specified bareMetalMachineName.**
237+
```
195238
az networkcloud baremetalmachine uncordon \
196239
--name <bareMetalMachineName> \
197240
--resource-group "<resourceGroup>" \
198241
--subscription <subscriptionID>
199242
```
200243

201-
## Request Support
244+
## Summary
245+
246+
Restarting, reimaging, and replacing are effective troubleshooting methods for addressing Azure Operator Nexus server problems. Here's a quick reference guide:
247+
248+
| Action | When to use | Impact | Requirements |
249+
|--------|------------|--------|-------------|
250+
| **Restart** | Temporary glitches, unresponsive VMs | Brief downtime | None, fastest option |
251+
| **Reimage** | OS corruption, security concerns | Longer downtime, preserves data | Workload evacuation recommended |
252+
| **Replace** | Hardware component failures | Longest downtime, preserves data | Hardware component replacement, specific parameters needed |
253+
254+
### Best practices
255+
256+
1. **Always follow the escalation path**: Start with restart, then reimage, then replace unless the issue clearly indicates otherwise.
257+
2. **Verify workloads before action**: Use the provided commands to identify running workloads before any disruptive action.
258+
3. **Cordon with evacuation**: When performing reimage or replace actions, always use `cordon` with `evacuate="True"` to safely move workloads.
259+
4. **Never run multiple operations simultaneously**: Ensure one operation completes before starting another to prevent server issues.
260+
5. **Verify resolution**: After performing any action, verify the BMM status and that the original issue is resolved.
261+
262+
More details about the BMM actions can be found in the [BMM actions](howto-baremetal-functions.md) article.
202263

203264
If you still have questions, [contact support](https://portal.azure.com/?#blade/Microsoft_Azure_Support/HelpAndSupportBlade).
204265
For more information about Support plans, see [Azure Support plans](https://azure.microsoft.com/support/plans/response/).

0 commit comments

Comments
 (0)