You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -4,40 +4,73 @@ description: Troubleshoot cluster bare metal machines with Restart, Reimage, Rep
4
4
ms.service: azure-operator-nexus
5
5
ms.custom: troubleshooting
6
6
ms.topic: troubleshooting
7
-
ms.date: 04/24/2024
7
+
ms.date: 04/03/2025
8
8
author: eak13
9
9
ms.author: ekarandjeff
10
10
---
11
11
12
-
# Troubleshoot Bare Metal server problems
12
+
# Troubleshoot Azure Operator Nexus Bare Metal Machine server problems
13
13
14
-
This article describes how to troubleshoot server problems by using `restart`, `reimage`, and `replace` actions on Azure Operator Nexus BareMetal Machine (BMM).
15
-
These operations are performed for maintenance on your servers and cause a disruption to the specific Bare Metal Machine.
14
+
This article describes how to troubleshoot server problems by using Restart, Reimage, and Replace actions on Azure Operator Nexus Bare Metal Machines (BMMs). You might need to take these actions on your server for maintenance reasons, which may cause a brief disruption to specific BMMs.
The time required to complete each of these actions is similar. Restarting is the fastest, whereas replacing takes slightly longer. All three actions are simple and efficient methods for troubleshooting.
> Do not perform any action against management servers without first consulting with Microsoft support personnel. Doing so could affect the integrity of the Operator Nexus Cluster.
## Follow best practice for Bare Metal Machine operations
23
+
- Familiarize yourself with the capabilities referenced in this article by reviewing the [BMM actions](howto-baremetal-functions.md).
24
+
- Gather the following information:
25
+
- Name of the managed resource group for the BMM
26
+
- Name of the BMM that requires a lifecycle management operation
27
+
- Subscription ID
24
28
25
-
The various operations `restart`, `reimage`, and `replace` are effective troubleshooting methods that you can use to address technical problems.
26
-
However, it's important to have a systematic approach and to consider other factors before you try any drastic measures.
29
+
> [!IMPORTANT]
30
+
> Disruptive command requests against a Kubernetes Control Plane (KCP) node are rejected if there is another disruptive action command already running against another KCP node or if the full KCP is not available.
31
+
>
32
+
> Restart, reimage and replace are all considered disruptive actions.
33
+
>
34
+
> This check is done to maintain the integrity of the Nexus instance and ensure multiple KCP nodes do not go down at once due to simultaneous disruptive actions. If multiple nodes go down, it will break the healthy quorum threshold of the Kubernetes Control Plane.
35
+
36
+
## Identify the corrective action
37
+
38
+
When troubleshooting a BMM for failures and determining the most appropriate corrective action, it is essential to understand the available options. This article provides a systematic approach to troubleshoot Azure Operator Nexus server problems using these three methods:
39
+
40
+
-**Restart** - Least invasive method, best for temporary glitches or unresponsive VMs
41
+
-**Reimage** - Intermediate solution, restores OS to known-good state without affecting data
42
+
-**Replace** - Most significant action, required for hardware component failures
43
+
44
+
### Troubleshooting decision tree
45
+
46
+
Follow this escalation path when troubleshooting BMM issues:
47
+
48
+
| Problem | First action | If problem persists | If still unresolved |
First, familiarize yourself with the operations by reading and following the advice on the recommended articles before proceeding with operations:
55
+
It's recommended to start with the least invasive solution (restart) and escalate to more complex measures only if necessary. Always validate that the issue is resolved after each corrective action.
29
56
30
-
-[Best Practices for BareMetal Machine Operations](./howto-bare-metal-best-practices.md).
31
-
-[Bare Metal Machine Lifecycle Management Operations](howto-baremetal-functions.md).
57
+
## Troubleshoot with a restart action
32
58
33
-
## Troubleshoot with a restart operation
59
+
Restarting a BMM is a process of restarting the server through a simple API call. This action can be useful for troubleshooting problems when Tenant Virtual Machines on the host aren't responsive or are otherwise stuck.
34
60
35
-
A `restart` operation can be useful in troubleshooting problems where tenant virtual machines on the host aren't responsive or are otherwise stuck.
61
+
The restart typically is the starting point for mitigating a problem.
36
62
37
-
One way approach this operation can be done by executing, in order, both a `power-off` followed by `start` operation.
38
-
This approach will `restart` a Bare Metal Machine command by performing a graceful shutdown that power cycles the node.
63
+
### Restart workflow
39
64
40
-
The following Azure CLI command will `power-off` the specified bareMetalMachineName.
65
+
1.**Assess impact** - Determine if restarting the BMM will impact critical workloads.
66
+
2.**Power off** - If needed, power off the BMM (optional).
67
+
3.**Start or restart** - Either start a powered-off BMM or restart a running BMM.
68
+
4.**Verify status** - Check if the BMM is back online and functioning properly.
69
+
70
+
> [!NOTE]
71
+
> The restart operation is the fastest recovery method but may not resolve issues related to OS corruption or hardware failures.
72
+
73
+
**The following Azure CLI command will `power-off` the specified bareMetalMachineName:**
41
74
42
75
```azurecli
43
76
az networkcloud baremetalmachine power-off \
@@ -46,7 +79,7 @@ az networkcloud baremetalmachine power-off \
46
79
--subscription <subscriptionID>
47
80
```
48
81
49
-
The following Azure CLI command will `start` the specified bareMetalMachineName.
82
+
**The following Azure CLI command will `start` the specified bareMetalMachineName:**
50
83
51
84
```azurecli
52
85
az networkcloud baremetalmachine start \
@@ -55,9 +88,7 @@ az networkcloud baremetalmachine start \
55
88
--subscription <subscriptionID>
56
89
```
57
90
58
-
Alternatively, you can let the `restart` command perform a server reboot.
59
-
60
-
The following Azure CLI command will `restart` the specified bareMetalMachineName.
91
+
**The following Azure CLI command will `restart` the specified bareMetalMachineName:**
61
92
62
93
```azurecli
63
94
az networkcloud baremetalmachine restart \
@@ -66,37 +97,52 @@ az networkcloud baremetalmachine restart \
66
97
--subscription <subscriptionID>
67
98
```
68
99
69
-
## Troubleshoot with a reimage operation
100
+
**To verify the BMM status after restart:**
101
+
102
+
```azurecli
103
+
az networkcloud baremetalmachine show \
104
+
--name <bareMetalMachineName> \
105
+
--resource-group "<resourceGroup>" \
106
+
--subscription <subscriptionID> \
107
+
--query "provisioningState"
108
+
```
109
+
110
+
## Troubleshoot with a reimage action
70
111
71
-
The `reimage` command on a Bare Metal Machine is a process that **redeploys** the OS image on disk without affecting the tenant data.
72
-
This operation executes the steps to rejoin the cluster with the same identifiers.
112
+
Reimaging a BMM is a process that you use to redeploy the image on the OS disk, without affecting the tenant data. This action executes the steps to rejoin the cluster with the same identifiers.
73
113
74
-
The `reimage` operation can be useful for troubleshooting problems by restoring the OS to a known-good working state.
75
-
Common causes that can be resolved through reimaging include recovery due to doubt of host integrity, suspected or confirmed security compromise, or "break glass" write activity.
114
+
The reimage action can be useful for troubleshooting problems by restoring the OS to a known-good working state. Common causes that can be resolved through reimaging include recovery due to doubt of host integrity, suspected or confirmed security compromise, or "break glass" write activity.
76
115
77
-
A `reimage` operation is the best practice for lowest operational risk to ensure the Bare Metal Machine's integrity.
116
+
A reimage action is the best practice for lowest operational risk to ensure the integrity of the BMM.
78
117
79
-
As a best practice, before executing the `reimage` command make sure the Bare Metal Machine's workloads are drained using the cordon command with `evacuate` parameter set to `True`.
For Nexus Kubernetes cluster nodes: (Requires logging into the Nexus Kubernetes cluster)
139
+
**For Nexus Kubernetes cluster nodes: (requires logging into the Nexus Kubernetes cluster)**
94
140
95
-
```shell
96
-
kubectl get nodes <resourceName> -ojson |jq '.metadata.labels."topology.kubernetes.io/baremetalmachine"'
141
+
```
142
+
kubectl get nodes <resourceName> -ojson |jq '.metadata.labels."topology.kubernetes.io/baremetalmachine"'
97
143
```
98
144
99
-
The following Azure CLI command will `cordon` the specified bareMetalMachineName.
145
+
**The following Azure CLI command will `cordon` the specified bareMetalMachineName.**
100
146
101
147
```azurecli
102
148
az networkcloud baremetalmachine cordon \
@@ -106,7 +152,7 @@ az networkcloud baremetalmachine cordon \
106
152
--subscription <subscriptionID>
107
153
```
108
154
109
-
The following Azure CLI command will `reimage` the specified bareMetalMachineName.
155
+
**The following Azure CLI command will `reimage` the specified bareMetalMachineName.**
110
156
111
157
```azurecli
112
158
az networkcloud baremetalmachine reimage \
@@ -115,7 +161,7 @@ az networkcloud baremetalmachine reimage \
115
161
--subscription <subscriptionID>
116
162
```
117
163
118
-
The following Azure CLI command will `uncordon` the specified bareMetalMachineName.
164
+
**The following Azure CLI command will `uncordon` the specified bareMetalMachineName.**
119
165
120
166
```azurecli
121
167
az networkcloud baremetalmachine uncordon \
@@ -124,23 +170,24 @@ az networkcloud baremetalmachine uncordon \
124
170
--subscription <subscriptionID>
125
171
```
126
172
127
-
## Troubleshoot with a replace operation
173
+
## Troubleshoot with a replace action
128
174
129
-
Servers contain many physical components that fail over time. It's important to understand which physical repairs require to perform a Bare Metal Machine `replace`.
130
-
Like the `reimage` action, the tenant data isn't modified during a `replace`.
175
+
Servers contain many physical components that can fail over time. It is important to understand which physical repairs require BMM replacement and when BMM replacement is recommended.
131
176
132
-
> [!IMPORTANT]
133
-
> With the `2024-07-01` GA API version, the RAID controller is reset during Bare Metal Machine replace, wiping all data from the server's virtual disks.
134
-
> Baseboard Management Controller (BMC) virtual disk alerts triggered during Bare Metal Machine replace can be ignored unless there are more physical disk and/or RAID controllers alerts.
177
+
A hardware validation process is invoked to ensure the integrity of the physical host in advance of deploying the OS image. Like the reimage action, the Tenant data isn't modified during replacement.
135
178
136
-
### Resolve hardware validation issues
179
+
> [!IMPORTANT]
180
+
> Starting with the 2024-07-01 GA API version, the RAID controller is reset during BMM replace, wiping all data from the server's virtual disks. Baseboard Management Controller (BMC) virtual disk alerts triggered during BMM replace can be ignored unless there are additional physical disk and/or RAID controllers alerts.
137
181
138
-
A hardware validation process is invoked, as part of the `replace`, to ensure the integrity of the physical host in advance of deploying the OS image.
139
-
As a best practice, first issue a `cordon` command to remove the Bare Metal Machine from workload scheduling and then shutdown/`power-off` the Bare Metal Machine in advance of physical repairs.
1.**Cordon and evacuate** - Remove workloads from the BMM before physical repair.
185
+
2.**Perform physical repairs** - Replace hardware components as needed.
186
+
3.**Execute replace command** - Run the replace command with required parameters.
187
+
4.**Uncordon** - Make the BMM schedulable again after replacement completes.
188
+
5.**Verify status** - Check that the BMM is properly functioning.
142
189
143
-
The following Azure CLI command will `cordon` the specified bareMetalMachineName.
190
+
**The following Azure CLI command will `cordon` the specified bareMetalMachineName.**
144
191
145
192
```azurecli
146
193
az networkcloud baremetalmachine cordon \
@@ -150,9 +197,11 @@ az networkcloud baremetalmachine cordon \
150
197
--subscription <subscriptionID>
151
198
```
152
199
153
-
A `replace` operation isn't required when you're performing a physical hot swappable power supply repair because the Bare Metal Machine host will continue to function normally after the repair.
200
+
### Hardware component replacement guide
201
+
202
+
When you're performing a physical hot swappable power supply repair, a replace action is not required because the BMM host will continue to function normally after the repair.
154
203
155
-
Although it isn't strictly necessary to bring the Bare Metal Machine back into service, we recommend doing a `replace` operation when you're performing the following physical repairs:
204
+
When you're performing the following physical repairs, we recommend a replace action, though it is not necessary to bring the BMM back into service:
156
205
157
206
- CPU
158
207
- Dual In-Line Memory Module (DIMM)
@@ -161,7 +210,7 @@ Although it isn't strictly necessary to bring the Bare Metal Machine back into s
161
210
- Transceiver
162
211
- Ethernet or fiber cable replacement
163
212
164
-
A `replace` operation***is required*** to bring the Bare Metal Machine back into service when you're performing the following physical repairs:
213
+
When you're performing the following physical repairs, a replace action***is required*** to bring the BMM back into service:
165
214
166
215
- Backplane
167
216
- System board
@@ -170,9 +219,9 @@ A `replace` operation ***is required*** to bring the Bare Metal Machine back int
170
219
- Mellanox Network Interface Card (NIC)
171
220
- Broadcom embedded NIC
172
221
173
-
After physical repairs are completed, perform a `replace` operation.
174
-
175
-
The following Azure CLI command will `replace` the specified bareMetalMachineName.
222
+
After physical repairs are completed, perform a replace action.
223
+
224
+
**The following Azure CLI command will `replace` the specified bareMetalMachineName.**
176
225
177
226
```azurecli
178
227
az networkcloud baremetalmachine replace \
@@ -186,10 +235,7 @@ az networkcloud baremetalmachine replace \
186
235
--subscription <subscriptionID>
187
236
```
188
237
189
-
Once the Bare Metal Machine `replace` operation completes successfully, validate that the Bare Metal Machine's `provisioningStatus` is `Succeeded` and its `readyState` is set to `True`.
190
-
Then, proceed to execute the `uncordon` operation to have the Bare Metal Machine rejoin the workload schedulable node pool.
191
-
192
-
The following Azure CLI command will `uncordon` the specified bareMetalMachineName.
238
+
**The following Azure CLI command will uncordon the specified bareMetalMachineName.**
193
239
194
240
```azurecli
195
241
az networkcloud baremetalmachine uncordon \
@@ -198,7 +244,25 @@ az networkcloud baremetalmachine uncordon \
198
244
--subscription <subscriptionID>
199
245
```
200
246
201
-
## Request Support
247
+
## Summary
248
+
249
+
Restarting, reimaging, and replacing are effective troubleshooting methods for addressing Azure Operator Nexus server problems. Here's a quick reference guide:
0 commit comments