You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/operator-nexus/troubleshoot-reboot-reimage-replace.md
+17-16Lines changed: 17 additions & 16 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -11,12 +11,12 @@ ms.author: ekarandjeff
11
11
12
12
# Troubleshoot Azure Operator Nexus Bare Metal Machine server problems
13
13
14
-
This article describes how to troubleshoot server problems by using Restart, Reimage, and Replace actions on Azure Operator Nexus Bare Metal Machines (BMMs). You might need to take these actions on your server for maintenance reasons, which may cause a brief disruption to specific BMMs.
14
+
This article describes how to troubleshoot server problems by using Restart, Reimage, and Replace actions on Azure Operator Nexus Bare Metal Machines (BMMs). You might need to take these actions on your server for maintenance reasons, which might cause a brief disruption to specific BMMs.
15
15
16
16
The time required to complete each of these actions is similar. Restarting is the fastest, whereas replacing takes slightly longer. All three actions are simple and efficient methods for troubleshooting.
17
17
18
18
> [!CAUTION]
19
-
> Do not perform any action against management servers without first consulting with Microsoft support personnel. Doing so could affect the integrity of the Operator Nexus Cluster.
19
+
> Don't perform any action against management servers without first consulting with Microsoft support personnel. Doing so could affect the integrity of the Operator Nexus Cluster.
20
20
21
21
## Prerequisites
22
22
@@ -27,17 +27,17 @@ The time required to complete each of these actions is similar. Restarting is th
27
27
- Subscription ID
28
28
29
29
> [!IMPORTANT]
30
-
> Disruptive command requests against a Kubernetes Control Plane (KCP) node are rejected if there is another disruptive action command already running against another KCP node or if the full KCP is not available.
30
+
> Disruptive command requests against a Kubernetes Control Plane (KCP) node are rejected if there's another disruptive action command already running against another KCP node or if the full KCP isn't available.
31
31
>
32
32
> Restart, reimage and replace are all considered disruptive actions.
33
33
>
34
-
> This check is done to maintain the integrity of the Nexus instance and ensure multiple KCP nodes do not go down at once due to simultaneous disruptive actions. If multiple nodes go down, it will break the healthy quorum threshold of the Kubernetes Control Plane.
34
+
> This check is done to maintain the integrity of the Nexus instance and ensure multiple KCP nodes don't go down at once due to simultaneous disruptive actions. If multiple nodes go down, it breaks the healthy quorum threshold of the Kubernetes Control Plane.
35
35
36
36
## Identify the corrective action
37
37
38
-
When troubleshooting a BMM for failures and determining the most appropriate corrective action, it is essential to understand the available options. This article provides a systematic approach to troubleshoot Azure Operator Nexus server problems using these three methods:
38
+
When troubleshooting a BMM for failures and determining the most appropriate corrective action, it's essential to understand the available options. This article provides a systematic approach to troubleshoot Azure Operator Nexus server problems using these three methods:
39
39
40
-
-**Restart** - Least invasive method, best for temporary glitches or unresponsive VMs
40
+
-**Restart** - Least invasive method, best for temporary glitches, or unresponsive Virtual Machines (VM)s
41
41
-**Reimage** - Intermediate solution, restores OS to known-good state without affecting data
42
42
-**Replace** - Most significant action, required for hardware component failures
43
43
@@ -52,23 +52,23 @@ Follow this escalation path when troubleshooting BMM issues:
52
52
| Known hardware failure | Replace | N/A | Contact support |
It's recommended to start with the least invasive solution (restart) and escalate to more complex measures only if necessary. Always validate that the issue is resolved after each corrective action.
55
+
The recommended approach is to start with the least invasive solution (restart) and escalate to more complex measures only if necessary. Always validate that the issue is resolved after each corrective action.
56
56
57
57
## Troubleshoot with a restart action
58
58
59
-
Restarting a BMM is a process of restarting the server through a simple API call. This action can be useful for troubleshooting problems when Tenant Virtual Machines on the host aren't responsive or are otherwise stuck.
59
+
Restarting a BMM is a process of restarting the server through a simple API call. This action can be useful for troubleshooting problems when Tenant VMs on the host aren't responsive or are otherwise stuck.
60
60
61
61
The restart typically is the starting point for mitigating a problem.
62
62
63
63
### Restart workflow
64
64
65
-
1.**Assess impact** - Determine if restarting the BMM will impact critical workloads.
65
+
1.**Assess impact** - Determine if restarting the BMM impacts critical workloads.
66
66
2.**Power off** - If needed, power off the BMM (optional).
67
67
3.**Start or restart** - Either start a powered-off BMM or restart a running BMM.
68
68
4.**Verify status** - Check if the BMM is back online and functioning properly.
69
69
70
70
> [!NOTE]
71
-
> The restart operation is the fastest recovery method but may not resolve issues related to OS corruption or hardware failures.
71
+
> The restart operation is the fastest recovery method but might not resolve issues related to OS corruption or hardware failures.
72
72
73
73
**The following Azure CLI command will `power-off` the specified bareMetalMachineName:**
74
74
@@ -124,7 +124,7 @@ A reimage action is the best practice for lowest operational risk to ensure the
124
124
125
125
> [!WARNING]
126
126
> Running more than one `baremetalmachine replace` or `reimage` command at the same time, or running a `replace`
127
-
> at the same time as a `reimage`will leave servers in a nonworking state. Make sure one operation has fully completed before starting another.
127
+
> at the same time as a `reimage`leaves servers in a nonworking state. Make sure one operation fully completes before starting another.
128
128
129
129
**To identify if any workloads are currently running on a BMM, run the following command:**
130
130
@@ -172,12 +172,12 @@ az networkcloud baremetalmachine uncordon \
172
172
173
173
## Troubleshoot with a replace action
174
174
175
-
Servers contain many physical components that can fail over time. It is important to understand which physical repairs require BMM replacement and when BMM replacement is recommended.
175
+
Servers contain many physical components that can fail over time. It's important to understand which physical repairs require BMM replacement and when BMM replacement is recommended.
176
176
177
177
A hardware validation process is invoked to ensure the integrity of the physical host in advance of deploying the OS image. Like the reimage action, the Tenant data isn't modified during replacement.
178
178
179
179
> [!IMPORTANT]
180
-
> Starting with the 2024-07-01 GA API version, the RAID controller is reset during BMM replace, wiping all data from the server's virtual disks. Baseboard Management Controller (BMC) virtual disk alerts triggered during BMM replace can be ignored unless there are additional physical disk and/or RAID controllers alerts.
180
+
> When run with default options, the RAID controller is reset during BMM replace, wiping all data from the server's virtual disks. Baseboard Management Controller (BMC) virtual disk alerts triggered during BMM replace can be ignored unless there are other physical disk and/or RAID controllers alerts. Starting with the 2025-07-01 preview version of the NetworkCloud API, and generally available with the 2025-09-01 GA version, use `replace` with `storage-policy="Preserve"` to retain virtual disk data.
181
181
182
182
### Replace workflow
183
183
@@ -199,9 +199,9 @@ az networkcloud baremetalmachine cordon \
199
199
200
200
### Hardware component replacement guide
201
201
202
-
When you're performing a physical hot swappable power supply repair, a replace action is not required because the BMM host will continue to function normally after the repair.
202
+
When you're performing a physical hot swappable power supply repair, a replace action isn't required because the BMM host will continue to function normally after the repair.
203
203
204
-
When you're performing the following physical repairs, we recommend a replace action, though it is not necessary to bring the BMM back into service:
204
+
When you're performing the following physical repairs, we recommend a replace action, though it isn't necessary to bring the BMM back into service:
205
205
206
206
- CPU
207
207
- Dual In-Line Memory Module (DIMM)
@@ -232,7 +232,8 @@ az networkcloud baremetalmachine replace \
232
232
--boot-mac-address <PXE_MAC> \
233
233
--machine-name <OS_HOSTNAME> \
234
234
--serial-number <SERIAL_NUM> \
235
-
--subscription <subscriptionID>
235
+
--subscription <subscriptionID> \
236
+
--storage-policy <STORAGE_POLICY>
236
237
```
237
238
238
239
**The following Azure CLI command will uncordon the specified bareMetalMachineName.**
0 commit comments