You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/operator-nexus/troubleshoot-reboot-reimage-replace.md
+79-14Lines changed: 79 additions & 14 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -13,6 +13,14 @@ ms.author: ekarandjeff
13
13
14
14
This article describes how to troubleshoot server problems by using restart, reimage, and replace actions on Azure Operator Nexus bare metal machines (BMMs). You might need to take these actions on your server for maintenance reasons, which causes a brief disruption to specific BMMs.
15
15
16
+
## In this article
17
+
-[Prerequisites](#prerequisites)
18
+
-[Identify the corrective action](#identify-the-corrective-action)
19
+
-[Troubleshoot with a restart action](#troubleshoot-with-a-restart-action)
20
+
-[Troubleshoot with a reimage action](#troubleshoot-with-a-reimage-action)
21
+
-[Troubleshoot with a replace action](#troubleshoot-with-a-replace-action)
22
+
-[Summary](#summary)
23
+
16
24
The time required to complete each of these actions is similar. Restarting is the fastest, whereas replacing takes slightly longer. All three actions are simple and efficient methods for troubleshooting.
17
25
18
26
> [!CAUTION]
@@ -35,48 +43,73 @@ The time required to complete each of these actions is similar. Restarting is th
35
43
36
44
## Identify the corrective action
37
45
38
-
When troubleshooting a BMM for failures and determining the most appropriate corrective action, it is essential to understand the available options. Restarting or reimaging a BMM can be both efficient and effective for resolving issues or restoring the software to a known-good state. In cases where one or more hardware components fail on the server, it may be necessary to replace the BMM entirely. This article outlines the best practices for each of these three actions.
46
+
When troubleshooting a BMM for failures and determining the most appropriate corrective action, it is essential to understand the available options. This article provides a systematic approach to troubleshoot Azure Operator Nexus server problems using these three methods:
39
47
40
-
Troubleshooting technical problems requires a systematic approach. One effective method is to start with the least invasive solution and work your way up to more complex and drastic measures, if necessary.
48
+
1.**Restart** - Least invasive method, best for temporary glitches or unresponsive VMs
49
+
2.**Reimage** - Intermediate solution, restores OS to known-good state without affecting data
50
+
3.**Replace** - Most significant action, required for hardware component failures
41
51
42
-
The first step in troubleshooting is to try restarting the device or system. Restarting can help to clear up any temporary glitches or errors that might be causing the problem.
52
+
### Troubleshooting decision tree
43
53
44
-
If restarting does not solve the problem, the next step is to try reimaging the device or system.
54
+
Follow this escalation path when troubleshooting BMM issues:
45
55
46
-
If reimaging does not solve the problem, the final step is to replace the faulty hardware component. While replacement is a more significant measure, it may be required if the issue stems from a hardware defect.
56
+
| Problem | First action | If problem persists | If still unresolved |
Keep in mind that these troubleshooting methods might not always be effective, and other factors in play might require a different approach.
63
+
It's recommended to start with the least invasive solution (restart) and escalate to more complex measures only if necessary. Always validate that the issue is resolved after each corrective action.
49
64
50
65
## Troubleshoot with a restart action
51
66
52
67
Restarting a BMM is a process of restarting the server through a simple API call. This action can be useful for troubleshooting problems when tenant virtual machines on the host aren't responsive or are otherwise stuck.
53
68
54
69
The restart typically is the starting point for mitigating a problem.
55
70
56
-
***The following Azure CLI command will `power-off` the specified bareMetalMachineName.***
71
+
### Restart workflow
72
+
73
+
1.**Assess impact** - Determine if restarting the BMM will impact critical workloads
74
+
2.**Power off** - If needed, power off the BMM (optional)
75
+
3.**Start or restart** - Either start a powered-off BMM or restart a running BMM
76
+
4.**Verify status** - Check if the BMM is back online and functioning properly
77
+
78
+
> [!NOTE]
79
+
> The restart operation is the fastest recovery method but may not resolve issues related to OS corruption or hardware failures.
80
+
81
+
**The following Azure CLI command will `power-off` the specified bareMetalMachineName:**
57
82
```
58
83
az networkcloud baremetalmachine power-off \
59
84
--name <bareMetalMachineName> \
60
85
--resource-group "<resourceGroup>" \
61
86
--subscription <subscriptionID>
62
87
```
63
88
64
-
***The following Azure CLI command will `start` the specified bareMetalMachineName.***
89
+
**The following Azure CLI command will `start` the specified bareMetalMachineName:**
65
90
```
66
91
az networkcloud baremetalmachine start \
67
92
--name <bareMetalMachineName> \
68
93
--resource-group "<resourceGroup>" \
69
94
--subscription <subscriptionID>
70
95
```
71
96
72
-
***The following Azure CLI command will `restart` the specified bareMetalMachineName.***
97
+
**The following Azure CLI command will `restart` the specified bareMetalMachineName:**
73
98
```
74
99
az networkcloud baremetalmachine restart \
75
100
--name <bareMetalMachineName> \
76
101
--resource-group "<resourceGroup>" \
77
102
--subscription <subscriptionID>
78
103
```
79
104
105
+
**To verify the BMM status after restart:**
106
+
```
107
+
az networkcloud baremetalmachine show \
108
+
--name <bareMetalMachineName> \
109
+
--resource-group "<resourceGroup>" \
110
+
--subscription <subscriptionID> \
111
+
--query "provisioningState"
112
+
```
80
113
81
114
## Troubleshoot with a reimage action
82
115
@@ -86,14 +119,23 @@ The reimage action can be useful for troubleshooting problems by restoring the O
86
119
87
120
A reimage action is the best practice for lowest operational risk to ensure the integrity of the BMM.
88
121
89
-
As a best practice, make sure the BMM's workloads are drained using the cordon command, with evacuate "True", before executing the reimage command.
122
+
### Reimage workflow
123
+
124
+
1.**Verify running workloads** - Before reimaging, check what workloads are running on the BMM
125
+
2.**Cordon and evacuate workloads** - Drain the BMM of workloads
126
+
3.**Perform reimage** - Execute the reimage operation
127
+
4.**Uncordon** - Make the BMM schedulable again after reimage completes
128
+
129
+
> [!WARNING]
130
+
> Running more than one `baremetalmachine replace` or `reimage` command at the same time, or running a `replace`
131
+
> at the same time as a `reimage` will leave servers in a nonworking state. Make sure one operation has fully completed before starting another.
90
132
91
133
**To identify if any workloads are currently running on a BMM, run the following command:**
92
134
93
135
**For Virtual Machines:**
94
136
```azurecli
95
-
az networkcloud baremetalmachine show -n <nodeName> /
96
-
--resource-group <resourceGroup> /
137
+
az networkcloud baremetalmachine show -n <nodeName> \
@@ -137,7 +179,13 @@ A hardware validation process is invoked to ensure the integrity of the physical
137
179
> [!IMPORTANT]
138
180
> Starting with the 2024-07-01 GA API version, the RAID controller is reset during BMM replace, wiping all data from the server's virtual disks. Baseboard Management Controller (BMC) virtual disk alerts triggered during BMM replace can be ignored unless there are additional physical disk and/or RAID controllers alerts.
139
181
140
-
As a best practice, first issue a `cordon` command to remove the bare metal machine from workload scheduling and then shut down the BMM in advance of physical repairs.
182
+
### Replace workflow
183
+
184
+
1.**Cordon and evacuate** - Remove workloads from the BMM before physical repair
185
+
2.**Perform physical repairs** - Replace hardware components as needed
186
+
3.**Execute replace command** - Run the replace command with required parameters
187
+
4.**Uncordon** - Make the BMM schedulable again after replacement completes
188
+
5.**Verify status** - Check that the BMM is properly functioning
141
189
142
190
**The following Azure CLI command will `cordon` the specified bareMetalMachineName.**
143
191
```
@@ -148,6 +196,8 @@ az networkcloud baremetalmachine cordon \
148
196
--subscription <subscriptionID>
149
197
```
150
198
199
+
### Hardware component replacement guide
200
+
151
201
When you're performing a physical hot swappable power supply repair, a replace action is not required because the BMM host will continue to function normally after the repair.
152
202
153
203
When you're performing the following physical repairs, we recommend a replace action, though it is not necessary to bring the BMM back into service:
@@ -193,7 +243,22 @@ az networkcloud baremetalmachine uncordon \
193
243
194
244
## Summary
195
245
196
-
Restarting, reimaging, and replacing are effective troubleshooting methods that you can use to address technical problems. However, it's important to have a systematic approach and to consider other factors before you try any drastic measures.
246
+
Restarting, reimaging, and replacing are effective troubleshooting methods for addressing Azure Operator Nexus server problems. Here's a quick reference guide:
0 commit comments