You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/operator-nexus/howto-bare-metal-best-practices.md
+50-48Lines changed: 50 additions & 48 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,7 +1,7 @@
1
1
---
2
2
title: Best practices for Bare Metal Machine operations
3
3
description: Steps that should be taken before executing any Bare Metal Machine replace, or reimage actions. Highlight essential prerequisites and common pitfalls to avoid.
4
-
ms.date: 05/22/2025
4
+
ms.date: 08/12/2025
5
5
ms.topic: how-to
6
6
ms.service: azure-operator-nexus
7
7
ms.custom: template-how-to, best-practices
@@ -67,7 +67,7 @@ See related articles:
67
67
-[How to monitor interface In and Out packet rate for network fabric devices]
68
68
-[How to configure diagnostic settings and monitor configuration differences in Nexus Network Fabric].
69
69
70
-
Evaluate for any Bare Metal Machine warnings or degraded conditions which could indicate the need to resolve hardware, network, or server configuration problems.
70
+
Evaluate for any Bare Metal Machine warnings or degraded conditions that could indicate the need to resolve hardware, network, or server configuration problems.
71
71
For more information, see [Troubleshoot Degraded Status Errors on Bare Metal Machines] and [Troubleshoot Bare Metal Machine Warning Status].
72
72
73
73
#### Determine if firmware update jobs are running
@@ -88,7 +88,7 @@ az networkcloud baremetalmachine run-read-command \
88
88
--output-directory .
89
89
```
90
90
91
-
Here's an example output from the `racadm jobqueue view` command which shows `Firmware Update`.
91
+
Here's an example output from the `racadm jobqueue view` command that shows `Firmware Update`.
92
92
93
93
```
94
94
[Job ID=JID_833540920066]
@@ -125,58 +125,60 @@ Message=[SYS043: Successfully exported Server Configuration Profile]
125
125
Percent Complete=[100]
126
126
```
127
127
128
-
#### Monitor progress using `run-read-command`
128
+
#### Monitor status in Bare Metal Machine JSON properties
129
129
130
-
In version 2506.2 and above, you can monitor the progress of long running Bare Metal Machine actions using a `run-read-command`.
130
+
In version 2509.1 and above, you can view the status of any recent or in progress actions in the `JSON View`of the corresponding Bare Metal Machine (Operator Nexus) resource. This information is visible in the `actionStates` field of the Bare Metal Machine JSON properties, when using API Version `2025-07-01-preview` or higher. The following information is available.
131
131
132
-
- Some long running actions such as `Replace` or `Reimage` are composed of multiple steps, for example, `Hardware Validation`, `Deprovisioning`, or `Provisioning`.
133
-
- The following `run-read-command` shows how to view the different steps in each action, and the progress or status of each step including any potential errors.
134
-
- This information is available on the BareMetalMachine kubernetes resource during or after the action is completed.
135
-
- For more information about the `run-read-command` feature, see [BareMetal Run-Read Execution](./howto-baremetal-run-read.md).
132
+
- Start and end time of the action.
133
+
- Status of the action (`Succeeded`, `Failed`, or `InProgress`).
134
+
- Any extra context or error message associated with the status.
135
+
- The Correlation ID for the original operation, as shown in the Azure Activity log.
136
+
- An ordered list of steps and their status - such as `Hardware Validation`, `Deprovisioning`, `Provisioning`, and `Cloud Init` for a BMM Replace action.
136
137
137
-
Example `run-read-command` to view action progress on Bare Metal Machine `rack2compute08`:
138
+
The most recent occurrence of each action type is shown, including any currently in-progress action.
138
139
139
-
```azurecli
140
-
az networkcloud baremetalmachine run-read-command \
## Best practices for a Bare Metal Machine reimage
@@ -200,7 +202,7 @@ Before initiating any `reimage` operation, ensure the following preconditions ar
200
202
201
203
- Make sure the Bare Metal Machine's workloads are drained using the [`cordon`](./howto-baremetal-functions.md#make-a-bare-metal-machine-unschedulable-cordon) command with the parameter `evacuate` set to `True`.
202
204
- Perform high level checks covered in the article [Troubleshoot Bare Metal Machine Provisioning].
203
-
- Evaluate any Bare Metal Machine warnings or degraded conditions which could indicate the need to resolve hardware, network, or server configuration problems before a `reimage` operation.
205
+
- Evaluate any Bare Metal Machine warnings or degraded conditions that could indicate the need to resolve hardware, network, or server configuration problems before a `reimage` operation.
204
206
For more information, read [Troubleshoot Degraded Status Errors on Bare Metal Machines] and [Troubleshoot Bare Metal Machine Warning Status].
205
207
- If the Bare Metal Machine reports a failed state with the reason of hardware validation (seen in the Bare Metal Machine `Detailed Status` and `Detailed Status Message` fields), then the Bare Metal Machine needs a `replace` instead.
206
208
See the [Best Practices for a Bare Metal Machine Replace](#best-practices-for-a-bare-metal-machine-replace).
@@ -228,7 +230,7 @@ Before initiating any `replace` operation, ensure the following preconditions ar
228
230
229
231
- Make sure the Bare Metal Machine's workloads are drained using the [`cordon`](./howto-baremetal-functions.md#make-a-bare-metal-machine-unschedulable-cordon) command with the parameter `evacuate` set to `True`.
230
232
- Perform high level checks covered in the article [Troubleshoot Bare Metal Machine Provisioning].
231
-
- Evaluate any Bare Metal Machine warnings or degraded conditions which could indicate the need to resolve hardware, network, or server configuration problems before a `replace` operation.
233
+
- Evaluate any Bare Metal Machine warnings or degraded conditions that could indicate the need to resolve hardware, network, or server configuration problems before a `replace` operation.
232
234
For more information, see [Troubleshoot Degraded Status Errors on Bare Metal Machines] and [Troubleshoot Bare Metal Machine Warning Status].
233
235
- Validate Bare Metal Machine is powered on.
234
236
- Validate that there are no running firmware upgrade jobs.
> In version 2509.1 and above, you can monitor recent or in-progress BMM actions in the Azure portal. For more information, see [Monitor status in Bare Metal Machine JSON properties](./howto-bare-metal-best-practices.md#monitor-status-in-bare-metal-machine-json-properties).
@@ -82,7 +85,7 @@ Existing workloads continue to run on the Bare Metal Machine unless the workload
82
85
83
86
### Drain Bare Metal Machine workloads
84
87
85
-
The cordon command supports the `evacuate` parameter which its default value `False` means that the `cordon` command prevents scheduling new workloads.
88
+
The cordon command supports the `evacuate` parameter, for which its default value `False` means that the `cordon` command prevents scheduling new workloads.
86
89
To drain workloads with the `cordon` command, the `evacuate` parameter must be set to `True`.
87
90
The workloads running on the Bare Metal Machine are `stopped` and the Bare Metal Machine is set to `pending` state.
88
91
@@ -173,7 +176,7 @@ az networkcloud baremetalmachine replace \
173
176
174
177
If the `replace` action fails due to a hardware validation failure, the specific error or test failure is shown in the `replace` response, as shown in the following examples.
175
178
This information can also be found in the Activity Log for the Bare Metal Machine (Operator Nexus).
176
-
The error code and error message are included the JSON properties of the corresponding `BareMetalMachines_Replace` operation.
179
+
The error code and error message are also included in the JSON properties of the corresponding `BareMetalMachines_Replace` operation.
177
180
178
181
**Example 1: Hardware validation fails due to invalid Key Vault URI for Baseboard Management Controller (BMC) credentials**
Copy file name to clipboardExpand all lines: articles/operator-nexus/troubleshoot-bare-metal-machine-warning.md
+8-5Lines changed: 8 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,15 +4,15 @@ description: Troubleshooting guide for Bare Metal Machines Warning status messag
4
4
ms.service: azure-operator-nexus
5
5
ms.custom: azure-operator-nexus
6
6
ms.topic: troubleshooting
7
-
ms.date: 04/17/2025
7
+
ms.date: 08/12/2025
8
8
author: robertstarling
9
9
ms.author: robstarling
10
10
ms.reviewer: ekarandjeff
11
11
---
12
12
13
13
# Troubleshoot _'Warning'_ detailed status messages on an Azure Operator Nexus Cluster Bare Metal Machine
14
14
15
-
This document provides basic troubleshooting information for Bare Metal Machine (BMM) resources which are reporting a _Warning_ message in the BMM detailed status message.
15
+
This document provides basic troubleshooting information for Bare Metal Machine (BMM) resources that are reporting a _Warning_ message in the BMM detailed status message.
16
16
17
17
## Symptoms
18
18
@@ -92,6 +92,8 @@ Review the `lastTransitionTime` and `message` fields for more information about
92
92
}
93
93
```
94
94
95
+
You can also check for any potentially related recent lifecycle actions (such as Restart or Power off actions) in the Azure portal. See [Monitor status in Bare Metal Machine JSON properties](./howto-bare-metal-best-practices.md#monitor-status-in-bare-metal-machine-json-properties). If available, this information is also visible in the output of the previous `run-read-command` in the `actionStates` status field.
96
+
95
97
## `Warning: PXE port is unhealthy`
96
98
97
99
This message in the BMM _Detailed status message_ field indicates a problem with network connectivity on the Preboot Execution Environment (PXE) Ethernet port on the underlying compute host.
@@ -114,8 +116,8 @@ To troubleshoot this issue:
114
116
- review the `conditions` status of the kubernetes `bmm` object, as described in the [Troubleshooting](#troubleshooting) section
115
117
- this information should identify the specific root cause (port down or port flapping) and approximate time of the issue
116
118
- check the Ethernet cabling and Top Of Rack (TOR) switch for the affected PXE port
117
-
- check for any other BMMs which are also reporting unhealthy PXE status or other network-related problems
118
-
- check for any recent deployment or infrastructure changes which coincide with the time of failure.
119
+
- check for any other BMMs that are also reporting unhealthy PXE status or other network-related problems
120
+
- check for any recent deployment or infrastructure changes that coincide with the time of failure.
119
121
120
122
**Example `conditions` output for PXE warning**
121
123
@@ -143,10 +145,11 @@ This message can indicate an issue with the underlying compute host or baseboard
143
145
To troubleshoot this issue:
144
146
145
147
- review the `conditions` status of the kubernetes `bmm` object, as described in the [Troubleshooting](#troubleshooting) section
148
+
- review the `actionStates` status field of the kubernetes `bmm` object for any recently initiated lifecycle actions (such as a Restart or Power off) as described in the [Troubleshooting](#troubleshooting) section
146
149
- this information should identify the approximate time of the issue and any other available details
147
150
- check the power feed, power cables, and physical hardware for the specified BMM
148
151
- check whether any other BMMs are also reporting an unexpected power state Warning, which might indicate a broader issue with the underlying infrastructure
149
-
- check for any recent deployment or infrastructure changes which coincide with the time of failure
152
+
- check for any recent deployment or infrastructure changes that coincide with the time of failure
150
153
- review the power state and logs on the BMC for the affected host.
151
154
152
155
For more information about logging into the BMC, see [Troubleshoot Hardware Validation Failure](./troubleshoot-hardware-validation-failure.md).
Copy file name to clipboardExpand all lines: articles/operator-nexus/troubleshoot-reboot-reimage-replace.md
+18-15Lines changed: 18 additions & 15 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,7 +4,7 @@ description: Troubleshoot cluster bare metal machines with Restart, Reimage, Rep
4
4
ms.service: azure-operator-nexus
5
5
ms.custom: troubleshooting
6
6
ms.topic: troubleshooting
7
-
ms.date: 04/03/2025
7
+
ms.date: 08/12/2025
8
8
author: eak13
9
9
ms.author: ekarandjeff
10
10
---
@@ -33,6 +33,9 @@ The time required to complete each of these actions is similar. Restarting is th
33
33
>
34
34
> This check is done to maintain the integrity of the Nexus instance and ensure multiple KCP nodes don't go down at once due to simultaneous disruptive actions. If multiple nodes go down, it breaks the healthy quorum threshold of the Kubernetes Control Plane.
35
35
36
+
> [!TIP]
37
+
> In version 2509.1 and above, you can monitor recent or in-progress BMM actions in the Azure portal. For more information, see [Monitor status in Bare Metal Machine JSON properties](./howto-bare-metal-best-practices.md#monitor-status-in-bare-metal-machine-json-properties).
38
+
36
39
## Identify the corrective action
37
40
38
41
When troubleshooting a BMM for failures and determining the most appropriate corrective action, it's essential to understand the available options. This article provides a systematic approach to troubleshoot Azure Operator Nexus server problems using these three methods:
@@ -45,12 +48,12 @@ When troubleshooting a BMM for failures and determining the most appropriate cor
45
48
46
49
Follow this escalation path when troubleshooting BMM issues:
47
50
48
-
| Problem | First action | If problem persists | If still unresolved |
The recommended approach is to start with the least invasive solution (restart) and escalate to more complex measures only if necessary. Always validate that the issue is resolved after each corrective action.
56
59
@@ -177,7 +180,7 @@ Servers contain many physical components that can fail over time. It's important
177
180
A hardware validation process is invoked to ensure the integrity of the physical host in advance of deploying the OS image. Like the reimage action, the Tenant data isn't modified during replacement.
178
181
179
182
> [!IMPORTANT]
180
-
> When run with default options, the RAID controller is reset during BMM replace, wiping all data from the server's virtual disks. Baseboard Management Controller (BMC) virtual disk alerts triggered during BMM replace can be ignored unless there are other physical disk and/or RAID controllers alerts. Starting with the 2025-07-01preview version of the NetworkCloud API, and generally available with the 2025-09-01 GA version, use `replace` with `storage-policy="Preserve"` to retain virtual disk data.
183
+
> When run with default options, the RAID controller is reset during BMM replace, wiping all data from the server's virtual disks. Baseboard Management Controller (BMC) virtual disk alerts triggered during BMM replace can be ignored unless there are other physical disk and/or RAID controllers alerts. Starting with the `2025-07-01-preview` version of the NetworkCloud API, and generally available with the `2025-09-01` GA version, use `replace` with `storage-policy="Preserve"` to retain virtual disk data.
181
184
182
185
### Replace workflow
183
186
@@ -210,7 +213,7 @@ When you're performing the following physical repairs, we recommend a replace ac
210
213
- Transceiver
211
214
- Ethernet or fiber cable replacement
212
215
213
-
When you're performing the following physical repairs, a replace action ***is required*** to bring the BMM back into service:
216
+
When you're performing the following physical repairs, a replace action **_is required_** to bring the BMM back into service:
214
217
215
218
- Backplane
216
219
- System board
@@ -220,7 +223,7 @@ When you're performing the following physical repairs, a replace action ***is re
220
223
- Broadcom embedded NIC
221
224
222
225
After physical repairs are completed, perform a replace action.
223
-
226
+
224
227
**The following Azure CLI command will `replace` the specified bareMetalMachineName.**
225
228
226
229
```azurecli
@@ -249,11 +252,11 @@ az networkcloud baremetalmachine uncordon \
249
252
250
253
Restarting, reimaging, and replacing are effective troubleshooting methods for addressing Azure Operator Nexus server problems. Here's a quick reference guide:
0 commit comments