Skip to content

Commit 02056db

Browse files
Merge pull request #304144 from robertstarling/ado2205530-actionstate-json-properties
BMM ActionState is now visible in JSON properties
2 parents d5a0bfc + 68d96c3 commit 02056db

File tree

4 files changed

+82
-71
lines changed

4 files changed

+82
-71
lines changed

articles/operator-nexus/howto-bare-metal-best-practices.md

Lines changed: 50 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
22
title: Best practices for Bare Metal Machine operations
33
description: Steps that should be taken before executing any Bare Metal Machine replace, or reimage actions. Highlight essential prerequisites and common pitfalls to avoid.
4-
ms.date: 05/22/2025
4+
ms.date: 08/12/2025
55
ms.topic: how-to
66
ms.service: azure-operator-nexus
77
ms.custom: template-how-to, best-practices
@@ -67,7 +67,7 @@ See related articles:
6767
- [How to monitor interface In and Out packet rate for network fabric devices]
6868
- [How to configure diagnostic settings and monitor configuration differences in Nexus Network Fabric].
6969

70-
Evaluate for any Bare Metal Machine warnings or degraded conditions which could indicate the need to resolve hardware, network, or server configuration problems.
70+
Evaluate for any Bare Metal Machine warnings or degraded conditions that could indicate the need to resolve hardware, network, or server configuration problems.
7171
For more information, see [Troubleshoot Degraded Status Errors on Bare Metal Machines] and [Troubleshoot Bare Metal Machine Warning Status].
7272

7373
#### Determine if firmware update jobs are running
@@ -88,7 +88,7 @@ az networkcloud baremetalmachine run-read-command \
8888
--output-directory .
8989
```
9090

91-
Here's an example output from the `racadm jobqueue view` command which shows `Firmware Update`.
91+
Here's an example output from the `racadm jobqueue view` command that shows `Firmware Update`.
9292

9393
```
9494
[Job ID=JID_833540920066]
@@ -125,58 +125,60 @@ Message=[SYS043: Successfully exported Server Configuration Profile]
125125
Percent Complete=[100]
126126
```
127127

128-
#### Monitor progress using `run-read-command`
128+
#### Monitor status in Bare Metal Machine JSON properties
129129

130-
In version 2506.2 and above, you can monitor the progress of long running Bare Metal Machine actions using a `run-read-command`.
130+
In version 2509.1 and above, you can view the status of any recent or in progress actions in the `JSON View` of the corresponding Bare Metal Machine (Operator Nexus) resource. This information is visible in the `actionStates` field of the Bare Metal Machine JSON properties, when using API Version `2025-07-01-preview` or higher. The following information is available.
131131

132-
- Some long running actions such as `Replace` or `Reimage` are composed of multiple steps, for example, `Hardware Validation`, `Deprovisioning`, or `Provisioning`.
133-
- The following `run-read-command` shows how to view the different steps in each action, and the progress or status of each step including any potential errors.
134-
- This information is available on the BareMetalMachine kubernetes resource during or after the action is completed.
135-
- For more information about the `run-read-command` feature, see [BareMetal Run-Read Execution](./howto-baremetal-run-read.md).
132+
- Start and end time of the action.
133+
- Status of the action (`Succeeded`, `Failed`, or `InProgress`).
134+
- Any extra context or error message associated with the status.
135+
- The Correlation ID for the original operation, as shown in the Azure Activity log.
136+
- An ordered list of steps and their status - such as `Hardware Validation`, `Deprovisioning`, `Provisioning`, and `Cloud Init` for a BMM Replace action.
136137

137-
Example `run-read-command` to view action progress on Bare Metal Machine `rack2compute08`:
138+
The most recent occurrence of each action type is shown, including any currently in-progress action.
138139

139-
```azurecli
140-
az networkcloud baremetalmachine run-read-command \
141-
-g <ResourceGroup_Name> \
142-
-n <Control Node BMM Name> \
143-
--limit-time-seconds 60 \
144-
--commands "[{command:'kubectl get',arguments:[-n,nc-system,bmm,rack2compute08,-o,json]}]" \
145-
--output-directory .
146-
```
147-
148-
Example output for a Replace action:
140+
Example `actionStates` output for a Bare Metal Machine Replace action:
149141

150142
```json
151-
[
152-
{
153-
"correlationId": "961a6154-4342-4831-9693-27314671e6a7",
154-
"endTime": "2025-05-15T21:20:44Z",
155-
"startTime": "2025-05-15T20:16:19Z",
156-
"status": "Completed",
157-
"stepStates": [
158-
{
159-
"endTime": "2025-05-15T20:25:51Z",
160-
"name": "Hardware Validation",
161-
"startTime": "2025-05-15T20:16:19Z",
162-
"status": "Completed"
163-
},
164-
{
165-
"endTime": "2025-05-15T20:26:21Z",
166-
"name": "Deprovisioning",
167-
"startTime": "2025-05-15T20:25:51Z",
168-
"status": "Completed"
169-
},
143+
{
144+
"properties": {
145+
"actionStates": [
170146
{
171-
"endTime": "2025-05-15T21:20:44Z",
172-
"name": "Provisioning",
173-
"startTime": "2025-05-15T20:26:21Z",
174-
"status": "Completed"
147+
"actionType": "Microsoft.NetworkCloud/bareMetalMachines/replace",
148+
"correlationId": "25d678cb-353c-41f4-8231-1135064ae582",
149+
"endTime": "2025-08-12T17:00:58Z",
150+
"startTime": "2025-08-12T15:32:12Z",
151+
"status": "Completed",
152+
"stepStates": [
153+
{
154+
"endTime": "2025-08-12T15:41:22Z",
155+
"startTime": "2025-08-12T15:32:12Z",
156+
"status": "Completed",
157+
"stepName": "Hardware Validation"
158+
},
159+
{
160+
"endTime": "2025-08-12T16:25:39Z",
161+
"startTime": "2025-08-12T15:41:22Z",
162+
"status": "Completed",
163+
"stepName": "Deprovisioning"
164+
},
165+
{
166+
"endTime": "2025-08-12T16:48:27Z",
167+
"startTime": "2025-08-12T16:25:39Z",
168+
"status": "Completed",
169+
"stepName": "Provisioning"
170+
},
171+
{
172+
"endTime": "2025-08-12T17:00:58Z",
173+
"startTime": "2025-08-12T16:48:27Z",
174+
"status": "Completed",
175+
"stepName": "Cloud Init"
176+
}
177+
]
175178
}
176-
],
177-
"type": "Microsoft.NetworkCloud/bareMetalMachines/replace"
179+
]
178180
}
179-
]
181+
}
180182
```
181183

182184
## Best practices for a Bare Metal Machine reimage
@@ -200,7 +202,7 @@ Before initiating any `reimage` operation, ensure the following preconditions ar
200202

201203
- Make sure the Bare Metal Machine's workloads are drained using the [`cordon`](./howto-baremetal-functions.md#make-a-bare-metal-machine-unschedulable-cordon) command with the parameter `evacuate` set to `True`.
202204
- Perform high level checks covered in the article [Troubleshoot Bare Metal Machine Provisioning].
203-
- Evaluate any Bare Metal Machine warnings or degraded conditions which could indicate the need to resolve hardware, network, or server configuration problems before a `reimage` operation.
205+
- Evaluate any Bare Metal Machine warnings or degraded conditions that could indicate the need to resolve hardware, network, or server configuration problems before a `reimage` operation.
204206
For more information, read [Troubleshoot Degraded Status Errors on Bare Metal Machines] and [Troubleshoot Bare Metal Machine Warning Status].
205207
- If the Bare Metal Machine reports a failed state with the reason of hardware validation (seen in the Bare Metal Machine `Detailed Status` and `Detailed Status Message` fields), then the Bare Metal Machine needs a `replace` instead.
206208
See the [Best Practices for a Bare Metal Machine Replace](#best-practices-for-a-bare-metal-machine-replace).
@@ -228,7 +230,7 @@ Before initiating any `replace` operation, ensure the following preconditions ar
228230

229231
- Make sure the Bare Metal Machine's workloads are drained using the [`cordon`](./howto-baremetal-functions.md#make-a-bare-metal-machine-unschedulable-cordon) command with the parameter `evacuate` set to `True`.
230232
- Perform high level checks covered in the article [Troubleshoot Bare Metal Machine Provisioning].
231-
- Evaluate any Bare Metal Machine warnings or degraded conditions which could indicate the need to resolve hardware, network, or server configuration problems before a `replace` operation.
233+
- Evaluate any Bare Metal Machine warnings or degraded conditions that could indicate the need to resolve hardware, network, or server configuration problems before a `replace` operation.
232234
For more information, see [Troubleshoot Degraded Status Errors on Bare Metal Machines] and [Troubleshoot Bare Metal Machine Warning Status].
233235
- Validate Bare Metal Machine is powered on.
234236
- Validate that there are no running firmware upgrade jobs.

articles/operator-nexus/howto-baremetal-functions.md

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ author: eak13
55
ms.author: ekarandjeff
66
ms.service: azure-operator-nexus
77
ms.topic: how-to
8-
ms.date: 04/02/2025
8+
ms.date: 08/12/2025
99
ms.custom: template-how-to, devx-track-azurecli
1010
---
1111

@@ -32,6 +32,9 @@ The Cordon action without the `evacuate` parameter isn't considered disruptive w
3232

3333
[!INCLUDE [important-donot-disrupt-kcpnodes](./includes/baremetal-machines/important-donot-disrupt-kcpnodes.md)]
3434

35+
> [!TIP]
36+
> In version 2509.1 and above, you can monitor recent or in-progress BMM actions in the Azure portal. For more information, see [Monitor status in Bare Metal Machine JSON properties](./howto-bare-metal-best-practices.md#monitor-status-in-bare-metal-machine-json-properties).
37+
3538
[!INCLUDE [prerequisites-azure-cli-bare-metal-machine-actions](./includes/baremetal-machines/prerequisites-azure-cli-bare-metal-machine-actions.md)]
3639

3740
## Power off a Bare Metal Machine
@@ -82,7 +85,7 @@ Existing workloads continue to run on the Bare Metal Machine unless the workload
8285

8386
### Drain Bare Metal Machine workloads
8487

85-
The cordon command supports the `evacuate` parameter which its default value `False` means that the `cordon` command prevents scheduling new workloads.
88+
The cordon command supports the `evacuate` parameter, for which its default value `False` means that the `cordon` command prevents scheduling new workloads.
8689
To drain workloads with the `cordon` command, the `evacuate` parameter must be set to `True`.
8790
The workloads running on the Bare Metal Machine are `stopped` and the Bare Metal Machine is set to `pending` state.
8891

@@ -173,7 +176,7 @@ az networkcloud baremetalmachine replace \
173176

174177
If the `replace` action fails due to a hardware validation failure, the specific error or test failure is shown in the `replace` response, as shown in the following examples.
175178
This information can also be found in the Activity Log for the Bare Metal Machine (Operator Nexus).
176-
The error code and error message are included the JSON properties of the corresponding `BareMetalMachines_Replace` operation.
179+
The error code and error message are also included in the JSON properties of the corresponding `BareMetalMachines_Replace` operation.
177180

178181
**Example 1: Hardware validation fails due to invalid Key Vault URI for Baseboard Management Controller (BMC) credentials**
179182

articles/operator-nexus/troubleshoot-bare-metal-machine-warning.md

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -4,15 +4,15 @@ description: Troubleshooting guide for Bare Metal Machines Warning status messag
44
ms.service: azure-operator-nexus
55
ms.custom: azure-operator-nexus
66
ms.topic: troubleshooting
7-
ms.date: 04/17/2025
7+
ms.date: 08/12/2025
88
author: robertstarling
99
ms.author: robstarling
1010
ms.reviewer: ekarandjeff
1111
---
1212

1313
# Troubleshoot _'Warning'_ detailed status messages on an Azure Operator Nexus Cluster Bare Metal Machine
1414

15-
This document provides basic troubleshooting information for Bare Metal Machine (BMM) resources which are reporting a _Warning_ message in the BMM detailed status message.
15+
This document provides basic troubleshooting information for Bare Metal Machine (BMM) resources that are reporting a _Warning_ message in the BMM detailed status message.
1616

1717
## Symptoms
1818

@@ -92,6 +92,8 @@ Review the `lastTransitionTime` and `message` fields for more information about
9292
}
9393
```
9494

95+
You can also check for any potentially related recent lifecycle actions (such as Restart or Power off actions) in the Azure portal. See [Monitor status in Bare Metal Machine JSON properties](./howto-bare-metal-best-practices.md#monitor-status-in-bare-metal-machine-json-properties). If available, this information is also visible in the output of the previous `run-read-command` in the `actionStates` status field.
96+
9597
## `Warning: PXE port is unhealthy`
9698

9799
This message in the BMM _Detailed status message_ field indicates a problem with network connectivity on the Preboot Execution Environment (PXE) Ethernet port on the underlying compute host.
@@ -114,8 +116,8 @@ To troubleshoot this issue:
114116
- review the `conditions` status of the kubernetes `bmm` object, as described in the [Troubleshooting](#troubleshooting) section
115117
- this information should identify the specific root cause (port down or port flapping) and approximate time of the issue
116118
- check the Ethernet cabling and Top Of Rack (TOR) switch for the affected PXE port
117-
- check for any other BMMs which are also reporting unhealthy PXE status or other network-related problems
118-
- check for any recent deployment or infrastructure changes which coincide with the time of failure.
119+
- check for any other BMMs that are also reporting unhealthy PXE status or other network-related problems
120+
- check for any recent deployment or infrastructure changes that coincide with the time of failure.
119121

120122
**Example `conditions` output for PXE warning**
121123

@@ -143,10 +145,11 @@ This message can indicate an issue with the underlying compute host or baseboard
143145
To troubleshoot this issue:
144146

145147
- review the `conditions` status of the kubernetes `bmm` object, as described in the [Troubleshooting](#troubleshooting) section
148+
- review the `actionStates` status field of the kubernetes `bmm` object for any recently initiated lifecycle actions (such as a Restart or Power off) as described in the [Troubleshooting](#troubleshooting) section
146149
- this information should identify the approximate time of the issue and any other available details
147150
- check the power feed, power cables, and physical hardware for the specified BMM
148151
- check whether any other BMMs are also reporting an unexpected power state Warning, which might indicate a broader issue with the underlying infrastructure
149-
- check for any recent deployment or infrastructure changes which coincide with the time of failure
152+
- check for any recent deployment or infrastructure changes that coincide with the time of failure
150153
- review the power state and logs on the BMC for the affected host.
151154

152155
For more information about logging into the BMC, see [Troubleshoot Hardware Validation Failure](./troubleshoot-hardware-validation-failure.md).

articles/operator-nexus/troubleshoot-reboot-reimage-replace.md

Lines changed: 18 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ description: Troubleshoot cluster bare metal machines with Restart, Reimage, Rep
44
ms.service: azure-operator-nexus
55
ms.custom: troubleshooting
66
ms.topic: troubleshooting
7-
ms.date: 04/03/2025
7+
ms.date: 08/12/2025
88
author: eak13
99
ms.author: ekarandjeff
1010
---
@@ -33,6 +33,9 @@ The time required to complete each of these actions is similar. Restarting is th
3333
>
3434
> This check is done to maintain the integrity of the Nexus instance and ensure multiple KCP nodes don't go down at once due to simultaneous disruptive actions. If multiple nodes go down, it breaks the healthy quorum threshold of the Kubernetes Control Plane.
3535
36+
> [!TIP]
37+
> In version 2509.1 and above, you can monitor recent or in-progress BMM actions in the Azure portal. For more information, see [Monitor status in Bare Metal Machine JSON properties](./howto-bare-metal-best-practices.md#monitor-status-in-bare-metal-machine-json-properties).
38+
3639
## Identify the corrective action
3740

3841
When troubleshooting a BMM for failures and determining the most appropriate corrective action, it's essential to understand the available options. This article provides a systematic approach to troubleshoot Azure Operator Nexus server problems using these three methods:
@@ -45,12 +48,12 @@ When troubleshooting a BMM for failures and determining the most appropriate cor
4548

4649
Follow this escalation path when troubleshooting BMM issues:
4750

48-
| Problem | First action | If problem persists | If still unresolved |
49-
|---------|-------------|-------------------|-------------------|
50-
| Unresponsive VMs or services | Restart | Reimage | Replace |
51-
| Software/OS corruption | Reimage | Replace | Contact support |
52-
| Known hardware failure | Replace | N/A | Contact support |
53-
| Security compromise | Reimage | Replace | Contact support |
51+
| Problem | First action | If problem persists | If still unresolved |
52+
| ---------------------------- | ------------ | ------------------- | ------------------- |
53+
| Unresponsive VMs or services | Restart | Reimage | Replace |
54+
| Software/OS corruption | Reimage | Replace | Contact support |
55+
| Known hardware failure | Replace | N/A | Contact support |
56+
| Security compromise | Reimage | Replace | Contact support |
5457

5558
The recommended approach is to start with the least invasive solution (restart) and escalate to more complex measures only if necessary. Always validate that the issue is resolved after each corrective action.
5659

@@ -177,7 +180,7 @@ Servers contain many physical components that can fail over time. It's important
177180
A hardware validation process is invoked to ensure the integrity of the physical host in advance of deploying the OS image. Like the reimage action, the Tenant data isn't modified during replacement.
178181

179182
> [!IMPORTANT]
180-
> When run with default options, the RAID controller is reset during BMM replace, wiping all data from the server's virtual disks. Baseboard Management Controller (BMC) virtual disk alerts triggered during BMM replace can be ignored unless there are other physical disk and/or RAID controllers alerts. Starting with the 2025-07-01 preview version of the NetworkCloud API, and generally available with the 2025-09-01 GA version, use `replace` with `storage-policy="Preserve"` to retain virtual disk data.
183+
> When run with default options, the RAID controller is reset during BMM replace, wiping all data from the server's virtual disks. Baseboard Management Controller (BMC) virtual disk alerts triggered during BMM replace can be ignored unless there are other physical disk and/or RAID controllers alerts. Starting with the `2025-07-01-preview` version of the NetworkCloud API, and generally available with the `2025-09-01` GA version, use `replace` with `storage-policy="Preserve"` to retain virtual disk data.
181184
182185
### Replace workflow
183186

@@ -210,7 +213,7 @@ When you're performing the following physical repairs, we recommend a replace ac
210213
- Transceiver
211214
- Ethernet or fiber cable replacement
212215

213-
When you're performing the following physical repairs, a replace action ***is required*** to bring the BMM back into service:
216+
When you're performing the following physical repairs, a replace action **_is required_** to bring the BMM back into service:
214217

215218
- Backplane
216219
- System board
@@ -220,7 +223,7 @@ When you're performing the following physical repairs, a replace action ***is re
220223
- Broadcom embedded NIC
221224

222225
After physical repairs are completed, perform a replace action.
223-
226+
224227
**The following Azure CLI command will `replace` the specified bareMetalMachineName.**
225228

226229
```azurecli
@@ -249,11 +252,11 @@ az networkcloud baremetalmachine uncordon \
249252

250253
Restarting, reimaging, and replacing are effective troubleshooting methods for addressing Azure Operator Nexus server problems. Here's a quick reference guide:
251254

252-
| Action | When to use | Impact | Requirements |
253-
|--------|------------|--------|-------------|
254-
| **Restart** | Temporary glitches, unresponsive VMs | Brief downtime | None, fastest option |
255-
| **Reimage** | OS corruption, security concerns | Longer downtime, preserves data | Workload evacuation recommended |
256-
| **Replace** | Hardware component failures | Longest downtime, preserves data | Hardware component replacement, specific parameters needed |
255+
| Action | When to use | Impact | Requirements |
256+
| ----------- | ------------------------------------ | -------------------------------- | ---------------------------------------------------------- |
257+
| **Restart** | Temporary glitches, unresponsive VMs | Brief downtime | None, fastest option |
258+
| **Reimage** | OS corruption, security concerns | Longer downtime, preserves data | Workload evacuation recommended |
259+
| **Replace** | Hardware component failures | Longest downtime, preserves data | Hardware component replacement, specific parameters needed |
257260

258261
### Best practices
259262

0 commit comments

Comments
 (0)