Merge pull request #304144 from robertstarling/ado2205530-actionstate-json-properties

JamesJBarnett · web-flow · commit 02056dbc42da · 2025-08-13T10:11:14.000-07:00
BMM ActionState is now visible in JSON properties
diff --git a/articles/operator-nexus/howto-bare-metal-best-practices.md b/articles/operator-nexus/howto-bare-metal-best-practices.md
@@ -1,7 +1,7 @@
 ---
 title: Best practices for Bare Metal Machine operations
 description: Steps that should be taken before executing any Bare Metal Machine replace, or reimage actions. Highlight essential prerequisites and common pitfalls to avoid.
-ms.date: 05/22/2025
+ms.date: 08/12/2025
 ms.topic: how-to
 ms.service: azure-operator-nexus
 ms.custom: template-how-to, best-practices
@@ -67,7 +67,7 @@ See related articles:
 - [How to monitor interface In and Out packet rate for network fabric devices]
 - [How to configure diagnostic settings and monitor configuration differences in Nexus Network Fabric].
 
-Evaluate for any Bare Metal Machine warnings or degraded conditions which could indicate the need to resolve hardware, network, or server configuration problems.
+Evaluate for any Bare Metal Machine warnings or degraded conditions that could indicate the need to resolve hardware, network, or server configuration problems.
 For more information, see [Troubleshoot Degraded Status Errors on Bare Metal Machines] and [Troubleshoot Bare Metal Machine Warning Status].
 
 #### Determine if firmware update jobs are running
@@ -88,7 +88,7 @@ az networkcloud baremetalmachine run-read-command \
   --output-directory .
 ```
 
-Here's an example output from the `racadm jobqueue view` command which shows `Firmware Update`.
+Here's an example output from the `racadm jobqueue view` command that shows `Firmware Update`.
 
 ```
 [Job ID=JID_833540920066]
@@ -125,58 +125,60 @@ Message=[SYS043: Successfully exported Server Configuration Profile]
 Percent Complete=[100]
 ```
 
-#### Monitor progress using `run-read-command`
+#### Monitor status in Bare Metal Machine JSON properties
 
-In version 2506.2 and above, you can monitor the progress of long running Bare Metal Machine actions using a `run-read-command`.
+In version 2509.1 and above, you can view the status of any recent or in progress actions in the `JSON View` of the corresponding Bare Metal Machine (Operator Nexus) resource. This information is visible in the `actionStates` field of the Bare Metal Machine JSON properties, when using API Version `2025-07-01-preview` or higher. The following information is available.
 
-- Some long running actions such as `Replace` or `Reimage` are composed of multiple steps, for example, `Hardware Validation`, `Deprovisioning`, or `Provisioning`.
-- The following `run-read-command` shows how to view the different steps in each action, and the progress or status of each step including any potential errors.
-- This information is available on the BareMetalMachine kubernetes resource during or after the action is completed.
-- For more information about the `run-read-command` feature, see [BareMetal Run-Read Execution](./howto-baremetal-run-read.md).
+- Start and end time of the action.
+- Status of the action (`Succeeded`, `Failed`, or `InProgress`).
+- Any extra context or error message associated with the status.
+- The Correlation ID for the original operation, as shown in the Azure Activity log.
+- An ordered list of steps and their status - such as `Hardware Validation`, `Deprovisioning`, `Provisioning`, and `Cloud Init` for a BMM Replace action.
 
-Example `run-read-command` to view action progress on Bare Metal Machine `rack2compute08`:
+The most recent occurrence of each action type is shown, including any currently in-progress action.
 
-```azurecli
-az networkcloud baremetalmachine run-read-command \
-  -g <ResourceGroup_Name> \
-  -n <Control Node BMM Name> \
-  --limit-time-seconds 60 \
-  --commands "[{command:'kubectl get',arguments:[-n,nc-system,bmm,rack2compute08,-o,json]}]" \
-  --output-directory .
-```
-
-Example output for a Replace action:
+Example `actionStates` output for a Bare Metal Machine Replace action:
 
 ```json
-[
-  {
-    "correlationId": "961a6154-4342-4831-9693-27314671e6a7",
-    "endTime": "2025-05-15T21:20:44Z",
-    "startTime": "2025-05-15T20:16:19Z",
-    "status": "Completed",
-    "stepStates": [
-      {
-        "endTime": "2025-05-15T20:25:51Z",
-        "name": "Hardware Validation",
-        "startTime": "2025-05-15T20:16:19Z",
-        "status": "Completed"
-      },
-      {
-        "endTime": "2025-05-15T20:26:21Z",
-        "name": "Deprovisioning",
-        "startTime": "2025-05-15T20:25:51Z",
-        "status": "Completed"
-      },
+{
+  "properties": {
+    "actionStates": [
       {
-        "endTime": "2025-05-15T21:20:44Z",
-        "name": "Provisioning",
-        "startTime": "2025-05-15T20:26:21Z",
-        "status": "Completed"
+        "actionType": "Microsoft.NetworkCloud/bareMetalMachines/replace",
+        "correlationId": "25d678cb-353c-41f4-8231-1135064ae582",
+        "endTime": "2025-08-12T17:00:58Z",
+        "startTime": "2025-08-12T15:32:12Z",
+        "status": "Completed",
+        "stepStates": [
+          {
+            "endTime": "2025-08-12T15:41:22Z",
+            "startTime": "2025-08-12T15:32:12Z",
+            "status": "Completed",
+            "stepName": "Hardware Validation"
+          },
+          {
+            "endTime": "2025-08-12T16:25:39Z",
+            "startTime": "2025-08-12T15:41:22Z",
+            "status": "Completed",
+            "stepName": "Deprovisioning"
+          },
+          {
+            "endTime": "2025-08-12T16:48:27Z",
+            "startTime": "2025-08-12T16:25:39Z",
+            "status": "Completed",
+            "stepName": "Provisioning"
+          },
+          {
+            "endTime": "2025-08-12T17:00:58Z",
+            "startTime": "2025-08-12T16:48:27Z",
+            "status": "Completed",
+            "stepName": "Cloud Init"
+          }
+        ]
       }
-    ],
-    "type": "Microsoft.NetworkCloud/bareMetalMachines/replace"
+    ]
   }
-]
+}
 ```
 
 ## Best practices for a Bare Metal Machine reimage
@@ -200,7 +202,7 @@ Before initiating any `reimage` operation, ensure the following preconditions ar
 
 - Make sure the Bare Metal Machine's workloads are drained using the [`cordon`](./howto-baremetal-functions.md#make-a-bare-metal-machine-unschedulable-cordon) command with the parameter `evacuate` set to `True`.
 - Perform high level checks covered in the article [Troubleshoot Bare Metal Machine Provisioning].
-- Evaluate any Bare Metal Machine warnings or degraded conditions which could indicate the need to resolve hardware, network, or server configuration problems before a `reimage` operation.
+- Evaluate any Bare Metal Machine warnings or degraded conditions that could indicate the need to resolve hardware, network, or server configuration problems before a `reimage` operation.
   For more information, read [Troubleshoot Degraded Status Errors on Bare Metal Machines] and [Troubleshoot Bare Metal Machine Warning Status].
 - If the Bare Metal Machine reports a failed state with the reason of hardware validation (seen in the Bare Metal Machine `Detailed Status` and `Detailed Status Message` fields), then the Bare Metal Machine needs a `replace` instead.
   See the [Best Practices for a Bare Metal Machine Replace](#best-practices-for-a-bare-metal-machine-replace).
@@ -228,7 +230,7 @@ Before initiating any `replace` operation, ensure the following preconditions ar
 
 - Make sure the Bare Metal Machine's workloads are drained using the [`cordon`](./howto-baremetal-functions.md#make-a-bare-metal-machine-unschedulable-cordon) command with the parameter `evacuate` set to `True`.
 - Perform high level checks covered in the article [Troubleshoot Bare Metal Machine Provisioning].
-- Evaluate any Bare Metal Machine warnings or degraded conditions which could indicate the need to resolve hardware, network, or server configuration problems before a `replace` operation.
+- Evaluate any Bare Metal Machine warnings or degraded conditions that could indicate the need to resolve hardware, network, or server configuration problems before a `replace` operation.
   For more information, see [Troubleshoot Degraded Status Errors on Bare Metal Machines] and [Troubleshoot Bare Metal Machine Warning Status].
 - Validate Bare Metal Machine is powered on.
 - Validate that there are no running firmware upgrade jobs.
diff --git a/articles/operator-nexus/howto-baremetal-functions.md b/articles/operator-nexus/howto-baremetal-functions.md
@@ -5,7 +5,7 @@ author: eak13
 ms.author: ekarandjeff
 ms.service: azure-operator-nexus
 ms.topic: how-to
-ms.date: 04/02/2025
+ms.date: 08/12/2025
 ms.custom: template-how-to, devx-track-azurecli
 ---
 
@@ -32,6 +32,9 @@ The Cordon action without the `evacuate` parameter isn't considered disruptive w
 
 [!INCLUDE [important-donot-disrupt-kcpnodes](./includes/baremetal-machines/important-donot-disrupt-kcpnodes.md)]
 
+> [!TIP]
+> In version 2509.1 and above, you can monitor recent or in-progress BMM actions in the Azure portal. For more information, see [Monitor status in Bare Metal Machine JSON properties](./howto-bare-metal-best-practices.md#monitor-status-in-bare-metal-machine-json-properties).
+
 [!INCLUDE [prerequisites-azure-cli-bare-metal-machine-actions](./includes/baremetal-machines/prerequisites-azure-cli-bare-metal-machine-actions.md)]
 
 ## Power off a Bare Metal Machine
@@ -82,7 +85,7 @@ Existing workloads continue to run on the Bare Metal Machine unless the workload
 
 ### Drain Bare Metal Machine workloads
 
-The cordon command supports the `evacuate` parameter which its default value `False` means that the `cordon` command prevents scheduling new workloads.
+The cordon command supports the `evacuate` parameter, for which its default value `False` means that the `cordon` command prevents scheduling new workloads.
 To drain workloads with the `cordon` command, the `evacuate` parameter must be set to `True`.
 The workloads running on the Bare Metal Machine are `stopped` and the Bare Metal Machine is set to `pending` state.
 
@@ -173,7 +176,7 @@ az networkcloud baremetalmachine replace \
 
 If the `replace` action fails due to a hardware validation failure, the specific error or test failure is shown in the `replace` response, as shown in the following examples.
 This information can also be found in the Activity Log for the Bare Metal Machine (Operator Nexus).
-The error code and error message are included the JSON properties of the corresponding `BareMetalMachines_Replace` operation.
+The error code and error message are also included in the JSON properties of the corresponding `BareMetalMachines_Replace` operation.
 
 **Example 1: Hardware validation fails due to invalid Key Vault URI for Baseboard Management Controller (BMC) credentials**
 
diff --git a/articles/operator-nexus/troubleshoot-bare-metal-machine-warning.md b/articles/operator-nexus/troubleshoot-bare-metal-machine-warning.md
@@ -4,15 +4,15 @@ description: Troubleshooting guide for Bare Metal Machines Warning status messag
 ms.service: azure-operator-nexus
 ms.custom: azure-operator-nexus
 ms.topic: troubleshooting
-ms.date: 04/17/2025
+ms.date: 08/12/2025
 author: robertstarling
 ms.author: robstarling
 ms.reviewer: ekarandjeff
 ---
 
 # Troubleshoot _'Warning'_ detailed status messages on an Azure Operator Nexus Cluster Bare Metal Machine
 
-This document provides basic troubleshooting information for Bare Metal Machine (BMM) resources which are reporting a _Warning_ message in the BMM detailed status message.
+This document provides basic troubleshooting information for Bare Metal Machine (BMM) resources that are reporting a _Warning_ message in the BMM detailed status message.
 
 ## Symptoms
 
@@ -92,6 +92,8 @@ Review the `lastTransitionTime` and `message` fields for more information about
 }
 ```
 
+You can also check for any potentially related recent lifecycle actions (such as Restart or Power off actions) in the Azure portal. See [Monitor status in Bare Metal Machine JSON properties](./howto-bare-metal-best-practices.md#monitor-status-in-bare-metal-machine-json-properties). If available, this information is also visible in the output of the previous `run-read-command` in the `actionStates` status field.
+
 ## `Warning: PXE port is unhealthy`
 
 This message in the BMM _Detailed status message_ field indicates a problem with network connectivity on the Preboot Execution Environment (PXE) Ethernet port on the underlying compute host.
@@ -114,8 +116,8 @@ To troubleshoot this issue:
 - review the `conditions` status of the kubernetes `bmm` object, as described in the [Troubleshooting](#troubleshooting) section
 - this information should identify the specific root cause (port down or port flapping) and approximate time of the issue
 - check the Ethernet cabling and Top Of Rack (TOR) switch for the affected PXE port
-- check for any other BMMs which are also reporting unhealthy PXE status or other network-related problems
-- check for any recent deployment or infrastructure changes which coincide with the time of failure.
+- check for any other BMMs that are also reporting unhealthy PXE status or other network-related problems
+- check for any recent deployment or infrastructure changes that coincide with the time of failure.
 
 **Example `conditions` output for PXE warning**
 
@@ -143,10 +145,11 @@ This message can indicate an issue with the underlying compute host or baseboard
 To troubleshoot this issue:
 
 - review the `conditions` status of the kubernetes `bmm` object, as described in the [Troubleshooting](#troubleshooting) section
+- review the `actionStates` status field of the kubernetes `bmm` object for any recently initiated lifecycle actions (such as a Restart or Power off) as described in the [Troubleshooting](#troubleshooting) section
 - this information should identify the approximate time of the issue and any other available details
 - check the power feed, power cables, and physical hardware for the specified BMM
 - check whether any other BMMs are also reporting an unexpected power state Warning, which might indicate a broader issue with the underlying infrastructure
-- check for any recent deployment or infrastructure changes which coincide with the time of failure
+- check for any recent deployment or infrastructure changes that coincide with the time of failure
 - review the power state and logs on the BMC for the affected host.
 
 For more information about logging into the BMC, see [Troubleshoot Hardware Validation Failure](./troubleshoot-hardware-validation-failure.md).
diff --git a/articles/operator-nexus/troubleshoot-reboot-reimage-replace.md b/articles/operator-nexus/troubleshoot-reboot-reimage-replace.md
@@ -4,7 +4,7 @@ description: Troubleshoot cluster bare metal machines with Restart, Reimage, Rep
 ms.service: azure-operator-nexus
 ms.custom: troubleshooting
 ms.topic: troubleshooting
-ms.date: 04/03/2025
+ms.date: 08/12/2025
 author: eak13
 ms.author: ekarandjeff
 ---
@@ -33,6 +33,9 @@ The time required to complete each of these actions is similar. Restarting is th
 >
 > This check is done to maintain the integrity of the Nexus instance and ensure multiple KCP nodes don't go down at once due to simultaneous disruptive actions. If multiple nodes go down, it breaks the healthy quorum threshold of the Kubernetes Control Plane.
 
+> [!TIP]
+> In version 2509.1 and above, you can monitor recent or in-progress BMM actions in the Azure portal. For more information, see [Monitor status in Bare Metal Machine JSON properties](./howto-bare-metal-best-practices.md#monitor-status-in-bare-metal-machine-json-properties).
+
 ## Identify the corrective action
 
 When troubleshooting a BMM for failures and determining the most appropriate corrective action, it's essential to understand the available options. This article provides a systematic approach to troubleshoot Azure Operator Nexus server problems using these three methods:
@@ -45,12 +48,12 @@ When troubleshooting a BMM for failures and determining the most appropriate cor
 
 Follow this escalation path when troubleshooting BMM issues:
 
-| Problem | First action | If problem persists | If still unresolved |
-|---------|-------------|-------------------|-------------------|
-| Unresponsive VMs or services | Restart | Reimage | Replace |
-| Software/OS corruption | Reimage | Replace | Contact support |
-| Known hardware failure | Replace | N/A | Contact support |
-| Security compromise | Reimage | Replace | Contact support |
+| Problem                      | First action | If problem persists | If still unresolved |
+| ---------------------------- | ------------ | ------------------- | ------------------- |
+| Unresponsive VMs or services | Restart      | Reimage             | Replace             |
+| Software/OS corruption       | Reimage      | Replace             | Contact support     |
+| Known hardware failure       | Replace      | N/A                 | Contact support     |
+| Security compromise          | Reimage      | Replace             | Contact support     |
 
 The recommended approach is to start with the least invasive solution (restart) and escalate to more complex measures only if necessary. Always validate that the issue is resolved after each corrective action.
 
@@ -177,7 +180,7 @@ Servers contain many physical components that can fail over time. It's important
 A hardware validation process is invoked to ensure the integrity of the physical host in advance of deploying the OS image. Like the reimage action, the Tenant data isn't modified during replacement.
 
 > [!IMPORTANT]
-> When run with default options, the RAID controller is reset during BMM replace, wiping all data from the server's virtual disks. Baseboard Management Controller (BMC) virtual disk alerts triggered during BMM replace can be ignored unless there are other physical disk and/or RAID controllers alerts. Starting with the 2025-07-01 preview version of the NetworkCloud API, and generally available with the 2025-09-01 GA version, use `replace` with `storage-policy="Preserve"` to retain virtual disk data.
+> When run with default options, the RAID controller is reset during BMM replace, wiping all data from the server's virtual disks. Baseboard Management Controller (BMC) virtual disk alerts triggered during BMM replace can be ignored unless there are other physical disk and/or RAID controllers alerts. Starting with the `2025-07-01-preview` version of the NetworkCloud API, and generally available with the `2025-09-01` GA version, use `replace` with `storage-policy="Preserve"` to retain virtual disk data.
 
 ### Replace workflow
 
@@ -210,7 +213,7 @@ When you're performing the following physical repairs, we recommend a replace ac
 - Transceiver
 - Ethernet or fiber cable replacement
 
-When you're performing the following physical repairs, a replace action ***is required*** to bring the BMM back into service:
+When you're performing the following physical repairs, a replace action **_is required_** to bring the BMM back into service:
 
 - Backplane
 - System board
@@ -220,7 +223,7 @@ When you're performing the following physical repairs, a replace action ***is re
 - Broadcom embedded NIC
 
 After physical repairs are completed, perform a replace action.
-  
+
 **The following Azure CLI command will `replace` the specified bareMetalMachineName.**
 
 ```azurecli
@@ -249,11 +252,11 @@ az networkcloud baremetalmachine uncordon \
 
 Restarting, reimaging, and replacing are effective troubleshooting methods for addressing Azure Operator Nexus server problems. Here's a quick reference guide:
 
-| Action | When to use | Impact | Requirements |
-|--------|------------|--------|-------------|
-| **Restart** | Temporary glitches, unresponsive VMs | Brief downtime | None, fastest option |
-| **Reimage** | OS corruption, security concerns | Longer downtime, preserves data | Workload evacuation recommended |
-| **Replace** | Hardware component failures | Longest downtime, preserves data | Hardware component replacement, specific parameters needed |
+| Action      | When to use                          | Impact                           | Requirements                                               |
+| ----------- | ------------------------------------ | -------------------------------- | ---------------------------------------------------------- |
+| **Restart** | Temporary glitches, unresponsive VMs | Brief downtime                   | None, fastest option                                       |
+| **Reimage** | OS corruption, security concerns     | Longer downtime, preserves data  | Workload evacuation recommended                            |
+| **Replace** | Hardware component failures          | Longest downtime, preserves data | Hardware component replacement, specific parameters needed |
 
 ### Best practices