|
| 1 | +--- |
| 2 | +title: Troubleshoot BMM Degraded issues in Azure Operator Nexus |
| 3 | +description: Troubleshooting guide for Bare Metal Machines in 'Degraded' status in Azure Operator Nexus. |
| 4 | +ms.service: azure-operator-nexus |
| 5 | +ms.custom: azure-operator-nexus |
| 6 | +ms.topic: troubleshooting |
| 7 | +ms.date: 02/03/2025 |
| 8 | +author: robertstarling |
| 9 | +ms.author: robstarling |
| 10 | +ms.reviewer: ekarandjeff |
| 11 | +--- |
| 12 | + |
| 13 | +# Troubleshoot _Degraded_ status errors on an Azure Operator Nexus Cluster Bare Metal Machine |
| 14 | + |
| 15 | +This document provides basic troubleshooting information for Bare Metal Machine (BMM) resources which are reporting a _Degraded_ status in the BMM detailed status message. |
| 16 | + |
| 17 | +## Symptoms |
| 18 | + |
| 19 | +Bare Metal Machines (BMM) which are in _Degraded_ state exhibit the following symptoms. |
| 20 | + |
| 21 | +- The Detailed status message includes one or more _Degraded_ messages as shown in the following table. |
| 22 | +- The BMM might be automatically cordoned, if the resource is continuously degraded for 15 minutes or longer (for Compute nodes only). |
| 23 | +- The BMM will then remain cordoned for 2 hours after the underlying conditions resolve, after which it will be automatically uncordoned. |
| 24 | +- Control and Management nodes can be reported as _Degraded_, but aren't automatically cordoned. |
| 25 | + |
| 26 | +| Detailed status message | Cordon automatically? | Details and mitigation | |
| 27 | +| -------------------------------------------------------- | --------------------- | ----------------------------------------------------------------------------------------------------------------- | |
| 28 | +| `Degraded: port is not functioning as expected` | Yes | [Degraded: `port is not functioning as expected`](#degraded-port-is-not-functioning-as-expected) | |
| 29 | +| `Degraded: LACP status is down` | Yes | [Degraded: `LACP status is down`](#degraded-lacp-status-is-down) | |
| 30 | +| `Degraded: BMM power state doesn't match expected state` | No | [Degraded: `BMM power state doesn't match expected state`](#degraded-bmm-power-state-doesnt-match-expected-state) | |
| 31 | + |
| 32 | +_Degraded_ status messages and associated automatic cordoning behavior are present in Azure Operator Nexus version 2502.1 and higher. |
| 33 | + |
| 34 | +## Troubleshooting |
| 35 | + |
| 36 | +To check for any Bare Metal Machines (BMMs) which are currently degraded, run `az networkcloud baremetalmachine list -g <ResourceGroup_Name> -o table`. This command shows the current status of all BMMs in the specified resource group. Any active _Degraded_ conditions are visible in the detailed status message. |
| 37 | + |
| 38 | +To see the current Cordoning status, include a `--query` parameter which specifies the `cordonStatus`, as seen in the following example. This command can help to identify any compute nodes which are still automatically cordoned due to recently resolved _Degraded_ conditions. |
| 39 | + |
| 40 | +```azurecli |
| 41 | +az networkcloud baremetalmachine list -g <ResourceGroup_Name> --output table --query "[].{name:name,powerState:powerState,provisioningState:provisioningState,readyState:readyState,cordonStatus:cordonStatus,detailedStatus:detailedStatus,detailedStatusMessage:detailedStatusMessage}" |
| 42 | +``` |
| 43 | + |
| 44 | +**Example Azure CLI output** |
| 45 | + |
| 46 | +``` |
| 47 | +Name PowerState ProvisioningState ReadyState CordonStatus DetailedStatus DetailedStatusMessage |
| 48 | +-------------- ------------ ------------------- ------------ -------------- ---------------- ----------------------------------------------------------------------------------------------------------------- |
| 49 | +rack2management1 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine. |
| 50 | +rack3management1 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine. |
| 51 | +rack2management2 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine. |
| 52 | +rack1management1 Off Succeeded False Uncordoned Available Available to participate in the cluster. |
| 53 | +rack3management2 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine. |
| 54 | +rack1management2 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine. |
| 55 | +rack3compute01 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine. |
| 56 | +rack1compute05 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine. |
| 57 | +rack1compute02 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine. |
| 58 | +rack1compute03 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine. |
| 59 | +rack1compute08 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine. |
| 60 | +rack2compute05 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine. |
| 61 | +rack2compute03 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine. |
| 62 | +rack1compute01 On Succeeded False Cordoned Provisioned The OS is provisioned to the machine. Degraded: LACP status is down |
| 63 | +rack2compute07 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine. |
| 64 | +rack2compute01 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine. |
| 65 | +rack1compute04 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine. |
| 66 | +rack3compute06 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine. |
| 67 | +rack3compute05 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine. |
| 68 | +rack3compute08 Off Succeeded False Uncordoned Error This machine has failed hardware validation |
| 69 | +rack2compute06 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine. |
| 70 | +rack3compute07 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine. |
| 71 | +rack3compute03 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine. |
| 72 | +rack3compute02 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine. |
| 73 | +rack1compute07 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine. |
| 74 | +rack3compute04 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine. |
| 75 | +rack2compute08 On Succeeded True Cordoned Provisioned The OS is provisioned to the machine. Degraded: port is not functioning as expected |
| 76 | +rack2compute02 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine. |
| 77 | +rack1compute06 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine. |
| 78 | +rack2compute04 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine. |
| 79 | +``` |
| 80 | + |
| 81 | +For more information, use an Azure CLI Bare Metal Machine `run-read-command` command such as the following to inspect the `conditions` status of the corresponding kubernetes BMM object. |
| 82 | + |
| 83 | +```azurecli |
| 84 | +az networkcloud baremetalmachine run-read-command -g <ResourceGroup_Name> -n rack2management2 --limit-time-seconds 60 --commands "[{command:'kubectl get',arguments:[-n,nc-system,bmm,rack2compute08,-o,json]}]" --output-directory . |
| 85 | +``` |
| 86 | + |
| 87 | +- Replace `<ResourceGroup_Name>` with the name of the resource group containing the BMM resources. |
| 88 | +- Replace `rack2management2` with the name of a BMM resource for a healthy Kubernetes control plane node, from which to execute the `kubectl get` command. |
| 89 | +- Replace `rack2compute08` with the name of the degraded or cordoned BMM to inspect. |
| 90 | +- For more information about the `run-read-command` feature, see [BareMetal Run-Read Execution](./howto-baremetal-run-read.md). |
| 91 | + |
| 92 | +Review the `lastTransitionTime` and `message` fields for more information about the corresponding degraded condition, as shown in the following example output. |
| 93 | + |
| 94 | +**Example `conditions` output:** |
| 95 | + |
| 96 | +``` |
| 97 | + "conditions": [ |
| 98 | + { |
| 99 | + "lastTransitionTime": "2025-01-30T23:54:04Z", |
| 100 | + "status": "True", |
| 101 | + "type": "BmmInExpectedLACPState" |
| 102 | + }, |
| 103 | + { |
| 104 | + "lastTransitionTime": "2025-02-01T22:07:14Z", |
| 105 | + "message": "Error: Port status for interface 98_p1 is down", |
| 106 | + "reason": "Port status is down", |
| 107 | + "severity": "Error", |
| 108 | + "status": "False", |
| 109 | + "type": "BmmInExpectedPortState" |
| 110 | + }, |
| 111 | + { |
| 112 | + "lastTransitionTime": "2025-01-30T23:54:04Z", |
| 113 | + "status": "True", |
| 114 | + "type": "BmmInExpectedPowerState" |
| 115 | + } |
| 116 | + ], |
| 117 | +``` |
| 118 | + |
| 119 | +## Automatic Cordoning |
| 120 | + |
| 121 | +If an uncordoned BMM is in a _Degraded_ state for 15 minutes or more, the node might be automatically cordoned, depending on which degraded conditions are present. |
| 122 | + |
| 123 | +- The `cordonStatus` field in the BMM object shows the current state of the node. |
| 124 | +- Only BMMs used for Compute are automatically cordoned. Control and Management nodes aren't automatically cordoned. |
| 125 | +- An automatically cordoned node will remain cordoned for 2 hours after the underlying conditions are resolved, after which it will be automatically uncordoned. |
| 126 | +- To uncordon a BMM manually, use the `az networkcloud baremetalmachine uncordon` command or execute the _Uncordon_ action from the Azure portal. |
| 127 | +- Manually uncordoning a BMM which still has a degraded condition has no effect. The _Uncordon_ request will execute successfully, but the node will immediately be automatically cordoned again until 2 hours after the underlying conditions are resolved. |
| 128 | + |
| 129 | +To investigate whether a currently cordoned BMM is due to a recent _Degraded_ state: |
| 130 | + |
| 131 | +- Review the `lastTransitionTime` in the `conditions` for the kubernetes `bmm` resource, as described in the [Troubleshooting](#troubleshooting) section, to identify any recently resolved _Degraded_ conditions. |
| 132 | +- Review the Activity Logs for the BMM resource in the Azure portal to check for any user initiated cordon requests. |
| 133 | + |
| 134 | +## Degraded: `port is not functioning as expected` |
| 135 | + |
| 136 | +This message in the BMM _Detailed status message_ field indicates that the physical link is down on one or more of the Mellanox interfaces on the underlying compute host. This scenario can indicate a cabling, switch port configuration, or hardware failure. |
| 137 | + |
| 138 | +To troubleshoot this issue: |
| 139 | + |
| 140 | +- review the `conditions` status of the kubernetes `bmm` object, as described in the [Troubleshooting](#troubleshooting) section |
| 141 | +- this information should identify the affected port and approximate time of the issue |
| 142 | +- check the Ethernet cabling and Top Of Rack (TOR) switch for the specified port |
| 143 | +- check for any recent deployment or infrastructure changes which coincide with the time of failure. |
| 144 | + |
| 145 | +**Example `conditions` output for unexpected port state** |
| 146 | + |
| 147 | +``` |
| 148 | + "conditions": [ |
| 149 | + { |
| 150 | + "lastTransitionTime": "2025-02-01T22:07:14Z", |
| 151 | + "message": "Error: Port status for interface 98_p1 is down", |
| 152 | + "reason": "Port status is down", |
| 153 | + "severity": "Error", |
| 154 | + "status": "False", |
| 155 | + "type": "BmmInExpectedPortState" |
| 156 | + } |
| 157 | + ], |
| 158 | +``` |
| 159 | + |
| 160 | +## Degraded: `LACP status is down` |
| 161 | + |
| 162 | +This message in the BMM _Detailed status message_ field indicates a Link Aggregation Control Protocol (LACP) failure on the underlying compute host, when the physical links are physically up. This scenario can indicate a cabling or Top Of Rack (TOR) switch configuration issue. |
| 163 | + |
| 164 | +To troubleshoot this issue: |
| 165 | + |
| 166 | +- review the `conditions` status of the kubernetes `bmm` object, as described in the [Troubleshooting](#troubleshooting) section |
| 167 | +- this information should identify the affected port and approximate time of the issue |
| 168 | +- check the Ethernet cabling and Top Of Rack (TOR) switch for the specified port |
| 169 | +- check whether any other BMMs are also reporting port or LACP issues, which might help to identify any potential mis-cabling or wider issue with the TOR switch or network configuration |
| 170 | +- check for any recent deployment or infrastructure changes which coincide with the time of failure |
| 171 | +- for more information about diagnosing and fixing LACP issues, see [Troubleshoot LACP Bonding](./troubleshoot-lacp-bonding.md). |
| 172 | + |
| 173 | +> [!WARNING] |
| 174 | +> As of version 2502.1, there's a known issue where `LACP status is down` can be incorrectly reported in addition to the `port is not functioning as expected` message during a port down scenario. This issue can happen when a BMM is restarted or reimaged while the physical port is down. This issue will be fixed in a future release. In the meantime, the `LACP status is down` warning can be safely ignored if the physical port is also down. |
| 175 | +
|
| 176 | +**Example `conditions` output for unexpected LACP state** |
| 177 | + |
| 178 | +``` |
| 179 | + "conditions": [ |
| 180 | + { |
| 181 | + "lastTransitionTime": "2025-01-31T12:24:27Z", |
| 182 | + "message": "Error: LACP status for interface 4b_p0 is down, LACP status for interface 4b_p1 is down", |
| 183 | + "reason": "LACP status is down", |
| 184 | + "severity": "Error", |
| 185 | + "status": "False", |
| 186 | + "type": "BmmInExpectedLACPState" |
| 187 | + }, |
| 188 | + ], |
| 189 | +``` |
| 190 | + |
| 191 | +## Degraded: `BMM power state doesn't match expected state` |
| 192 | + |
| 193 | +This message in the BMM _Detailed status message_ field indicates that either: |
| 194 | + |
| 195 | +- the underlying host is powered off when it should be on, or |
| 196 | +- the underlying host is powered on when it should be off. |
| 197 | + |
| 198 | +This condition can happen temporarily during a normal Restart, Reimage, or similar BMM lifecycle event. However, a persistent 'unexpected power state' message can indicate an issue with the underlying compute host or baseboard management controller (BMC). |
| 199 | + |
| 200 | +To troubleshoot this issue: |
| 201 | + |
| 202 | +- review the `conditions` status of the kubernetes `bmm` object, as described in the [Troubleshooting](#troubleshooting) section |
| 203 | +- this information should identify the approximate time of the issue and any other available details |
| 204 | +- check the power feed, power cables, and physical hardware for the specified BMM |
| 205 | +- check whether any other BMMs are also reporting an unexpected degraded state, which might indicate a broader issue with the underlying infrastructure |
| 206 | +- check for any recent deployment or infrastructure changes which coincide with the time of failure |
| 207 | +- review the power state and logs on the BMC for the affected host. |
| 208 | + |
| 209 | +For more information about logging into the BMC, see [Troubleshoot Hardware Validation Failure](./troubleshoot-hardware-validation-failure.md). |
| 210 | + |
| 211 | +**Example `conditions` output for unexpected power state** |
| 212 | + |
| 213 | +``` |
| 214 | + "conditions": [ |
| 215 | + { |
| 216 | + "lastTransitionTime": "2025-02-03T22:35:55Z", |
| 217 | + "message": "BareMetalMachine expected to be powered on", |
| 218 | + "reason": "BmmPoweredOnExpected", |
| 219 | + "severity": "Error", |
| 220 | + "status": "False", |
| 221 | + "type": "BmmInExpectedPowerState" |
| 222 | + }, |
| 223 | + ], |
| 224 | +``` |
0 commit comments