|
| 1 | +--- |
| 2 | +title: Troubleshoot BMM Degraded issues in Azure Operator Nexus |
| 3 | +description: Troubleshooting guide for Bare Metal Machines in *degraded* status in Azure Operator Nexus. |
| 4 | +ms.service: azure-operator-nexus |
| 5 | +ms.custom: troubleshooting |
| 6 | +ms.topic: troubleshooting |
| 7 | +ms.date: 02/03/2025 |
| 8 | +author: robertstarling |
| 9 | +ms.author: robstarling |
| 10 | +ms.reviewer: ekarandjeff |
| 11 | +--- |
| 12 | + |
| 13 | +# Troubleshoot _Degraded_ status errors on an Azure Operator Nexus cluster Bare Metal Machine |
| 14 | + |
| 15 | +This document provides troubleshooting information for Bare Metal Machine (BMM) resources which are reporting a _Degraded_ status in the BMM detailed status message. |
| 16 | + |
| 17 | +## Symptoms |
| 18 | + |
| 19 | +Bare Metal Machines (BMM) which are in _Degraded_ state exhibit the following symptoms. |
| 20 | + |
| 21 | +- The Detailed status message includes one or more _Degraded_ messages as shown in the following table. |
| 22 | +- The BMM might be automatically cordoned, if the resource is continuously degraded for 15 minutes or longer (for Compute nodes only). |
| 23 | +- The BMM will then remain cordoned for 2 hours after the underlying conditions resolve, after which it will be automatically uncordoned. |
| 24 | +- Control and Management nodes can also be reported as _Degraded_, but aren't automatically cordoned. |
| 25 | + |
| 26 | +| Detailed status message | Cordon automatically? | |
| 27 | +| -------------------------------------------------------- | --------------------- | |
| 28 | +| `Degraded: port is not functioning as expected` | Yes | |
| 29 | +| `Degraded: LACP status is down` | Yes | |
| 30 | +| `Degraded: BMM power state doesn't match expected state` | No | |
| 31 | + |
| 32 | +The _Degraded_ status messages and associated automatic cordoning behavior was introduced in Azure Operator Nexus version 4.1. |
| 33 | + |
| 34 | +## Troubleshooting |
| 35 | + |
| 36 | +To check for any Bare Metal Machines (BMMs) which are currently degraded, run `az networkcloud baremetalmachine list -g <ResourceGroup_Name> -o table`. This command shows the current status of all BMMs in the specified resource group, including any current _Degraded_ conditions included in the detailed status message. |
| 37 | + |
| 38 | +To see the current Cordoning status, including any nodes which might be automatically cordoned due to _Degraded_ conditions, include a `--query` parameter which includes the `cordonStatus` field in the output, as seen in the following example. |
| 39 | + |
| 40 | +```azurecli |
| 41 | +az networkcloud baremetalmachine list -g <ResourceGroup_Name> --output table --query "[].{name:name,powerState:powerState,provisioningState:provisioningState,readyState:readyState,cordonStatus:cordonStatus,detailedStatus:detailedStatus,detailedStatusMessage:detailedStatusMessage}" |
| 42 | +``` |
| 43 | + |
| 44 | +**Example Azure CLI output** |
| 45 | + |
| 46 | +``` |
| 47 | +Name PowerState ProvisioningState ReadyState CordonStatus DetailedStatus DetailedStatusMessage |
| 48 | +-------------- ------------ ------------------- ------------ -------------- ---------------- ----------------------------------------------------------------------------------------------------------------- |
| 49 | +rack2management1 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine. |
| 50 | +rack3management1 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine. |
| 51 | +rack2management2 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine. |
| 52 | +rack1management1 Off Succeeded False Uncordoned Available Available to participate in the cluster. |
| 53 | +rack3management2 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine. |
| 54 | +rack1management2 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine. |
| 55 | +rack3compute1 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine. |
| 56 | +rack1compute5 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine. |
| 57 | +rack1compute2 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine. |
| 58 | +rack1compute3 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine. |
| 59 | +rack1compute8 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine. |
| 60 | +rack2compute5 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine. |
| 61 | +rack2compute3 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine. |
| 62 | +rack1compute1 On Succeeded False Cordoned Provisioned The OS is provisioned to the machine. Degraded: LACP status is down |
| 63 | +rack2compute7 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine. |
| 64 | +rack2compute1 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine. |
| 65 | +rack1compute4 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine. |
| 66 | +rack3compute6 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine. |
| 67 | +rack3compute5 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine. |
| 68 | +rack3compute8 Off Succeeded False Uncordoned Error This machine has failed hardware validation |
| 69 | +rack2compute6 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine. |
| 70 | +rack3compute7 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine. |
| 71 | +rack3compute3 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine. |
| 72 | +rack3compute2 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine. |
| 73 | +rack1compute7 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine. |
| 74 | +rack3compute4 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine. |
| 75 | +rack2compute8 On Succeeded True Cordoned Provisioned The OS is provisioned to the machine. Degraded: port is not functioning as expected |
| 76 | +rack2compute2 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine. |
| 77 | +rack1compute6 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine. |
| 78 | +rack2compute4 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine. |
| 79 | +``` |
| 80 | + |
| 81 | +To get more information about the cause of a degraded condition, inspect the `conditions` status of the corresponding kubernetes BMM object, using an Azure CLI Bare Metal Machine `run-read-command` command such as the following. |
| 82 | + |
| 83 | +```azurecli |
| 84 | +az networkcloud baremetalmachine run-read-command -g <ResourceGroup_Name> -n rack2management2 --limit-time-seconds 60 --commands "[{command:'kubectl get',arguments:[-n,nc-system,bmm,rack2compute8,-o,json]}]" --output-directory . |
| 85 | +``` |
| 86 | + |
| 87 | +- Replace `<ResourceGroup_Name>` with the name of the resource group containing the BMM resources. |
| 88 | +- Replace `rack2management2` with the name of a BMM resource for a healthy Kubernetes control plane node, from which to execute the `kubectl get` command. |
| 89 | +- Replace `rack2compute8` with the name of the degraded or cordoned BMM to inspect. |
| 90 | +- For more information about the `run-read-command` feature, see [BareMetal Run-Read Execution](./howto-baremetal-run-read.md). |
| 91 | + |
| 92 | +Review the `lastTransitionTime` and `message` fields for more information about the corresponding degraded condition, as shown in the following example output. |
| 93 | + |
| 94 | +**Example `conditions` output:** |
| 95 | + |
| 96 | +``` |
| 97 | + "conditions": [ |
| 98 | + { |
| 99 | + "lastTransitionTime": "2025-01-30T23:54:04Z", |
| 100 | + "status": "True", |
| 101 | + "type": "BmmInExpectedLACPState" |
| 102 | + }, |
| 103 | + { |
| 104 | + "lastTransitionTime": "2025-02-01T22:07:14Z", |
| 105 | + "message": "Error: Port status for interface 98_p1 is down", |
| 106 | + "reason": "Port status is down", |
| 107 | + "severity": "Error", |
| 108 | + "status": "False", |
| 109 | + "type": "BmmInExpectedPortState" |
| 110 | + }, |
| 111 | + { |
| 112 | + "lastTransitionTime": "2025-01-30T23:54:04Z", |
| 113 | + "status": "True", |
| 114 | + "type": "BmmInExpectedPowerState" |
| 115 | + } |
| 116 | + ], |
| 117 | +``` |
| 118 | + |
| 119 | +## Automatic Cordoning |
| 120 | + |
| 121 | +If an uncordoned BMM is in a _Degraded_ state for 15 minutes or more, the node might be automatically cordoned, depending on which degraded condition is present. |
| 122 | + |
| 123 | +- The `cordonStatus` field in the BMM object shows the current cordoning status of the node. |
| 124 | +- Only BMMs used for Compute are automatically cordoned; Control and Management nodes aren't automatically cordoned. |
| 125 | +- An automatically cordoned node will remain cordoned for 2 hours after the underlying conditions are resolved, after which it will be automatically uncordoned. |
| 126 | +- To uncordon a BMM manually, use the `az networkcloud baremetalmachine uncordon` command or execute the 'Uncordon' action from the Azure portal. |
| 127 | +- Manually uncordoning a BMM which is still in an active degraded state has no effect. The `uncordon` request will execute successfully, but the node will immediately be automatically cordoned again (and will remain cordoned until 2 hours after the underlying conditions are resolved, as normal). |
| 128 | + |
| 129 | +To investigate whether a currently cordoned node is due to a recent _Degraded_ state or other reason: |
| 130 | + |
| 131 | +- Review the `lastTransitionTime` in the `conditions` for the kubernetes `bmm` resource, as described in the [Troubleshooting](#troubleshooting) section, to identify any recently resolved _Degraded_ conditions. |
| 132 | +- Review the Activity Logs for the BMM resource in the Azure portal to check for any user initiated cordon requests. |
| 133 | + |
| 134 | +### Degraded: `port is not functioning as expected` |
| 135 | + |
| 136 | +This message in the BMM _Detailed status message_ field indicates that the physical link is down on one or more of the Mellanox interfaces on the underlying compute host. This scenario can indicate a cabling, switch port configuration, or hardware failure. |
| 137 | + |
| 138 | +To troubleshoot this issue: |
| 139 | + |
| 140 | +- review the `conditions` status of the kubernetes `bmm` object, as described in the [Troubleshooting](#troubleshooting) section |
| 141 | +- this information should identify the affected port and approximate time of the issue |
| 142 | +- check the Ethernet cabling and Top Of Rack (TOR) switch for the specified port |
| 143 | +- check for any recent deployment or infrastructure changes which coincide with the time of failure. |
| 144 | + |
| 145 | +**Example `conditions` output for unexpected port state** |
| 146 | + |
| 147 | +``` |
| 148 | + "conditions": [ |
| 149 | + { |
| 150 | + "lastTransitionTime": "2025-02-01T22:07:14Z", |
| 151 | + "message": "Error: Port status for interface 98_p1 is down", |
| 152 | + "reason": "Port status is down", |
| 153 | + "severity": "Error", |
| 154 | + "status": "False", |
| 155 | + "type": "BmmInExpectedPortState" |
| 156 | + } |
| 157 | + ], |
| 158 | +``` |
| 159 | + |
| 160 | +### Degraded: LACP status is down |
| 161 | + |
| 162 | +This message in the BMM _Detailed status message_ field indicates a Link Aggregation Control Protocol (LACP) failure on the underlying compute host, when the physical links are physically up. This scenario can indicate a cabling or Top Of Rack (TOR) switch configuration issue. |
| 163 | + |
| 164 | +To troubleshoot this issue: |
| 165 | + |
| 166 | +- review the `conditions` status of the kubernetes `bmm` object, as described in the [Troubleshooting](#troubleshooting) section |
| 167 | +- this information should identify the affected port and approximate time of the issue |
| 168 | +- check the Ethernet cabling and Top Of Rack (TOR) switch for the specified port |
| 169 | +- check whether any other BMMs are also reporting port or LACP issues, which might help to identify any potential mis-cabling or wider issue with the TOR switch or network configuration |
| 170 | +- check for any recent deployment or infrastructure changes which coincide with the time of failure |
| 171 | +- for more information about diagnosing and fixing LACP issues, see [Troubleshoot LACP Bonding](./troubleshoot-lacp-bonding.md). |
| 172 | + |
| 173 | +> [!WARNING] |
| 174 | +> As of version 4.1, there's a known issue where 'LACP degraded' status can be incorrectly reported at the same time as the `port is not functioning as expected` condition. This scenario can happen when a BMM is restarted or reimaged while the physical port is down. This issue will be fixed in a future release. In the meantime, the LACP degraded status can be safely ignored if the physical port is also down. |
| 175 | +
|
| 176 | +**Example `conditions` output for unexpected LACP state** |
| 177 | + |
| 178 | +``` |
| 179 | + "conditions": [ |
| 180 | + { |
| 181 | + "lastTransitionTime": "2025-01-31T12:24:27Z", |
| 182 | + "message": "Error: LACP status for interface 4b_p0 is down, LACP status for interface 4b_p1 is down", |
| 183 | + "reason": "LACP status is down", |
| 184 | + "severity": "Error", |
| 185 | + "status": "False", |
| 186 | + "type": "BmmInExpectedLACPState" |
| 187 | + }, |
| 188 | + ], |
| 189 | +``` |
| 190 | + |
| 191 | +### Degraded: BMM power state doesn't match expected state |
| 192 | + |
| 193 | +This message in the BMM _Detailed status message_ field indicates that either: |
| 194 | + |
| 195 | +- the underlying host is powered off when it should be on, or |
| 196 | +- the underlying host is powered on when it should be off. |
| 197 | + |
| 198 | +This condition can happen temporarily during a normal Restart, Reimage, or similar BMM lifecycle event. However, a persistent 'unexpected power state' message can indicate an issue with the underlying compute host or baseboard management controller (BMC). |
| 199 | + |
| 200 | +To troubleshoot this issue: |
| 201 | + |
| 202 | +- review the `conditions` status of the kubernetes `bmm` object, as described in the [Troubleshooting](#troubleshooting) section |
| 203 | +- this information should identify the approximate time of the issue and any other available details |
| 204 | +- check the power cabling and physical hardware for the specified BMM |
| 205 | +- check whether any other BMMs are also reporting an unexpected degraded state, which might indicate a broader issue with the underlying infrastructure |
| 206 | +- check for any recent deployment or infrastructure changes which coincide with the time of failure |
| 207 | +- review the power state and logs on the BMC for the affected host. |
| 208 | + |
| 209 | +For more information about logging into the BMC, see [Troubleshoot Hardware Validation Failure](./troubleshoot-hardware-validation-failure.md). |
| 210 | + |
| 211 | +**Example `conditions` output for unexpected power state** |
| 212 | + |
| 213 | +``` |
| 214 | + "conditions": [ |
| 215 | + { |
| 216 | + "lastTransitionTime": "2025-02-03T22:35:55Z", |
| 217 | + "message": "BareMetalMachine expected to be powered on", |
| 218 | + "reason": "BmmPoweredOnExpected", |
| 219 | + "severity": "Error", |
| 220 | + "status": "False", |
| 221 | + "type": "BmmInExpectedPowerState" |
| 222 | + }, |
| 223 | + ], |
| 224 | +``` |
0 commit comments