Skip to content

Commit 6ad5b98

Browse files
markups from self-review
1 parent 20bd1ac commit 6ad5b98

File tree

1 file changed

+48
-48
lines changed

1 file changed

+48
-48
lines changed

articles/operator-nexus/troubleshoot-bmm-degraded.md

Lines changed: 48 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
title: Troubleshoot BMM Degraded issues in Azure Operator Nexus
3-
description: Troubleshooting guide for Bare Metal Machines in *degraded* status in Azure Operator Nexus.
3+
description: Troubleshooting guide for Bare Metal Machines in 'Degraded' status in Azure Operator Nexus.
44
ms.service: azure-operator-nexus
55
ms.custom: troubleshooting
66
ms.topic: troubleshooting
@@ -21,21 +21,21 @@ Bare Metal Machines (BMM) which are in _Degraded_ state exhibit the following sy
2121
- The Detailed status message includes one or more _Degraded_ messages as shown in the following table.
2222
- The BMM might be automatically cordoned, if the resource is continuously degraded for 15 minutes or longer (for Compute nodes only).
2323
- The BMM will then remain cordoned for 2 hours after the underlying conditions resolve, after which it will be automatically uncordoned.
24-
- Control and Management nodes can also be reported as _Degraded_, but aren't automatically cordoned.
24+
- Control and Management nodes can be reported as _Degraded_, but aren't automatically cordoned.
2525

26-
| Detailed status message | Cordon automatically? |
27-
| -------------------------------------------------------- | --------------------- |
28-
| `Degraded: port is not functioning as expected` | Yes |
29-
| `Degraded: LACP status is down` | Yes |
30-
| `Degraded: BMM power state doesn't match expected state` | No |
26+
| Detailed status message | Cordon automatically? | Details and mitigation |
27+
| -------------------------------------------------------- | --------------------- | ----------------------------------------------------------------------------------------------------------------- |
28+
| `Degraded: port is not functioning as expected` | Yes | [Degraded: `port is not functioning as expected`](#degraded-port-is-not-functioning-as-expected) |
29+
| `Degraded: LACP status is down` | Yes | [Degraded: `LACP status is down`](#degraded-lacp-status-is-down) |
30+
| `Degraded: BMM power state doesn't match expected state` | No | [Degraded: `BMM power state doesn't match expected state`](#degraded-bmm-power-state-doesnt-match-expected-state) |
3131

32-
The _Degraded_ status messages and associated automatic cordoning behavior was introduced in Azure Operator Nexus version 4.1.
32+
_Degraded_ status messages and associated automatic cordoning behavior are present in Azure Operator Nexus version 4.1 and higher.
3333

3434
## Troubleshooting
3535

36-
To check for any Bare Metal Machines (BMMs) which are currently degraded, run `az networkcloud baremetalmachine list -g <ResourceGroup_Name> -o table`. This command shows the current status of all BMMs in the specified resource group, including any current _Degraded_ conditions included in the detailed status message.
36+
To check for any Bare Metal Machines (BMMs) which are currently degraded, run `az networkcloud baremetalmachine list -g <ResourceGroup_Name> -o table`. This command shows the current status of all BMMs in the specified resource group. Any active _Degraded_ conditions are visible in the detailed status message.
3737

38-
To see the current Cordoning status, including any nodes which might be automatically cordoned due to _Degraded_ conditions, include a `--query` parameter which includes the `cordonStatus` field in the output, as seen in the following example.
38+
To see the current Cordoning status, include a `--query` parameter which specifies the `cordonStatus`, as seen in the following example. This command can help to identify any compute nodes which are still automatically cordoned due to recently resolved _Degraded_ conditions.
3939

4040
```azurecli
4141
az networkcloud baremetalmachine list -g <ResourceGroup_Name> --output table --query "[].{name:name,powerState:powerState,provisioningState:provisioningState,readyState:readyState,cordonStatus:cordonStatus,detailedStatus:detailedStatus,detailedStatusMessage:detailedStatusMessage}"
@@ -52,41 +52,41 @@ rack2management2 On Succeeded True Uncordoned
5252
rack1management1 Off Succeeded False Uncordoned Available Available to participate in the cluster.
5353
rack3management2 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
5454
rack1management2 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
55-
rack3compute1 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
56-
rack1compute5 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
57-
rack1compute2 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
58-
rack1compute3 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
59-
rack1compute8 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
60-
rack2compute5 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
61-
rack2compute3 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
62-
rack1compute1 On Succeeded False Cordoned Provisioned The OS is provisioned to the machine. Degraded: LACP status is down
63-
rack2compute7 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
64-
rack2compute1 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
65-
rack1compute4 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
66-
rack3compute6 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
67-
rack3compute5 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
68-
rack3compute8 Off Succeeded False Uncordoned Error This machine has failed hardware validation
69-
rack2compute6 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
70-
rack3compute7 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
71-
rack3compute3 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
72-
rack3compute2 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
73-
rack1compute7 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
74-
rack3compute4 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
75-
rack2compute8 On Succeeded True Cordoned Provisioned The OS is provisioned to the machine. Degraded: port is not functioning as expected
76-
rack2compute2 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
77-
rack1compute6 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
78-
rack2compute4 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
55+
rack3compute01 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
56+
rack1compute05 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
57+
rack1compute02 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
58+
rack1compute03 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
59+
rack1compute08 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
60+
rack2compute05 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
61+
rack2compute03 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
62+
rack1compute01 On Succeeded False Cordoned Provisioned The OS is provisioned to the machine. Degraded: LACP status is down
63+
rack2compute07 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
64+
rack2compute01 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
65+
rack1compute04 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
66+
rack3compute06 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
67+
rack3compute05 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
68+
rack3compute08 Off Succeeded False Uncordoned Error This machine has failed hardware validation
69+
rack2compute06 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
70+
rack3compute07 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
71+
rack3compute03 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
72+
rack3compute02 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
73+
rack1compute07 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
74+
rack3compute04 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
75+
rack2compute08 On Succeeded True Cordoned Provisioned The OS is provisioned to the machine. Degraded: port is not functioning as expected
76+
rack2compute02 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
77+
rack1compute06 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
78+
rack2compute04 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
7979
```
8080

81-
To get more information about the cause of a degraded condition, inspect the `conditions` status of the corresponding kubernetes BMM object, using an Azure CLI Bare Metal Machine `run-read-command` command such as the following.
81+
For more information, use an Azure CLI Bare Metal Machine `run-read-command` command such as the following to inspect the `conditions` status of the corresponding kubernetes BMM object.
8282

8383
```azurecli
84-
az networkcloud baremetalmachine run-read-command -g <ResourceGroup_Name> -n rack2management2 --limit-time-seconds 60 --commands "[{command:'kubectl get',arguments:[-n,nc-system,bmm,rack2compute8,-o,json]}]" --output-directory .
84+
az networkcloud baremetalmachine run-read-command -g <ResourceGroup_Name> -n rack2management2 --limit-time-seconds 60 --commands "[{command:'kubectl get',arguments:[-n,nc-system,bmm,rack2compute08,-o,json]}]" --output-directory .
8585
```
8686

8787
- Replace `<ResourceGroup_Name>` with the name of the resource group containing the BMM resources.
8888
- Replace `rack2management2` with the name of a BMM resource for a healthy Kubernetes control plane node, from which to execute the `kubectl get` command.
89-
- Replace `rack2compute8` with the name of the degraded or cordoned BMM to inspect.
89+
- Replace `rack2compute08` with the name of the degraded or cordoned BMM to inspect.
9090
- For more information about the `run-read-command` feature, see [BareMetal Run-Read Execution](./howto-baremetal-run-read.md).
9191

9292
Review the `lastTransitionTime` and `message` fields for more information about the corresponding degraded condition, as shown in the following example output.
@@ -118,20 +118,20 @@ Review the `lastTransitionTime` and `message` fields for more information about
118118

119119
## Automatic Cordoning
120120

121-
If an uncordoned BMM is in a _Degraded_ state for 15 minutes or more, the node might be automatically cordoned, depending on which degraded condition is present.
121+
If an uncordoned BMM is in a _Degraded_ state for 15 minutes or more, the node might be automatically cordoned, depending on which degraded conditions are present.
122122

123-
- The `cordonStatus` field in the BMM object shows the current cordoning status of the node.
124-
- Only BMMs used for Compute are automatically cordoned; Control and Management nodes aren't automatically cordoned.
123+
- The `cordonStatus` field in the BMM object shows the current state of the node.
124+
- Only BMMs used for Compute are automatically cordoned. Control and Management nodes aren't automatically cordoned.
125125
- An automatically cordoned node will remain cordoned for 2 hours after the underlying conditions are resolved, after which it will be automatically uncordoned.
126-
- To uncordon a BMM manually, use the `az networkcloud baremetalmachine uncordon` command or execute the 'Uncordon' action from the Azure portal.
127-
- Manually uncordoning a BMM which is still in an active degraded state has no effect. The `uncordon` request will execute successfully, but the node will immediately be automatically cordoned again (and will remain cordoned until 2 hours after the underlying conditions are resolved, as normal).
126+
- To uncordon a BMM manually, use the `az networkcloud baremetalmachine uncordon` command or execute the _Uncordon_ action from the Azure portal.
127+
- Manually uncordoning a BMM which still has a degraded condition has no effect. The _Uncordon_ request will execute successfully, but the node will immediately be automatically cordoned again until 2 hours after the underlying conditions are resolved.
128128

129-
To investigate whether a currently cordoned node is due to a recent _Degraded_ state or other reason:
129+
To investigate whether a currently cordoned node is due to a recent _Degraded_ state:
130130

131131
- Review the `lastTransitionTime` in the `conditions` for the kubernetes `bmm` resource, as described in the [Troubleshooting](#troubleshooting) section, to identify any recently resolved _Degraded_ conditions.
132132
- Review the Activity Logs for the BMM resource in the Azure portal to check for any user initiated cordon requests.
133133

134-
### Degraded: `port is not functioning as expected`
134+
## Degraded: `port is not functioning as expected`
135135

136136
This message in the BMM _Detailed status message_ field indicates that the physical link is down on one or more of the Mellanox interfaces on the underlying compute host. This scenario can indicate a cabling, switch port configuration, or hardware failure.
137137

@@ -157,7 +157,7 @@ To troubleshoot this issue:
157157
],
158158
```
159159

160-
### Degraded: LACP status is down
160+
## Degraded: `LACP status is down`
161161

162162
This message in the BMM _Detailed status message_ field indicates a Link Aggregation Control Protocol (LACP) failure on the underlying compute host, when the physical links are physically up. This scenario can indicate a cabling or Top Of Rack (TOR) switch configuration issue.
163163

@@ -171,7 +171,7 @@ To troubleshoot this issue:
171171
- for more information about diagnosing and fixing LACP issues, see [Troubleshoot LACP Bonding](./troubleshoot-lacp-bonding.md).
172172

173173
> [!WARNING]
174-
> As of version 4.1, there's a known issue where 'LACP degraded' status can be incorrectly reported at the same time as the `port is not functioning as expected` condition. This scenario can happen when a BMM is restarted or reimaged while the physical port is down. This issue will be fixed in a future release. In the meantime, the LACP degraded status can be safely ignored if the physical port is also down.
174+
> As of version 4.1, there's a known issue where `LACP status is down` can be incorrectly reported in addition to the `port is not functioning as expected` message during a port down scenario. This issue can happen when a BMM is restarted or reimaged while the physical port is down. This issue will be fixed in a future release. In the meantime, the `LACP status is down` warning can be safely ignored if the physical port is also down.
175175
176176
**Example `conditions` output for unexpected LACP state**
177177

@@ -188,7 +188,7 @@ To troubleshoot this issue:
188188
],
189189
```
190190

191-
### Degraded: BMM power state doesn't match expected state
191+
## Degraded: `BMM power state doesn't match expected state`
192192

193193
This message in the BMM _Detailed status message_ field indicates that either:
194194

@@ -201,7 +201,7 @@ To troubleshoot this issue:
201201

202202
- review the `conditions` status of the kubernetes `bmm` object, as described in the [Troubleshooting](#troubleshooting) section
203203
- this information should identify the approximate time of the issue and any other available details
204-
- check the power cabling and physical hardware for the specified BMM
204+
- check the power feed, power cables, and physical hardware for the specified BMM
205205
- check whether any other BMMs are also reporting an unexpected degraded state, which might indicate a broader issue with the underlying infrastructure
206206
- check for any recent deployment or infrastructure changes which coincide with the time of failure
207207
- review the power state and logs on the BMC for the affected host.

0 commit comments

Comments
 (0)