Skip to content

Commit ab96680

Browse files
BMM troubleshooting updates for 2503.1
1 parent f9be6f9 commit ab96680

File tree

3 files changed

+260
-77
lines changed

3 files changed

+260
-77
lines changed

articles/operator-nexus/TOC.yml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -229,7 +229,7 @@
229229
- name: How to upgrade OS of terminal server
230230
href: howto-upgrade-os-of-terminal-server.md
231231
- name: How to restrict serial port access and set timeout on terminal-server
232-
href: howto-restrict-serial-port-access-and-set-timeout-on-terminal-server.md
232+
href: howto-restrict-serial-port-access-and-set-timeout-on-terminal-server.md
233233
- name: Cluster
234234
expanded: false
235235
items:
@@ -282,7 +282,6 @@
282282
- name: Kubernetes cluster features
283283
href: howto-kubernetes-cluster-features.md
284284

285-
286285
- name: Nexus Virtual Machine
287286
expanded: false
288287
items:
@@ -359,6 +358,8 @@
359358
href: troubleshoot-hardware-validation-failure.md
360359
- name: Troubleshoot Degraded status
361360
href: troubleshoot-bare-metal-machine-degraded.md
361+
- name: Troubleshoot Warning status
362+
href: troubleshoot-bare-metal-machine-warning.md
362363
- name: Troubleshoot Control Plane Quorum
363364
href: troubleshoot-control-plane-quorum.md
364365
- name: Troubleshoot Accepted Cluster Resource

articles/operator-nexus/troubleshoot-bare-metal-machine-degraded.md

Lines changed: 96 additions & 75 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ description: Troubleshooting guide for Bare Metal Machines in 'Degraded' status
44
ms.service: azure-operator-nexus
55
ms.custom: azure-operator-nexus
66
ms.topic: troubleshooting
7-
ms.date: 02/03/2025
7+
ms.date: 03/03/2025
88
author: robertstarling
99
ms.author: robstarling
1010
ms.reviewer: ekarandjeff
@@ -19,15 +19,15 @@ This document provides basic troubleshooting information for Bare Metal Machine
1919
Bare Metal Machines (BMM) which are in _Degraded_ state exhibit the following symptoms.
2020

2121
- The Detailed status message includes one or more _Degraded_ messages as shown in the following table.
22-
- The BMM might be automatically cordoned, if the resource is continuously degraded for 15 minutes or longer (for Compute nodes only).
22+
- The BMM is automatically cordoned once the resource is continuously degraded for more than 15 minutes (for Compute nodes only).
2323
- The BMM will then remain cordoned for 2 hours after the underlying conditions resolve, after which it will be automatically uncordoned.
2424
- Control and Management nodes can be reported as _Degraded_, but aren't automatically cordoned.
2525

26-
| Detailed status message | Cordon automatically? | Details and mitigation |
27-
| -------------------------------------------------------- | --------------------- | ----------------------------------------------------------------------------------------------------------------- |
28-
| `Degraded: port is not functioning as expected` | Yes | [Degraded: `port is not functioning as expected`](#degraded-port-is-not-functioning-as-expected) |
29-
| `Degraded: LACP status is down` | Yes | [Degraded: `LACP status is down`](#degraded-lacp-status-is-down) |
30-
| `Degraded: BMM power state doesn't match expected state` | No | [Degraded: `BMM power state doesn't match expected state`](#degraded-bmm-power-state-doesnt-match-expected-state) |
26+
| Detailed status message | Details and mitigation |
27+
| ------------------------------- | ---------------------------------------------------------------- |
28+
| `Degraded: port down` | [`Degraded: port down`](#degraded-port-down) |
29+
| `Degraded: port flapping` | [`Degraded: port flapping`](#degraded-port-flapping) |
30+
| `Degraded: LACP status is down` | [`Degraded: LACP status is down`](#degraded-lacp-status-is-down) |
3131

3232
_Degraded_ status messages and associated automatic cordoning behavior are present in Azure Operator Nexus version 2502.1 and higher.
3333

@@ -59,7 +59,7 @@ rack1compute03 On Succeeded True Uncordoned
5959
rack1compute08 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
6060
rack2compute05 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
6161
rack2compute03 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
62-
rack1compute01 On Succeeded False Cordoned Provisioned The OS is provisioned to the machine. Degraded: LACP status is down
62+
rack1compute01 On Succeeded False Cordoned Provisioned The OS is provisioned to the machine. Degraded: port down
6363
rack2compute07 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
6464
rack2compute01 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
6565
rack1compute04 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
@@ -72,13 +72,29 @@ rack3compute03 On Succeeded True Uncordoned
7272
rack3compute02 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
7373
rack1compute07 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
7474
rack3compute04 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
75-
rack2compute08 On Succeeded True Cordoned Provisioned The OS is provisioned to the machine. Degraded: port is not functioning as expected
75+
rack2compute08 On Succeeded True Cordoned Provisioned The OS is provisioned to the machine. Degraded: port flapping Degraded: port down
7676
rack2compute02 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
7777
rack1compute06 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
7878
rack2compute04 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
7979
```
8080

81-
For more information, use an Azure CLI Bare Metal Machine `run-read-command` command such as the following to inspect the `conditions` status of the corresponding kubernetes BMM object.
81+
Additional information about recent degraded conditions and automatic cordoning is available in the following fields on the `bmm` kubernetes resource.
82+
83+
- `degradedStartTime` and `degradedEndTime` show the start and end time of the most recent _degraded_ state
84+
- `conditions` shows the status of any individual conditions which are contributing to a _degraded_ state
85+
- `cordonStatus` indicates whether the node is currently cordoned or uncordoned
86+
- `annotations` shows which conditions triggered the current cordon, if automatically cordoned.
87+
- `platform.afo-nc.microsoft.com/lacp-down-cordon`
88+
- `platform.afo-nc.microsoft.com/port-down-cordon`
89+
- `platform.afo-nc.microsoft.com/port-flap-cordon`
90+
- If the user manually cordoned the BMM, the following annotation is also present.
91+
- `platform.afo-nc.microsoft.com/cutomer-cordon`
92+
- The Activity Logs for the BMM resource in the Azure portal can also provide more information about any recent user initiated cordon requests.
93+
94+
- The `annotations` metadata on the `bmm` kubernetes resource shows which condition triggered the cordon.
95+
- The `conditions` status on the `bmm` kubernetes object shows the current status and timestamp for any triggering conditions.
96+
97+
To view these `bmm` kubernetes resource fields, use an Azure CLI `run-read-command` command as shown in the following example.
8298

8399
```azurecli
84100
az networkcloud baremetalmachine run-read-command -g <ResourceGroup_Name> -n rack2management2 --limit-time-seconds 60 --commands "[{command:'kubectl get',arguments:[-n,nc-system,bmm,rack2compute08,-o,json]}]" --output-directory .
@@ -87,51 +103,66 @@ az networkcloud baremetalmachine run-read-command -g <ResourceGroup_Name> -n rac
87103
- Replace `<ResourceGroup_Name>` with the name of the resource group containing the BMM resources.
88104
- Replace `rack2management2` with the name of a BMM resource for a healthy Kubernetes control plane node, from which to execute the `kubectl get` command.
89105
- Replace `rack2compute08` with the name of the degraded or cordoned BMM to inspect.
90-
- For more information about the `run-read-command` feature, see [BareMetal Run-Read Execution](./howto-baremetal-run-read.md).
91106

92-
Review the `lastTransitionTime` and `message` fields for more information about the corresponding degraded condition, as shown in the following example output.
107+
For more information about the `run-read-command` feature, see [BareMetal Run-Read Execution](./howto-baremetal-run-read.md).
108+
109+
**Example `run-read-command` output (`kubectl get bmm`):**
93110

94-
**Example `conditions` output:**
111+
This example shows an automatically cordoned BMM with two active _Degraded_ conditions.
95112

96113
```
97-
"conditions": [
98-
{
99-
"lastTransitionTime": "2025-01-30T23:54:04Z",
100-
"status": "True",
101-
"type": "BmmInExpectedLACPState"
102-
},
103-
{
104-
"lastTransitionTime": "2025-02-01T22:07:14Z",
105-
"message": "Error: Port status for interface 98_p1 is down",
106-
"reason": "Port status is down",
107-
"severity": "Error",
108-
"status": "False",
109-
"type": "BmmInExpectedPortState"
110-
},
111-
{
112-
"lastTransitionTime": "2025-01-30T23:54:04Z",
113-
"status": "True",
114-
"type": "BmmInExpectedPowerState"
114+
{
115+
"metadata": {
116+
"annotations": {
117+
"platform.afo-nc.microsoft.com/port-down-cordon": "true",
118+
"platform.afo-nc.microsoft.com/port-flap-cordon": "true"
115119
}
116-
],
120+
},
121+
"status": {
122+
"conditions": [
123+
{
124+
"lastTransitionTime": "2025-03-04T02:47:59Z",
125+
"status": "True",
126+
"type": "BmmInExpectedLACPState"
127+
},
128+
{
129+
"lastTransitionTime": "2025-03-04T03:27:00Z",
130+
"message": "Physical link(s) down: 4b_p1",
131+
"reason": "PortDown",
132+
"status": "False",
133+
"type": "BmmNetworkLinksUp"
134+
},
135+
{
136+
"lastTransitionTime": "2025-03-04T03:49:00Z",
137+
"message": "Port flapping in the last 15 mins: 4b_p1 (2 times)",
138+
"reason": "PortFlappingDetected",
139+
"status": "False",
140+
"type": "BmmNetworkLinksStable"
141+
}
142+
],
143+
"cordonStatus": "Cordoned",
144+
"degradedStartTime": "2025-03-04T03:27:00Z",
145+
"detailedStatus": "Provisioned",
146+
"detailedStatusMessage": "The OS is provisioned to the machine. Degraded: port flapping Degraded: port down",
147+
}
148+
}
117149
```
118150

119151
## Automatic Cordoning
120152

121-
If an uncordoned BMM is in a _Degraded_ state for 15 minutes or more, the node might be automatically cordoned, depending on which degraded conditions are present.
153+
If an uncordoned Compute BMM remains in a _Degraded_ state for more than 15 minutes, the node is automatically cordoned.
122154

123-
- The `cordonStatus` field in the BMM object shows the current state of the node.
124-
- Only BMMs used for Compute are automatically cordoned. Control and Management nodes aren't automatically cordoned.
125155
- An automatically cordoned node will remain cordoned for 2 hours after the underlying conditions are resolved, after which it will be automatically uncordoned.
126156
- To uncordon a BMM manually, use the `az networkcloud baremetalmachine uncordon` command or execute the _Uncordon_ action from the Azure portal.
127-
- Manually uncordoning a BMM which still has a degraded condition has no effect. The _Uncordon_ request will execute successfully, but the node will immediately be automatically cordoned again until 2 hours after the underlying conditions are resolved.
157+
- Manually uncordoning a BMM which still has an active degraded condition isn't allowed. In this case, the _Uncordon_ request is rejected with an error message similar to the following.
128158

129-
To investigate whether a currently cordoned BMM is due to a recent _Degraded_ state:
159+
`action rejected: baremetalmachine 'rack1compute01' currently degraded since 2025-02-26 05:26:09 +0000 UTC`
130160

131-
- Review the `lastTransitionTime` in the `conditions` for the kubernetes `bmm` resource, as described in the [Troubleshooting](#troubleshooting) section, to identify any recently resolved _Degraded_ conditions.
132-
- Review the Activity Logs for the BMM resource in the Azure portal to check for any user initiated cordon requests.
161+
Note: only BMMs used for _Compute_ are automatically cordoned. Control and Management nodes aren't automatically cordoned.
133162

134-
## Degraded: `port is not functioning as expected`
163+
For more information about investigating the root cause of an automatic cordon, see [Troubleshooting](#troubleshooting).
164+
165+
## `Degraded: port down`
135166

136167
This message in the BMM _Detailed status message_ field indicates that the physical link is down on one or more of the Mellanox interfaces on the underlying compute host. This scenario can indicate a cabling, switch port configuration, or hardware failure.
137168

@@ -142,22 +173,21 @@ To troubleshoot this issue:
142173
- check the Ethernet cabling and Top Of Rack (TOR) switch for the specified port
143174
- check for any recent deployment or infrastructure changes which coincide with the time of failure.
144175

145-
**Example `conditions` output for unexpected port state**
176+
**Example `conditions` output for port down**
146177

147178
```
148179
"conditions": [
149180
{
150-
"lastTransitionTime": "2025-02-01T22:07:14Z",
151-
"message": "Error: Port status for interface 98_p1 is down",
152-
"reason": "Port status is down",
153-
"severity": "Error",
181+
"lastTransitionTime": "2025-03-04T03:27:00Z",
182+
"message": "Physical link(s) down: 4b_p1",
183+
"reason": "PortDown",
154184
"status": "False",
155-
"type": "BmmInExpectedPortState"
156-
}
185+
"type": "BmmNetworkLinksUp"
186+
},
157187
],
158188
```
159189

160-
## Degraded: `LACP status is down`
190+
## `Degraded: LACP status is down`
161191

162192
This message in the BMM _Detailed status message_ field indicates a Link Aggregation Control Protocol (LACP) failure on the underlying compute host, when the physical links are physically up. This scenario can indicate a cabling or Top Of Rack (TOR) switch configuration issue.
163193

@@ -171,7 +201,7 @@ To troubleshoot this issue:
171201
- for more information about diagnosing and fixing LACP issues, see [Troubleshoot LACP Bonding](./troubleshoot-lacp-bonding.md).
172202

173203
> [!WARNING]
174-
> As of version 2502.1, there's a known issue where `LACP status is down` can be incorrectly reported in addition to the `port is not functioning as expected` message during a port down scenario. This issue can happen when a BMM is restarted or reimaged while the physical port is down. This issue will be fixed in a future release. In the meantime, the `LACP status is down` warning can be safely ignored if the physical port is also down.
204+
> In version 2502.1, there's a known issue where `LACP status is down` can be incorrectly reported in addition to a `port is not functioning as expected` message during a port down scenario. This issue can happen when a BMM is restarted or reimaged while the physical port is down. In this case, the LACP warning can be safely ignored if the physical port is also down. This issue is fixed in version 2503.1.
175205
176206
**Example `conditions` output for unexpected LACP state**
177207

@@ -188,37 +218,28 @@ To troubleshoot this issue:
188218
],
189219
```
190220

191-
## Degraded: `BMM power state doesn't match expected state`
192-
193-
This message in the BMM _Detailed status message_ field indicates that either:
194-
195-
- the underlying host is powered off when it should be on, or
196-
- the underlying host is powered on when it should be off.
221+
## `Degraded: port flapping`
197222

198-
This condition can happen temporarily during a normal Restart, Reimage, or similar BMM lifecycle event. However, a persistent 'unexpected power state' message can indicate an issue with the underlying compute host or baseboard management controller (BMC).
223+
This message in the BMM _Detailed status message_ field indicates that one or more of the Mellanox ethernet ports is experiencing port flapping. Port flapping is defined as two or more changes in the physical link state within the previous 15 minutes. This behavior can indicate a cabling, switch or hardware issue, or possible network configuration issues.
199224

200225
To troubleshoot this issue:
201226

202-
- review the `conditions` status of the kubernetes `bmm` object, as described in the [Troubleshooting](#troubleshooting) section
203-
- this information should identify the approximate time of the issue and any other available details
204-
- check the power feed, power cables, and physical hardware for the specified BMM
205-
- check whether any other BMMs are also reporting an unexpected degraded state, which might indicate a broader issue with the underlying infrastructure
206-
- check for any recent deployment or infrastructure changes which coincide with the time of failure
207-
- review the power state and logs on the BMC for the affected host.
208-
209-
For more information about logging into the BMC, see [Troubleshoot Hardware Validation Failure](./troubleshoot-hardware-validation-failure.md).
227+
- identify the affected port and approximate time of the issue by reviewing the BMM `conditions`, as described in the [Troubleshooting](#troubleshooting) section
228+
- check the `degradedStartTime` timestamp on the `bmm` object (if different) for more context about the overall timeline
229+
- check the Ethernet cabling and Top Of Rack (TOR) switch for the specified port
230+
- check for any other BMMs which are also reporting port flapping or link failures, for information about the scope of the issue or any common cause
231+
- check for any recent deployment or infrastructure changes which coincide with the time of failure.
210232

211-
**Example `conditions` output for unexpected power state**
233+
**Example `conditions` output for port flapping**
212234

213235
```
214-
"conditions": [
215-
{
216-
"lastTransitionTime": "2025-02-03T22:35:55Z",
217-
"message": "BareMetalMachine expected to be powered on",
218-
"reason": "BmmPoweredOnExpected",
219-
"severity": "Error",
220-
"status": "False",
221-
"type": "BmmInExpectedPowerState"
222-
},
223-
],
236+
"conditions": [
237+
{
238+
"lastTransitionTime": "2025-03-04T03:49:00Z",
239+
"message": "Port flapping in the last 15 mins: 4b_p1 (2 times)",
240+
"reason": "PortFlappingDetected",
241+
"status": "False",
242+
"type": "BmmNetworkLinksStable"
243+
},
244+
],
224245
```

0 commit comments

Comments
 (0)