You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
|`Degraded: port is not functioning as expected`| Yes |[Degraded: `port is not functioning as expected`](#degraded-port-is-not-functioning-as-expected)|
29
-
|`Degraded: LACP status is down`| Yes |[Degraded: `LACP status is down`](#degraded-lacp-status-is-down)|
30
-
|`Degraded: BMM power state doesn't match expected state`|No |[Degraded: `BMM power state doesn't match expected state`](#degraded-bmm-power-state-doesnt-match-expected-state)|
26
+
| Detailed status message |Details and mitigation |
|`Degraded: port down`|[`Degraded: port down`](#degraded-port-down)|
29
+
|`Degraded: port flapping`|[`Degraded: port flapping`](#degraded-port-flapping)|
30
+
|`Degraded: LACP status is down`|[`Degraded: LACP status is down`](#degraded-lacp-status-is-down)|
31
31
32
32
_Degraded_ status messages and associated automatic cordoning behavior are present in Azure Operator Nexus version 2502.1 and higher.
33
33
@@ -59,7 +59,7 @@ rack1compute03 On Succeeded True Uncordoned
59
59
rack1compute08 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
60
60
rack2compute05 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
61
61
rack2compute03 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
62
-
rack1compute01 On Succeeded False Cordoned Provisioned The OS is provisioned to the machine. Degraded: LACP status is down
62
+
rack1compute01 On Succeeded False Cordoned Provisioned The OS is provisioned to the machine. Degraded: port down
63
63
rack2compute07 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
64
64
rack2compute01 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
65
65
rack1compute04 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
@@ -72,13 +72,29 @@ rack3compute03 On Succeeded True Uncordoned
72
72
rack3compute02 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
73
73
rack1compute07 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
74
74
rack3compute04 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
75
-
rack2compute08 On Succeeded True Cordoned Provisioned The OS is provisioned to the machine. Degraded: port is not functioning as expected
75
+
rack2compute08 On Succeeded True Cordoned Provisioned The OS is provisioned to the machine. Degraded: port flapping Degraded: port down
76
76
rack2compute02 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
77
77
rack1compute06 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
78
78
rack2compute04 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
79
79
```
80
80
81
-
For more information, use an Azure CLI Bare Metal Machine `run-read-command` command such as the following to inspect the `conditions` status of the corresponding kubernetes BMM object.
81
+
Additional information about recent degraded conditions and automatic cordoning is available in the following fields on the `bmm` kubernetes resource.
82
+
83
+
-`degradedStartTime` and `degradedEndTime` show the start and end time of the most recent _degraded_ state
84
+
-`conditions` shows the status of any individual conditions which are contributing to a _degraded_ state
85
+
-`cordonStatus` indicates whether the node is currently cordoned or uncordoned
86
+
-`annotations` shows which conditions triggered the current cordon, if automatically cordoned.
87
+
-`platform.afo-nc.microsoft.com/lacp-down-cordon`
88
+
-`platform.afo-nc.microsoft.com/port-down-cordon`
89
+
-`platform.afo-nc.microsoft.com/port-flap-cordon`
90
+
- If the user manually cordoned the BMM, the following annotation is also present.
91
+
-`platform.afo-nc.microsoft.com/cutomer-cordon`
92
+
- The Activity Logs for the BMM resource in the Azure portal can also provide more information about any recent user initiated cordon requests.
93
+
94
+
- The `annotations` metadata on the `bmm` kubernetes resource shows which condition triggered the cordon.
95
+
- The `conditions` status on the `bmm` kubernetes object shows the current status and timestamp for any triggering conditions.
96
+
97
+
To view these `bmm` kubernetes resource fields, use an Azure CLI `run-read-command` command as shown in the following example.
- Replace `<ResourceGroup_Name>` with the name of the resource group containing the BMM resources.
88
104
- Replace `rack2management2` with the name of a BMM resource for a healthy Kubernetes control plane node, from which to execute the `kubectl get` command.
89
105
- Replace `rack2compute08` with the name of the degraded or cordoned BMM to inspect.
90
-
- For more information about the `run-read-command` feature, see [BareMetal Run-Read Execution](./howto-baremetal-run-read.md).
91
106
92
-
Review the `lastTransitionTime` and `message` fields for more information about the corresponding degraded condition, as shown in the following example output.
107
+
For more information about the `run-read-command` feature, see [BareMetal Run-Read Execution](./howto-baremetal-run-read.md).
108
+
109
+
**Example `run-read-command` output (`kubectl get bmm`):**
93
110
94
-
**Example `conditions` output:**
111
+
This example shows an automatically cordoned BMM with two active _Degraded_ conditions.
95
112
96
113
```
97
-
"conditions": [
98
-
{
99
-
"lastTransitionTime": "2025-01-30T23:54:04Z",
100
-
"status": "True",
101
-
"type": "BmmInExpectedLACPState"
102
-
},
103
-
{
104
-
"lastTransitionTime": "2025-02-01T22:07:14Z",
105
-
"message": "Error: Port status for interface 98_p1 is down",
"message": "Port flapping in the last 15 mins: 4b_p1 (2 times)",
138
+
"reason": "PortFlappingDetected",
139
+
"status": "False",
140
+
"type": "BmmNetworkLinksStable"
141
+
}
142
+
],
143
+
"cordonStatus": "Cordoned",
144
+
"degradedStartTime": "2025-03-04T03:27:00Z",
145
+
"detailedStatus": "Provisioned",
146
+
"detailedStatusMessage": "The OS is provisioned to the machine. Degraded: port flapping Degraded: port down",
147
+
}
148
+
}
117
149
```
118
150
119
151
## Automatic Cordoning
120
152
121
-
If an uncordoned BMM is in a _Degraded_ state for 15 minutes or more, the node might be automatically cordoned, depending on which degraded conditions are present.
153
+
If an uncordoned Compute BMM remains in a _Degraded_ state for more than 15 minutes, the node is automatically cordoned.
122
154
123
-
- The `cordonStatus` field in the BMM object shows the current state of the node.
124
-
- Only BMMs used for Compute are automatically cordoned. Control and Management nodes aren't automatically cordoned.
125
155
- An automatically cordoned node will remain cordoned for 2 hours after the underlying conditions are resolved, after which it will be automatically uncordoned.
126
156
- To uncordon a BMM manually, use the `az networkcloud baremetalmachine uncordon` command or execute the _Uncordon_ action from the Azure portal.
127
-
- Manually uncordoning a BMM which still has a degraded condition has no effect. The _Uncordon_ request will execute successfully, but the node will immediately be automatically cordoned again until 2 hours after the underlying conditions are resolved.
157
+
- Manually uncordoning a BMM which still has an active degraded condition isn't allowed. In this case, the _Uncordon_ request is rejected with an error message similar to the following.
128
158
129
-
To investigate whether a currently cordoned BMM is due to a recent _Degraded_ state:
159
+
`action rejected: baremetalmachine 'rack1compute01' currently degraded since 2025-02-26 05:26:09 +0000 UTC`
130
160
131
-
- Review the `lastTransitionTime` in the `conditions` for the kubernetes `bmm` resource, as described in the [Troubleshooting](#troubleshooting) section, to identify any recently resolved _Degraded_ conditions.
132
-
- Review the Activity Logs for the BMM resource in the Azure portal to check for any user initiated cordon requests.
161
+
Note: only BMMs used for _Compute_ are automatically cordoned. Control and Management nodes aren't automatically cordoned.
133
162
134
-
## Degraded: `port is not functioning as expected`
163
+
For more information about investigating the root cause of an automatic cordon, see [Troubleshooting](#troubleshooting).
164
+
165
+
## `Degraded: port down`
135
166
136
167
This message in the BMM _Detailed status message_ field indicates that the physical link is down on one or more of the Mellanox interfaces on the underlying compute host. This scenario can indicate a cabling, switch port configuration, or hardware failure.
137
168
@@ -142,22 +173,21 @@ To troubleshoot this issue:
142
173
- check the Ethernet cabling and Top Of Rack (TOR) switch for the specified port
143
174
- check for any recent deployment or infrastructure changes which coincide with the time of failure.
144
175
145
-
**Example `conditions` output for unexpected port state**
176
+
**Example `conditions` output for port down**
146
177
147
178
```
148
179
"conditions": [
149
180
{
150
-
"lastTransitionTime": "2025-02-01T22:07:14Z",
151
-
"message": "Error: Port status for interface 98_p1 is down",
152
-
"reason": "Port status is down",
153
-
"severity": "Error",
181
+
"lastTransitionTime": "2025-03-04T03:27:00Z",
182
+
"message": "Physical link(s) down: 4b_p1",
183
+
"reason": "PortDown",
154
184
"status": "False",
155
-
"type": "BmmInExpectedPortState"
156
-
}
185
+
"type": "BmmNetworkLinksUp"
186
+
},
157
187
],
158
188
```
159
189
160
-
## Degraded: `LACP status is down`
190
+
## `Degraded: LACP status is down`
161
191
162
192
This message in the BMM _Detailed status message_ field indicates a Link Aggregation Control Protocol (LACP) failure on the underlying compute host, when the physical links are physically up. This scenario can indicate a cabling or Top Of Rack (TOR) switch configuration issue.
163
193
@@ -171,7 +201,7 @@ To troubleshoot this issue:
171
201
- for more information about diagnosing and fixing LACP issues, see [Troubleshoot LACP Bonding](./troubleshoot-lacp-bonding.md).
172
202
173
203
> [!WARNING]
174
-
> As of version 2502.1, there's a known issue where `LACP status is down` can be incorrectly reported in addition to the`port is not functioning as expected` message during a port down scenario. This issue can happen when a BMM is restarted or reimaged while the physical port is down. This issue will be fixed in a future release. In the meantime, the `LACP status is down`warning can be safely ignored if the physical port is also down.
204
+
> In version 2502.1, there's a known issue where `LACP status is down` can be incorrectly reported in addition to a`port is not functioning as expected` message during a port down scenario. This issue can happen when a BMM is restarted or reimaged while the physical port is down. In this case, the LACP warning can be safely ignored if the physical port is also down. This issue is fixed in version 2503.1.
175
205
176
206
**Example `conditions` output for unexpected LACP state**
177
207
@@ -188,37 +218,28 @@ To troubleshoot this issue:
188
218
],
189
219
```
190
220
191
-
## Degraded: `BMM power state doesn't match expected state`
192
-
193
-
This message in the BMM _Detailed status message_ field indicates that either:
194
-
195
-
- the underlying host is powered off when it should be on, or
196
-
- the underlying host is powered on when it should be off.
221
+
## `Degraded: port flapping`
197
222
198
-
This condition can happen temporarily during a normal Restart, Reimage, or similar BMM lifecycle event. However, a persistent 'unexpected power state' message can indicate an issue with the underlying compute host or baseboard management controller (BMC).
223
+
This message in the BMM _Detailed status message_ field indicates that one or more of the Mellanox ethernet ports is experiencing port flapping. Port flapping is defined as two or more changes in the physical link state within the previous 15 minutes. This behavior can indicate a cabling, switch or hardware issue, or possible network configuration issues.
199
224
200
225
To troubleshoot this issue:
201
226
202
-
- review the `conditions` status of the kubernetes `bmm` object, as described in the [Troubleshooting](#troubleshooting) section
203
-
- this information should identify the approximate time of the issue and any other available details
204
-
- check the power feed, power cables, and physical hardware for the specified BMM
205
-
- check whether any other BMMs are also reporting an unexpected degraded state, which might indicate a broader issue with the underlying infrastructure
206
-
- check for any recent deployment or infrastructure changes which coincide with the time of failure
207
-
- review the power state and logs on the BMC for the affected host.
208
-
209
-
For more information about logging into the BMC, see [Troubleshoot Hardware Validation Failure](./troubleshoot-hardware-validation-failure.md).
227
+
- identify the affected port and approximate time of the issue by reviewing the BMM `conditions`, as described in the [Troubleshooting](#troubleshooting) section
228
+
- check the `degradedStartTime` timestamp on the `bmm` object (if different) for more context about the overall timeline
229
+
- check the Ethernet cabling and Top Of Rack (TOR) switch for the specified port
230
+
- check for any other BMMs which are also reporting port flapping or link failures, for information about the scope of the issue or any common cause
231
+
- check for any recent deployment or infrastructure changes which coincide with the time of failure.
210
232
211
-
**Example `conditions` output for unexpected power state**
233
+
**Example `conditions` output for port flapping**
212
234
213
235
```
214
-
"conditions": [
215
-
{
216
-
"lastTransitionTime": "2025-02-03T22:35:55Z",
217
-
"message": "BareMetalMachine expected to be powered on",
218
-
"reason": "BmmPoweredOnExpected",
219
-
"severity": "Error",
220
-
"status": "False",
221
-
"type": "BmmInExpectedPowerState"
222
-
},
223
-
],
236
+
"conditions": [
237
+
{
238
+
"lastTransitionTime": "2025-03-04T03:49:00Z",
239
+
"message": "Port flapping in the last 15 mins: 4b_p1 (2 times)",
0 commit comments