Skip to content

Commit 20bd1ac

Browse files
Add BMM degraded troubleshooting article
1 parent c058ec3 commit 20bd1ac

File tree

3 files changed

+254
-26
lines changed

3 files changed

+254
-26
lines changed

articles/operator-nexus/TOC.yml

Lines changed: 28 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -141,30 +141,30 @@
141141
- name: On-Premises Operator Nexus Instance
142142
expanded: false
143143
items:
144-
- name: Before you start Operator Nexus platform deployment
145-
href: howto-platform-prerequisites.md
146-
- name: Network Fabric
147-
href: howto-configure-network-fabric.md
148-
- name: Cluster
149-
href: howto-configure-cluster.md
150-
- name: Cluster Template JSON Example
151-
href: cluster-jsonc-example.md
152-
- name: Cluster Parameters JSON Example
153-
href: cluster-parameters-jsonc-example.md
154-
- name: Instance Readiness Testing
155-
href: howto-run-instance-readiness-testing.md
156-
- name: Cluster Upgrades
157-
href: howto-cluster-runtime-upgrade.md
158-
- name: Cluster Upgrades With PauseRack Startegy
159-
href: howto-cluster-runtime-upgrade-with-pauserack-strategy.md
160-
- name: Network Fabric Upgrades
161-
href: howto-upgrade-nexus-fabric.md
162-
- name: Credential Rotation
163-
href: howto-credential-rotation.md
164-
- name: Credential Manager Key Vault
165-
href: how-to-credential-manager-key-vault.md
166-
- name: Updating ExpressRoute Gateway Authorization Key in Azure Operator Nexus
167-
href: howto-update-expressroute-authorization-key.md
144+
- name: Before you start Operator Nexus platform deployment
145+
href: howto-platform-prerequisites.md
146+
- name: Network Fabric
147+
href: howto-configure-network-fabric.md
148+
- name: Cluster
149+
href: howto-configure-cluster.md
150+
- name: Cluster Template JSON Example
151+
href: cluster-jsonc-example.md
152+
- name: Cluster Parameters JSON Example
153+
href: cluster-parameters-jsonc-example.md
154+
- name: Instance Readiness Testing
155+
href: howto-run-instance-readiness-testing.md
156+
- name: Cluster Upgrades
157+
href: howto-cluster-runtime-upgrade.md
158+
- name: Cluster Upgrades With PauseRack Startegy
159+
href: howto-cluster-runtime-upgrade-with-pauserack-strategy.md
160+
- name: Network Fabric Upgrades
161+
href: howto-upgrade-nexus-fabric.md
162+
- name: Credential Rotation
163+
href: howto-credential-rotation.md
164+
- name: Credential Manager Key Vault
165+
href: how-to-credential-manager-key-vault.md
166+
- name: Updating ExpressRoute Gateway Authorization Key in Azure Operator Nexus
167+
href: howto-update-expressroute-authorization-key.md
168168
- name: Network Fabric
169169
expanded: false
170170
items:
@@ -213,7 +213,7 @@
213213
- name: How to upgrade os of terminal server
214214
href: howto-upgrade-os-of-terminal-server.md
215215
- name: How to restrict serial port access and set timeout on terminal-server
216-
href: howto-restrict-serial-port-access-and-set-timeout-on-terminal-server.md
216+
href: howto-restrict-serial-port-access-and-set-timeout-on-terminal-server.md
217217
- name: Cluster
218218
expanded: false
219219
items:
@@ -333,6 +333,8 @@
333333
href: troubleshoot-bare-metal-machine-provisioning.md
334334
- name: Troubleshoot Hardware Validation Failure
335335
href: troubleshoot-hardware-validation-failure.md
336+
- name: Troubleshoot Degraded status
337+
href: troubleshoot-bmm-degraded.md
336338
- name: Troubleshoot Control Plane Quorum
337339
href: troubleshoot-control-plane-quorum.md
338340
- name: Troubleshoot Accepted Cluster Resource
@@ -429,4 +431,4 @@
429431
expanded: false
430432
items:
431433
- name: 2404.2
432-
href: release-notes-2404.2.md
434+
href: release-notes-2404.2.md
Lines changed: 224 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,224 @@
1+
---
2+
title: Troubleshoot BMM Degraded issues in Azure Operator Nexus
3+
description: Troubleshooting guide for Bare Metal Machines in *degraded* status in Azure Operator Nexus.
4+
ms.service: azure-operator-nexus
5+
ms.custom: troubleshooting
6+
ms.topic: troubleshooting
7+
ms.date: 02/03/2025
8+
author: robertstarling
9+
ms.author: robstarling
10+
ms.reviewer: ekarandjeff
11+
---
12+
13+
# Troubleshoot _Degraded_ status errors on an Azure Operator Nexus cluster Bare Metal Machine
14+
15+
This document provides troubleshooting information for Bare Metal Machine (BMM) resources which are reporting a _Degraded_ status in the BMM detailed status message.
16+
17+
## Symptoms
18+
19+
Bare Metal Machines (BMM) which are in _Degraded_ state exhibit the following symptoms.
20+
21+
- The Detailed status message includes one or more _Degraded_ messages as shown in the following table.
22+
- The BMM might be automatically cordoned, if the resource is continuously degraded for 15 minutes or longer (for Compute nodes only).
23+
- The BMM will then remain cordoned for 2 hours after the underlying conditions resolve, after which it will be automatically uncordoned.
24+
- Control and Management nodes can also be reported as _Degraded_, but aren't automatically cordoned.
25+
26+
| Detailed status message | Cordon automatically? |
27+
| -------------------------------------------------------- | --------------------- |
28+
| `Degraded: port is not functioning as expected` | Yes |
29+
| `Degraded: LACP status is down` | Yes |
30+
| `Degraded: BMM power state doesn't match expected state` | No |
31+
32+
The _Degraded_ status messages and associated automatic cordoning behavior was introduced in Azure Operator Nexus version 4.1.
33+
34+
## Troubleshooting
35+
36+
To check for any Bare Metal Machines (BMMs) which are currently degraded, run `az networkcloud baremetalmachine list -g <ResourceGroup_Name> -o table`. This command shows the current status of all BMMs in the specified resource group, including any current _Degraded_ conditions included in the detailed status message.
37+
38+
To see the current Cordoning status, including any nodes which might be automatically cordoned due to _Degraded_ conditions, include a `--query` parameter which includes the `cordonStatus` field in the output, as seen in the following example.
39+
40+
```azurecli
41+
az networkcloud baremetalmachine list -g <ResourceGroup_Name> --output table --query "[].{name:name,powerState:powerState,provisioningState:provisioningState,readyState:readyState,cordonStatus:cordonStatus,detailedStatus:detailedStatus,detailedStatusMessage:detailedStatusMessage}"
42+
```
43+
44+
**Example Azure CLI output**
45+
46+
```
47+
Name PowerState ProvisioningState ReadyState CordonStatus DetailedStatus DetailedStatusMessage
48+
-------------- ------------ ------------------- ------------ -------------- ---------------- -----------------------------------------------------------------------------------------------------------------
49+
rack2management1 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
50+
rack3management1 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
51+
rack2management2 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
52+
rack1management1 Off Succeeded False Uncordoned Available Available to participate in the cluster.
53+
rack3management2 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
54+
rack1management2 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
55+
rack3compute1 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
56+
rack1compute5 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
57+
rack1compute2 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
58+
rack1compute3 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
59+
rack1compute8 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
60+
rack2compute5 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
61+
rack2compute3 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
62+
rack1compute1 On Succeeded False Cordoned Provisioned The OS is provisioned to the machine. Degraded: LACP status is down
63+
rack2compute7 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
64+
rack2compute1 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
65+
rack1compute4 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
66+
rack3compute6 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
67+
rack3compute5 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
68+
rack3compute8 Off Succeeded False Uncordoned Error This machine has failed hardware validation
69+
rack2compute6 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
70+
rack3compute7 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
71+
rack3compute3 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
72+
rack3compute2 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
73+
rack1compute7 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
74+
rack3compute4 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
75+
rack2compute8 On Succeeded True Cordoned Provisioned The OS is provisioned to the machine. Degraded: port is not functioning as expected
76+
rack2compute2 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
77+
rack1compute6 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
78+
rack2compute4 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
79+
```
80+
81+
To get more information about the cause of a degraded condition, inspect the `conditions` status of the corresponding kubernetes BMM object, using an Azure CLI Bare Metal Machine `run-read-command` command such as the following.
82+
83+
```azurecli
84+
az networkcloud baremetalmachine run-read-command -g <ResourceGroup_Name> -n rack2management2 --limit-time-seconds 60 --commands "[{command:'kubectl get',arguments:[-n,nc-system,bmm,rack2compute8,-o,json]}]" --output-directory .
85+
```
86+
87+
- Replace `<ResourceGroup_Name>` with the name of the resource group containing the BMM resources.
88+
- Replace `rack2management2` with the name of a BMM resource for a healthy Kubernetes control plane node, from which to execute the `kubectl get` command.
89+
- Replace `rack2compute8` with the name of the degraded or cordoned BMM to inspect.
90+
- For more information about the `run-read-command` feature, see [BareMetal Run-Read Execution](./howto-baremetal-run-read.md).
91+
92+
Review the `lastTransitionTime` and `message` fields for more information about the corresponding degraded condition, as shown in the following example output.
93+
94+
**Example `conditions` output:**
95+
96+
```
97+
"conditions": [
98+
{
99+
"lastTransitionTime": "2025-01-30T23:54:04Z",
100+
"status": "True",
101+
"type": "BmmInExpectedLACPState"
102+
},
103+
{
104+
"lastTransitionTime": "2025-02-01T22:07:14Z",
105+
"message": "Error: Port status for interface 98_p1 is down",
106+
"reason": "Port status is down",
107+
"severity": "Error",
108+
"status": "False",
109+
"type": "BmmInExpectedPortState"
110+
},
111+
{
112+
"lastTransitionTime": "2025-01-30T23:54:04Z",
113+
"status": "True",
114+
"type": "BmmInExpectedPowerState"
115+
}
116+
],
117+
```
118+
119+
## Automatic Cordoning
120+
121+
If an uncordoned BMM is in a _Degraded_ state for 15 minutes or more, the node might be automatically cordoned, depending on which degraded condition is present.
122+
123+
- The `cordonStatus` field in the BMM object shows the current cordoning status of the node.
124+
- Only BMMs used for Compute are automatically cordoned; Control and Management nodes aren't automatically cordoned.
125+
- An automatically cordoned node will remain cordoned for 2 hours after the underlying conditions are resolved, after which it will be automatically uncordoned.
126+
- To uncordon a BMM manually, use the `az networkcloud baremetalmachine uncordon` command or execute the 'Uncordon' action from the Azure portal.
127+
- Manually uncordoning a BMM which is still in an active degraded state has no effect. The `uncordon` request will execute successfully, but the node will immediately be automatically cordoned again (and will remain cordoned until 2 hours after the underlying conditions are resolved, as normal).
128+
129+
To investigate whether a currently cordoned node is due to a recent _Degraded_ state or other reason:
130+
131+
- Review the `lastTransitionTime` in the `conditions` for the kubernetes `bmm` resource, as described in the [Troubleshooting](#troubleshooting) section, to identify any recently resolved _Degraded_ conditions.
132+
- Review the Activity Logs for the BMM resource in the Azure portal to check for any user initiated cordon requests.
133+
134+
### Degraded: `port is not functioning as expected`
135+
136+
This message in the BMM _Detailed status message_ field indicates that the physical link is down on one or more of the Mellanox interfaces on the underlying compute host. This scenario can indicate a cabling, switch port configuration, or hardware failure.
137+
138+
To troubleshoot this issue:
139+
140+
- review the `conditions` status of the kubernetes `bmm` object, as described in the [Troubleshooting](#troubleshooting) section
141+
- this information should identify the affected port and approximate time of the issue
142+
- check the Ethernet cabling and Top Of Rack (TOR) switch for the specified port
143+
- check for any recent deployment or infrastructure changes which coincide with the time of failure.
144+
145+
**Example `conditions` output for unexpected port state**
146+
147+
```
148+
"conditions": [
149+
{
150+
"lastTransitionTime": "2025-02-01T22:07:14Z",
151+
"message": "Error: Port status for interface 98_p1 is down",
152+
"reason": "Port status is down",
153+
"severity": "Error",
154+
"status": "False",
155+
"type": "BmmInExpectedPortState"
156+
}
157+
],
158+
```
159+
160+
### Degraded: LACP status is down
161+
162+
This message in the BMM _Detailed status message_ field indicates a Link Aggregation Control Protocol (LACP) failure on the underlying compute host, when the physical links are physically up. This scenario can indicate a cabling or Top Of Rack (TOR) switch configuration issue.
163+
164+
To troubleshoot this issue:
165+
166+
- review the `conditions` status of the kubernetes `bmm` object, as described in the [Troubleshooting](#troubleshooting) section
167+
- this information should identify the affected port and approximate time of the issue
168+
- check the Ethernet cabling and Top Of Rack (TOR) switch for the specified port
169+
- check whether any other BMMs are also reporting port or LACP issues, which might help to identify any potential mis-cabling or wider issue with the TOR switch or network configuration
170+
- check for any recent deployment or infrastructure changes which coincide with the time of failure
171+
- for more information about diagnosing and fixing LACP issues, see [Troubleshoot LACP Bonding](./troubleshoot-lacp-bonding.md).
172+
173+
> [!WARNING]
174+
> As of version 4.1, there's a known issue where 'LACP degraded' status can be incorrectly reported at the same time as the `port is not functioning as expected` condition. This scenario can happen when a BMM is restarted or reimaged while the physical port is down. This issue will be fixed in a future release. In the meantime, the LACP degraded status can be safely ignored if the physical port is also down.
175+
176+
**Example `conditions` output for unexpected LACP state**
177+
178+
```
179+
"conditions": [
180+
{
181+
"lastTransitionTime": "2025-01-31T12:24:27Z",
182+
"message": "Error: LACP status for interface 4b_p0 is down, LACP status for interface 4b_p1 is down",
183+
"reason": "LACP status is down",
184+
"severity": "Error",
185+
"status": "False",
186+
"type": "BmmInExpectedLACPState"
187+
},
188+
],
189+
```
190+
191+
### Degraded: BMM power state doesn't match expected state
192+
193+
This message in the BMM _Detailed status message_ field indicates that either:
194+
195+
- the underlying host is powered off when it should be on, or
196+
- the underlying host is powered on when it should be off.
197+
198+
This condition can happen temporarily during a normal Restart, Reimage, or similar BMM lifecycle event. However, a persistent 'unexpected power state' message can indicate an issue with the underlying compute host or baseboard management controller (BMC).
199+
200+
To troubleshoot this issue:
201+
202+
- review the `conditions` status of the kubernetes `bmm` object, as described in the [Troubleshooting](#troubleshooting) section
203+
- this information should identify the approximate time of the issue and any other available details
204+
- check the power cabling and physical hardware for the specified BMM
205+
- check whether any other BMMs are also reporting an unexpected degraded state, which might indicate a broader issue with the underlying infrastructure
206+
- check for any recent deployment or infrastructure changes which coincide with the time of failure
207+
- review the power state and logs on the BMC for the affected host.
208+
209+
For more information about logging into the BMC, see [Troubleshoot Hardware Validation Failure](./troubleshoot-hardware-validation-failure.md).
210+
211+
**Example `conditions` output for unexpected power state**
212+
213+
```
214+
"conditions": [
215+
{
216+
"lastTransitionTime": "2025-02-03T22:35:55Z",
217+
"message": "BareMetalMachine expected to be powered on",
218+
"reason": "BmmPoweredOnExpected",
219+
"severity": "Error",
220+
"status": "False",
221+
"type": "BmmInExpectedPowerState"
222+
},
223+
],
224+
```

articles/operator-nexus/troubleshoot-lacp-bonding.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,8 @@ ms.date: 11/15/2024
1313

1414
On physical host startup, the two Mellanox cards are bonded to a pair of Arista switches by the Link Aggregation Control Protocol (LACP). If LACP isn't properly negotiated between the server's cards and the switches, it can cause strange packet loss or load-balancing behavior. These errors might not be noticeable until a tenant workload attempts to pass traffic. They occur because of the hashing/load-balancing nature of LACP.
1515

16+
If there's an issue with LACP bonding on a given host, an appropriate 'degraded' message is included in the 'Detailed status message' for the corresponding Bare Metal Machine (BMM) resource. For more information, see the [Troubleshoot Degraded status](./troubleshoot-bmm-degraded.md) guide.
17+
1618
## Diagnosis
1719

1820
If LACP isn't negotiated correctly, traffic loss can occur. But traffic can pass for some flows too. This behavior can manifest itself as a virtual machine that can't get on the network, or even as object attribute memory (OAM) or storage outages.

0 commit comments

Comments
 (0)