Skip to content

Commit c891dab

Browse files
Merge pull request #300192 from robertstarling/robstarling/2505_ado2095777_nic_failed_degraded_message
ADO 2184888: add NIC Failed degraded message
2 parents 9f7061b + 674b2be commit c891dab

File tree

1 file changed

+32
-3
lines changed

1 file changed

+32
-3
lines changed

articles/operator-nexus/troubleshoot-bare-metal-machine-degraded.md

Lines changed: 32 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ description: Troubleshooting guide for Bare Metal Machines in 'Degraded' status
44
ms.service: azure-operator-nexus
55
ms.custom: azure-operator-nexus
66
ms.topic: troubleshooting
7-
ms.date: 03/03/2025
7+
ms.date: 05/21/2025
88
author: robertstarling
99
ms.author: robstarling
1010
ms.reviewer: ekarandjeff
@@ -25,9 +25,10 @@ Bare Metal Machines (BMM) which are in _Degraded_ state exhibit the following sy
2525

2626
| Detailed status message | Details and mitigation |
2727
| ------------------------------- | ---------------------------------------------------------------- |
28+
| `Degraded: NIC failed` | [`Degraded: NIC failed`](#degraded-nic-failed) |
2829
| `Degraded: port down` | [`Degraded: port down`](#degraded-port-down) |
29-
| `Degraded: port flapping` | [`Degraded: port flapping`](#degraded-port-flapping) |
3030
| `Degraded: LACP status is down` | [`Degraded: LACP status is down`](#degraded-lacp-status-is-down) |
31+
| `Degraded: port flapping` | [`Degraded: port flapping`](#degraded-port-flapping) |
3132

3233
_Degraded_ status messages and associated automatic cordoning behavior are present in Azure Operator Nexus version 2502.1 and higher.
3334

@@ -131,7 +132,7 @@ This example shows an automatically cordoned BMM with two active _Degraded_ cond
131132
"cordonStatus": "Cordoned",
132133
"degradedStartTime": "2025-03-04T03:27:00Z",
133134
"detailedStatus": "Provisioned",
134-
"detailedStatusMessage": "The OS is provisioned to the machine. Degraded: port flapping Degraded: port down",
135+
"detailedStatusMessage": "The OS is provisioned to the machine. Degraded: port flapping Degraded: port down"
135136
}
136137
}
137138
```
@@ -150,6 +151,34 @@ Note: only BMMs used for _Compute_ are automatically cordoned. Control and Manag
150151

151152
For more information about investigating the root cause of an automatic cordon, see [Troubleshooting](#troubleshooting).
152153

154+
## `Degraded: NIC Failed`
155+
156+
This message indicates that one of the expected Mellanox Network Interface Cards (NICs) on the underlying compute host is failed or missing.
157+
This message typically indicates a hardware failure on the NIC, or that the card isn't correctly seated in the host.
158+
159+
To troubleshoot this issue:
160+
161+
- to identify the nonoperational NIC, check the Ethernet link status indicators on the underlying compute host
162+
- check that the NIC is correctly installed and seated
163+
- sign into the Baseboard Management Controller (BMC) to check the hardware status of the NIC
164+
- review detailed hardware logs by generating a Dell TSR (Technical Support Report) as described in the Dell Knowledge Base article [Export a SupportAssist Collection Using an iDRAC](https://www.dell.com/support/kbdoc/en-us/000126308/export-a-supportassist-collection-via-idrac9)
165+
- review the most recent time of failure reported by the Bare Metal Machine `conditions`, as described in the [Troubleshooting](#troubleshooting) section
166+
- power cycle the host by executing a "Restart" action on the Bare Metal Machine resource, and see if the condition clears.
167+
168+
**Example `conditions` output for NIC failed**
169+
170+
```json
171+
"conditions": [
172+
{
173+
"lastTransitionTime": "2025-05-21T16:49:29Z",
174+
"message": "Expected 2 devices in oam-bond, found 1: 98_pf0vf0_vf",
175+
"reason": "OamDevicesUnhealthy",
176+
"status": "False",
177+
"type": "BmmNicsHealthy"
178+
},
179+
],
180+
```
181+
153182
## `Degraded: port down`
154183

155184
This message in the BMM _Detailed status message_ field indicates that the physical link is down on one or more of the Mellanox interfaces on the underlying compute host.

0 commit comments

Comments
 (0)