Skip to content

Commit cb4fc98

Browse files
authored
Create troubleshoot-lacp-bonding.md
Create troubleshoot-lacp-bonding
1 parent a4c42b4 commit cb4fc98

File tree

1 file changed

+54
-0
lines changed

1 file changed

+54
-0
lines changed
Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
---
2+
title: "Azure Operator Nexus: Networking"
3+
description: Checking LACP Bonding on Physical Hosts.
4+
author: keithritchie73
5+
ms.author: keithritchie
6+
ms.service: azure-operator-nexus
7+
ms.custom: azure-operator-nexus
8+
ms.topic: troubleshooting
9+
ms.date: 11/15/2024
10+
---
11+
12+
# Checking LACP Bonding on Physical Hosts
13+
14+
On physical host startup, the two Mellanox cards are LACP bonded to a pair of Arista switches. If LACP isn't properly negotiated between the server's cards and the switches, it can cause strange packet loss or load balancing behavior. These errors might not be noticeable until a tenant workload attempts to pass traffic and is due to the hashing/load balancing nature of LACP.
15+
16+
## Diagnosis
17+
18+
If, LACP isn't negotiated correctly traffic loss can occur. But traffic can pass for some flows too. This behavior can manifest itself as a vm that can't get on the network, or even oam/storage outages.
19+
20+
## Checking LACP Bonding
21+
22+
To check the LACP bonding status on a physical host run the following command. For control plane hosts, use file 8a_pf_bond as there's only one Mellanox card on those hosts. For worker hosts, use either 4b_pf_bond or 98_pf_bond to check its two cards.
23+
24+
```bash
25+
# cat /proc/net/bonding/8a_pf_bond
26+
```
27+
28+
### Interpreting the results
29+
30+
Key validations to check in the /proc/net/bonding/ output are:
31+
32+
For Bond level (the top part):
33+
34+
1. MII Status: up - Is the entire bond up
35+
2. LACP active: on - Is LACP active
36+
3. Aggregator ID: 1 - The top level aggregator ID should match both slaves. See each slave port for its aggregator ID.
37+
4. System MAC address: 42:56:86:9c:81:89 - Is there a System MAC defined. If a bond isn't negotiated this will be undefined or all zeros, e.g 00:00:00:00:00:00
38+
39+
For each slave port:
40+
41+
1. MII Status: up - Is the interface up
42+
2. Aggregator ID: 1 - Both slaves should have the same aggregator ID
43+
3. details partner lacp pdu: port state 61 - The value is a bit mask that represents the LACP negotiation state on that port. Generally 61 and 63 are what we want. See: <https://movingpackets.net/2017/10/17/decoding-lacp-port-state/>
44+
>[!NOTE]
45+
> This article contains references to the term *slave*, a term that Microsoft no longer uses. When the term is removed from the software, we’ll remove it from this article.
46+
47+
### Fixing the issue
48+
49+
The most common causes for these LACP issues are host/switch miswiring or mismatched LACP/MLAG configuration on the Arista switches. Investigate the situation by tracing out and repairing any wiring issues. If the wiring is correct, then determine if the switch LACP/MLAG configuration is incorrect.
50+
51+
## Further information
52+
53+
If you still have questions, [contact support](https://portal.azure.com/?#blade/Microsoft_Azure_Support/HelpAndSupportBlade).
54+
For more information about Support plans, see [Azure Support plans](https://azure.microsoft.com/support/plans/response/).

0 commit comments

Comments
 (0)