|
| 1 | +--- |
| 2 | +title: "Azure Operator Nexus: Networking" |
| 3 | +description: Checking LACP Bonding on Physical Hosts. |
| 4 | +author: keithritchie73 |
| 5 | +ms.author: keithritchie |
| 6 | +ms.service: azure-operator-nexus |
| 7 | +ms.custom: azure-operator-nexus |
| 8 | +ms.topic: troubleshooting |
| 9 | +ms.date: 11/15/2024 |
| 10 | +--- |
| 11 | + |
| 12 | +# Checking LACP Bonding on Physical Hosts |
| 13 | + |
| 14 | +On physical host startup, the two Mellanox cards are LACP bonded to a pair of Arista switches. If LACP isn't properly negotiated between the server's cards and the switches, it can cause strange packet loss or load balancing behavior. These errors might not be noticeable until a tenant workload attempts to pass traffic and is due to the hashing/load balancing nature of LACP. |
| 15 | + |
| 16 | +## Diagnosis |
| 17 | + |
| 18 | +If, LACP isn't negotiated correctly traffic loss can occur. But traffic can pass for some flows too. This behavior can manifest itself as a vm that can't get on the network, or even oam/storage outages. |
| 19 | + |
| 20 | +## Checking LACP Bonding |
| 21 | + |
| 22 | +To check the LACP bonding status on a physical host run the following command. For control plane hosts, use file 8a_pf_bond as there's only one Mellanox card on those hosts. For worker hosts, use either 4b_pf_bond or 98_pf_bond to check its two cards. |
| 23 | + |
| 24 | +```bash |
| 25 | +# cat /proc/net/bonding/8a_pf_bond |
| 26 | +``` |
| 27 | + |
| 28 | +### Interpreting the results |
| 29 | + |
| 30 | +Key validations to check in the /proc/net/bonding/ output are: |
| 31 | + |
| 32 | +For Bond level (the top part): |
| 33 | + |
| 34 | +1. MII Status: up - Is the entire bond up |
| 35 | +2. LACP active: on - Is LACP active |
| 36 | +3. Aggregator ID: 1 - The top level aggregator ID should match both slaves. See each slave port for its aggregator ID. |
| 37 | +4. System MAC address: 42:56:86:9c:81:89 - Is there a System MAC defined. If a bond isn't negotiated this will be undefined or all zeros, e.g 00:00:00:00:00:00 |
| 38 | + |
| 39 | +For each slave port: |
| 40 | + |
| 41 | +1. MII Status: up - Is the interface up |
| 42 | +2. Aggregator ID: 1 - Both slaves should have the same aggregator ID |
| 43 | +3. details partner lacp pdu: port state 61 - The value is a bit mask that represents the LACP negotiation state on that port. Generally 61 and 63 are what we want. See: <https://movingpackets.net/2017/10/17/decoding-lacp-port-state/> |
| 44 | +>[!NOTE] |
| 45 | +> This article contains references to the term *slave*, a term that Microsoft no longer uses. When the term is removed from the software, we’ll remove it from this article. |
| 46 | +
|
| 47 | +### Fixing the issue |
| 48 | + |
| 49 | +The most common causes for these LACP issues are host/switch miswiring or mismatched LACP/MLAG configuration on the Arista switches. Investigate the situation by tracing out and repairing any wiring issues. If the wiring is correct, then determine if the switch LACP/MLAG configuration is incorrect. |
| 50 | + |
| 51 | +## Further information |
| 52 | + |
| 53 | +If you still have questions, [contact support](https://portal.azure.com/?#blade/Microsoft_Azure_Support/HelpAndSupportBlade). |
| 54 | +For more information about Support plans, see [Azure Support plans](https://azure.microsoft.com/support/plans/response/). |
0 commit comments