Skip to content

Commit b68c13e

Browse files
authored
Merge pull request #291719 from bearzz23/troubleshoot-packet-loss
TSG for packet loss
2 parents f63b75a + 01124ff commit b68c13e

File tree

2 files changed

+102
-0
lines changed

2 files changed

+102
-0
lines changed

articles/operator-nexus/TOC.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -330,6 +330,8 @@
330330
href: troubleshoot-memory-limits.md
331331
- name: Troubleshoot LACP Bonding
332332
href: troubleshoot-lacp-bonding.md
333+
- name: Troubleshoot NAKS Cluster Node Packet Loss
334+
href: troubleshoot-packet-loss.md
333335
- name: Tenant Workload
334336
expanded: false
335337
items:
Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
---
2+
title: Troubleshoot packet loss between NAKS worker nodes for Azure Operator Nexus
3+
description: Troubleshoot packet loss between NAKS worker nodes, and learn how to debug the issue.
4+
ms.service: azure-operator-nexus
5+
ms.custom: troubleshooting
6+
ms.topic: troubleshooting
7+
ms.date: 12/10/2024
8+
ms.author: yinongdai
9+
author: bearzz23
10+
---
11+
# Troubleshoot packet loss between NAKS worker nodes for Azure Operator Nexus
12+
This guide provides detailed steps for troubleshooting packet loss between NAKS worker nodes.
13+
14+
## Prerequisites
15+
16+
* Command line access to the Nexus Kubernetes Cluster is required
17+
* Necessary permissions to make changes to the Nexus Kubernetes Cluster objects
18+
19+
## Symptoms
20+
21+
Network diagnostic tools, such as iperf, report a high percentage of lost packets during data transfer tests. Detailed logs from networking tools show an abnormal number of dropped or lost packets.
22+
Sample output:
23+
```console
24+
iperf3 -c <server-ip> -u -b 100M -l 1500
25+
Connecting to host <server-ip>, port 5201
26+
[ 5] local <client-ip> port 33326 connected to <server-ip> port 5201
27+
[ ID] Interval Transfer Bitrate Total Datagrams
28+
[ 5] 0.00-1.00 sec 11.9 MBytes 99.9 Mbits/sec 8326
29+
[ 5] 1.00-2.00 sec 11.9 MBytes 100 Mbits/sec 8334
30+
[ 5] 2.00-3.00 sec 11.8 MBytes 98.7 Mbits/sec 8242
31+
[ 5] 3.00-4.00 sec 12.1 MBytes 101 Mbits/sec 8424
32+
[ 5] 4.00-5.00 sec 11.9 MBytes 100 Mbits/sec 8334
33+
[ 5] 5.00-6.00 sec 11.9 MBytes 100 Mbits/sec 8333
34+
[ 5] 6.00-7.00 sec 11.9 MBytes 100 Mbits/sec 8333
35+
[ 5] 7.00-8.00 sec 11.9 MBytes 100 Mbits/sec 8334
36+
[ 5] 8.00-9.00 sec 11.9 MBytes 100 Mbits/sec 8333
37+
[ 5] 9.00-10.00 sec 11.9 MBytes 100 Mbits/sec 8333
38+
- - - - - - - - - - - - - - - - - - - - - - - - -
39+
[ ID] Interval Transfer Bitrate Jitter Lost/Total Datagrams
40+
[ 5] 0.00-10.00 sec 119 MBytes 100 Mbits/sec 0.000 ms 0/83326 (0%) sender
41+
[ 5] 0.00-10.00 sec 119 MBytes 99.6 Mbits/sec 0.005 ms 291/83326 (0.35%) receiver
42+
iperf Done.
43+
```
44+
45+
## Troubleshooting steps
46+
The following troubleshooting steps can be used for diagnosing the cluster.
47+
48+
### Gather information
49+
To assist with the troubleshooting process, please gather and provide the following cluster information:
50+
51+
* Subscription ID: the unique identifier of your Azure subscription.
52+
* Tenant ID: the unique identifier of your Microsoft Entra tenant.
53+
* Undercloud Name: the name of the undercloud resource associated with your deployment.
54+
* Undercloud Resource Group: the resource group containing the undercloud resource.
55+
* NAKS Cluster Name: the name of the NAKS cluster experiencing issues.
56+
* NAKS Cluster Resource Group: the resource group containing the NAKS cluster.
57+
* Inter-Switch Devices (ISD) connected to NAKS: the details of the Inter-Switch Devices (ISDs) that are connected to the NAKS cluster.
58+
* Source and Destination IPs: the source and destination IP addresses where packet drops are being observed.
59+
60+
### Verify provisioning status of the Network Fabric
61+
Verify on Azure portal that the NF status is in the provisioned state; the Provisioning State should be 'Succeeded' and Configuration State 'Provisioned'.
62+
63+
### View iperf-client pod events
64+
Use kubectl to inspect events from the iperf-client pod for more detailed information. This can help identify the root cause of the issue with the iperf-client pod.
65+
```console
66+
kubectl get events --namespace default | grep iperf-client
67+
```
68+
Sample output:
69+
```console
70+
NAMESPACE LAST SEEN TYPE REASON OBJECT MESSAGE
71+
default 5m39s Warning BackOff pod/iperf-client-8f7974984-xr67p Back-off restarting failed container iperf-client in pod iperf-client-8f7974984-xr67p_default(masked-id)
72+
```
73+
74+
### Validate L3 ISD configuration
75+
Confirm that the L3 ISD (Layer 3 Isolation Domain) configuration on the devices is correct.
76+
77+
## Potential solutions
78+
If the iperf-client pod is constantly being restarted and other resource statuses appear to be healthy, the following remedies can be attempted:
79+
80+
### Adjust network buffer settings
81+
Modify the network buffer settings to improve performance by adjusting the following parameters:
82+
* net.core.rmem_max: Increase the maximum receive buffer size.
83+
* net.core.wmem_max: Increase the maximum send buffer size.
84+
Commands:
85+
```console
86+
sysctl -w net.core.rmem_max=67108864
87+
sysctl -w net.core.wmem_max=67108864
88+
```
89+
90+
### Optimize iperf tool usage
91+
Use iperf tool options to optimize buffer usage and run parallel streams:
92+
* -P: Number of parallel client streams.
93+
* -w: TCP window size.
94+
Example:
95+
```console
96+
iperf3 -c <destination-ip> -u -b 100M -l 1500 -P 4 -w 256k
97+
```
98+
99+
If you still have questions, [contact support](https://portal.azure.com/?#blade/Microsoft_Azure_Support/HelpAndSupportBlade).
100+
For more information about Support plans, see [Azure Support plans](https://azure.microsoft.com/support/plans/response/).

0 commit comments

Comments
 (0)