Skip to content

Commit 700d57e

Browse files
committed
TSG for packet loss
1 parent ed4cb34 commit 700d57e

File tree

1 file changed

+97
-0
lines changed

1 file changed

+97
-0
lines changed
Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
---
2+
title: Troubleshoot packet loss between NAKS worker nodes for Azure Operator Nexus
3+
description: Troubleshoot packet loss between NAKS worker node, and learn how to debug failure codes.
4+
ms.service: azure-operator-nexus
5+
ms.custom: troubleshooting
6+
ms.topic: troubleshooting
7+
ms.date: 10/31/2024
8+
ms.author: yinongdai
9+
author: yinongdai
10+
---
11+
# Troubleshoot packet loss between NAKS worker nodes for Azure Operator Nexus
12+
This guide provides detailed steps for troubleshooting packet loss between NAKS worker node.
13+
14+
## Prerequisites
15+
16+
* Command line access to the Nexus Kubernetes Cluster is required
17+
* Necessary permissions to make changes to the Nexus Kubernetes Cluster objects
18+
19+
## Symptoms
20+
21+
Network diagnostic tools, such as iperf, report a high percentage of lost packets during data transfer tests. Detailed logs from networking tools show an abnormal number of dropped or lost packets.
22+
Sample output:
23+
```console
24+
iperf3 -c <server-ip> -u -b 100M -l 1500
25+
Connecting to host <server-ip>, port 5201
26+
[ 5] local <client-ip> port 33326 connected to <server-ip> port 5201
27+
[ ID] Interval Transfer Bitrate Total Datagrams
28+
[ 5] 0.00-1.00 sec 11.9 MBytes 99.9 Mbits/sec 8326
29+
[ 5] 1.00-2.00 sec 11.9 MBytes 100 Mbits/sec 8334
30+
[ 5] 2.00-3.00 sec 11.8 MBytes 98.7 Mbits/sec 8242
31+
[ 5] 3.00-4.00 sec 12.1 MBytes 101 Mbits/sec 8424
32+
[ 5] 4.00-5.00 sec 11.9 MBytes 100 Mbits/sec 8334
33+
[ 5] 5.00-6.00 sec 11.9 MBytes 100 Mbits/sec 8333
34+
[ 5] 6.00-7.00 sec 11.9 MBytes 100 Mbits/sec 8333
35+
[ 5] 7.00-8.00 sec 11.9 MBytes 100 Mbits/sec 8334
36+
[ 5] 8.00-9.00 sec 11.9 MBytes 100 Mbits/sec 8333
37+
[ 5] 9.00-10.00 sec 11.9 MBytes 100 Mbits/sec 8333
38+
- - - - - - - - - - - - - - - - - - - - - - - - -
39+
[ ID] Interval Transfer Bitrate Jitter Lost/Total Datagrams
40+
[ 5] 0.00-10.00 sec 119 MBytes 100 Mbits/sec 0.000 ms 0/83326 (0%) sender
41+
[ 5] 0.00-10.00 sec 119 MBytes 99.6 Mbits/sec 0.005 ms 291/83326 (0.35%) receiver
42+
iperf Done.
43+
```
44+
45+
## Troubleshooting Steps
46+
The following troubleshooting steps can be used for diagnosing the cluster.
47+
48+
### Gather Information
49+
To assist with the troubleshooting process, please gather and provide the following cluster information:
50+
51+
* Subscription ID: the unique identifier of your Azure subscription.
52+
* Tenant ID: the unique identifier of your Azure Active Directory (AAD) tenant.
53+
* Undercloud Name: the name of the undercloud resource associated with your deployment.
54+
* Undercloud Resource Group: the resource group containing the undercloud resource.
55+
* NAKS Cluster Name: the name of the NAKS cluster experiencing issues.
56+
* NAKS Cluster Resource Group: the resource group containing the NAKS cluster.
57+
* Inter-Switch Devices (ISD) connected to NAKS: the details of the Inter-Switch Devices (ISDs) that are connected to the NAKS cluster.
58+
* Source and Destination IPs: the source and destination IP addresses where packet drops are being observed.
59+
60+
### Verify Provisioning Status of the Network Fabric
61+
Verified on Azure Portal that the NF status is in the provisioned state; the Provisioning State should be 'Succeeded' and Configuration State 'Provisioned'.
62+
63+
### View SOS Pod Logs
64+
Use kubectl to inspect events from the iperf-client pod for more detailed information. This can help identify the root cause of the issue with the iperf-client pod.
65+
```console
66+
kubectl get events --namespace default | grep iperf-client
67+
```
68+
Sample output:
69+
```console
70+
NAMESPACE LAST SEEN TYPE REASON OBJECT MESSAGE
71+
default 5m39s Warning BackOff pod/iperf-client-8f7974984-xr67p Back-off restarting failed container iperf-client in pod iperf-client-8f7974984-xr67p_default(masked-id)
72+
```
73+
74+
### Validate L3 ISD Configuration
75+
Confirm that the L3 ISD (Layer 3 Isolation Domain) configuration on the devices is correct.
76+
77+
### Adjust Netowork Bugger Settings
78+
Modify the network buffer settings to improve performance by adjusting the following parameters:
79+
* net.core.rmem_max: Increase the maximum receive buffer size.
80+
* net.core.wmem_max: Increase the maximum send buffer size.
81+
Commands:
82+
```console
83+
sysctl -w net.core.rmem_max=67108864
84+
sysctl -w net.core.wmem_max=67108864
85+
```
86+
87+
### Optimize iperf Tool Usage
88+
Use iperf tool options to optimize buffer usage and run parallel streams:
89+
* -P: Number of parallel client streams.
90+
* -w: TCP window size.
91+
Example:
92+
```console
93+
iperf3 -c <destination-ip> -u -b 100M -l 1500 -P 4 -w 256k
94+
```
95+
96+
If you still have questions, [contact support](https://portal.azure.com/?#blade/Microsoft_Azure_Support/HelpAndSupportBlade).
97+
For more information about Support plans, see [Azure Support plans](https://azure.microsoft.com/support/plans/response/).

0 commit comments

Comments
 (0)