Skip to content

Commit c7284e0

Browse files
committed
Diagnose and solve UDP packet drops in AKS
1 parent af93dae commit c7284e0

File tree

2 files changed

+134
-0
lines changed

2 files changed

+134
-0
lines changed

articles/aks/TOC.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -554,6 +554,8 @@
554554
href: api-server-vnet-integration.md
555555
- name: Private Endpoint
556556
href: private-clusters.md#use-a-private-endpoint-connection
557+
- name: Diagnose and solve UDP packet drops
558+
href: troubleshoot-udp-packet-drops.md
557559
- name: Storage
558560
items:
559561
- name: CSI storage drivers
Lines changed: 132 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,132 @@
1+
---
2+
title: Diagnose and solve UDP packet drops in Azure Kubernetes Service (AKS)
3+
description: Learn how to diagnose and solve UDP packet drops in Azure Kubernetes Service (AKS).
4+
ms.topic: how-to
5+
ms.date: 05/08/2024
6+
author: schaffererin
7+
ms.author: schaffererin
8+
ms.service: azure-kubernetes-service
9+
---
10+
11+
# Diagnose and solve UDP packet drops in Azure Kubernetes Service (AKS)
12+
13+
This article describes how to diagnose and solve UDP packet drops in Azure Kubernetes Service (AKS). It walks you through UDP packet drop issues caused by a small read buffer, which could lead to overflow in cases of high network traffic.
14+
15+
UDP, or *User Datagram Protocol*, is a connectionless protocol used within managed AKS clusters. UDP packets don't establish a connection before data transfer, so they're sent without any guarantee of delivery, reliability, or order. This means that UDP packets can be lost, duplicated, or arrive out of order at the destination due to various reasons.
16+
17+
## Prerequisites
18+
19+
* An AKS cluster with at least one node pool and one pod running a UDP-based application.
20+
* Azure CLI installed and configured. For more information, see [Install the Azure CLI](/cli/azure/install-azure-cli).
21+
* Kubectl installed and configured to connect to your AKS cluster. For more information, see [Install kubectl](/cli/azure/install-azure-cli).
22+
* A client machine that can send and receive UDP packets to and from your AKS cluster.
23+
24+
## Issue: UDP connections have a high packet drop rate
25+
26+
One possible cause of UDP packet loss is that the UDP buffer size is too small to handle the incoming traffic. The UDP buffer size determines how much data can be stored in the kernel before the application processes it. If the buffer size is insufficient, the kernel might drop packets that exceed the buffer capacity. This setting is managed at the virtual machine (VM) level for your nodes. The default value is *212992 bytes* or *0.2 MB*.
27+
28+
There are two different variables at the VM level that apply to the UDP buffer size:
29+
30+
* `net.core.rmem_max = 212992 bytes`: The maximum possible buffer size for incoming traffic on a per-socket basis.
31+
* `net.core.rmem_default = 212992 bytes`: The default buffer size for incoming traffic on a per-socket basis.
32+
33+
To allow the buffer to grow to serve more traffic, you need to update the maximum values for read buffer sizes based on your application's requirements.
34+
35+
> [!IMPORTANT]
36+
> This article focuses on Ubuntu Linux kernel buffer sizes. If you want to see other configurations for Linux and Windows, see [Customize node configuration for AKS node pools](./custom-node-configuration.md).
37+
38+
## Diagnose the issue
39+
40+
1. Get a list of your nodes using the `kubectl get nodes` command and pick a node you want to check the buffer settings for.
41+
42+
```bash
43+
kubectl get nodes
44+
```
45+
46+
2. Set up a debug pod on the node you selected using the `kubectl debug` command. Replace `<node-name>` with the name of the node you want to debug.
47+
48+
```bash
49+
kubectl debug <node-name> -it --image=ubuntu --share-processes -- bash
50+
```
51+
52+
3. Get the value of the `net.core.rmem_max` and `net.core.rmem_default` variables using the following `sysctl` command:
53+
54+
```bash
55+
sysctl net.core.rmem_max net.core.rmem_default net.core.wmem_max net.core.wmem_default
56+
```
57+
58+
4. Check if your buffer is too small for your application and dropping packets by simulating realistic network traffic on your pods.
59+
5. Check the UDP file using the following `cat` command:
60+
61+
```bash
62+
cat /proc/net/udp
63+
```
64+
65+
This file shows you the statistics of the current open connections under the `rx_queue` column. It doesn't show historical data.
66+
67+
6. Check the snmp file using the following `cat` command:
68+
69+
```bash
70+
cat /proc/net/snmp
71+
```
72+
73+
This file shows you the life-to-date of the UDP packets, including how many packets were dropped under the `RcvbufErrors` column.
74+
75+
If you notice an increase beyond your buffer size in the `rx_queue` or an uptick in the `RcvbufErrors` value, you need to upgrade your buffer size.
76+
77+
> [!WARNING]
78+
> If your application consistently runs at or beyond the buffer limits, simply increasing the size might not be the best solution. In such cases, you want to analyze and optimize your application for how it processes UDP requests. Increasing the buffer size is only beneficial if you experience bursts of traffic that cause the buffer to run out of space, because it assists in giving the kernel extra time/resources to process the burst in requests.
79+
80+
## Mitigate the issue
81+
82+
> [!IMPORTANT]
83+
> Before you proceed, it's important to understand the impact of changing the buffer size. The buffer size tells the system kernel to reserve a certain amount of memory for the socket. More sockets and larger buffers can lead to increased memory reserved for the sockets and less memory available for other resources on the nodes. This can lead to resource starvation if not configured properly.
84+
85+
You can change buffer size values on a node pool level during the node pool creation process. The steps in this section show you how to configure a Linux OS and apply the changes to all nodes in the node pool. You can't add this setting to an existing node pool.
86+
87+
1. Create a `linuxosconfig.json` file with the following content. You can modify the values based on your application's requirements and node SKU. The minimum value is *212992 bytes*, and the maximum value is *134217728 bytes*.
88+
89+
```json
90+
{
91+
"sysctls": {
92+
"netCoreRmemMax": 8000000
93+
}
94+
}
95+
```
96+
97+
2. Make sure you're in the same directory as the `linuxosconfig.json` file and create a new node pool with the buffer size configuration using the [`az aks nodepool add`][az-aks-nodepool-add] command.
98+
99+
```azurecli-interactive
100+
az aks nodepool add --resource-group $RESOURCE_GROUP --cluster-name $CLUSTER_NAME --name $NODE_POOL_NAME --linux-os-config ./linuxosconfig.json
101+
```
102+
103+
This command sets the maximum UDP buffer size to `8 MB` for each socket on the node. You can adjust the value in the `linuxosconfig.json` file to meet your application's requirements.
104+
105+
## Validate the changes
106+
107+
Once you apply the new values, you can access your VM to ensure the new values are set as default.
108+
109+
1. Get a list of your nodes using the `kubectl get nodes` command and pick a node you want to check the buffer settings for.
110+
111+
```bash
112+
kubectl get nodes
113+
```
114+
115+
2. Set up a debug pod on the node you selected using the `kubectl debug` command. Replace `<node-name>` with the name of the node you want to debug.
116+
117+
```bash
118+
kubectl debug <node-name> -it --image=ubuntu --share-processes -- bash
119+
```
120+
121+
3. Get the value of the `net.core.rmem_max` and `net.core.wmem_max` variables using the following `sysctl` command:
122+
123+
```bash
124+
sysctl net.core.rmem_max net.core.wmem_max
125+
```
126+
127+
## Next steps
128+
129+
In this article, you learned how to diagnose and solve UDP packet drops in Azure Kubernetes Service (AKS). For more information on how to troubleshoot issues in AKS, see the [Azure Kubernetes Service troubleshooting documentation](/troubleshoot/azure/azure-kubernetes/welcome-azure-kubernetes).
130+
131+
<!-- LINKS -->
132+
[az-aks-nodepool-add]: /cli/azure/aks/nodepool#az-aks-nodepool-add

0 commit comments

Comments
 (0)