You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
ms.custom: sap:Create, Upgrade, Scale and Delete operations (cluster or nodepool)
10
10
---
11
11
# Troubleshoot API server and etcd problems in Azure Kubernetes Services
@@ -89,7 +89,40 @@ Although it's helpful to know which clients generate the highest request volume,
89
89
90
90
### Step 2: Identify and chart the average latency of API server requests per user agent
91
91
92
-
To identify the average latency of API server requests per user agent as plotted on a time chart, run the following query:
92
+
**1.a.** Use the API Server Resource Intensive Listing Detector in Azure Portal
93
+
94
+
> **New:** Azure Kubernetes Service now provides a built-in analyzer to help you identify agents making resource-intensive LIST calls, which are a leading cause of API server and etcd performance issues.
95
+
96
+
**How to access the detector:**
97
+
98
+
1. Open your AKS cluster in the Azure portal.
99
+
2. Go to **Diagnose and solve problems**.
100
+
3. Click **Cluster and Control Plane Availability and Performance**.
101
+
4. Select **API server resource intensive listing detector**.
102
+
103
+
This detector analyzes recent API server activity and highlights agents or workloads generating large or frequent LIST calls. It provides a summary of potential impacts, such as request timeouts, increased 408/503 errors, node instability, health probe failures, and OOM-Kills in API server or etcd.
104
+
105
+
#### How to interpret the detector output
106
+
107
+
-**Summary:**
108
+
Indicates if resource-intensive LIST calls were detected and describes possible impacts on your cluster.
109
+
-**Analysis window:**
110
+
Shows the 30-minute window analyzed, with peak memory and CPU usage.
111
+
-**Read types:**
112
+
Explains whether LIST calls were served from the API server cache (preferred) or required fetching from etcd (most impactful).
113
+
-**Charts and tables:**
114
+
Identify which agents, namespaces, or workloads are generating the most resource-intensive LIST calls.
115
+
116
+
> Only successful LIST calls are counted. Failed or throttled calls are excluded.
117
+
118
+
The analyzer also provides actionable recommendations directly in the Azure portal, tailored to the detected patterns, to help you remediate and optimize your cluster.
119
+
120
+
> **Note:**
121
+
> The API server resource intensive listing detector is available to all users with access to the AKS resource in the Azure portal. No special permissions or prerequisites are required.
122
+
>
123
+
> After identifying the offending agents and applying the above recommendations, you can further use [Priority and Fairness](https://kubernetes.io/docs/concepts/cluster-administration/flow-control/) or refer to [This Section](https://review.learn.microsoft.com/en-us/troubleshoot/azure/azure-kubernetes/create-upgrade-delete/troubleshoot-apiserver-etcd?branch=pr-en-us-9260&tabs=resource-specific#cause-3-an-offending-client-makes-excessive-list-or-put-calls) to throttle or isolate problematic clients.
124
+
125
+
**1.b.** Additionally, you can also run following query to identify the average latency of API server requests per user agent as plotted on a time chart:
93
126
94
127
### [Resource-specific](#tab/resource-specific)
95
128
@@ -279,39 +312,6 @@ The following procedure shows you how to throttle an offending client's LIST Pod
279
312
kubectl get --raw /metrics | grep "restrict-bad-client"
280
313
```
281
314
282
-
### Solution 3c: Use the API Server Resource Intensive Listing Detector in Azure Portal
283
-
284
-
> **New:** Azure Kubernetes Service now provides a built-in analyzer to help you identify agents making resource-intensive LIST calls, which are a leading cause of API server and etcd performance issues.
285
-
286
-
**How to access the detector:**
287
-
288
-
1. Open your AKS cluster in the Azure portal.
289
-
2. Go to **Diagnose and solve problems**.
290
-
3. Click **Cluster and Control Plane Availability and Performance**.
291
-
4. Select **API server resource intensive listing detector**.
292
-
293
-
This detector analyzes recent API server activity and highlights agents or workloads generating large or frequent LIST calls. It provides a summary of potential impacts, such as request timeouts, increased 408/503 errors, node instability, health probe failures, and OOM-Kills in API server or etcd.
294
-
295
-
#### How to interpret the detector output
296
-
297
-
- **Summary:**
298
-
Indicates if resource-intensive LIST calls were detected and describes possible impacts on your cluster.
299
-
- **Analysis window:**
300
-
Shows the 30-minute window analyzed, with peak memory and CPU usage.
301
-
- **Read types:**
302
-
Explains whether LIST calls were served from the API server cache (preferred) or required fetching from etcd (most impactful).
303
-
- **Charts and tables:**
304
-
Identify which agents, namespaces, or workloads are generating the most resource-intensive LIST calls.
305
-
306
-
> Only successful LIST calls are counted. Failed or throttled calls are excluded.
307
-
308
-
The analyzer also provides actionable recommendations directly in the Azure portal, tailored to the detected patterns, to help you remediate and optimize your cluster.
309
-
310
-
> **Note:**
311
-
> The API server resource intensive listing detector is available to all users with access to the AKS resource in the Azure portal. No special permissions or prerequisites are required.
312
-
>
313
-
> After identifying the offending agents and applying the above recommendations, you can further use [Priority and Fairness](https://kubernetes.io/docs/concepts/cluster-administration/flow-control/) to throttle or isolate problematic clients.
314
-
315
315
## Cause 4: A custom webhook might cause a deadlock in API server pods
316
316
317
317
A custom webhook, such as Kyverno, might be causing a deadlock within API server pods.
0 commit comments