Skip to content

Commit 5a5043a

Browse files
committed
Incorporated newly-added content into guides
1 parent 8e4ecc5 commit 5a5043a

File tree

2 files changed

+101
-17
lines changed

2 files changed

+101
-17
lines changed

articles/aks/best-practices-performance-scale-large.md

Lines changed: 31 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ title: Performance and scaling best practices for large workloads in Azure Kuber
33
titleSuffix: Azure Kubernetes Service
44
description: Learn the best practices for performance and scaling for large workloads in Azure Kubernetes Service (AKS).
55
ms.topic: conceptual
6-
ms.date: 11/01/2023
6+
ms.date: 11/02/2023
77
---
88

99
# Best practices for performance and scaling for large workloads in Azure Kubernetes Service (AKS)
@@ -23,6 +23,36 @@ In this article, you learn about:
2323
> [!NOTE]
2424
> This article focuses on best practices for **large workloads**. For best practices for **small to medium workloads**, see [Performance and scaling best practices for small to medium workloads in Azure Kubernetes Service (AKS)](./best-practices-performance-scale.md).
2525
26+
## AKS control plane
27+
28+
In AKS, a *cluster* consists of a set of nodes (physical or virtual machines (VMs)) that run Kubernetes agents and are managed by the control plane. Kubernetes has a multi-dimensional scale envelope with each resource type representing a dimension. Not all resources are alike. For example, *watches* are commonly set on secrets, which result in list calls to the kube-apiserver that add cost and a disproportionately higher load on the control plane compared to resources without watches.
29+
30+
IMAGE
31+
32+
The control plane manages all resource scaling, so the more you scale the cluster within a given dimension, the less you can scale within other dimensions. For example, running hundreds of thousands of pods in an AKS cluster impacts how much pod churn the control plane can support.
33+
34+
The size of the envelope is proportional to the size of the Kubernetes control plane. AKS supports two control plane tiers as part of the Base SKU: the Free tier and the Standard tier. For more information, see [Free and Standard pricing tiers for AKS cluster management](./free-standard-pricing-tiers.md).
35+
36+
> [!IMPORTANT]
37+
> We highly recommend using the Standard tier for production or at-scale workloads. AKS automatically scales up the Kubernetes control plane to support the following scale limits:
38+
>
39+
> * 5,000 nodes per AKS cluster
40+
> * 200,000 pods per AKS cluster (with Azure CNI Overlay)
41+
42+
In most cases, crossing the scale limit threshold results in degraded performance, but doesn't cause the cluster to immediately fail over. To manage load on the Kubernetes control plane, consider scaling in batches of up to 10-20% of the current scale. For example, for a 5,000 node cluster, scale in increments of 500-1,000 nodes. While AKS does autoscale your control plane, it doesn't happen instantaneously. You can leverage API Priority and Fairness (APF) to throttle specific clients and request types to protect the control plane during high churn and load.
43+
44+
## Kube client
45+
46+
LIST requests can be expensive. When working with lists that might have more than a few thousand small objects or more than a few hundred large objects, you should consider the following guidelines:
47+
48+
* **Consider the number of objects (CRs) you expect to eventually exist** when defining a new resource type (CRD).
49+
* **The load on etcd and API Server primarily relies on the number of objects that exist, not the number of objects that are returned**. Even if you use a field selector to filter the list and retrieve only a small number of results, these guidelines still apply. The only exception is retrieval of a single object by `metadata.name`.
50+
* **Avoid repeated LIST calls if possible** if your code needs to maintain an updated list of objects in memory. Instead, consider using the Informer classes provided in most Kubernetes libraries. Informers automatically combine LIST and WATCH functionalities to efficiently maintain an in-memory collection.
51+
* **Consider whether you need strong consistency** if Informers don't meet your needs. Do you need to see the most recent data, up to the exact moment in time you issued the query? If not, set `ResourceVersion=0`. This causes the API Server cache to serve your request instead of etcd.
52+
* **If you can't use Informers or the API Server cache, read large lists in chunks**.
53+
* **Avoid listing more often than needed**. If you can't use Informers, consider how often your application lists the resources. After you read the last object in a large list, don't immediately re-query the same list. You should wait awhile instead.
54+
* **Consider the number of running instances of your client application**. There's a big difference between having a single controller listing objects vs. having pods on each node doing the same thing. If you plan to have multiple instances of your client application periodically listing large numbers of objects, your solution won't scale to large clusters.
55+
2656
## Throttling
2757

2858
The load on a cloud application can vary over time based on factors such as the number of active users or the types of actions that users perform. If the processing requirements of the system exceed the capacity of the available resources, the system can become overloaded and suffer from poor performance and failures.

0 commit comments

Comments
 (0)