Skip to content

Commit ed6fe8e

Browse files
committed
Final updates for first publish docs
1 parent 5035bfc commit ed6fe8e

File tree

2 files changed

+18
-11
lines changed

2 files changed

+18
-11
lines changed

articles/aks/best-practices-performance-scale-large.md

Lines changed: 16 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -9,41 +9,45 @@ ms.date: 11/03/2023
99
# Best practices for performance and scaling for large workloads in Azure Kubernetes Service (AKS)
1010

1111
> [!NOTE]
12-
> This article focuses on best practices for **large workloads**. For best practices for **small to medium workloads**, see [Performance and scaling best practices for small to medium workloads in Azure Kubernetes Service (AKS)](./best-practices-performance-scale.md).
12+
> This article focuses on general best practices for **large workloads**. For best practices specific to **small to medium workloads**, see [Performance and scaling best practices for small to medium workloads in Azure Kubernetes Service (AKS)](./best-practices-performance-scale.md).
1313
1414
As you deploy and maintain clusters in AKS, you can use the following best practices to help you optimize performance and scaling.
1515

16-
Keep in mind that *large* is a relative term. Kubernetes is a multi-dimensional scale envelope, and the scale envelope for your workload depends on the resources you use. For example, a cluster with 100 nodes and thousands of pods or CRDs might be considered large. A 1,000 node cluster with 1,000 pods and various other resources might be considered small from the control plane perspective. The best signal for scale of a Kubernetes control plane is API server HTTP request success rate and latency, as that's a proxy for the amount of load on the control plane.
16+
Keep in mind that *large* is a relative term. Kubernetes has a multi-dimensional scale envelope, and the scale envelope for your workload depends on the resources you use. For example, a cluster with 100 nodes and thousands of pods or CRDs might be considered large. A 1,000 node cluster with 1,000 pods and various other resources might be considered small from the control plane perspective. The best signal for scale of a Kubernetes control plane is API server HTTP request success rate and latency, as that's a proxy for the amount of load on the control plane.
1717

1818
In this article, you learn about:
1919

2020
> [!div class="checklist"]
2121
>
2222
> * AKS and Kubernetes control plane scalability.
23-
> * Kube Client best practices, including backoff, watches, and pagination.
23+
> * Kubernetes Client best practices, including backoff, watches, and pagination.
2424
> * Azure API and platform throttling limits.
2525
> * Feature limitations.
2626
> * Networking and node pool scaling best practices.
2727
28-
## AKS control plane
28+
## AKS and Kubernetes control plane scalability
2929

30-
In AKS, a *cluster* consists of a set of nodes (physical or virtual machines (VMs)) that run Kubernetes agents and are managed by the control plane. Kubernetes has a multi-dimensional scale envelope with each resource type representing a dimension. Not all resources are alike. For example, *watches* are commonly set on secrets, which result in list calls to the kube-apiserver that add cost and a disproportionately higher load on the control plane compared to resources without watches.
30+
In AKS, a *cluster* consists of a set of nodes (physical or virtual machines (VMs)) that run Kubernetes agents and are managed by the Kubernetes control plane hosted by AKS. While AKS optimizes the Kubernetes control plane and its components for scalability and performance, it's still bound by the upstream project limits.
3131

32-
The control plane manages all resource scaling, so the more you scale the cluster within a given dimension, the less you can scale within other dimensions. For example, running hundreds of thousands of pods in an AKS cluster impacts how much pod churn rate (pod mutations per second) the control plane can support.
32+
Kubernetes has a multi-dimensional scale envelope with each resource type representing a dimension. Not all resources are alike. For example, *watches* are commonly set on secrets, which result in list calls to the kube-apiserver that add cost and a disproportionately higher load on the control plane compared to resources without watches.
33+
34+
The control plane manages all the resource scaling in the cluster, so the more you scale the cluster within a given dimension, the less you can scale within other dimensions. For example, running hundreds of thousands of pods in an AKS cluster impacts how much pod churn rate (pod mutations per second) the control plane can support.
3335

3436
The size of the envelope is proportional to the size of the Kubernetes control plane. AKS supports two control plane tiers as part of the Base SKU: the Free tier and the Standard tier. For more information, see [Free and Standard pricing tiers for AKS cluster management][free-standard-tier].
3537

3638
> [!IMPORTANT]
3739
> We highly recommend using the Standard tier for production or at-scale workloads. AKS automatically scales up the Kubernetes control plane to support the following scale limits:
3840
>
39-
> * 5,000 nodes per AKS cluster
41+
> * Up to 5,000 nodes per AKS cluster
4042
> * 200,000 pods per AKS cluster (with Azure CNI Overlay)
4143
4244
In most cases, crossing the scale limit threshold results in degraded performance, but doesn't cause the cluster to immediately fail over. To manage load on the Kubernetes control plane, consider scaling in batches of up to 10-20% of the current scale. For example, for a 5,000 node cluster, scale in increments of 500-1,000 nodes. While AKS does autoscale your control plane, it doesn't happen instantaneously.
4345

4446
You can leverage API Priority and Fairness (APF) to throttle specific clients and request types to protect the control plane during high churn and load.
4547

46-
## Kube Client
48+
## Kubernetes clients
49+
50+
Kubernetes clients are the applications clients, such as operators or monitoring agents, deployed in the Kubernetes cluster that need to communicate with the kube-api server to perform read or mutate operations. It's important to optimize the behavior of these clients to minimize the load they add to the kube-api server and Kubernetes control plane.
4751

4852
AKS doesn't expose control plane and API server metrics via Prometheus or through platform metrics. However, you can analyze API server traffic and client behavior through Kube Audit logs. For more information, see [Troubleshoot the Kubernetes control plane](/troubleshoot/azure/azure-kubernetes-troubleshoot-apiserver-etcd#troubleshooting-checklist).
4953

@@ -78,9 +82,11 @@ Always upgrade your Kubernetes clusters to the latest version. Newer versions co
7882

7983
## Feature limitations
8084

81-
As you scale your AKS clusters to larger scale points, keep the following feature limitation in mind:
85+
As you scale your AKS clusters to larger scale points, keep the following feature limitations in mind:
8286

8387
* AKS supports up to a 1,000 node scale in an AKS cluster by default. While AKS doesn't prevent you from scaling further, doing so might result in degraded performance. If you want to scale beyond 1,000 nodes, you can request a limit increase. For more information, see [Best practices for creating and running AKS clusters at scale][run-aks-at-scale].
88+
* [Azure Network Policy Manager (Azure NPM)][azure-npm] only supports up to 250 nodes.
89+
* You can't use the Stop and Start feature with clusters that have more than 100 nodes. For more information, see [Stop and start an AKS cluster](./start-stop-cluster.md).
8490

8591
## Networking
8692

@@ -90,6 +96,7 @@ As you scale your AKS clusters to larger scale points, keep the following networ
9096
* Use Azure CNI Overlay to scale up to 200,000 pods and 5,000 nodes per cluster. For more information, see [Configure Azure CNI Overlay networking in AKS][azure-cni-overlay].
9197
* If your application needs direct pod-to-pod communication across clusters, use Azure CNI with dynamic IP allocation and scale up to 50,000 application pods per cluster with one routable IP per pod. For more information, see [Configure Azure CNI networking for dynamic IP allocation in AKS][azure-cni-dynamic-ip].
9298
* When using internal Kubernetes services behind an internal load balancer, we recommend creating an internal load balancer or service below a 750 node scale for optimal scaling performance and load balancer elasticity.
99+
* Azure NPM only supports up to 250 nodes. If you want to enforce network policies for larger clusters, consider using [Azure CNI powered by Cilium](./azure-cni-powered-by-cilium.md), which combines the robust control plane of Azure CNI with the Cilium data plane to provide high performance networking and security.
93100

94101
## Node pool scaling
95102

articles/aks/best-practices-performance-scale.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ ms.date: 11/03/2023
99
# Best practices for performance and scaling for small to medium workloads in Azure Kubernetes Service (AKS)
1010

1111
> [!NOTE]
12-
> This article focuses on best practices for **small to medium workloads**. For best practices for **large workloads**, see [Performance and scaling best practices for large workloads in Azure Kubernetes Service (AKS)](./best-practices-performance-scale-large.md).
12+
> This article focuses on general best practices for **small to medium workloads**. For best practices specific to **large workloads**, see [Performance and scaling best practices for large workloads in Azure Kubernetes Service (AKS)](./best-practices-performance-scale-large.md).
1313
1414
As you deploy and maintain clusters in AKS, you can use the following best practices to help you optimize performance and scaling.
1515

@@ -151,7 +151,7 @@ The Ubuntu 2204 image is fully supported by Microsoft, Canonical, and the Ubuntu
151151
152152
Application performance is closely tied to the VM SKUs you use in your workloads. Larger and more powerful VMs, generally provide better performance. For *mission critical or product workloads*, we recommend using VMs with at least an 8-core CPU. VMs with newer hardware generations, like v4 and v5, can also help improve performance. Keep in mind that create and scale latency might vary depending on the VM SKUs you use.
153153
154-
### Node pools
154+
### Use dedicated system node pools
155155
156156
For scaling performance and reliability, we recommend using a dedicated system node pool. With this configuration, the dedicated system node pool reserves space for critical system resources such as system OS daemons. Your application workload can then run in a user node pool to increase the availability of allocatable resources for your application. This configuration also helps mitigate the risk of resource competition between the system and application.
157157

0 commit comments

Comments
 (0)