Skip to content

Commit f272336

Browse files
authored
Merge pull request #259118 from schaffererin/aks-cluster-autoscaler-best-practices
Cluster autoscaler best practices doc
2 parents bb51b0c + b5fe37b commit f272336

8 files changed

+203
-176
lines changed

articles/aks/TOC.yml

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -346,7 +346,11 @@
346346
- name: Proximity placement groups
347347
href: reduce-latency-ppg.md
348348
- name: Cluster Autoscaler
349-
href: cluster-autoscaler.md
349+
items:
350+
- name: Cluster Autoscaler overview
351+
href: cluster-autoscaler-overview.md
352+
- name: Use the Cluster Autoscaler on AKS
353+
href: cluster-autoscaler.md
350354
- name: Node autoprovision
351355
href: node-autoprovision.md
352356
- name: Availability Zones

articles/aks/availability-zones.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -68,6 +68,9 @@ AKS clusters deployed using availability zones can distribute nodes across multi
6868

6969
If a single zone becomes unavailable, your applications continue to run on clusters configured to spread across multiple zones.
7070

71+
> [!NOTE]
72+
> When implementing **availability zones with the [cluster autoscaler](./cluster-autoscaler-overview.md)**, we recommend using a single node pool for each zone. You can set the `--balance-similar-node-groups` parameter to `True` to maintain a balanced distribution of nodes across zones for your workloads during scale up operations. When this approach isn't implemented, scale down operations can disrupt the balance of nodes across zones.
73+
7174
## Create an AKS cluster across availability zones
7275

7376
When you create a cluster using the [az aks create][az-aks-create] command, the `--zones` parameter specifies the availability zones to deploy agent nodes into. The availability zones that the managed control plane components are deployed into are **not** controlled by this parameter. They are automatically spread across all availability zones (if present) in the region during cluster deployment.

articles/aks/best-practices-performance-scale-large.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -76,7 +76,7 @@ Keeping the above considerations in mind, customers are typically able to deploy
7676
Always upgrade your Kubernetes clusters to the latest version. Newer versions contain many improvements that address performance and throttling issues. If you're using an upgraded version of Kubernetes and still see throttling due to the actual load or the number of clients in the subscription, you can try the following options:
7777

7878
* **Analyze errors using AKS Diagnose and Solve Problems**: You can use [AKS Diagnose and Solve Problems](./aks-diagnostics.md) to analyze errors, identity the root cause, and get resolution recommendations.
79-
* **Increase the Cluster Autoscaler scan interval**: If the diagnostic reports show that [Cluster Autoscaler throttling has been detected](/troubleshoot/azure/azure-kubernetes/429-too-many-requests-errors#analyze-and-identify-errors-by-using-aks-diagnose-and-solve-problems), you can [increase the scan interval](./cluster-autoscaler.md#change-the-cluster-autoscaler-settings) to reduce the number of calls to Virtual Machine Scale Sets from the Cluster Autoscaler.
79+
* **Increase the Cluster Autoscaler scan interval**: If the diagnostic reports show that [Cluster Autoscaler throttling has been detected](/troubleshoot/azure/azure-kubernetes/429-too-many-requests-errors#analyze-and-identify-errors-by-using-aks-diagnose-and-solve-problems), you can [increase the scan interval](./cluster-autoscaler.md#update-the-cluster-autoscaler-settings) to reduce the number of calls to Virtual Machine Scale Sets from the Cluster Autoscaler.
8080
* **Reconfigure third-party applications to make fewer calls**: If you filter by *user agents* in the ***View request rate and throttle details*** diagnostic and see that [a third-party application, such as a monitoring application, makes a large number of GET requests](/troubleshoot/azure/azure-kubernetes/429-too-many-requests-errors#analyze-and-identify-errors-by-using-aks-diagnose-and-solve-problems), you can change the settings of these applications to reduce the frequency of the GET calls. Make sure the application clients use exponential backoff when calling Azure APIs.
8181
* **Split your clusters into different subscriptions or regions**: If you have a large number of clusters and node pools that use Virtual Machine Scale Sets, you can split them into different subscriptions or regions within the same subscription. Most Azure API limits are shared at the subscription-region level, so you can move or scale your clusters to different subscriptions or regions to get unblocked on Azure API throttling. This option is especially helpful if you expect your clusters to have high activity. There are no generic guidelines for these limits. If you want specific guidance, you can create a support ticket.
8282

articles/aks/best-practices-performance-scale.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,7 @@ Implementing [vertical pod autoscaling](./vertical-pod-autoscaler.md) is useful
5757

5858
Implementing cluster autoscaling is useful if your existing nodes lack sufficient capacity, as it helps with scaling up and provisioning new nodes.
5959

60-
When considering cluster autoscaling, the decision of when to remove a node involves a tradeoff between optimizing resource utilization and ensuring resource availability. Eliminating underutilized nodes enhances cluster utilization but might result in new workloads having to wait for resources to be provisioned before they can be deployed. It's important to find a balance between these two factors that aligns with your cluster and workload requirements and [configure the cluster autoscaler profile settings accordingly](./cluster-autoscaler.md#change-the-cluster-autoscaler-settings).
60+
When considering cluster autoscaling, the decision of when to remove a node involves a tradeoff between optimizing resource utilization and ensuring resource availability. Eliminating underutilized nodes enhances cluster utilization but might result in new workloads having to wait for resources to be provisioned before they can be deployed. It's important to find a balance between these two factors that aligns with your cluster and workload requirements and [configure the cluster autoscaler profile settings accordingly](./cluster-autoscaler.md#update-the-cluster-autoscaler-settings).
6161

6262
The Cluster Autoscaler profile settings apply universally to all autoscaler-enabled node pools in your cluster. This means that any scaling actions occurring in one autoscaler-enabled node pool might impact the autoscaling behavior in another node pool. It's important to apply consistent and synchronized profile settings across all relevant node pools to ensure that the autoscaler behaves as expected.
6363

@@ -234,7 +234,7 @@ The following table provides a breakdown of suggested use cases for OS disks sup
234234

235235
#### IOPS and throughput
236236

237-
Input/output operations per second (IOPS) refers to the number of read and write operations that a disk can perform in a second. Throughout refers to the amount of data that can be transferred in a given time period.
237+
Input/output operations per second (IOPS) refers to the number of read and write operations that a disk can perform in a second. Throughput refers to the amount of data that can be transferred in a given time period.
238238

239239
OS disks are responsible for storing the operating system and its associated files, and the VMs are responsible for running the applications. When selecting a VM, ensure the size and performance of the OS disk and VM SKU don't have a large discrepancy. A discrepancy in size or performance can cause performance issues and resource contention. For example, if the OS disk is significantly smaller than the VMs, it can limit the amount of space available for application data and cause the system to run out of disk space. If the OS disk has lower performance than the VMs, it can become a bottleneck and limit the overall performance of the system. Make sure the size and performance are balanced to ensure optimal performance in Kubernetes.
240240

Lines changed: 109 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,109 @@
1+
---
2+
title: Cluster autoscaling in Azure Kubernetes Service (AKS) overview
3+
titleSuffix: Azure Kubernetes Service
4+
description: Learn about cluster autoscaling in Azure Kubernetes Service (AKS) using the cluster autoscaler.
5+
ms.topic: conceptual
6+
ms.date: 01/05/2024
7+
---
8+
9+
# Cluster autoscaling in Azure Kubernetes Service (AKS) overview
10+
11+
To keep up with application demands in Azure Kubernetes Service (AKS), you might need to adjust the number of nodes that run your workloads. The cluster autoscaler component watches for pods in your cluster that can't be scheduled because of resource constraints. When the cluster autoscaler detects issues, it scales up the number of nodes in the node pool to meet the application demand. It also regularly checks nodes for a lack of running pods and scales down the number of nodes as needed.
12+
13+
This article helps you understand how the cluster autoscaler works in AKS. It also provides guidance, best practices, and considerations when configuring the cluster autoscaler for your AKS workloads. If you want to enable, disable, or update the cluster autoscaler for your AKS workloads, see [Use the cluster autoscaler in AKS](./cluster-autoscaler.md).
14+
15+
## About the cluster autoscaler
16+
17+
Clusters often need a way to scale automatically to adjust to changing application demands, such as between workdays and evenings or weekends. AKS clusters can scale in the following ways:
18+
19+
* The **cluster autoscaler** periodically checks for pods that can't be scheduled on nodes because of resource constraints. The cluster then automatically increases the number of nodes. Manual scaling is disabled when you use the cluster autoscaler. For more information, see [How does scale up work?](https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-does-scale-up-work).
20+
* The **[Horizontal Pod Autoscaler][horizontal-pod-autoscaler]** uses the Metrics Server in a Kubernetes cluster to monitor the resource demand of pods. If an application needs more resources, the number of pods is automatically increased to meet the demand.
21+
* The **[Vertical Pod Autoscaler][vertical-pod-autoscaler]** automatically sets resource requests and limits on containers per workload based on past usage to ensure pods are scheduled onto nodes that have the required CPU and memory resources.
22+
23+
:::image type="content" source="media/cluster-autoscaler/cluster-autoscaler.png" alt-text="Screenshot of how the cluster autoscaler and horizontal pod autoscaler often work together to support the required application demands.":::
24+
25+
It's a common practice to enable cluster autoscaler for nodes and either the Vertical Pod Autoscaler or Horizontal Pod Autoscaler for pods. When you enable the cluster autoscaler, it applies the specified scaling rules when the node pool size is lower than the minimum or greater than the maximum. The cluster autoscaler waits to take effect until a new node is needed in the node pool or until a node might be safely deleted from the current node pool. For more information, see [How does scale down work?](https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-does-scale-down-work)
26+
27+
## Best practices and considerations
28+
29+
* When implementing **availability zones with the cluster autoscaler**, we recommend using a single node pool for each zone. You can set the `--balance-similar-node-groups` parameter to `True` to maintain a balanced distribution of nodes across zones for your workloads during scale up operations. When this approach isn't implemented, scale down operations can disrupt the balance of nodes across zones.
30+
* For **clusters with more than 400 nodes**, we recommend using Azure CNI or Azure CNI Overlay.
31+
* To **effectively run workloads concurrently on both Spot and Fixed node pools**, consider using [*priority expanders*](https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-are-expanders). This approach allows you to schedule pods based on the priority of the node pool.
32+
* Exercise caution when **assigning CPU/Memory requests on pods**. The cluster autoscaler scales up based on pending pods rather than CPU/Memory pressure on nodes.
33+
* For **clusters concurrently hosting both long-running workloads, like web apps, and short/bursty job workloads**, we recommend separating them into distinct node pools with [Affinity Rules](./operator-best-practices-advanced-scheduler.md#node-affinity)/[expanders](https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-are-expanders) or using [PriorityClass](https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/#priorityclass) to help prevent unnecessary node drain or scale down operations.
34+
* We **don't recommend making direct changes to nodes in autoscaled node pools**. All nodes in the same node group should have uniform capacity, labels, and system pods running on them.
35+
* Nodes don't scale up if pods have a PriorityClass value below -10. Priority -10 is reserved for [overprovisioning pods](https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-can-i-configure-overprovisioning-with-cluster-autoscaler). For more information, see [Using the cluster autoscaler with Pod Priority and Preemption](https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-does-cluster-autoscaler-work-with-pod-priority-and-preemption).
36+
* **Don't combine other node autoscaling mechanisms**, such as Virtual Machine Scale Set autoscalers, with the cluster autoscaler.
37+
* The cluster autoscaler **might be unable to scale down if pods can't move, such as in the following situations**:
38+
* A directly created pod not backed by a controller object, such as a Deployment or ReplicaSet.
39+
* A pod disruption budget (PDB) that's too restrictive and doesn't allow the number of pods to fall below a certain threshold.
40+
* A pod uses node selectors or anti-affinity that can't be honored if scheduled on a different node.
41+
For more information, see [What types of pods can prevent the cluster autoscaler from removing a node?](https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-types-of-pods-can-prevent-ca-from-removing-a-node).
42+
43+
## Cluster autoscaler profile
44+
45+
The [cluster autoscaler profile](./cluster-autoscaler.md#cluster-autoscaler-profile-settings) is a set of parameters that control the behavior of the cluster autoscaler. You can configure the cluster autoscaler profile when you create a cluster or update an existing cluster.
46+
47+
### Optimizing the cluster autoscaler profile
48+
49+
You should fine-tune the cluster autoscaler profile settings according to your specific workload scenarios while also considering tradeoffs between performance and cost. This section provides examples that demonstrate those tradeoffs.
50+
51+
It's important to note that the cluster autoscaler profile settings are cluster-wide and applied to all autoscale-enabled node pools. Any scaling actions that take place in one node pool can affect the autoscaling behavior of other node pools, which can lead to unexpected results. Make sure you apply consistent and synchronized profile configurations across all relevant node pools to ensure you get your desired results.
52+
53+
#### Example 1: Optimizing for performance
54+
55+
For clusters that handle substantial and bursty workloads with a primary focus on performance, we recommend increasing the `scan-interval` and decreasing the `scale-down-utilization-threshold`. These settings help batch multiple scaling operations into a single call, optimizing scaling time and the utilization of compute read/write quotas. It also helps mitigate the risk of swift scale down operations on underutilized nodes, enhancing the pod scheduling efficiency.
56+
57+
For clusters with daemonset pods, we recommend setting `ignore-daemonset-utilization` to `true`, which effectively ignores node utilization by daemonset pods and minimizes unnecessary scale down operations.
58+
59+
#### Example 2: Optimizing for cost
60+
61+
If you want a cost-optimized profile, we recommend setting the following parameter configurations:
62+
63+
* Reduce `scale-down-unneeded-time`, which is the amount of time a node should be unneeded before it's eligible for scale down.
64+
* Reduce `scale-down-delay-after-add`, which is the amount of time to wait after a node is added before considering it for scale down.
65+
* Increase `scale-down-utilization-threshold`, which is the utilization threshold for removing nodes.
66+
* Increase `max-empty-bulk-delete`, which is the maximum number of nodes that can be deleted in a single call.
67+
68+
## Common issues and mitigation recommendations
69+
70+
### Not triggering scale up operations
71+
72+
| Common causes | Mitigation recommendations |
73+
|--------------|--------------|
74+
| PersistentVolume node affinity conflicts, which can arise when using the cluster autoscaler with multiple availability zones or when a pod's or persistent volume's zone differs from the node's zone. | Use one node pool per availability zone and enabling `--balance-similar-node-groups`. You can also set the [`volumeBindingMode` field to `WaitForFirstConsumer`](./azure-disk-csi.md#create-a-custom-storage-class) in the pod specification to prevent the volume from being bound to a node until a pod using the volume is created. |
75+
| Taints and Tolerations/Node affinity conflicts | Assess the taints assigned to your nodes and review the tolerations defined in your pods. If necessary, make adjustments to the [taints and tolerations](./operator-best-practices-advanced-scheduler.md#provide-dedicated-nodes-using-taints-and-tolerations) to ensure that your pods can be efficiently scheduled on your nodes. |
76+
77+
### Scale up operation failures
78+
79+
| Common causes | Mitigation recommendations |
80+
|--------------|--------------|
81+
| IP address exhaustion in the subnet | Add another subnet in the same virtual network and add another node pool into the new subnet. |
82+
| Core quota exhaustion | Approved core quota has been exhausted. [Request a quota increase](../quotas/quickstart-increase-quota-portal.md). The cluster autoscaler enters an [exponential backoff state](#node-pool-in-backoff) within the specific node group when it experiences multiple failed scale up attempts. |
83+
| Max size of node pool | Increase the max nodes on the node pool or create a new node pool. |
84+
| Requests/Calls exceeding the rate limit | See [429 Too Many Requests errors](/troubleshoot/azure/azure-kubernetes/429-too-many-requests-errors). |
85+
86+
### Scale down operation failures
87+
88+
| Common causes | Mitigation recommendations |
89+
|--------------|--------------|
90+
| Pod preventing node drain/Unable to evict pod |• View [what types of pods can prevent scale down](https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-types-of-pods-can-prevent-ca-from-removing-a-node). <br> • For pods using local storage, such as hostPath and emptyDir, set the cluster autoscaler profile flag `skip-nodes-with-local-storage` to `false`. <br> • In the pod specification, set the `cluster-autoscaler.kubernetes.io/safe-to-evict` annotation to `true`. <br> • Check your [PDB](https://kubernetes.io/docs/tasks/run-application/configure-pdb/), as it might be restrictive. |
91+
| Min size of node pool | Reduce the minimum size of the node pool. |
92+
| Requests/Calls exceeding the rate limit | See [429 Too Many Requests errors](/troubleshoot/azure/azure-kubernetes/429-too-many-requests-errors). |
93+
| Write operations locked | Don't make any changes to the [fully managed AKS resource group](./cluster-configuration.md#fully-managed-resource-group-preview) (see [AKS support policies](./support-policies.md)). Remove or reset any [resource locks](../azure-resource-manager/management/lock-resources.md) you previously applied to the resource group. |
94+
95+
### Other issues
96+
97+
| Common causes | Mitigation recommendations |
98+
|--------------|--------------|
99+
| PriorityConfigMapNotMatchedGroup | Make sure that you add all the node groups requiring autoscaling to the [expander configuration file](https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/expander/priority/readme.md#configuration). |
100+
101+
### Node pool in backoff
102+
103+
Node pool in backoff was introduced in version 0.6.2 and causes the cluster autoscaler to back off from scaling a node pool after a failure.
104+
105+
Depending on how long the scaling operations have been experiencing failures, it may take up to 30 minutes before making another attempt. You can reset the node pool's backoff state by disabling and then re-enabling autoscaling.
106+
107+
<!-- LINKS --->
108+
[vertical-pod-autoscaler]: vertical-pod-autoscaler.md
109+
[horizontal-pod-autoscaler]:concepts-scale.md#horizontal-pod-autoscaler

0 commit comments

Comments
 (0)