Skip to content

Commit d0662c7

Browse files
authored
Merge pull request #263586 from kevinkrp93/5k-_ccp_scaling
5k ccp scaling
2 parents 81ea29c + b057758 commit d0662c7

6 files changed

+22
-76
lines changed

.openpublishing.redirection.azure-kubernetes-service.json

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -461,6 +461,11 @@
461461
"source_path_from_root": "/articles/aks/command-invoke.md",
462462
"redirect_url": "/azure/aks/access-private-cluster",
463463
"redirect_document_id": false
464+
},
465+
{
466+
"source_path_from_root": "/articles/aks/operator-best-practices-run-at-scale.md",
467+
"redirect_url": "/azure/aks/best-practices-performance-scale-large.md",
468+
"redirect_document_id": false
464469
}
465470
]
466471
}

articles/aks/TOC.yml

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -190,8 +190,6 @@
190190
href: operator-best-practices-cluster-isolation.md
191191
- name: Basic scheduler features
192192
href: operator-best-practices-scheduler.md
193-
- name: Run AKS clusters at scale
194-
href: operator-best-practices-run-at-scale.md
195193
- name: Advanced scheduler features
196194
href: operator-best-practices-advanced-scheduler.md
197195
- name: Networking

articles/aks/best-practices-app-cluster-reliability.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -493,7 +493,7 @@ For more information, see [Configure Azure CNI networking for dynamic allocation
493493
>
494494
> Use v5 VM SKUs for improved performance during and after updates, less overall impact, and a more reliable connection for your applications.
495495

496-
For node pools in AKS, use v5 SKU VMs with ephemeral OS disks to provide sufficient compute resources for kube-system pods. For more information, see [Best practices for creating and running AKS clusters at scale](./operator-best-practices-run-at-scale.md) and [Best practices for performance and scaling for large workloads in AKS](./best-practices-performance-scale-large.md).
496+
For node pools in AKS, use v5 SKU VMs with ephemeral OS disks to provide sufficient compute resources for kube-system pods. For more information, see [Best practices for performance and scaling large workloads in AKS](./best-practices-performance-scale-large.md).
497497

498498
### Do *not* use B series VMs
499499

@@ -562,5 +562,5 @@ For more information, see [Secure your AKS clusters with Azure Policy](./use-azu
562562
This article focused on best practices for deployment and cluster reliability for Azure Kubernetes Service (AKS) clusters. For more best practices, see the following articles:
563563

564564
* [High availability and disaster recovery overview for AKS](./ha-dr-overview.md)
565-
* [Run AKS clusters at scale](./operator-best-practices-run-at-scale.md)
565+
* [Run AKS clusters at scale](./best-practices-performance-scale-large.md)
566566
* [Baseline architecture for an AKS cluster](/azure/architecture/reference-architectures/containers/aks/baseline-aks)

articles/aks/best-practices-performance-scale-large.md

Lines changed: 14 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -84,7 +84,13 @@ Always upgrade your Kubernetes clusters to the latest version. Newer versions co
8484

8585
As you scale your AKS clusters to larger scale points, keep the following feature limitations in mind:
8686

87-
* AKS supports up to a 1,000 node scale in an AKS cluster by default. While AKS doesn't prevent you from scaling further, doing so might result in degraded performance. If you want to scale beyond 1,000 nodes, you can request a limit increase. For more information, see [Best practices for creating and running AKS clusters at scale][run-aks-at-scale].
87+
* AKS supports scaling up to 5,000 nodes by default for all Standard Tier / LTS clusters. AKS scales your cluster's control plane at runtime based on cluster size and API server resource utilization. If you cannot scale up to the supported limit, enable [control plane metrics (Preview)](./monitor-control-plane-metrics.md) with the [Azure Monitor managed service for Prometheus](../azure-monitor/essentials/prometheus-metrics-overview.md) to monitor the control plane. To help troubleshoot scaling performance or reliability issues, see the following resources:
88+
* [AKS at scale troubleshooting guide](/troubleshoot/azure/azure-kubernetes/aks-at-scale-troubleshoot-guide)
89+
* [Troubleshoot the Kubernetes control plane](/troubleshoot/azure/azure-kubernetes/troubleshoot-apiserver-etcd)
90+
91+
> [!NOTE]
92+
> During the operation to scale the control plane, you might encounter elevated API server latency or timeouts for up to 15 minutes. If you continue to have problems scaling to the supported limit, open a [support ticket](https://portal.azure.com/#create/Microsoft.Support/Parameters/%7B%0D%0A%09%22subId%22%3A+%22%22%2C%0D%0A%09%22pesId%22%3A+%225a3a423f-8667-9095-1770-0a554a934512%22%2C%0D%0A%09%22supportTopicId%22%3A+%2280ea0df7-5108-8e37-2b0e-9737517f0b96%22%2C%0D%0A%09%22contextInfo%22%3A+%22AksLabelDeprecationMarch22%22%2C%0D%0A%09%22caller%22%3A+%22Microsoft_Azure_ContainerService+%2B+AksLabelDeprecationMarch22%22%2C%0D%0A%09%22severity%22%3A+%223%22%0D%0A%7D).
93+
8894
* [Azure Network Policy Manager (Azure npm)][azure-npm] only supports up to 250 nodes.
8995
* You can't use the Stop and Start feature with clusters that have more than 100 nodes. For more information, see [Stop and start an AKS cluster](./start-stop-cluster.md).
9096

@@ -107,8 +113,11 @@ As you scale your AKS clusters to larger scale points, keep the following node p
107113
* When running at-scale AKS clusters, use the cluster autoscaler whenever possible to ensure dynamic scaling of node pools based on the demand for compute resources. For more information, see [Automatically scale an AKS cluster to meet application demands][cluster-autoscaler].
108114
* If you're scaling beyond 1,000 nodes and are *not* using the cluster autoscaler, we recommend scaling in batches of 500-700 nodes at a time. The scaling operations should have a two-minute to five-minute wait time between scale up operations to prevent Azure API throttling. For more information, see [API management: Caching and throttling policies][throttling-policies].
109115

110-
> [!NOTE]
111-
> You can't use [Azure Network Policy Manager (Azure NPM)][azure-npm] with clusters that have more than 500 nodes.
116+
## Cluster upgrade considerations and best practices
117+
118+
* When a cluster reaches the 5,000 node limit, cluster upgrades are blocked. This limits prevents an upgrade because there isn't available node capacity to perform rolling updates within the max surge property limit. If you have a cluster at this limit, we recommend [scaling down the cluster](./concepts-scale.md) under 3,000 nodes before attempting a cluster upgrade. This will provide extra capacity for node churn and minimize load on the control plane.
119+
* When upgrading clusters with more than 500 nodes, it is recommended to use a [max surge configuration](./upgrade-aks-cluster.md#set-max-surge-value) of 10-20% of the node pool's capacity. AKS configures upgrades with a default value of 10% for max surge. You can customize the max surge settings per node pool to enable a trade-off between upgrade speed and workload disruption. When you increase the max surge settings, the upgrade process completes faster, but you might experience disruptions during the upgrade process. For more information, see [Customize node surge upgrade][max surge].
120+
* For more cluster upgrade information, see [Upgrade an AKS cluster][cluster upgrades].
112121

113122
<!-- LINKS - Internal --->
114123
[run-aks-at-scale]: ./operator-best-practices-run-at-scale.md
@@ -118,6 +127,8 @@ As you scale your AKS clusters to larger scale points, keep the following node p
118127
[pricing-tiers]: ./free-standard-pricing-tiers.md
119128
[cluster-autoscaler]: cluster-autoscaler.md
120129
[azure-npm]: ../virtual-network/kubernetes-network-policies.md
130+
[cluster upgrades]: upgrade-cluster.md
131+
[max surge]: upgrade-aks-cluster.md#customize-node-surge-upgrade
121132

122133
<!-- LINKS - External -->
123134
[throttling-policies]: https://azure.microsoft.com/blog/api-management-advanced-caching-and-throttling-policies/

articles/aks/operator-best-practices-run-at-scale.md

Lines changed: 0 additions & 68 deletions
This file was deleted.

includes/container-service-limits.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ ms.custom: include file
1212
| Resource | Limit |
1313
|--|:-|
1414
| Maximum clusters per subscription | 5000 <br />Note: spread clusters across different regions to account for Azure API throttling limits |
15-
| Maximum nodes per cluster with Virtual Machine Scale Sets and [Standard Load Balancer SKU][standard-load-balancer] | 5000 across all [node pools][node-pool] (default limit: 1000) <br />Note: Running more than a 1000 nodes per cluster requires increasing the default node limit quota. [Contact support][Contact Support] for assistance. |
15+
| Maximum nodes per cluster with Virtual Machine Scale Sets and [Standard Load Balancer SKU][standard-load-balancer] | 5000 across all [node-pools][node-pool] <br />Note: If you are unable to scale up to 5000 nodes per cluster, see [Best Practices for Large Clusters](../articles/aks/best-practices-performance-scale-large.md). |
1616
| Maximum nodes per node pool (Virtual Machine Scale Sets node pools) | 1000 |
1717
| Maximum node pools per cluster | 100 |
1818
| Maximum pods per node: with [Kubenet][Kubenet] networking plug-in<sup>1</sup> | Maximum: 250 <br /> Azure CLI default: 110 <br /> Azure Resource Manager template default: 110 <br /> Azure portal deployment default: 30 |

0 commit comments

Comments
 (0)