Merge pull request #263586 from kevinkrp93/5k-_ccp_scaling

AnnaMHuff · web-flow · commit d0662c78d9fa · 2024-03-19T11:52:17.000-06:00
5k  ccp scaling
diff --git a/.openpublishing.redirection.azure-kubernetes-service.json b/.openpublishing.redirection.azure-kubernetes-service.json
@@ -461,6 +461,11 @@
             "source_path_from_root": "/articles/aks/command-invoke.md",
             "redirect_url": "/azure/aks/access-private-cluster",
             "redirect_document_id": false
+        },
+        {
+            "source_path_from_root": "/articles/aks/operator-best-practices-run-at-scale.md",
+            "redirect_url": "/azure/aks/best-practices-performance-scale-large.md",
+            "redirect_document_id": false
         }
     ]
 }
diff --git a/articles/aks/TOC.yml b/articles/aks/TOC.yml
@@ -190,8 +190,6 @@
           href: operator-best-practices-cluster-isolation.md
         - name: Basic scheduler features
           href: operator-best-practices-scheduler.md
-        - name: Run AKS clusters at scale
-          href: operator-best-practices-run-at-scale.md
         - name: Advanced scheduler features
           href: operator-best-practices-advanced-scheduler.md
         - name: Networking
diff --git a/articles/aks/best-practices-app-cluster-reliability.md b/articles/aks/best-practices-app-cluster-reliability.md
@@ -493,7 +493,7 @@ For more information, see [Configure Azure CNI networking for dynamic allocation
 >
 > Use v5 VM SKUs for improved performance during and after updates, less overall impact, and a more reliable connection for your applications.
 
-For node pools in AKS, use v5 SKU VMs with ephemeral OS disks to provide sufficient compute resources for kube-system pods. For more information, see [Best practices for creating and running AKS clusters at scale](./operator-best-practices-run-at-scale.md) and [Best practices for performance and scaling for large workloads in AKS](./best-practices-performance-scale-large.md).
+For node pools in AKS, use v5 SKU VMs with ephemeral OS disks to provide sufficient compute resources for kube-system pods. For more information, see [Best practices for performance and scaling large workloads in AKS](./best-practices-performance-scale-large.md).
 
 ### Do *not* use B series VMs
 
@@ -562,5 +562,5 @@ For more information, see [Secure your AKS clusters with Azure Policy](./use-azu
 This article focused on best practices for deployment and cluster reliability for Azure Kubernetes Service (AKS) clusters. For more best practices, see the following articles:
 
 * [High availability and disaster recovery overview for AKS](./ha-dr-overview.md)
-* [Run AKS clusters at scale](./operator-best-practices-run-at-scale.md)
+* [Run AKS clusters at scale](./best-practices-performance-scale-large.md)
 * [Baseline architecture for an AKS cluster](/azure/architecture/reference-architectures/containers/aks/baseline-aks)
diff --git a/articles/aks/best-practices-performance-scale-large.md b/articles/aks/best-practices-performance-scale-large.md
@@ -84,7 +84,13 @@ Always upgrade your Kubernetes clusters to the latest version. Newer versions co
 
 As you scale your AKS clusters to larger scale points, keep the following feature limitations in mind:
 
-* AKS supports up to a 1,000 node scale in an AKS cluster by default. While AKS doesn't prevent you from scaling further, doing so might result in degraded performance. If you want to scale beyond 1,000 nodes, you can request a limit increase. For more information, see [Best practices for creating and running AKS clusters at scale][run-aks-at-scale].
+* AKS supports scaling up to 5,000 nodes by default for all Standard Tier / LTS clusters. AKS scales your cluster's control plane at runtime based on cluster size and API server resource utilization. If you cannot scale up to the supported limit, enable [control plane metrics (Preview)](./monitor-control-plane-metrics.md) with the [Azure Monitor managed service for Prometheus](../azure-monitor/essentials/prometheus-metrics-overview.md) to monitor the control plane. To help troubleshoot scaling performance or reliability issues, see the following resources:
+  * [AKS at scale troubleshooting guide](/troubleshoot/azure/azure-kubernetes/aks-at-scale-troubleshoot-guide) 
+  * [Troubleshoot the Kubernetes control plane](/troubleshoot/azure/azure-kubernetes/troubleshoot-apiserver-etcd)
+
+> [!NOTE]
+> During the operation to scale the control plane, you might encounter elevated API server latency or timeouts for up to 15 minutes. If you continue to have problems scaling to the supported limit, open a [support ticket](https://portal.azure.com/#create/Microsoft.Support/Parameters/%7B%0D%0A%09%22subId%22%3A+%22%22%2C%0D%0A%09%22pesId%22%3A+%225a3a423f-8667-9095-1770-0a554a934512%22%2C%0D%0A%09%22supportTopicId%22%3A+%2280ea0df7-5108-8e37-2b0e-9737517f0b96%22%2C%0D%0A%09%22contextInfo%22%3A+%22AksLabelDeprecationMarch22%22%2C%0D%0A%09%22caller%22%3A+%22Microsoft_Azure_ContainerService+%2B+AksLabelDeprecationMarch22%22%2C%0D%0A%09%22severity%22%3A+%223%22%0D%0A%7D).
+
 * [Azure Network Policy Manager (Azure npm)][azure-npm] only supports up to 250 nodes.
 * You can't use the Stop and Start feature with clusters that have more than 100 nodes. For more information, see [Stop and start an AKS cluster](./start-stop-cluster.md).
 
@@ -107,8 +113,11 @@ As you scale your AKS clusters to larger scale points, keep the following node p
 * When running at-scale AKS clusters, use the cluster autoscaler whenever possible to ensure dynamic scaling of node pools based on the demand for compute resources. For more information, see [Automatically scale an AKS cluster to meet application demands][cluster-autoscaler].
 * If you're scaling beyond 1,000 nodes and are *not* using the cluster autoscaler, we recommend scaling in batches of 500-700 nodes at a time. The scaling operations should have a two-minute to five-minute wait time between scale up operations to prevent Azure API throttling. For more information, see [API management: Caching and throttling policies][throttling-policies].
 
-> [!NOTE]
-> You can't use [Azure Network Policy Manager (Azure NPM)][azure-npm] with clusters that have more than 500 nodes.
+## Cluster upgrade considerations and best practices
+
+* When a cluster reaches the 5,000 node limit, cluster upgrades are blocked. This limits prevents an upgrade because there isn't available node capacity to perform rolling updates within the max surge property limit. If you have a cluster at this limit, we recommend [scaling down the cluster](./concepts-scale.md) under 3,000 nodes before attempting a cluster upgrade. This will provide extra capacity for node churn and minimize load on the control plane.
+* When upgrading clusters with more than 500 nodes, it is recommended to use a [max surge configuration](./upgrade-aks-cluster.md#set-max-surge-value) of 10-20% of the node pool's capacity. AKS configures upgrades with a default value of 10% for max surge. You can customize the max surge settings per node pool to enable a trade-off between upgrade speed and workload disruption. When you increase the max surge settings, the upgrade process completes faster, but you might experience disruptions during the upgrade process. For more information, see [Customize node surge upgrade][max surge].
+* For more cluster upgrade information, see [Upgrade an AKS cluster][cluster upgrades].
 
 <!-- LINKS - Internal --->
 [run-aks-at-scale]: ./operator-best-practices-run-at-scale.md
@@ -118,6 +127,8 @@ As you scale your AKS clusters to larger scale points, keep the following node p
 [pricing-tiers]: ./free-standard-pricing-tiers.md
 [cluster-autoscaler]: cluster-autoscaler.md
 [azure-npm]: ../virtual-network/kubernetes-network-policies.md
+[cluster upgrades]: upgrade-cluster.md
+[max surge]: upgrade-aks-cluster.md#customize-node-surge-upgrade
 
 <!-- LINKS - External -->
 [throttling-policies]: https://azure.microsoft.com/blog/api-management-advanced-caching-and-throttling-policies/
diff --git a/articles/aks/operator-best-practices-run-at-scale.md b/articles/aks/operator-best-practices-run-at-scale.md
diff --git a/includes/container-service-limits.md b/includes/container-service-limits.md
@@ -12,7 +12,7 @@ ms.custom: include file
 | Resource | Limit |
 |--|:-|
 | Maximum clusters per subscription | 5000 <br />Note: spread clusters across different regions to account for Azure API throttling limits |
-| Maximum nodes per cluster with Virtual Machine Scale Sets and [Standard Load Balancer SKU][standard-load-balancer] | 5000 across all [node pools][node-pool] (default limit: 1000)  <br />Note: Running more than a 1000 nodes per cluster requires increasing the default node limit quota. [Contact support][Contact Support] for assistance. |
+| Maximum nodes per cluster with Virtual Machine Scale Sets and [Standard Load Balancer SKU][standard-load-balancer] | 5000 across all [node-pools][node-pool] <br />Note: If you are unable to scale up to 5000 nodes per cluster, see [Best Practices for Large Clusters](../articles/aks/best-practices-performance-scale-large.md). |
 | Maximum nodes per node pool (Virtual Machine Scale Sets node pools) | 1000 |
 | Maximum node pools per cluster | 100 |
 | Maximum pods per node: with [Kubenet][Kubenet] networking plug-in<sup>1</sup> | Maximum: 250 <br /> Azure CLI default: 110 <br /> Azure Resource Manager template default: 110 <br /> Azure portal deployment default: 30 |

Original file line number	Diff line number	Diff line change
`@@ -461,6 +461,11 @@`
`461`	`461`	`"source_path_from_root": "/articles/aks/command-invoke.md",`
`462`	`462`	`"redirect_url": "/azure/aks/access-private-cluster",`
`463`	`463`	`"redirect_document_id": false`
	`464`	`+ },`
	`465`	`+ {`
	`466`	`+ "source_path_from_root": "/articles/aks/operator-best-practices-run-at-scale.md",`
	`467`	`+ "redirect_url": "/azure/aks/best-practices-performance-scale-large.md",`
	`468`	`+ "redirect_document_id": false`
`464`	`469`	`}`
`465`	`470`	`]`
`466`	`471`	`}`