You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/aks/best-practices-app-cluster-reliability.md
+15-28Lines changed: 15 additions & 28 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,7 +3,7 @@ title: Deployment and cluster reliability best practices for Azure Kubernetes Se
3
3
titleSuffix: Azure Kubernetes Service
4
4
description: Learn the best practices for deployment and cluster reliability for Azure Kubernetes Service (AKS) workloads.
5
5
ms.topic: conceptual
6
-
ms.date: 02/21/2024
6
+
ms.date: 02/22/2024
7
7
---
8
8
9
9
# Deployment and cluster reliability best practices for Azure Kubernetes Service (AKS)
@@ -15,7 +15,7 @@ The best practices in this article are organized into the following categories:
15
15
| Category | Best practices |
16
16
| -------- | -------------- |
17
17
|[Deployment level best practices](#deployment-level-best-practices)| • [Pod Disruption Budgets (PDBs)](#pod-disruption-budgets-pdbs) <br/> • [Pod CPU and memory limits](#pod-cpu-and-memory-limits) <br/> • [Pre-stop hooks](#pre-stop-hooks) <br/> • [maxUnavailable](#maxunavailable) <br/> • [Pod anti-affinity](#pod-anti-affinity) <br/> • [Readiness and liveness probes](#readiness-and-liveness-probes) <br/> • [Multi-replica applications](#multi-replica-applications)|
18
-
|[Cluster and node pool level best practices](#cluster-and-node-pool-level-best-practices)| • [Availability zones](#availability-zones) <br/> • [Cluster autoscaling](#cluster-autoscaling) <br/> • [Scale-down mode](#scale-down-mode) <br/> • [Standard Load Balancer](#standard-load-balancer) <br/> • [System node pools](#system-node-pools) <br/> • [Accelerated Networking](#accelerated-networking) <br/> • [Image versions](#image-versions) <br/> • [Azure CNI for dynamic IP allocation](#azure-cni-for-dynamic-ip-allocation) <br/> • [v5 SKU VMs](#v5-sku-vms) <br/> • [Do *not* use B series VMs](#do-not-use-b-series-vms) <br/> • [Premium Disks](#premium-disks) <br/> • [Container Insights](#container-insights) <br/> • [Azure Policy](#azure-policy)|
18
+
|[Cluster and node pool level best practices](#cluster-and-node-pool-level-best-practices)| • [Availability zones](#availability-zones) <br/> • [Cluster autoscaling](#cluster-autoscaling) <br/> • [Standard Load Balancer](#standard-load-balancer) <br/> • [System node pools](#system-node-pools) <br/> • [Accelerated Networking](#accelerated-networking) <br/> • [Image versions](#image-versions) <br/> • [Azure CNI for dynamic IP allocation](#azure-cni-for-dynamic-ip-allocation) <br/> • [v5 SKU VMs](#v5-sku-vms) <br/> • [Do *not* use B series VMs](#do-not-use-b-series-vms) <br/> • [Premium Disks](#premium-disks) <br/> • [Container Insights](#container-insights) <br/> • [Azure Policy](#azure-policy)|
19
19
20
20
## Deployment level best practices
21
21
@@ -109,11 +109,11 @@ For more information, see [Assign CPU Resources to Containers and Pods](https://
109
109
110
110
> **Best practice guidance**
111
111
>
112
-
> Use pre-stop hooks to ensure graceful termination during SIGTERM.
112
+
> Use pre-stop hooks to ensure graceful termination of a container.
113
113
114
-
A `PreStop` hook is called immediately before a container is terminated due to an API request or management event, such as a liveness probe failure. The pod's termination grace period countdown begins before the `PreStop` hook is executed, so the container eventually terminates within the termination grace period.
114
+
A `PreStop` hook is called immediately before a container is terminated due to an API request or management event, such as preemption, resource contention, or a liveness/startup probe failure. A call to the `PreStop` hook fails if the container is already in a terminated or completed state, and the hook must complete before the TERM signal to stop the container is sent. The pod's termination grace period countdown begins before the `PreStop` hook is executed, so the container eventually terminates within the termination grace period.
115
115
116
-
The following example pod definition file shows how to use a `PreStop` hook to ensure graceful termination during SIGTERM:
116
+
The following example pod definition file shows how to use a `PreStop` hook to ensure graceful termination of a container:
117
117
118
118
```yaml
119
119
apiVersion: v1
@@ -219,7 +219,11 @@ For more information, see [Affinity and anti-affinity in Kubernetes](https://kub
219
219
> [!TIP]
220
220
> Use pod anti-affinity across availability zones to ensure that pods are spread across availability zones for zone-down scenarios.
221
221
>
222
-
> When you deploy your application across multiple availability zones, you can use pod anti-affinity to ensure that pods are spread across availability zones. This practice helps ensure that your application remains available in the event of a zone-down scenario. For more information, see [Best practices for multiple zones](https://kubernetes.io/docs/setup/best-practices/multiple-zones/) and [Overview of availability zones for AKS clusters](./availability-zones.md#overview-of-availability-zones-for-aks-clusters).
222
+
> You can think of availability zones as backups for your application. If one zone goes down, your application can continue to run in another zone. You use affinity and anti-affinity rules to schedule specific pods on specific nodes. For example, let's say you have a memory/CPU-intensive pod, you might want to schedule it on a larger VM SKU to give the pod the capacity it needs to run.
223
+
>
224
+
> When you deploy your application across multiple availability zones, you can use pod anti-affinity to ensure that pods are spread across availability zones. This practice helps ensure that your application remains available in the event of a zone-down scenario.
225
+
>
226
+
> For more information, see [Best practices for multiple zones](https://kubernetes.io/docs/setup/best-practices/multiple-zones/) and [Overview of availability zones for AKS clusters](./availability-zones.md#overview-of-availability-zones-for-aks-clusters).
223
227
224
228
### Readiness and liveness probes
225
229
@@ -332,26 +336,6 @@ You can also enable the cluster autoscaler on an existing node pool and configur
332
336
333
337
For more information, see [Use the cluster autoscaler in AKS](./cluster-autoscaler.md).
334
338
335
-
### Scale-down mode
336
-
337
-
> **Best practice guidance**
338
-
>
339
-
> Use scale-down mode to control the delete and deallocate behavior of nodes in your AKS cluster upon scaling down.
340
-
341
-
By default, scale up operations performed manually or by the cluster autoscaler require the allocation and provisioning of new nodes, and scale down operations delete nodes. Scale-down mode allows you to decide whether you want to delete or deallocate the nodes in your AKS clusters upon scaling down.
342
-
343
-
You can use the `--scale-down-mode` parameter to set the scale-down mode to `Deallocate` or `Delete`, as shown in the following examples:
344
-
345
-
```azurecli-interactive
346
-
# Set the scale-down mode to Deallocate
347
-
az aks nodepool add --node-count 20 --scale-down-mode Deallocate --node-osdisk-type Managed --max-pods 10 --name nodepool2 --cluster-name myAKSCluster --resource-group myResourceGroup
For more information, see [Use scale-down mode to delete or deallocate nodes in AKS](./scale-down-mode.md).
354
-
355
339
### Standard Load Balancer
356
340
357
341
> **Best practice guidance**
@@ -360,6 +344,9 @@ For more information, see [Use scale-down mode to delete or deallocate nodes in
360
344
361
345
In Azure, the [Standard Load Balancer](../load-balancer/skus.md) SKU is designed to be equipped for load balancing network layer traffic when high performance and low latency are needed. The Standard Load Balancer routes traffic within and across regions and to availability zones for high resiliency. The Standard SKU is the recommended and default SKU to use when creating an AKS cluster.
362
346
347
+
> [!IMPORTANT]
348
+
> On September 30, 2025, Basic Load Balancer will be retired. For more information, see the [official announcement](https://azure.microsoft.com/updates/azure-basic-load-balancer-will-be-retired-on-30-september-2025-upgrade-to-standard-load-balancer/). We recommend that you use the Standard Load Balancer for new deployments and upgrade existing deployments to the Standard Load Balancer. For more information, see [Upgrading from Basic Load Balancer](../load-balancer/load-balancer-basic-upgrade-guidance.md).
349
+
363
350
The following example shows a `LoadBalancer` service manifest that uses the Standard Load Balancer:
364
351
365
352
```yaml
@@ -437,7 +424,7 @@ AKS provides multiple auto-upgrade channels for node OS image upgrades. You can
437
424
>
438
425
> Use the standard tier for product workloads for greater cluster reliability and resources, support for up to 5,000 nodes in a cluster, and Uptime SLA enabled by default.
439
426
440
-
The standard tier for Azure Kubernetes Service (AKS) provides a financially backed 99.9% uptime service-level agreement (SLA) for your production workloads. The standard tier also provides greater cluster reliability and resources, support for up to 5,000 nodes in a cluster, and Uptime SLA enabled by default. For more information, see [Standard pricing tier for AKS cluster management](./free-standard-pricing-tiers.md).
427
+
The standard tier for Azure Kubernetes Service (AKS) provides a financially backed 99.9% uptime [service-level agreement (SLA)](https://www.azure.cn/en-us/support/sla/kubernetes-service/) for your production workloads. The standard tier also provides greater cluster reliability and resources, support for up to 5,000 nodes in a cluster, and Uptime SLA enabled by default. For more information, see [Standard pricing tier for AKS cluster management](./free-standard-pricing-tiers.md).
441
428
442
429
### Azure CNI for dynamic IP allocation
443
430
@@ -461,7 +448,7 @@ For more information, see [Configure Azure CNI networking for dynamic allocation
461
448
>
462
449
> Use v5 VM SKUs for improved performance during and after updates, less overall impact, and a more reliable connection for your applications.
463
450
464
-
For system node pools in AKS, use v5 SKU VMs or an equivalent core/memory VM SKU with ephemeral OS disks to provide sufficient compute resources for kube-system pods. For more information, see [Best practices for creating and running AKS clusters at scale](./operator-best-practices-run-at-scale.md) and [Best practices for performance and scaling for large workloads in AKS](./best-practices-performance-scale-large.md).
451
+
For node pools in AKS, use v5 SKU VMs with ephemeral OS disks to provide sufficient compute resources for kube-system pods. For more information, see [Best practices for creating and running AKS clusters at scale](./operator-best-practices-run-at-scale.md) and [Best practices for performance and scaling for large workloads in AKS](./best-practices-performance-scale-large.md).
0 commit comments