You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/reliability/reliability-aks.md
+26-14Lines changed: 26 additions & 14 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,13 +6,13 @@ ms.author: schaffererin
6
6
ms.topic: reliability-article
7
7
ms.custom: subject-reliability, references_regions #Required - use references_regions if specific regions are mentioned.
8
8
ms.service: azure-kubernetes-service
9
-
ms.date: 02/07/2025
9
+
ms.date: 02/19/2025
10
10
#Customer intent: As an engineer responsible for business continuity, I want to understand who need to understand the details of how AKS works from a reliability perspective and plan disaster recovery strategies in alignment with the exact processes that Azure services follow during different kinds of situations.
11
11
---
12
12
13
13
# Reliability in Azure Kubernetes Service (AKS)
14
14
15
-
This article describes how you can make your Azure Kubernetes Service (AKS) workloads more resilient. It covers topics such as cluster configuration best practices for zonal/regional resiliency and recommended Kubernetes configurations for high availability.
15
+
This article describes reliability support in [Azure Kubernetes Service (AKS)](../app-service/overview.md). It covers both intra-regional resiliency with [availability zones](#availability-zone-support) and information on [multi-region deployments](#multi-region-support).
16
16
17
17
## AKS cluster architecture
18
18
@@ -30,25 +30,27 @@ For recommendations on how to deploy production workloads in AKS, see the follow
30
30
-[High availability and disaster recovery overview for Azure Kubernetes Service (AKS)](/azure/aks/ha-dr-overview)
31
31
-[Zone resiliency considerations for Azure Kubernetes Service (AKS)](/azure/aks/aks-zone-resiliency)
32
32
33
-
## Redundancy
33
+
## Reliability architecture
34
34
35
-
To achieve redundancy in AKS, we recommend **deploying at least two replicas of your application** using [Azure Kubernetes Fleet Manager](/azure/kubernetes-fleet/overview) or an [active-active high availability solution](/azure/aks/active-active-solution). If deploying multiple clusters in multiple regions, make sure you [configure load balancing](/azure/aks/best-practices-app-cluster-reliability#standard-load-balancer) across pods.
35
+
To achieve redundancy in AKS, consider **deploying at least two replicas of your application** using [Azure Kubernetes Fleet Manager (Fleet)](/azure/kubernetes-fleet/overview) or an [active-active high availability solution](/azure/aks/active-active-solution).
36
+
37
+
If you're deploying multiple clusters in multiple regions, consider [configuring load balancing](/azure/aks/best-practices-app-cluster-reliability#standard-load-balancer) across pods. Consider your deployment requirements and expectations when selecting the appropriate method for your workloads. To review load balancing options, see [Load balancing for high availability and disaster recovery in Azure Kubernetes Service (AKS)](/azure/aks/ha-dr-overview#load-balancing).
36
38
37
39
## Transient faults
38
40
39
-
To protect against transient faults, we recommend the following:
41
+
To protect against transient faults, consider using the following:
40
42
41
43
-**[Pod Disruption Budgets (PDBs)](/azure/aks/best-practices-app-cluster-reliability#pod-disruption-budgets-pdbs)**: Ensures that a minimum number of pods remain available during voluntary disruptions.
42
44
-**[`maxUnavailable`](/azure/aks/best-practices-app-cluster-reliability#maxunavailable)**: Defines the maximum number of pods that can be unavailable during a rolling update of a deployment.
43
-
-**[Pod topology spread constraints](/azure/aks/best-practices-app-cluster-reliability#pod-topology-spread-constraints)**: Ensures that pods are spread across different nodes or zones to improve availability and reliability.
45
+
-**[Pod topology spread constraints](/azure/aks/best-practices-app-cluster-reliability#pod-topology-spread-constraints)**: Ensures that pods are spread across different nodes or zones to improve availability and reliability by removing the dependency on a single point of failure.
44
46
45
47
<!-- Add information about AKS reaction to unforeseen downtime of nodes (e.g. if the underlying host becomes unresponsive) -->
46
48
47
49
## Availability zone support
48
50
49
51
[!INCLUDE [AZ support description](includes/reliability-availability-zone-description-include.md)]
50
52
51
-
You can configure AKS to be *zone redundant*, which means your resources are spread across multiple availability zones. Zone redundancy helps you achieve resiliency and reliability for your production workloads. You can help ensure this by using the following best practices:
53
+
You can configure AKS to be *zone redundant*, which means your resources are spread across multiple availability zones. Zone redundancy helps you achieve resiliency and reliability for your production workloads. You can help ensure this by implementing the following:
52
54
53
55
-**Run multiple replicas** to make the most of the zone redundancy. For example, if you have a three-zone cluster, run at least three replicas of your application.
54
56
-**Enable automatic scaling** through the [cluster autoscaler](/azure/aks/cluster-autoscaler) or [node autoprovisioning (NAP)](/azure/aks/node-autoprovision) to help ensure that your application can handle traffic spikes.
@@ -72,12 +74,14 @@ When using availability zones in AKS, consider the following:
72
74
- You can only define availability zones during creation of the cluster or node pool.
73
75
- It's not possible to update an existing non-availability zone cluster to use availability zones after creating the cluster.
74
76
- Clusters with availability zones enabled require using Azure Standard Load Balancer for distribution across zones. You can only define this load balancer type at cluster create time. For more information and the limitations of the standard load balancer, see [Azure load balancer standard SKU limitations](/azure/aks/load-balancer-standard#limitaitons).
75
-
- When implementing **availability zones with the [cluster autoscaler](/azure/aks/cluster-autoscaler-overview)**, we recommend using a single node pool for each zone. You can set the `--balance-similar-node-groups` parameter to `true` to maintain a balanced distribution of nodes across zones for your workloads during scale up operations. When this approach isn't implemented, scale down operations can disrupt the balance of nodes across zones. This configuration doesn't guarantee that similar node groups will have the same number of nodes:
76
-
- Currently, balancing happens during scale up operations only. The cluster autoscaler scales down underutilized nodes regardless of the relative sizes of the node groups.
77
-
- The cluster autoscaler adds nodes based on pending pods and the requests of the pods to calculate the number of nodes to add.
78
-
- The cluster autoscaler only balances between node groups that can support the same set of pending pods.
79
77
- You can use Azure zone-redundant storage (ZRS) disks to replicate your storage across three availability zones in the region you select. A ZRS disk lets you recover from availability zone failure without data loss. For more information, see [ZRS for managed disks](/azure/virtual-machines/disks-redundancy#zone-redundant-storage-for-managed-disks).
80
78
79
+
### Considerations for the cluster autoscaler
80
+
81
+
When implementing **availability zones with the [cluster autoscaler](/azure/aks/cluster-autoscaler-overview)**, consider using a single node pool for each zone. You can set the `--balance-similar-node-groups` parameter to `true` to maintain a balanced distribution of nodes across zones for your workloads during scale up operations. When this approach isn't implemented, scale down operations can disrupt the balance of nodes across zones. This configuration doesn't guarantee that similar node groups will have the same number of nodes.
82
+
83
+
Currently, balancing happens during scale up operations only. The cluster autoscaler scales down underutilized nodes regardless of the relative sizes of the node groups. The cluster autoscaler adds nodes based on pending pods and the requests of the pods to calculate the number of nodes to add. The cluster autoscaler only balances between node groups that can support the same set of pending pods.
84
+
81
85
### Cost
82
86
83
87
Availability zones are free to use. You only pay for the virtual machines (VMs) and other resources that you deploy in the availability zones.
@@ -88,18 +92,24 @@ Availability zones are free to use. You only pay for the virtual machines (VMs)
88
92
89
93
### Capacity planning and management
90
94
91
-
We recommend that you use the following best practices for capacity planning and management:
95
+
When planning for capacity in an AKS cluster that uses availability zones, consider using the following:
-[Single instance VM node pools](/azure/aks/virtual-machines-node-pools)
95
-
-[Go multi-region with Azure Kubernetes Fleet Manager](/azure/kubernetes-fleet-overview)
99
+
-[Go multi-region with Azure Kubernetes Fleet Manager (Fleet)](/azure/kubernetes-fleet-overview)
96
100
97
101
### Traffic routing between zones
98
102
103
+
[Configure AZ-aware networking](/azure/aks/aks-zone-resiliency#configure-az-aware-networking) to route traffic between zones.
104
+
99
105
### Data replication between zones
100
106
107
+
[Make your storage disk decision](/azure/aks/aks-zone-resiliency#make-your-storage-disk-decision) based on your workload requirements.
108
+
101
109
### Zone-down experience
102
110
111
+
AKS clusters are resilient to zone failures. If a zone fails, the cluster continues to run in the remaining zones. The cluster's control plane and nodes are spread across the zones, and the Azure platform automatically handles the distribution of nodes. For more information, see [ADD LINK](ADD_LINK).
112
+
103
113
### Failback
104
114
105
115
### Testing for zone failures
@@ -111,7 +121,9 @@ You can test for resiliency to failures using the following methods:
111
121
112
122
## Multi-region support
113
123
114
-
To provide multi-region support for your AKS workloads, you can use Azure Kubernetes Fleet Manager. For more information, see the [Azure Kubernetes Fleet Manager documentation](/azure/kubernetes-fleet-overview).
124
+
To provide multi-region support for your AKS workloads, you can use **Azure Kubernetes Fleet Manager (Fleet)**. Fleet enables you to manage a set of AKS clusters as a single unit, and those clusters can be distributed across multiple Azure regions. With Fleet, you can automate some aspects of cluster management such as cluster and node image upgrades, and you can use its traffic distribution capabilities to spread traffic across the clusters and automatically fail over if a region is unavailable.
125
+
126
+
For more information, see the [Azure Kubernetes Fleet Manager (Fleet) documentation](/azure/kubernetes-fleet-overview).
0 commit comments