Skip to content

Commit 170f050

Browse files
committed
Updates to doc
1 parent 20b9975 commit 170f050

File tree

1 file changed

+26
-14
lines changed

1 file changed

+26
-14
lines changed

articles/reliability/reliability-aks.md

Lines changed: 26 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -6,13 +6,13 @@ ms.author: schaffererin
66
ms.topic: reliability-article
77
ms.custom: subject-reliability, references_regions #Required - use references_regions if specific regions are mentioned.
88
ms.service: azure-kubernetes-service
9-
ms.date: 02/07/2025
9+
ms.date: 02/19/2025
1010
#Customer intent: As an engineer responsible for business continuity, I want to understand who need to understand the details of how AKS works from a reliability perspective and plan disaster recovery strategies in alignment with the exact processes that Azure services follow during different kinds of situations.
1111
---
1212

1313
# Reliability in Azure Kubernetes Service (AKS)
1414

15-
This article describes how you can make your Azure Kubernetes Service (AKS) workloads more resilient. It covers topics such as cluster configuration best practices for zonal/regional resiliency and recommended Kubernetes configurations for high availability.
15+
This article describes reliability support in [Azure Kubernetes Service (AKS)](../app-service/overview.md). It covers both intra-regional resiliency with [availability zones](#availability-zone-support) and information on [multi-region deployments](#multi-region-support).
1616

1717
## AKS cluster architecture
1818

@@ -30,25 +30,27 @@ For recommendations on how to deploy production workloads in AKS, see the follow
3030
- [High availability and disaster recovery overview for Azure Kubernetes Service (AKS)](/azure/aks/ha-dr-overview)
3131
- [Zone resiliency considerations for Azure Kubernetes Service (AKS)](/azure/aks/aks-zone-resiliency)
3232

33-
## Redundancy
33+
## Reliability architecture
3434

35-
To achieve redundancy in AKS, we recommend **deploying at least two replicas of your application** using [Azure Kubernetes Fleet Manager](/azure/kubernetes-fleet/overview) or an [active-active high availability solution](/azure/aks/active-active-solution). If deploying multiple clusters in multiple regions, make sure you [configure load balancing](/azure/aks/best-practices-app-cluster-reliability#standard-load-balancer) across pods.
35+
To achieve redundancy in AKS, consider **deploying at least two replicas of your application** using [Azure Kubernetes Fleet Manager (Fleet)](/azure/kubernetes-fleet/overview) or an [active-active high availability solution](/azure/aks/active-active-solution).
36+
37+
If you're deploying multiple clusters in multiple regions, consider [configuring load balancing](/azure/aks/best-practices-app-cluster-reliability#standard-load-balancer) across pods. Consider your deployment requirements and expectations when selecting the appropriate method for your workloads. To review load balancing options, see [Load balancing for high availability and disaster recovery in Azure Kubernetes Service (AKS)](/azure/aks/ha-dr-overview#load-balancing).
3638

3739
## Transient faults
3840

39-
To protect against transient faults, we recommend the following:
41+
To protect against transient faults, consider using the following:
4042

4143
- **[Pod Disruption Budgets (PDBs)](/azure/aks/best-practices-app-cluster-reliability#pod-disruption-budgets-pdbs)**: Ensures that a minimum number of pods remain available during voluntary disruptions.
4244
- **[`maxUnavailable`](/azure/aks/best-practices-app-cluster-reliability#maxunavailable)**: Defines the maximum number of pods that can be unavailable during a rolling update of a deployment.
43-
- **[Pod topology spread constraints](/azure/aks/best-practices-app-cluster-reliability#pod-topology-spread-constraints)**: Ensures that pods are spread across different nodes or zones to improve availability and reliability.
45+
- **[Pod topology spread constraints](/azure/aks/best-practices-app-cluster-reliability#pod-topology-spread-constraints)**: Ensures that pods are spread across different nodes or zones to improve availability and reliability by removing the dependency on a single point of failure.
4446

4547
<!-- Add information about AKS reaction to unforeseen downtime of nodes (e.g. if the underlying host becomes unresponsive) -->
4648

4749
## Availability zone support
4850

4951
[!INCLUDE [AZ support description](includes/reliability-availability-zone-description-include.md)]
5052

51-
You can configure AKS to be *zone redundant*, which means your resources are spread across multiple availability zones. Zone redundancy helps you achieve resiliency and reliability for your production workloads. You can help ensure this by using the following best practices:
53+
You can configure AKS to be *zone redundant*, which means your resources are spread across multiple availability zones. Zone redundancy helps you achieve resiliency and reliability for your production workloads. You can help ensure this by implementing the following:
5254

5355
- **Run multiple replicas** to make the most of the zone redundancy. For example, if you have a three-zone cluster, run at least three replicas of your application.
5456
- **Enable automatic scaling** through the [cluster autoscaler](/azure/aks/cluster-autoscaler) or [node autoprovisioning (NAP)](/azure/aks/node-autoprovision) to help ensure that your application can handle traffic spikes.
@@ -72,12 +74,14 @@ When using availability zones in AKS, consider the following:
7274
- You can only define availability zones during creation of the cluster or node pool.
7375
- It's not possible to update an existing non-availability zone cluster to use availability zones after creating the cluster.
7476
- Clusters with availability zones enabled require using Azure Standard Load Balancer for distribution across zones. You can only define this load balancer type at cluster create time. For more information and the limitations of the standard load balancer, see [Azure load balancer standard SKU limitations](/azure/aks/load-balancer-standard#limitaitons).
75-
- When implementing **availability zones with the [cluster autoscaler](/azure/aks/cluster-autoscaler-overview)**, we recommend using a single node pool for each zone. You can set the `--balance-similar-node-groups` parameter to `true` to maintain a balanced distribution of nodes across zones for your workloads during scale up operations. When this approach isn't implemented, scale down operations can disrupt the balance of nodes across zones. This configuration doesn't guarantee that similar node groups will have the same number of nodes:
76-
- Currently, balancing happens during scale up operations only. The cluster autoscaler scales down underutilized nodes regardless of the relative sizes of the node groups.
77-
- The cluster autoscaler adds nodes based on pending pods and the requests of the pods to calculate the number of nodes to add.
78-
- The cluster autoscaler only balances between node groups that can support the same set of pending pods.
7977
- You can use Azure zone-redundant storage (ZRS) disks to replicate your storage across three availability zones in the region you select. A ZRS disk lets you recover from availability zone failure without data loss. For more information, see [ZRS for managed disks](/azure/virtual-machines/disks-redundancy#zone-redundant-storage-for-managed-disks).
8078

79+
### Considerations for the cluster autoscaler
80+
81+
When implementing **availability zones with the [cluster autoscaler](/azure/aks/cluster-autoscaler-overview)**, consider using a single node pool for each zone. You can set the `--balance-similar-node-groups` parameter to `true` to maintain a balanced distribution of nodes across zones for your workloads during scale up operations. When this approach isn't implemented, scale down operations can disrupt the balance of nodes across zones. This configuration doesn't guarantee that similar node groups will have the same number of nodes.
82+
83+
Currently, balancing happens during scale up operations only. The cluster autoscaler scales down underutilized nodes regardless of the relative sizes of the node groups. The cluster autoscaler adds nodes based on pending pods and the requests of the pods to calculate the number of nodes to add. The cluster autoscaler only balances between node groups that can support the same set of pending pods.
84+
8185
### Cost
8286

8387
Availability zones are free to use. You only pay for the virtual machines (VMs) and other resources that you deploy in the availability zones.
@@ -88,18 +92,24 @@ Availability zones are free to use. You only pay for the virtual machines (VMs)
8892

8993
### Capacity planning and management
9094

91-
We recommend that you use the following best practices for capacity planning and management:
95+
When planning for capacity in an AKS cluster that uses availability zones, consider using the following:
9296

9397
- [Node autoprovisioning (NAP)](/azure/aks/node-autoprovision)
9498
- [Single instance VM node pools](/azure/aks/virtual-machines-node-pools)
95-
- [Go multi-region with Azure Kubernetes Fleet Manager](/azure/kubernetes-fleet-overview)
99+
- [Go multi-region with Azure Kubernetes Fleet Manager (Fleet)](/azure/kubernetes-fleet-overview)
96100

97101
### Traffic routing between zones
98102

103+
[Configure AZ-aware networking](/azure/aks/aks-zone-resiliency#configure-az-aware-networking) to route traffic between zones.
104+
99105
### Data replication between zones
100106

107+
[Make your storage disk decision](/azure/aks/aks-zone-resiliency#make-your-storage-disk-decision) based on your workload requirements.
108+
101109
### Zone-down experience
102110

111+
AKS clusters are resilient to zone failures. If a zone fails, the cluster continues to run in the remaining zones. The cluster's control plane and nodes are spread across the zones, and the Azure platform automatically handles the distribution of nodes. For more information, see [ADD LINK](ADD_LINK).
112+
103113
### Failback
104114

105115
### Testing for zone failures
@@ -111,7 +121,9 @@ You can test for resiliency to failures using the following methods:
111121

112122
## Multi-region support
113123

114-
To provide multi-region support for your AKS workloads, you can use Azure Kubernetes Fleet Manager. For more information, see the [Azure Kubernetes Fleet Manager documentation](/azure/kubernetes-fleet-overview).
124+
To provide multi-region support for your AKS workloads, you can use **Azure Kubernetes Fleet Manager (Fleet)**. Fleet enables you to manage a set of AKS clusters as a single unit, and those clusters can be distributed across multiple Azure regions. With Fleet, you can automate some aspects of cluster management such as cluster and node image upgrades, and you can use its traffic distribution capabilities to spread traffic across the clusters and automatically fail over if a region is unavailable.
125+
126+
For more information, see the [Azure Kubernetes Fleet Manager (Fleet) documentation](/azure/kubernetes-fleet-overview).
115127

116128
## Backups
117129

0 commit comments

Comments
 (0)