Skip to content

Commit 52bdcec

Browse files
committed
Merge branch 'main' of https://github.com/MicrosoftDocs/azure-docs-pr into nat-freshness
2 parents 35c1034 + 74e91b5 commit 52bdcec

9 files changed

+498
-179
lines changed

.openpublishing.redirection.json

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14407,6 +14407,11 @@
1440714407
"source_path_from_root": "/articles/automation/dsc-linux-powershell.md",
1440814408
"redirect_url": "/azure/automation/automation-dsc-overview"
1440914409
},
14410+
{
14411+
"source_path_from_root": "/articles/aks/operator-best-practices-multi-region.md",
14412+
"redirect_url": "/azure/aks/ha-dr-overview",
14413+
"redirect_document_id": false
14414+
},
1441014415
{
1441114416
"source_path_from_root": "/articles/virtual-machines/extensions/dsc-linux.md",
1441214417
"redirect_url": "/azure/virtual-machines/extensions/dsc-overview"

articles/aks/TOC.yml

Lines changed: 14 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -158,6 +158,20 @@
158158
href: best-practices.md
159159
- name: Baseline architecture for an AKS cluster
160160
href: /azure/architecture/reference-architectures/containers/aks/secure-baseline-aks?toc=/azure/aks/toc.json&bc=/azure/aks/breadcrumb/toc.json
161+
- name: High availability disaster recovery
162+
items:
163+
- name: Overview
164+
href: ha-dr-overview.md
165+
required: yes
166+
limit: 1
167+
- name: Solutions
168+
items:
169+
- name: Active-active
170+
href: active-active-solution.md
171+
- name: Active-passive
172+
href: active-passive-solution.md
173+
- name: Passive-cold
174+
href: passive-cold-solution.md
161175
- name: Security
162176
items:
163177
- name: Authentication and authorization
@@ -180,8 +194,6 @@
180194
href: operator-best-practices-network.md
181195
- name: Storage
182196
href: operator-best-practices-storage.md
183-
- name: Business continuity (BC) and disaster recovery (DR)
184-
href: operator-best-practices-multi-region.md
185197
- name: Performance and scaling
186198
items:
187199
- name: For small to medium workloads
Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
---
2+
title: Recommended active-active high availability solution overview for Azure Kubernetes Service (AKS)
3+
description: Learn about the recommended active-active high availability solution overview for Azure Kubernetes Service (AKS).
4+
author: schaffererin
5+
ms.author: schaffererin
6+
ms.service: azure-kubernetes-service
7+
ms.topic: concept-article
8+
ms.date: 01/30/2024
9+
---
10+
11+
# Recommended active-active high availability solution overview for Azure Kubernetes Service (AKS)
12+
13+
When you create an application in Azure Kubernetes Service (AKS) and choose an Azure region during resource creation, it's a single-region app. In the event of a disaster that causes the region to become unavailable, your application also becomes unavailable. If you create an identical deployment in a secondary Azure region, your application becomes less susceptible to a single-region disaster, which guarantees business continuity, and any data replication across the regions lets you recover your last application state.
14+
15+
While there are multiple patterns that can provide recoverability for an AKS solution, this guide outlines the recommended active-active high availability solution for AKS. Within this solution, we deploy two independent and identical AKS clusters into two paired Azure regions with both clusters actively serving traffic.
16+
17+
> [!NOTE]
18+
> The following use case can be considered standard practice within AKS. It has been reviewed internally and vetted in conjunction with our Microsoft partners.
19+
20+
## Active-active high availability solution overview
21+
22+
This solution relies on two identical AKS clusters configured to actively serve traffic. You place a global traffic manager, such as [Azure Front Door](../frontdoor/front-door-overview.md), in front of the two clusters to distribute traffic across them. You must consistently configure the clusters to host an instance of all applications required for the solution to function.
23+
24+
Availability zones are another way to ensure high availability and fault tolerance for your AKS cluster within the same region. Availability zones allow you to distribute your cluster nodes across multiple isolated locations within an Azure region. This way, if one zone goes down due to a power outage, hardware failure, or network issue, your cluster can continue to run and serve your applications. Availability zones also improve the performance and scalability of your cluster by reducing the latency and contention among nodes. To set up availability zones for your AKS cluster, you need to specify the zone numbers when creating or updating your node pools. For more information, see [What are Azure availability zones?](../reliability/availability-zones-overview.md)
25+
26+
> [!NOTE]
27+
> Many regions support availability zones. Consider using regions with availability zones to provide more resiliency and availability for your workloads. For more information, see [Recover from a region-wide service disruption](/azure/architecture/resiliency/recovery-loss-azure-region).
28+
29+
## Scenarios and configurations
30+
31+
This solution is best implemented when hosting stateless applications and/or with other technologies also deployed across both regions, such as horizontal scaling. In scenarios where the hosted application is reliant on resources, such as databases, that are actively in only one region, we recommend instead implementing an [active-passive solution](./active-passive-solution.md) for potential cost savings, as active-passive has more downtime than active-active.
32+
33+
## Components
34+
35+
The active-active high availability solution uses many Azure services. This section covers only the components unique to this multi-cluster architecture. For more information on the remaining components, see the [AKS baseline architecture](/azure/architecture/reference-architectures/containers/aks/baseline-aks?toc=%2Fazure%2Faks%2Ftoc.json&bc=%2Fazure%2Faks%2Fbreadcrumb%2Ftoc.json).
36+
37+
**Multiple clusters and regions**: You deploy multiple AKS clusters, each in a separate Azure region. During normal operations, your Azure Front Door configuration routes network traffic between all regions. If one region becomes unavailable, traffic routes to a region with the fastest load time for the user.
38+
39+
**Hub-spoke network per region**: A regional hub-spoke network pair is deployed for each regional AKS instance. [Azure Firewall Manager](../firewall-manager/overview.md) policies manage the firewall policies across all regions.
40+
41+
**Regional key store**: You provision [Azure Key Vault](../key-vault/general/overview.md) in each region to store sensitive values and keys specific to the AKS instance and to support services found in that region.
42+
43+
**Azure Front Door**: [Azure Front Door](../frontdoor/front-door-overview.md) load balances and routes traffic to a regional [Azure Application Gateway](../application-gateway/overview.md) instance, which sits in front of each AKS cluster. Azure Front Door allows for *layer seven* global routing.
44+
45+
**Log Analytics**: Regional [Log Analytics](../azure-monitor/logs/log-analytics-overview.md) instances store regional networking metrics and diagnostic logs. A shared instance stores metrics and diagnostic logs for all AKS instances.
46+
47+
**Container Registry**: The container images for the workload are stored in a managed container registry. With this solution, a single [Azure Container Registry](../container-registry/container-registry-intro.md) instance is used for all Kubernetes instances in the cluster. Geo-replication for Azure Container Registry enables you to replicate images to the selected Azure regions and provides continued access to images even if a region experiences an outage.
48+
49+
## Failover process
50+
51+
If a service or service component becomes unavailable in one region, traffic should be routed to a region where that service is available. A multi-region architecture includes many different failure points. In this section, we cover the potential failure points.
52+
53+
### Application Pods (Regional)
54+
55+
A Kubernetes deployment object creates multiple replicas of a pod (*ReplicaSet*). If one is unavailable, traffic is routed between the remaining replicas. The Kubernetes *ReplicaSet* attempts to keep the specified number of replicas up and running. If one instance goes down, a new instance should be recreated. [Liveness probes](../container-instances/container-instances-liveness-probe.md) can check the state of the application or process running in the pod. If the pod is unresponsive, the liveness probe removes the pod, which forces the *ReplicaSet* to create a new instance.
56+
57+
For more information, see [Kubernetes ReplicaSet](https://kubernetes.io/docs/concepts/workloads/controllers/replicaset/).
58+
59+
### Application Pods (Global)
60+
61+
When an entire region becomes unavailable, the pods in the cluster are no longer available to serve requests. In this case, the Azure Front Door instance routes all traffic to the remaining health regions. The Kubernetes clusters and pods in these regions continue to serve requests. To compensate for increased traffic and requests to the remaining cluster, keep in mind the following guidance:
62+
63+
- Make sure network and compute resources are right sized to absorb any sudden increase in traffic due to region failover. For example, when using Azure Container Network Interface (CNI), make sure you have a subnet that can support all pod IPs with a spiked traffic load.
64+
- Use the [Horizontal Pod Autoscaler](./concepts-scale.md#horizontal-pod-autoscaler) to increase the pod replica count to compensate for the increased regional demand.
65+
- Use the AKS [Cluster Autoscaler](./cluster-autoscaler.md) to increase the Kubernetes instance node counts to compensate for the increased regional demand.
66+
67+
### Kubernetes node pools (Regional)
68+
69+
Occasionally, localized failure can occur to compute resources, such as power becoming unavailable in a single rack of Azure servers. To protect your AKS nodes from becoming a single point regional failure, use [Azure Availability Zones](./availability-zones.md). Availability zones ensure that AKS nodes in each availability zone are physically separated from those defined in another availability zones.
70+
71+
### Kubernetes node pools (Global)
72+
73+
In a complete regional failure, Azure Front Door routes traffic to the remaining healthy regions. Again, make sure to compensate for increased traffic and requests to the remaining cluster.
74+
75+
## Failover testing strategy
76+
77+
While there are no mechanisms currently available within AKS to take down an entire region of deployment for testing purposes, [Azure Chaos Studio](../chaos-studio/chaos-studio-overview.md) offers the ability to create a chaos experiment on your cluster.
78+
79+
## Next steps
80+
81+
If you're considering a different solution, see the following articles:
82+
83+
- [Active passive disaster recovery solution overview for Azure Kubernetes Service (AKS)](./active-passive-solution.md)
84+
- [Passive cold solution overview for Azure Kubernetes Service (AKS)](./passive-cold-solution.md)
Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
---
2+
title: Recommended active-passive disaster recovery solution overview for Azure Kubernetes Service (AKS)
3+
description: Learn about an active-passive disaster recovery solution overview for Azure Kubernetes Service (AKS).
4+
author: schaffererin
5+
ms.author: schaffererin
6+
ms.service: azure-kubernetes-service
7+
ms.topic: concept-article
8+
ms.date: 01/30/2024
9+
---
10+
11+
# Active-passive disaster recovery solution overview for Azure Kubernetes Service (AKS)
12+
13+
When you create an application in Azure Kubernetes Service (AKS) and choose an Azure region during resource creation, it's a single-region app. When the region becomes unavailable during a disaster, your application also becomes unavailable. If you create an identical deployment in a secondary Azure region, your application becomes less susceptible to a single-region disaster, which guarantees business continuity, and any data replication across the regions lets you recover your last application state.
14+
15+
This guide outlines an active-passive disaster recovery solution for AKS. Within this solution, we deploy two independent and identical AKS clusters into two paired Azure regions with only one cluster actively serving traffic.
16+
17+
> [!NOTE]
18+
> The following practice has been reviewed internally and vetted in conjunction with our Microsoft partners.
19+
20+
## Active-passive solution overview
21+
22+
In this disaster recovery approach, we have two independent AKS clusters being deployed in two Azure regions. However, only one of the clusters is actively serving traffic at any one time. The secondary cluster (not actively serving traffic) contains the same configuration and application data as the primary cluster but doesn’t accept any traffic unless directed by Azure Front Door traffic manager.
23+
24+
## Scenarios and configurations
25+
26+
This solution is best implemented when hosting applications reliant on resources, such as databases, that actively serve traffic in one region. In scenarios where you need to host stateless applications deployed across both regions, such as horizontal scaling, we recommend considering an [active-active solution](./active-active-solution.md), as active-passive involves added latency.
27+
28+
## Components
29+
30+
The active-passive disaster recovery solution uses many Azure services. This example architecture involves the following components:
31+
32+
**Multiple clusters and regions**: You deploy multiple AKS clusters, each in a separate Azure region. During normal operations, network traffic is routed to the primary AKS cluster set in the Azure Front Door configuration.
33+
34+
**Configured cluster prioritization**: You set a prioritization level between 1-5 for each cluster (with 1 being the highest priority and 5 being the lowest priority). You can set multiple clusters to the same priority level and specify the weight for each cluster. If the primary cluster becomes unavailable, traffic automatically routes to the next region selected in Azure Front Door. All traffic must go through Azure Front Door for this system to work.
35+
36+
**Azure Front Door**: [Azure Front Door](../frontdoor/front-door-overview.md) load balances and routes traffic to the [Azure Application Gateway](../application-gateway/overview.md) instance in the primary region (cluster must be marked with priority 1). In the event of a region failure, the service redirects traffic to the next cluster in the priority list.
37+
38+
For more information, see [Priority-based traffic-routing](../frontdoor/routing-methods.md#priority-based-traffic-routing).
39+
40+
**Hub-spoke pair**: A hub-spoke pair is deployed for each regional AKS instance. [Azure Firewall Manager](../firewall-manager/overview.md) policies manage the firewall rules across each region.
41+
42+
**Key Vault**: You provision an [Azure Key Vault](../key-vault/general/overview.md) in each region to store secrets and keys.
43+
44+
**Log Analytics**: Regional [Log Analytics](../azure-monitor/logs/log-analytics-overview.md) instances store regional networking metrics and diagnostic logs. A shared instance stores metrics and diagnostic logs for all AKS instances.
45+
46+
**Container Registry**: The container images for the workload are stored in a managed container registry. With this solution, a single [Azure Container Registry](../container-registry/container-registry-intro.md) instance is used for all Kubernetes instances in the cluster. Geo-replication for Azure Container Registry enables you to replicate images to the selected Azure regions and provides continued access to images even if a region experiences an outage.
47+
48+
## Failover process
49+
50+
If a service or service component becomes unavailable in one region, traffic should be routed to a region where that service is available. A multi-region architecture includes many different failure points. In this section, we cover the potential failure points.
51+
52+
### Application Pods (Regional)
53+
54+
A Kubernetes deployment object creates multiple replicas of a pod (*ReplicaSet*). If one is unavailable, traffic is routed between the remaining replicas. The Kubernetes *ReplicaSet* attempts to keep the specified number of replicas up and running. If one instance goes down, a new instance should be recreated. [Liveness probes](../container-instances/container-instances-liveness-probe.md) can check the state of the application or process running in the pod. If the pod is unresponsive, the liveness probe removes the pod, which forces the *ReplicaSet* to create a new instance.
55+
56+
For more information, see [Kubernetes ReplicaSet](https://kubernetes.io/docs/concepts/workloads/controllers/replicaset/).
57+
58+
### Application Pods (Global)
59+
60+
When an entire region becomes unavailable, the pods in the cluster are no longer available to serve requests. In this case, the Azure Front Door instance routes all traffic to the remaining health regions. The Kubernetes clusters and pods in these regions continue to serve requests. To compensate for increased traffic and requests to the remaining cluster, keep in mind the following guidance:
61+
62+
- Make sure network and compute resources are right sized to absorb any sudden increase in traffic due to region failover. For example, when using Azure Container Network Interface (CNI), make sure you have a subnet that can support all pod IPs with a spiked traffic load.
63+
- Use the [Horizontal Pod Autoscaler](./concepts-scale.md#horizontal-pod-autoscaler) to increase the pod replica count to compensate for the increased regional demand.
64+
- Use the AKS [Cluster Autoscaler](./cluster-autoscaler.md) to increase the Kubernetes instance node counts to compensate for the increased regional demand.
65+
66+
### Kubernetes node pools (Regional)
67+
68+
Occasionally, localized failure can occur to compute resources, such as power becoming unavailable in a single rack of Azure servers. To protect your AKS nodes from becoming a single point regional failure, use [Azure Availability Zones](./availability-zones.md). Availability zones ensure that AKS nodes in each availability zone are physically separated from those defined in another availability zones.
69+
70+
### Kubernetes node pools (Global)
71+
72+
In a complete regional failure, Azure Front Door routes traffic to the remaining healthy regions. Again, make sure to compensate for increased traffic and requests to the remaining cluster.
73+
74+
## Failover testing strategy
75+
76+
While there are no mechanisms currently available within AKS to take down an entire region of deployment for testing purposes, [Azure Chaos Studio](../chaos-studio/chaos-studio-overview.md) offers the ability to create a chaos experiment on your cluster.
77+
78+
## Next steps
79+
80+
If you're considering a different solution, see the following articles:
81+
82+
- [Active active high availability solution overview for Azure Kubernetes Service (AKS)](./active-active-solution.md)
83+
- [Passive cold solution overview for Azure Kubernetes Service (AKS)](./passive-cold-solution.md)

0 commit comments

Comments
 (0)