You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/reliability/reliability-aks.md
+21-21Lines changed: 21 additions & 21 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -18,17 +18,17 @@ Resiliency is a shared responsibility between you and Microsoft and so this arti
18
18
19
19
## Production deployment recommendations
20
20
21
-
For recommendations on how to deploy reliable production workloads in AKS, see the following articles:
21
+
For recommendations about how to deploy reliable production workloads in AKS, see the following articles:
22
22
23
23
-[Deployment and cluster reliability best practices for AKS](/azure/aks/best-practices-app-cluster-reliability)
24
-
-[High availability and disaster recovery overview for AKS](/azure/aks/ha-dr-overview)
24
+
-[High availability (HA) and disaster recovery (DR) overview for AKS](/azure/aks/ha-dr-overview)
25
25
-[Zone resiliency considerations for AKS](/azure/aks/aks-zone-resiliency)
26
26
27
27
## Reliability architecture overview
28
28
29
29
When you create an AKS cluster, the Azure platform automatically creates and configures:
30
30
31
-
- A [control plane](/azure/aks/core-aks-concepts#control-plane)with the API server, etcd, the scheduler, and other pods that are required to manage your workload.
31
+
- A [control plane](/azure/aks/core-aks-concepts#control-plane)that has the API server, etcd, the scheduler, and other pods that are required to manage your workload.
32
32
33
33
- A [system node pool](/azure/aks/use-system-pools) to your subscription that hosts your add-ons and other pods that run in the *kube-system* namespace.
34
34
@@ -48,26 +48,26 @@ Resiliency is a shared responsibility between you and Microsoft. As a compute se
When you use AKS, transient faults can occur due to various reasons, including application crashes, pod scaling and balancing operations, node patching, and temporary infrastructure failures such as hardware or networking problems.
51
+
When you use AKS, transient faults can occur because of various reasons, including application crashes, pod scaling and balancing operations, node patching, and temporary infrastructure failures such as hardware or networking problems.
52
52
53
53
It's impossible to eliminate all transient faults, so clients that access your AKS-hosted applications should be prepared to retry failed requests and follow other transient fault handling recommendations. You can minimize the likelihood of transient faults and avoid or mitigate the downtime they might cause by following Kubernetes and Azure best practices in your deployment.
54
54
55
55
- Set pod disruption budgets (PDBs) in your pod YAML to specify how many pods you need to have in a `Ready` state at a given time. When you set PDBs, AKS ensures a minimum availability of replicas when it runs operations to cordon and drain the nodes. If a PDB can't be satisfied during processes like upgrades, the pod continues to function and the operation might fail. For more information, see [PDBs](/azure/aks/best-practices-app-cluster-reliability#pod-disruption-budgets-pdbs).
56
56
57
57
- Use `maxUnavailable` to define the maximum number of replicas that can become unavailable at a given time. For example, when you perform a rolling restart, AKS ensures that no more than the `maxUnavailable` number of pods are churned at a given time. For more information, see [maxUnavailable](/azure/aks/best-practices-app-cluster-reliability#maxunavailable).
58
58
59
-
- Follow deployment best practices. Pod replicas can also fail due to application problems. For more information, see [Deployment-level best practices](/azure/aks/best-practices-app-cluster-reliability#deployment-level-best-practices) for AKS cluster reliability.
59
+
- Follow deployment best practices. Pod replicas can also fail because of application problems. For more information, see [Deployment-level best practices](/azure/aks/best-practices-app-cluster-reliability#deployment-level-best-practices) for AKS cluster reliability.
60
60
61
61
> [!NOTE]
62
-
> If you want AKS to validate your deployments for adherence to best practices and provide blocking or warning notifications, you can use deployment safeguards (preview). Deployment safeguards are a managed offering that helps enforce product best practices before your code gets deployed to the cluster. For more information, see [Use deployment safeguards to enforce best practices in AKS (preview)](/azure/aks/deployment-safeguards).
62
+
> If you want AKS to validate your deployments for adherence to best practices and provide blocking or warning notifications, you can use [deployment safeguards (preview)](/azure/aks/deployment-safeguards). Deployment safeguards are a managed offering that helps enforce product best practices before your code gets deployed to the cluster.
63
63
64
64
## Availability zone support
65
65
66
66
[!INCLUDE [AZ support description](includes/reliability-availability-zone-description-include.md)]
67
67
68
68
When you deploy an AKS cluster into a region that supports availability zones, different components require different types of configuration.
69
69
70
-
The AKS control plane is zone resilient by default. If a zone fails, the control plane doesn't require any configuration or management to achieve resiliency. However, control plane resiliency isn't sufficient for your cluster to remain operational when a zone fails. For the system node pool and any user node pools that you deploy, you must enable availability zone support to ensure that your workloads are resilient to availability zone failures.
70
+
The AKS control plane is zone resilient by default. If a zone fails, the control plane doesn't require any configuration or management to achieve resiliency. However, control plane resiliency isn't sufficient for your cluster to remain operational when a zone fails. For the system node pool and any user node pools that you deploy, you must enable availability zone support to help ensure that your workloads are resilient to availability zone failures.
71
71
72
72
### Region support
73
73
@@ -83,11 +83,11 @@ To enhance the reliability and resiliency of AKS production workloads in a regio
83
83
84
84
- Enable automatic scaling.
85
85
86
-
Kubernetes node pools provide manual and automatic scaling options. By using manual scaling, you can add or delete nodes as needed, and pending pods wait until you scale up a node pool. AKS-managed scaling uses the [cluster autoscaler](/azure/aks/cluster-autoscaler) or [node autoprovisioning (NAP)](/azure/aks/node-autoprovision). AKS scales the node pool up or down based on the pod's requirements within your subscription's SKU quota and capacity. This method ensures that your pods are scheduled on available nodes in the availability zones.
86
+
Kubernetes node pools provide manual and automatic scaling options. By using manual scaling, you can add or delete nodes as needed, and pending pods wait until you scale up a node pool. AKS-managed scaling uses the [cluster autoscaler](/azure/aks/cluster-autoscaler) or [node autoprovisioning (NAP)](/azure/aks/node-autoprovision). AKS scales the node pool up or down based on the pod's requirements within your subscription's SKU quota and capacity. This method helps ensure that your pods are scheduled on available nodes in the availability zones.
87
87
88
88
- Set pod topology constraints.
89
89
90
-
Use pod topology spread constraints to control how pods are spread across different nodes or zones. Constraints help you achieve high availability and resiliency and efficient resource utilization. If you prefer to spread pods strictly across zones, you can set constraints to force a pod into a pending state to maintain the balance of pods across zones. For more information, see [Pod topology spread constraints](/azure/aks/best-practices-app-cluster-reliability#pod-topology-spread-constraints).
90
+
Use pod topology spread constraints to control how pods are spread across different nodes or zones. Constraints help you achieve HA, resiliency, and efficient resource usage. If you prefer to spread pods strictly across zones, you can set constraints to force a pod into a pending state to maintain the balance of pods across zones. For more information, see [Pod topology spread constraints](/azure/aks/best-practices-app-cluster-reliability#pod-topology-spread-constraints).
91
91
92
92
- Configure zone-resilient networking.
93
93
@@ -105,23 +105,23 @@ There's no extra charge to enable availability zone support in AKS. You pay for
105
105
106
106
- You can create a new AKS cluster that uses availability zones and [configure availability zone support](/azure/aks/availability-zones).
107
107
108
-
- You can't enable availability zone support after you create a cluster. Instead, you need to create a new cluster with availability zone support enabled and delete the old one.
108
+
- You can't enable availability zone support after you create a cluster. Instead, you need to create a new cluster with availability zone support enabled and delete the existing cluster.
109
109
110
-
- You can't disable availability zone support after you create a cluster. Instead, you need to create a new cluster with availability zone support disabled and delete the old one.
110
+
- You can't disable availability zone support after you create a cluster. Instead, you need to create a new cluster with availability zone support disabled and delete the existing cluster.
111
111
112
112
### Traffic routing between zones
113
113
114
114
When you deploy an AKS cluster that uses availability zones, it's important to ensure that your networking components are also zone resilient. Depending on the load balancers and other networking components that you use, you might need to explicitly configure components to route traffic to the correct nodes in the correct zones and to respond to zone outages. For more information, see [Zone resiliency considerations for AKS](/azure/aks/aks-zone-resiliency).
115
115
116
116
### Data replication between zones
117
117
118
-
If you run a stateless workload, you should use managed Azure services, such [Azure databases](https://azure.microsoft.com/products/category/databases/), [Azure Cache for Redis](/azure/azure-cache-for-redis/cache-overview), or [Azure Storage](https://azure.microsoft.com/products/category/storage/) to store the application data. By using these services, you can ensure that your traffic can be moved across nodes and zones without risking data loss or affecting the user experience. You can use Kubernetes [Deployments](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/), [Services](https://kubernetes.io/docs/concepts/services-networking/service/), and [Health Probes](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/) to manage stateless pods and ensure even distribution across zones.
118
+
If you run a stateless workload, you should use managed Azure services, such [Azure databases](https://azure.microsoft.com/products/category/databases/), [Azure Cache for Redis](/azure/azure-cache-for-redis/cache-overview), or [Azure Storage](https://azure.microsoft.com/products/category/storage/) to store the application data. You can use these services to help ensure that your traffic can be moved across nodes and zones without risking data loss or affecting the user experience. You can use Kubernetes [deployments](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/), [services](https://kubernetes.io/docs/concepts/services-networking/service/), and [health probes](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/) to manage stateless pods and ensure even distribution across zones.
119
119
120
-
If you need to store state within your cluster by using Azure disks, use Azure zone-redundant storage to ensure that your data is replicated across multiple availability zones. For more information, see [Choose the right disk type based on application needs](/azure/aks/aks-zone-resiliency#make-your-storage-disk-decision).
120
+
If you need to store state within your cluster by using Azure disks, use Azure zone-redundant storage to help ensure that your data is replicated across multiple availability zones. For more information, see [Choose the right disk type based on application needs](/azure/aks/aks-zone-resiliency#make-your-storage-disk-decision).
121
121
122
122
### Zone-down experience
123
123
124
-
-**Detection and response:** When a zone outage occurs, the control plane automatically fails over. If your node pools use availability zones and follow [zone resiliency best practices](#considerations), you can expect AKS to bring up nodes and replicas in the zones that are up and running. AKS does this task automatically when you use managed solutions like cluster autoscaler or NAP. Without autoscaling, nodes and replicas remain in the *Pending* state and wait for manual intervention to scale up the node pool.
124
+
-**Detection and response:** When a zone outage occurs, the control plane automatically fails over. If your node pools use availability zones and follow [zone resiliency best practices](#considerations), you can expect AKS to bring up nodes and replicas in the zones that are operational. AKS does this task automatically when you use managed solutions like cluster autoscaler or NAP. Without autoscaling, nodes and replicas remain in the *Pending* state and wait for manual intervention to scale up the node pool.
125
125
126
126
AKS also attempts to rebalance the pods across the healthy zones. If you choose to manually scale your node pool in a zone-down scenario, your pods might remain in the *Pending* state when there are no nodes available in the healthy zones. Scaling out in the remaining zones is also subject to the availability of quota and capacity for the VM SKU that you use.
127
127
@@ -143,7 +143,7 @@ When the availability zone recovers, failback behavior depends on the component:
143
143
144
144
-**Control plane:** AKS automatically restores control plane operations across all availability zones. No manual intervention is required.
145
145
146
-
-**Node pools and nodes:** Immediately after failback, nodes remain in the previously healthy zones and aren't restored into the recovered zone. However, the next time you perform a node-scaling operation, such as when you scale out your node pool, the node pool can create nodes in the recovered zone.
146
+
-**Node pools and nodes:** Immediately after failback, nodes remain in the previously healthy zones and aren't restored in the recovered zone. However, the next time you perform a node-scaling operation, such as when you scale out your node pool, the node pool can create nodes in the recovered zone.
147
147
148
148
-**Pods:** Immediately after failback, pods continue to run on the nodes that they currently run on. When new pods are created or existing pods are re-created, they're eligible to use nodes in the recovered zone.
149
149
@@ -168,17 +168,17 @@ If you need to deploy your Kubernetes workload to multiple Azure regions, you ha
168
168
169
169
- Manage a set of AKS clusters as a single unit, and those clusters can be distributed across multiple Azure regions.
170
170
171
-
- Automate certain aspects of cluster management, such as cluster and node image upgrades.
171
+
- Automate specific aspects of cluster management, such as cluster and node image upgrades.
172
172
173
173
- Use traffic distribution capabilities to spread traffic across the clusters and automatically fail over if a region is unavailable.
174
174
175
-
- Orchestrate failover with a manual active-active or active-passive deployment model if your workload requires more nuanced control over the different components of interregional failovers. For more information, see [High availability and disaster recovery overview for AKS](/azure/aks/ha-dr-overview).
175
+
- Orchestrate failover by using a manual active-active or active-passive deployment model if your workload requires more nuanced control over the different components of interregional failovers. For more information, see [HA and DR overview for AKS](/azure/aks/ha-dr-overview).
176
176
177
177
## Backups
178
178
179
-
By using Azure Backup, you can use a backup extension to back up AKS cluster resources and persistent volumes that attach to the cluster. The Backup vault communicates with the AKS cluster through the extension to perform backup and restore operations.
179
+
Azure Backup has an extension that you can use to back up AKS cluster resources and persistent volumes that attach to the cluster. The Backup vault communicates with the AKS cluster through the extension to perform backup and restore operations.
180
180
181
-
If your AKS cluster is in a [region with a pair](./regions-paired.md), you can configure backups to be stored in geo-redundant storage. You can restore geo-redundant backups into the paired region.
181
+
If your AKS cluster is in a [paired region](./regions-paired.md), you can configure backups to be stored in geo-redundant storage. You can restore geo-redundant backups into the paired region.
182
182
183
183
For more information, see the following articles:
184
184
@@ -193,8 +193,8 @@ Strive to use stateless clusters that minimize the need for backup. Store data i
193
193
194
194
The service-level agreement (SLA) for Azure Logic Apps describes the expected availability of the service and the conditions that must be met to achieve that availability expectation. For more information, see [SLAs for online services](https://www.microsoft.com/licensing/docs/view/Service-Level-Agreements-SLA-for-Online-Services).
195
195
196
-
AKS offers three pricing tiers for cluster management: **Free**, **Standard**, and **Premium**. For more information, see [Free, Standard, and Premium pricing tiers for AKS cluster management](/azure/aks/free-standard-pricing-tiers). The Free tier enables you to use AKS to test your workloads. The Standard and Premium tiers are designed for production workloads. When you deploy an AKS cluster that has availability zones enabled, the uptime percentage defined in the SLA increases. However, the SLA applies only if you deploy a cluster in the Standard or Premium pricing tier.
196
+
AKS provides three [pricing tiers for cluster management](/azure/aks/free-standard-pricing-tiers): **Free**, **Standard**, and **Premium**. The Free tier enables you to use AKS to test your workloads. The Standard and Premium tiers are designed for production workloads. When you deploy an AKS cluster that has availability zones enabled, the uptime percentage defined in the SLA increases. However, the SLA applies only if you deploy a cluster in the Standard or Premium pricing tier.
0 commit comments