Skip to content

Commit d7081dc

Browse files
committed
Deployment and cluster reliability best practices for AKS
1 parent 4fd5b17 commit d7081dc

File tree

1 file changed

+272
-0
lines changed

1 file changed

+272
-0
lines changed
Lines changed: 272 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,272 @@
1+
---
2+
title: Deployment and cluster reliability best practices for Azure Kubernetes Service (AKS)
3+
titleSuffix: Azure Kubernetes Service
4+
description: Learn the best practices for deployment and cluster reliability for Azure Kubernetes Service (AKS) workloads.
5+
ms.topic: conceptual
6+
ms.date: 01/31/2024
7+
---
8+
9+
# Deployment and cluster reliability best practices for Azure Kubernetes Service (AKS)
10+
11+
## Deployment level best practices
12+
13+
### Pod Disruption Budgets (PDBs)
14+
15+
> **Best practice guidance**
16+
>
17+
> Use Pod Disruption Budgets (PDBs) to ensure that a minimum number of pods remain available during *voluntary disruptions*, such as upgrade operations or accidental pod deletions.
18+
19+
[Pod Disruption Budgets (PDBs)](https://kubernetes.io/docs/concepts/workloads/pods/disruptions/#pod-disruption-budgets) allow you to define how deployments or replica sets respond during voluntary disruptions, such as upgrade operations or accidental pod deletions. Using PDBs, you can define a minimum or maximum unavailable resource count.
20+
21+
For example, let's say you need to perform a cluster upgrade and already have a PDB defined. Before performing the cluster upgrade, the Kubernetes scheduler ensures that the minimum number of pods defined in the PDB are available. If the upgrade would cause the number of available pods to fall below the minimum defined in the PDS, the scheduler schedules extra pods on other nodes before allowing the upgrade to proceed.
22+
23+
In the following example PDB definition file, the `minAvailable` field sets the minimum number of pods that must remain available during voluntary disruptions:
24+
25+
```yaml
26+
apiVersion: policy/v1
27+
kind: PodDisruptionBudget
28+
metadata:
29+
name: mypdb
30+
spec:
31+
minAvailable: 3 # Minimum number of pods that must remain available
32+
selector:
33+
matchLabels:
34+
app: myapp
35+
```
36+
37+
For more information, see [Plan for availability using PDBs](./operator-best-practices-scheduler.md#plan-for-availability-using-pod-disruption-budgets) and [Specifying a Disruption Budget for your Application](https://kubernetes.io/docs/tasks/run-application/configure-pdb/).
38+
39+
### Pod CPU and memory limits
40+
41+
> **Best practice guidance**
42+
>
43+
> Set pod CPU and memory limits for all pods to ensure that pods don't consume all resources on a node and to provide protection during service threats, such as DDoS attacks.
44+
45+
Pod CPU and memory limits define the maximum amount of CPU and memory a pod can use. When a pod exceeds its defined limits, it gets marked for removal. For more information, see [CPU resource units in Kubernetes](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#meaning-of-cpu) and [Memory resource units in Kubernetes](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#meaning-of-memory).
46+
47+
Setting CPU and memory limits helps you maintain node health and minimizes impact to other pods on the node. Avoid setting a pod limit higher than your nodes can support. Each AKS node reserves a set amount of CPU and memory for the core Kubernetes components. If you set a pod limit higher than the node can support, your application might try to consume too many resources and negatively impact other pods on the node. Cluster administrators need to set resource quotas on a namespace that requires setting resource requests and limits. For more information, see [Enforce resource quotas in AKS](./operator-best-practices-scheduler.md#enforce-resource-quotas).
48+
49+
In the following example pod definition file, the `resources` section sets the CPU and memory limits for the pod:
50+
51+
```yaml
52+
kind: Pod
53+
apiVersion: v1
54+
metadata:
55+
name: mypod
56+
spec:
57+
containers:
58+
- name: mypod
59+
image: mcr.microsoft.com/oss/nginx/nginx:1.15.5-alpine
60+
resources:
61+
requests:
62+
cpu: 100m
63+
memory: 128Mi
64+
limits:
65+
cpu: 250m
66+
memory: 256Mi
67+
```
68+
69+
For more information, see [Assign CPU Resources to Containers and Pods](https://kubernetes.io/docs/tasks/configure-pod-container/assign-cpu-resource/) and [Assign Memory Resources to Containers and Pods](https://kubernetes.io/docs/tasks/configure-pod-container/assign-memory-resource/).
70+
71+
### Pod anti-affinity
72+
73+
> **Best practice guidance**
74+
>
75+
> Use pod anti-affinity to ensure that pods are spread across nodes for node-down scenarios.
76+
77+
You can use the `nodeSelector` field in your pod specification to specify the node labels you want the target node to have. Kubernetes only schedules the pod onto nodes that have the specified labels. Anti-affinity expands the types of constraints you can define and gives you more control over the selection logic. Anti-affinity allows you to constrain pods against labels on other pods. For more information, see [Affinity and anti-affinity in Kubernetes](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#affinity-and-anti-affinity).
78+
79+
### Pod anti-affinity across availability zones
80+
81+
> **Best practice guidance**
82+
>
83+
> Use pod anti-affinity across availability zones to ensure that pods are spread across availability zones for zone-down scenarios.
84+
85+
When you deploy your application across multiple availability zones, you can use pod anti-affinity to ensure that pods are spread across availability zones. This practice helps ensure that your application remains available in the event of a zone-down scenario. For more information, see [Best practices for multiple zones](https://kubernetes.io/docs/setup/best-practices/multiple-zones/) and [Overview of availability zones for AKS clusters](./availability-zones.md#overview-of-availability-zones-for-aks-clusters).
86+
87+
### Readiness and liveness probes
88+
89+
> **Best practice guidance**
90+
>
91+
> Configure readiness and liveness probes to improve resiliency for high load and lower container restarts.
92+
93+
#### Readiness probes
94+
95+
In Kubernetes, the kubelet uses readiness probes to know when a container is ready to start accepting traffic. A pod is considered *ready* when all of its containers are ready. When a pod is *not ready*, it's removed from service load balancers. For more information, see [Readiness Probes in Kubernetes](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-readiness-probes).
96+
97+
For containerized applications that serve traffic, you should verify that your container is ready to handle incoming requests. [Azure Container Instances](../container-instances/container-instances-overview.md) supports readiness probes to include configurations so that your container can't be accessed under certain conditions.
98+
99+
The following example YAML snipped shows a readiness probe configuration:
100+
101+
```yaml
102+
readinessProbe:
103+
exec:
104+
command:
105+
- cat
106+
- /tmp/healthy
107+
initialDelaySeconds: 5
108+
periodSeconds: 5
109+
```
110+
111+
For more information, see [Configure readiness probes](../container-instances/container-instances-readiness-probe.md).
112+
113+
#### Liveness probes
114+
115+
In Kubernetes, the kubelet uses liveness probes to know when to restart a container. If a container fails its liveness probe, the container is restarted. For more information, see [Liveness Probes in Kubernetes](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/).
116+
117+
Containerized applications can run for extended periods of time, resulting in broken states in need of repair by restarting the container. [Azure Container Instances](../container-instances/container-instances-overview.md) supports liveness probes to include configurations so that your container can be restarted under certain conditions.
118+
119+
The following example YAML snipped shows a liveness probe configuration:
120+
121+
```yaml
122+
livenessProbe:
123+
exec:
124+
command:
125+
- cat
126+
- /tmp/healthy
127+
```
128+
129+
For more information, see [Configure liveness probes](../container-instances/container-instances-liveness-probe.md).
130+
131+
### Pre-stop hooks
132+
133+
> **Best practice guidance**
134+
>
135+
> Use pre-stop hooks to ensure graceful termination during SIGTERM.
136+
137+
A `PreStop` hook is called immediately before a container is terminated due to an API request or management event, such as a liveness probe failure. The pod's termination grace period countdown begins before the `PreStop` hook is executed, so the container will eventually terminate within the termination grace period. For more information, see [Container lifecycle hooks](https://kubernetes.io/docs/concepts/containers/container-lifecycle-hooks/#container-hooks) and [Termination of Pods](https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination).
138+
139+
### Multi-replica applications
140+
141+
> **Best practice guidance**
142+
>
143+
> Deploy at least two replicas of your application to ensure high availability and resiliency in node-down scenarios.
144+
145+
When you create an application in AKS and choose an Azure region during resource creation, it's a single-region app. In the event of a disaster that causes the region to become unavailable, your application also becomes unavailable. If you create an identical deployment in a secondary Azure region, your application becomes less susceptible to a single-region disaster and any data replication across the regions lets you recover your last application state.
146+
147+
For more information, see [Recommended active-active high availability solution overview for AKS](./active-active-solution.md) and [Running Multiple Instances of your Application](https://kubernetes.io/docs/tutorials/kubernetes-basics/scale/scale-intro/).
148+
149+
## Cluster level best practices
150+
151+
### Availability zones
152+
153+
Require at least two for zone-down scenarios.
154+
155+
https://kubernetes.io/docs/setup/best-practices/multiple-zones/
156+
157+
### Premium Disks
158+
159+
Needed to achieve 99.9% availability in one VM.
160+
161+
https://learn.microsoft.com/en-us/azure/aks/use-premium-v2-disks
162+
163+
### Application dependencies
164+
165+
Such as databases, warn customers if they use dependencies that aren't AZ resilient.
166+
167+
### Auto-scale imbalance
168+
169+
Auto scale requires one node pool in each zone to balance load.
170+
171+
https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/
172+
https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/
173+
174+
### Image versions
175+
176+
Images shouldn't use the latest tag.
177+
178+
https://kubernetes.io/docs/concepts/containers/images/
179+
180+
### Standard tier for production
181+
182+
Use standard tier for production workloads.
183+
184+
https://learn.microsoft.com/en-us/azure/aks/free-standard-pricing-tiers
185+
186+
### maxUnavailable
187+
188+
Minimum number of pods for rolling upgrades.
189+
190+
https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#max-unavailable
191+
192+
### Accelerated Networking
193+
194+
Provides lower latency, reduced jitter, and decreased CPU utilization on the VMs.
195+
196+
https://learn.microsoft.com/en-us/azure/virtual-network/accelerated-networking-overview?tabs=redhat
197+
198+
### Standard Load Balancer
199+
200+
Supports multiple availability zones, HTTP probes, and it works in multiple data centers.
201+
202+
https://learn.microsoft.com/en-us/azure/aks/load-balancer-standard
203+
204+
### Dynamic IP for Azure CNI
205+
206+
Prevents IP exhaustion for AKS clusters if using Azure CNI.
207+
208+
https://learn.microsoft.com/en-us/azure/aks/configure-azure-cni-dynamic-ip-allocation
209+
210+
### Container insights
211+
212+
Use Prometheus or other tools to track cluster performance.
213+
214+
https://learn.microsoft.com/en-us/azure/azure-monitor/containers/kubernetes-monitoring-enable?tabs=cli
215+
216+
### Scale-down mode
217+
218+
Use scale-down to delete/deallocate nodes.
219+
220+
https://learn.microsoft.com/en-us/azure/aks/scale-down-mode
221+
222+
### Azure policies
223+
224+
Ensures compliance of cluster.
225+
226+
https://learn.microsoft.com/en-us/azure/aks/use-azure-policy
227+
228+
### System node pools
229+
230+
#### Do not use taints
231+
232+
Don't add taints to system node pools.
233+
234+
https://learn.microsoft.com/en-us/azure/aks/use-system-pools?tabs=azure-cli
235+
236+
#### Autoscaler for system node pools
237+
238+
Use the autoscaler for system node pools.
239+
240+
https://learn.microsoft.com/en-us/azure/aks/use-system-pools?tabs=azure-cli
241+
https://learn.microsoft.com/en-us/azure/aks/keda-about
242+
243+
#### At least two nodes in system node pools
244+
245+
Ensures resiliency for node-down scenarios.
246+
247+
https://learn.microsoft.com/en-us/azure/aks/use-system-pools?tabs=azure-cli
248+
249+
### Container images
250+
251+
Only use allowed images.
252+
253+
https://learn.microsoft.com/en-us/azure/aks/operator-best-practices-container-image-management
254+
https://learn.microsoft.com/en-us/azure/aks/image-integrity?tabs=azure-cli
255+
256+
### Image pulls
257+
258+
No unauthenticated image pulls.
259+
260+
https://learn.microsoft.com/en-us/azure/aks/artifact-streaming
261+
262+
### v5 SKU VMs
263+
264+
v4/v5 SKUs have better reliability and less impact of updates.
265+
266+
https://learn.microsoft.com/en-us/azure/aks/best-practices-performance-scale
267+
https://learn.microsoft.com/en-us/azure/aks/best-practices-performance-scale-large
268+
https://learn.microsoft.com/en-us/azure/aks/operator-best-practices-run-at-scale
269+
270+
#### Do not use B series VMs
271+
272+
B series VMs are low performance and don't work well with AKS.

0 commit comments

Comments
 (0)