Skip to content

Commit cecc5bf

Browse files
committed
Incorporated feedback
1 parent 4b2d110 commit cecc5bf

File tree

1 file changed

+67
-25
lines changed

1 file changed

+67
-25
lines changed

articles/aks/best-practices-app-cluster-reliability.md

Lines changed: 67 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ title: Deployment and cluster reliability best practices for Azure Kubernetes Se
33
titleSuffix: Azure Kubernetes Service
44
description: Learn the best practices for deployment and cluster reliability for Azure Kubernetes Service (AKS) workloads.
55
ms.topic: conceptual
6-
ms.date: 02/22/2024
6+
ms.date: 03/11/2024
77
---
88

99
# Deployment and cluster reliability best practices for Azure Kubernetes Service (AKS)
@@ -14,7 +14,7 @@ The best practices in this article are organized into the following categories:
1414

1515
| Category | Best practices |
1616
| -------- | -------------- |
17-
| [Deployment level best practices](#deployment-level-best-practices) |[Pod Disruption Budgets (PDBs)](#pod-disruption-budgets-pdbs) <br/> • [Pod CPU and memory limits](#pod-cpu-and-memory-limits) <br/> • [Pre-stop hooks](#pre-stop-hooks) <br/> • [maxUnavailable](#maxunavailable) <br/> • [Pod anti-affinity](#pod-anti-affinity) <br/> • [Readiness and liveness probes](#readiness-and-liveness-probes) <br/> • [Multi-replica applications](#multi-replica-applications) |
17+
| [Deployment level best practices](#deployment-level-best-practices) |[Pod Disruption Budgets (PDBs)](#pod-disruption-budgets-pdbs) <br/> • [Pod CPU and memory limits](#pod-cpu-and-memory-limits) <br/> • [Pre-stop hooks](#pre-stop-hooks) <br/> • [maxUnavailable](#maxunavailable) <br/> • [Pod anti-affinity](#pod-anti-affinity) <br/> • [Readiness, liveness, and startup probes](#readiness-liveness-and-startup-probes) <br/> • [Multi-replica applications](#multi-replica-applications) |
1818
| [Cluster and node pool level best practices](#cluster-and-node-pool-level-best-practices) |[Availability zones](#availability-zones) <br/> • [Cluster autoscaling](#cluster-autoscaling) <br/> • [Standard Load Balancer](#standard-load-balancer) <br/> • [System node pools](#system-node-pools) <br/> • [Accelerated Networking](#accelerated-networking) <br/> • [Image versions](#image-versions) <br/> • [Azure CNI for dynamic IP allocation](#azure-cni-for-dynamic-ip-allocation) <br/> • [v5 SKU VMs](#v5-sku-vms) <br/> • [Do *not* use B series VMs](#do-not-use-b-series-vms) <br/> • [Premium Disks](#premium-disks) <br/> • [Container Insights](#container-insights) <br/> • [Azure Policy](#azure-policy) |
1919

2020
## Deployment level best practices
@@ -30,19 +30,19 @@ The following deployment level best practices help ensure high availability and
3030
>
3131
> Use Pod Disruption Budgets (PDBs) to ensure that a minimum number of pods remain available during *voluntary disruptions*, such as upgrade operations or accidental pod deletions.
3232
33-
[Pod Disruption Budgets (PDBs)](https://kubernetes.io/docs/concepts/workloads/pods/disruptions/#pod-disruption-budgets) allow you to define how deployments or replica sets respond during voluntary disruptions, such as upgrade operations or accidental pod deletions. Using PDBs, you can define a minimum or maximum unavailable resource count.
33+
[Pod Disruption Budgets (PDBs)](https://kubernetes.io/docs/concepts/workloads/pods/disruptions/#pod-disruption-budgets) allow you to define how deployments or replica sets respond during voluntary disruptions, such as upgrade operations or accidental pod deletions. Using PDBs, you can define a minimum or maximum unavailable resource count. PDBs only affect the Eviction API for voluntary disruptions.
3434

3535
For example, let's say you need to perform a cluster upgrade and already have a PDB defined. Before performing the cluster upgrade, the Kubernetes scheduler ensures that the minimum number of pods defined in the PDB are available. If the upgrade would cause the number of available pods to fall below the minimum defined in the PDBs, the scheduler schedules extra pods on other nodes before allowing the upgrade to proceed. If you don't set a PDB, the scheduler doesn't have any constraints on the number of pods that can be unavailable during the upgrade, which can lead to a lack of resources and potential cluster outages.
3636

37-
In the following example PDB definition file, the `minAvailable` field sets the minimum number of pods that must remain available during voluntary disruptions:
37+
In the following example PDB definition file, the `minAvailable` field sets the minimum number of pods that must remain available during voluntary disruptions. The value can be an absolute number (for example, *3*) or a percentage of the desired number of pods (for example, *10%*).
3838

3939
```yaml
4040
apiVersion: policy/v1
4141
kind: PodDisruptionBudget
4242
metadata:
4343
name: mypdb
4444
spec:
45-
minAvailable: 3 # Minimum number of pods that must remain available
45+
minAvailable: 3 # Minimum number of pods that must remain available during voluntary disruptions
4646
selector:
4747
matchLabels:
4848
app: myapp
@@ -109,7 +109,7 @@ For more information, see [Assign CPU Resources to Containers and Pods](https://
109109

110110
> **Best practice guidance**
111111
>
112-
> Use pre-stop hooks to ensure graceful termination of a container.
112+
> When applicable, use pre-stop hooks to ensure graceful termination of a container.
113113

114114
A `PreStop` hook is called immediately before a container is terminated due to an API request or management event, such as preemption, resource contention, or a liveness/startup probe failure. A call to the `PreStop` hook fails if the container is already in a terminated or completed state, and the hook must complete before the TERM signal to stop the container is sent. The pod's termination grace period countdown begins before the `PreStop` hook is executed, so the container eventually terminates within the termination grace period.
115115

@@ -139,11 +139,11 @@ For more information, see [Container lifecycle hooks](https://kubernetes.io/docs
139139

140140
> **Best practice guidance**
141141
>
142-
> Define the maximum number of pods that can be unavailable during a rolling upgrade using the `maxUnavailable` field in your deployment to ensure that a minimum number of pods remain available during the upgrade.
142+
> Define the maximum number of pods that can be unavailable during a rolling update using the `maxUnavailable` field in your deployment to ensure that a minimum number of pods remain available during the upgrade.
143143

144-
The `maxUnavailable` field specifies the maximum number of pods that can be unavailable during the upgrade process. The value can be an absolute number (for example, *five*) or a percentage of the desired number of pods (for example, *10%*).
144+
The `maxUnavailable` field specifies the maximum number of pods that can be unavailable during the update process. The value can be an absolute number (for example, *3*) or a percentage of the desired number of pods (for example, *10%*). `maxUnavailable` pertains to the Delete API, which is used during rolling updates.
145145

146-
The following example deployment manifest uses the `maxAvailable` field to set the maximum number of pods that can be unavailable during the upgrade process:
146+
The following example deployment manifest uses the `maxAvailable` field to set the maximum number of pods that can be unavailable during the update process:
147147

148148
```yaml
149149
apiVersion: apps/v1
@@ -199,8 +199,9 @@ spec:
199199
- key: topology.kubernetes.io/zone
200200
operator: In
201201
values:
202-
- antarctica-east1
203-
- antarctica-west1
202+
- 0 # Azure Availability Zone 0
203+
- 1 # Azure Availability Zone 1
204+
- 2 # Azure Availability Zone 2
204205
preferredDuringSchedulingIgnoredDuringExecution:
205206
- weight: 1
206207
preference:
@@ -225,18 +226,16 @@ For more information, see [Affinity and anti-affinity in Kubernetes](https://kub
225226
>
226227
> For more information, see [Best practices for multiple zones](https://kubernetes.io/docs/setup/best-practices/multiple-zones/) and [Overview of availability zones for AKS clusters](./availability-zones.md#overview-of-availability-zones-for-aks-clusters).
227228

228-
### Readiness and liveness probes
229+
### Readiness, liveness, and startup probes
229230

230231
> **Best practice guidance**
231232
>
232-
> Configure readiness and liveness probes to improve resiliency for high load and lower container restarts.
233+
> Configure readiness, liveness, and startup probes when applicable to improve resiliency for high loads and lower container restarts.
233234

234235
#### Readiness probes
235236

236237
In Kubernetes, the kubelet uses readiness probes to know when a container is ready to start accepting traffic. A pod is considered *ready* when all of its containers are ready. When a pod is *not ready*, it's removed from service load balancers. For more information, see [Readiness Probes in Kubernetes](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-readiness-probes).
237238

238-
For containerized applications that serve traffic, you should verify that your container is ready to handle incoming requests. [Azure Container Instances](../container-instances/container-instances-overview.md) supports readiness probes to include configurations so that your container can't be accessed under certain conditions.
239-
240239
The following example pod definition file shows a readiness probe configuration:
241240

242241
```yaml
@@ -255,19 +254,58 @@ For more information, see [Configure readiness probes](../container-instances/co
255254

256255
In Kubernetes, the kubelet uses liveness probes to know when to restart a container. If a container fails its liveness probe, the container is restarted. For more information, see [Liveness Probes in Kubernetes](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/).
257256

258-
Containerized applications can run for extended periods of time, resulting in broken states in need of repair by restarting the container. [Azure Container Instances](../container-instances/container-instances-overview.md) supports liveness probes to include configurations so that your container can be restarted under certain conditions.
259-
260257
The following example pod definition file shows a liveness probe configuration:
261258

262259
```yaml
260+
livenessProbe:
261+
exec:
262+
command:
263+
- cat
264+
- /tmp/healthy
265+
```
266+
267+
Another kind of liveness probe uses an HTTP GET request. The following example pod definition file shows an HTTP GET request liveness probe configuration:
268+
269+
```yaml
270+
apiVersion: v1
271+
kind: Pod
272+
metadata:
273+
labels:
274+
test: liveness
275+
name: liveness-http
276+
spec:
277+
containers:
278+
- name: liveness
279+
image: registry.k8s.io/liveness
280+
args:
281+
- /server
263282
livenessProbe:
264-
exec:
265-
command:
266-
- cat
267-
- /tmp/healthy
283+
httpGet:
284+
path: /healthz
285+
port: 8080
286+
httpHeaders:
287+
- name: Custom-Header
288+
value: Awesome
289+
initialDelaySeconds: 3
290+
periodSeconds: 3
268291
```
269292

270-
For more information, see [Configure liveness probes](../container-instances/container-instances-liveness-probe.md).
293+
For more information, see [Configure liveness probes](../container-instances/container-instances-liveness-probe.md) and [Define a liveness HTTP request](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-a-liveness-http-request).
294+
295+
#### Startup probes
296+
297+
In Kubernetes, the kubelet uses startup probes to know when a container application has started. When you configure a startup probe, readiness and liveness probes don't start until the startup probe succeeds, ensuring the readiness and liveness probes don't interfere with application startup. For more information, see [Startup Probes in Kubernetes](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-startup-probes).
298+
299+
The following example pod definition file shows a startup probe configuration:
300+
301+
```yaml
302+
startupProbe:
303+
httpGet:
304+
path: /healthz
305+
port: 8080
306+
failureThreshold: 30
307+
periodSeconds: 10
308+
```
271309

272310
### Multi-replica applications
273311

@@ -386,7 +424,7 @@ Use the autoscaler on node pools to configure the minimum and maximum scale limi
386424

387425
For more information, see [Use the cluster autoscaler on node pools](./cluster-autoscaler.md#use-the-cluster-autoscaler-on-node-pools).
388426

389-
#### At least two nodes per system node pool
427+
#### At least three nodes per system node pool
390428

391429
> **Best practice guidance**
392430
>
@@ -414,17 +452,21 @@ For more information, see [Accelerated Networking overview](../virtual-network/a
414452
>
415453
> Images shouldn't use the `latest` tag.
416454

455+
#### Container image tags
456+
417457
Using the `latest` tag for [container images](https://kubernetes.io/docs/concepts/containers/images/) can lead to unpredictable behavior and makes it difficult to track which version of the image is running in your cluster. You can minimize these risks by integrating and running scan and remediation tools in your containers at build and runtime. For more information, see [Best practices for container image management in AKS](./operator-best-practices-container-image-management.md).
418458

459+
#### Node image upgrades
460+
419461
AKS provides multiple auto-upgrade channels for node OS image upgrades. You can use these channels to control the timing of upgrades. We recommend joining these auto-upgrade channels to ensure that your nodes are running the latest security patches and updates. For more information, see [Auto-upgrade node OS images in AKS](./auto-upgrade-node-os-image.md).
420462

421463
### Standard tier for production workloads
422464

423465
> **Best practice guidance**
424466
>
425-
> Use the standard tier for product workloads for greater cluster reliability and resources, support for up to 5,000 nodes in a cluster, and Uptime SLA enabled by default.
467+
> Use the Standard tier for product workloads for greater cluster reliability and resources, support for up to 5,000 nodes in a cluster, and Uptime SLA enabled by default. If you need LTS, consider using the Premium tier.
426468

427-
The standard tier for Azure Kubernetes Service (AKS) provides a financially backed 99.9% uptime [service-level agreement (SLA)](https://www.azure.cn/en-us/support/sla/kubernetes-service/) for your production workloads. The standard tier also provides greater cluster reliability and resources, support for up to 5,000 nodes in a cluster, and Uptime SLA enabled by default. For more information, see [Standard pricing tier for AKS cluster management](./free-standard-pricing-tiers.md).
469+
The Standard tier for Azure Kubernetes Service (AKS) provides a financially backed 99.9% uptime [service-level agreement (SLA)](https://www.azure.cn/en-us/support/sla/kubernetes-service/) for your production workloads. The standard tier also provides greater cluster reliability and resources, support for up to 5,000 nodes in a cluster, and Uptime SLA enabled by default. For more information, see [Pricing tiers for AKS cluster management](./free-standard-pricing-tiers.md).
428470

429471
### Azure CNI for dynamic IP allocation
430472

0 commit comments

Comments
 (0)