Skip to content

Commit cdcd100

Browse files
Merge pull request #258532 from Nickomang/aks-drain-timeout
Added configurable drain timeout
2 parents 7ca2b32 + db87a56 commit cdcd100

File tree

2 files changed

+22
-1
lines changed

2 files changed

+22
-1
lines changed

articles/aks/upgrade-aks-cluster.md

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -132,6 +132,7 @@ During the cluster upgrade process, AKS performs the following operations:
132132

133133
* Add a new buffer node (or as many nodes as configured in [max surge](#customize-node-surge-upgrade)) to the cluster that runs the specified Kubernetes version.
134134
* [Cordon and drain][kubernetes-drain] one of the old nodes to minimize disruption to running applications. If you're using max surge, it [cordons and drains][kubernetes-drain] as many nodes at the same time as the number of buffer nodes specified.
135+
* For long running pods, you can configure the node drain timeout, which allows for custom wait time on the eviction of pods and graceful termination per node. If not specified, the default is 30 minutes.
135136
* When the old node is fully drained, it's reimaged to receive the new version and becomes the buffer node for the following node to be upgraded.
136137
* This process repeats until all nodes in the cluster have been upgraded.
137138
* At the end of the process, the last buffer node is deleted, maintaining the existing agent node count and zone balance.
@@ -229,6 +230,24 @@ AKS accepts both integer values and a percentage value for max surge. An integer
229230
az aks nodepool update -n mynodepool -g MyResourceGroup --cluster-name MyManagedCluster --max-surge 5
230231
```
231232
233+
#### Set node drain timeout value
234+
235+
When you have a long running workload on a certain pod, it may result in one of the following cases:
236+
- Your pod takes a long time to come up, such as when restoring a database.
237+
- Your pod uses graceful termination to take a long time to shut down.
238+
239+
In these scenarios, you can configure a node drain timeout that AKS will respect in the upgrade workflow. If you prefer your upgrades to be fast and are confident in your pod startup/terminate times being fast, you may want to set a low drain timeout. Otherwise, higher drain timeouts will affect how long you wait before discovering an issue. If no node drain timeout value is specified, the default is 30 minutes.
240+
241+
To set a node drain timeout for new or existing node pools using the [`az aks nodepool add`][az-aks-nodepool-add] or [`az aks nodepool update`][az-aks-nodepool-update] command:
242+
243+
```azurecli-interactive
244+
# Set drain timeout for a new node pool
245+
az aks nodepool add -n mynodepool -g MyResourceGroup --cluster-name MyManagedCluster --drainTimeoutInMinutes 100
246+
247+
# Update drain timeout for an existing node pool
248+
az aks nodepool update -n mynodepool -g MyResourceGroup --cluster-name MyManagedCluster --drainTimeoutInMinutes 45
249+
```
250+
232251
## View upgrade events
233252

234253
* View upgrade events using the `kubectl get events` command.

articles/aks/upgrade-cluster.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -43,11 +43,12 @@ Persistent volume claims (PVCs) backed by Azure locally redundant storage (LRS)
4343

4444
## Optimize upgrades to improve performance and minimize disruptions
4545

46-
The combination of [Planned Maintenance Window][planned-maintenance], [Max Surge](./upgrade-aks-cluster.md#customize-node-surge-upgrade), and [Pod Disruption Budget][pdb-spec] can significantly increase the likelihood of node upgrades completing successfully by the end of the maintenance window while also minimizing disruptions.
46+
The combination of [Planned Maintenance Window][planned-maintenance], [Max Surge](./upgrade-aks-cluster.md#customize-node-surge-upgrade), and [Pod Disruption Budget][pdb-spec], and [node drain timeout][drain-timeout] can significantly increase the likelihood of node upgrades completing successfully by the end of the maintenance window while also minimizing disruptions.
4747

4848
* [Planned Maintenance Window][planned-maintenance] enables service teams to schedule auto-upgrade during a pre-defined window, typically a low-traffic period, to minimize workload impact. We recommend a window duration of at least *four hours*.
4949
* [Max Surge](./upgrade-aks-cluster.md#customize-node-surge-upgrade) on the node pool allows requesting extra quota during the upgrade process and limits the number of nodes selected for upgrade simultaneously. A higher max surge results in a faster upgrade process. We don't recommend setting it at 100%, as it upgrades all nodes simultaneously, which can cause disruptions to running applications. We recommend a max surge quota of *33%* for production node pools.
5050
* [Pod Disruption Budget][pdb-spec] is set for service applications and limits the number of pods that can be down during voluntary disruptions, such as AKS-controlled node upgrades. It can be configured as `minAvailable` replicas, indicating the minimum number of application pods that need to be active, or `maxUnavailable` replicas, indicating the maximum number of application pods that can be terminated, ensuring high availability for the application. Refer to the guidance provided for configuring [Pod Disruption Budgets (PDBs)][pdb-concepts]. PDB values should be validated to determine the settings that work best for your specific service.
51+
* [Node drain timeout][drain-timeout] on the node pool allows configuring the wait time for eviction of pods and graceful termination per node during upgrades, typically applicable for long running workloads. When the node drain timeout is specified to an amount of time (in minutes), AKS honors waiting on pod disruption budgets. If not specified, the default is 30 minutes.
5152

5253
## Next steps
5354

@@ -63,5 +64,6 @@ This article listed different upgrade options for AKS clusters. To learn more ab
6364
<!-- LINKS - internal -->
6465
[aks-tutorial-prepare-app]: ./tutorial-kubernetes-prepare-app.md
6566
[nodepool-upgrade]: manage-node-pools.md#upgrade-a-single-node-pool
67+
[drain-timeout]: ./upgrade-aks-cluster.md#set-node-drain-timeout-value
6668
[planned-maintenance]: planned-maintenance.md
6769
[specific-nodepool]: node-image-upgrade.md#upgrade-a-specific-node-pool

0 commit comments

Comments
 (0)