Skip to content

Commit 1f3cb04

Browse files
author
Mayank Kumar
committed
v1.24 blog post: Maxunavailable for StatefulSet
1 parent 7f3f987 commit 1f3cb04

File tree

2 files changed

+149
-5
lines changed

2 files changed

+149
-5
lines changed
Lines changed: 149 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,149 @@
1+
---
2+
layout: blog
3+
title: 'Kubernetes 1.24: Maximum Unavailable Replicas for StatefulSet'
4+
date: 2022-05-27
5+
slug: maxunavailable-for-statefulset
6+
---
7+
8+
**Author:** Mayank Kumar (Salesforce)
9+
10+
Kubernetes [StatefulSets](/docs/concepts/workloads/controllers/statefulset/), since their introduction in
11+
1.5 and becoming stable in 1.9, have been widely used to run stateful applications. They provide stable pod identity, persistent
12+
per pod storage and ordered graceful deployment, scaling and rolling updates. You can think of StatefulSet as the atomic building
13+
block for running complex stateful applications. As the use of Kubernetes has grown, so has the number of scenarios requiring
14+
StatefulSets. Many of these scenarios, require faster rolling updates than the currently supported one-pod-at-a-time updates, in the
15+
case where you're using the `OrderedReady` Pod management policy for a StatefulSet.
16+
17+
18+
Here are some examples:
19+
20+
- I am using a StatefulSet to orchestrate a multi-instance, cache based application where the size of the cache is large. The cache
21+
starts cold and requires some siginificant amount of time before the container can start. There could be more initial startup tasks
22+
that are required. A RollingUpdate on this StatefulSet would take a lot of time before the application is fully updated. If the
23+
StatefulSet supported updating more than one pod at a time, it would result in a much faster update.
24+
25+
- My stateful application is composed of leaders and followers or one writer and multiple readers. I have multiple readers or
26+
followers and my application can tolerate multiple pods going down at the same time. I want to update this application more than
27+
one pod at a time so that i get the new updates rolled out quickly, especially if the number of instances of my application are
28+
large. Note that my application still requires unique identity per pod.
29+
30+
31+
In order to support such scenarios, Kubernetes 1.24 includes a new alpha feature to help. Before you can use the new feature you must
32+
enable the `MaxUnavailableStatefulSet` feature flag. Once you enable that, you can specify a new field called `maxUnavailable`, part
33+
of the `spec` for a StatefulSet. For example:
34+
35+
```
36+
apiVersion: apps/v1
37+
kind: StatefulSet
38+
metadata:
39+
name: web
40+
namespace: default
41+
spec:
42+
podManagementPolicy: OrderedReady # you must set OrderedReady
43+
replicas: 5
44+
selector:
45+
matchLabels:
46+
app: nginx
47+
template:
48+
metadata:
49+
labels:
50+
app: nginx
51+
spec:
52+
containers:
53+
- image: k8s.gcr.io/nginx-slim:0.8
54+
imagePullPolicy: IfNotPresent
55+
name: nginx
56+
updateStrategy:
57+
rollingUpdate:
58+
maxUnavailable: 2 # this is the new alpha field, whose default value is 1
59+
partition: 0
60+
type: RollingUpdate
61+
```
62+
63+
If you enable the new feature and you don't specify a value for `maxUnavailable` in a StatefulSet, Kubernetes applies a default
64+
`maxUnavailable: 1`. This matches the behavior you would see if you don't enable the new feature.
65+
66+
I'll run through a scenario based on that example manifest to demonstrate how this feature works. I will deploy a StatefulSet that
67+
has 5 replicas, with `maxUnavailable` set to 2 and `partition` set to 0.
68+
69+
I can trigger a rolling update by changing the image to `k8s.gcr.io/nginx-slim:0.9`. Once I initiate the rolling update, I can
70+
watch the pods update 2 at a time as the current value of maxUnavailable is 2. The below output shows a span of time and is not
71+
complete. The maxUnavailable can be an absolute number (for example, 2) or a percentage of desired Pods (for example, 10%). The
72+
absolute number is calculated from percentage by rounding down.
73+
```
74+
kubectl get pods --watch
75+
```
76+
77+
```
78+
NAME READY STATUS RESTARTS AGE
79+
web-0 1/1 Running 0 85s
80+
web-1 1/1 Running 0 2m6s
81+
web-2 1/1 Running 0 106s
82+
web-3 1/1 Running 0 2m47s
83+
web-4 1/1 Running 0 2m27s
84+
web-4 1/1 Terminating 0 5m43s ----> start terminating 4
85+
web-3 1/1 Terminating 0 6m3s ----> start terminating 3
86+
web-3 0/1 Terminating 0 6m7s
87+
web-3 0/1 Pending 0 0s
88+
web-3 0/1 Pending 0 0s
89+
web-4 0/1 Terminating 0 5m48s
90+
web-4 0/1 Terminating 0 5m48s
91+
web-3 0/1 ContainerCreating 0 2s
92+
web-3 1/1 Running 0 2s
93+
web-4 0/1 Pending 0 0s
94+
web-4 0/1 Pending 0 0s
95+
web-4 0/1 ContainerCreating 0 0s
96+
web-4 1/1 Running 0 1s
97+
web-2 1/1 Terminating 0 5m46s ----> start terminating 2 (only after both 4 and 3 are running)
98+
web-1 1/1 Terminating 0 6m6s ----> start terminating 1
99+
web-2 0/1 Terminating 0 5m47s
100+
web-1 0/1 Terminating 0 6m7s
101+
web-1 0/1 Pending 0 0s
102+
web-1 0/1 Pending 0 0s
103+
web-1 0/1 ContainerCreating 0 1s
104+
web-1 1/1 Running 0 2s
105+
web-2 0/1 Pending 0 0s
106+
web-2 0/1 Pending 0 0s
107+
web-2 0/1 ContainerCreating 0 0s
108+
web-2 1/1 Running 0 1s
109+
web-0 1/1 Terminating 0 6m6s ----> start terminating 0 (only after 2 and 1 are running)
110+
web-0 0/1 Terminating 0 6m7s
111+
web-0 0/1 Pending 0 0s
112+
web-0 0/1 Pending 0 0s
113+
web-0 0/1 ContainerCreating 0 0s
114+
web-0 1/1 Running 0 1s
115+
```
116+
Note that as soon as the rolling update starts, both 4 and 3 (the two highest ordinal pods) start terminating at the same time. Pods
117+
with ordinal 4 and 3 may become ready at their own pace. As soon as both pods 4 and 3 are ready, pods 2 and 1 start terminating at the
118+
same time. When pods 2 and 1 are both running and ready, pod 0 starts terminating.
119+
120+
In Kubernetes, updates to StatefulSets follow a strict ordering when updating Pods. In this example, the update starts at replica 4, then
121+
replica 3, then replica 2, and so on, one pod at a time. When going one pod at a time, its not possible for 3 to be running and ready
122+
before 4. When `maxUnavailable` is more than 1 (in the example scenario I set `maxUnavailable` to 2), it is possible that replica 3 becomes
123+
ready and running before replica 4 is ready—and that is ok. If you're a developer and you set `maxUnavailable` to more than 1, you should
124+
know that this outcome is possible and you must ensure that your application is able to handle such ordering issues that occur
125+
if any. When you set `maxUnavailable` greater than 1, the ordering is guaranteed in between each batch of pods being updated. That guarantee
126+
means that pods in update batch 2 (replicas 2 and 1) cannot start updating until the pods from batch 0 (replicas 4 and 3) are ready.
127+
128+
Although Kubernetes refers to these as _replicas_, your stateful application may have a different view and each pod of the StatefulSet may
129+
be holding completely different data than other pods. The important thing here is that updates to StatefulSets happen in batches, and you can
130+
now have a batch size larger than 1 (as an alpha feature).
131+
132+
Also note, that the above behavior is with `podManagementPolicy: OrderedReady`. If you defined a StatefulSet as `podManagementPolicy: Parallel`,
133+
not only `maxUnavailable` number of replicas are terminated at the same time; `maxUnavailable` number of replicas start in `ContainerCreating`
134+
phase at the same time as well. This is called bursting.
135+
136+
So, now you may have a lot of questions about:-
137+
- What is the behavior when you set `podManagementPolicy: Parallel`?
138+
- What is the behavior when `partition` to a value other than `0`?
139+
140+
It might be better to try and see it for yourself. This is an alpha feature, and the Kubernetes contributors are looking for feedback on this feature. Did
141+
this help you achieve your stateful scenarios Did you find a bug or do you think the behavior as implemented is not intuitive or can
142+
break applications or catch them by surprise? Please [open an issue](https://github.com/kubernetes/kubernetes/issues) to let us know.
143+
144+
Keep an eye on this space, for more blogs to dissect the behavior of this feature in the coming months.
145+
## Further reading and next steps {#next-steps}
146+
- [Maximum unavailable Pods](/docs/concepts/workloads/controllers/statefulset/#maximum-unavailable-pods)
147+
- [KEP for MaxUnavailable for StatefulSet](https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/961-maxunavailable-for-statefulset)
148+
- [Implementation](https://github.com/kubernetes/kubernetes/pull/82162/files)
149+
- [Enhancement Tracking Issue](https://github.com/kubernetes/enhancements/issues/961)

content/en/docs/concepts/workloads/controllers/statefulset.md

Lines changed: 0 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -323,11 +323,6 @@ After reverting the template, you must also delete any Pods that StatefulSet had
323323
already attempted to run with the bad configuration.
324324
StatefulSet will then begin to recreate the Pods using the reverted template.
325325

326-
#### MaxUnavailable
327-
The maximum number of pods that can be unavailable during the update.
328-
Value can be an absolute number (ex: 5) or a percentage of desired pods (ex: 10%).
329-
Absolute number is calculated from percentage by rounding up. This can not be 0.
330-
Defaults to 1. This field is alpha-level and is only honored by servers that enable the
331326

332327
## PersistentVolumeClaim retention
333328

0 commit comments

Comments
 (0)