Merge pull request #32638 from krmayankk/maxun

k8s-ci-robot · web-flow · commit 21275df80635 · 2022-05-26T06:27:23.000-07:00
Add blog post for v1.24 maxUnavailable for StatefulSet
diff --git a/content/en/blog/_posts/2022-05-27-maxunavailable-for-statefulset.md b/content/en/blog/_posts/2022-05-27-maxunavailable-for-statefulset.md
@@ -0,0 +1,148 @@
+---
+layout: blog
+title: 'Kubernetes 1.24: Maximum Unavailable Replicas for StatefulSet'
+date: 2022-05-27
+slug: maxunavailable-for-statefulset
+---
+
+**Author:** Mayank Kumar (Salesforce)
+
+Kubernetes [StatefulSets](/docs/concepts/workloads/controllers/statefulset/), since their introduction in 
+1.5 and becoming stable in 1.9, have been widely used to run stateful applications. They provide stable pod identity, persistent
+per pod storage and ordered graceful deployment, scaling and rolling updates. You can think of StatefulSet as the atomic building
+block for running complex stateful applications. As the use of Kubernetes has grown, so has the number of scenarios requiring
+StatefulSets. Many of these scenarios, require faster rolling updates than the currently supported one-pod-at-a-time updates, in the 
+case where you're using the `OrderedReady` Pod management policy for a StatefulSet.
+
+ 
+Here are some examples:
+
+-  I am using a StatefulSet to orchestrate a multi-instance, cache based application where the size of the cache is large. The cache 
+   starts cold and requires some siginificant amount of time before the container can start. There could be more initial startup tasks
+   that are required. A RollingUpdate on this StatefulSet would take a lot of time before the application is fully updated. If the 
+   StatefulSet supported updating more than one pod at a time, it would result in a much faster update.
+
+-  My stateful application is composed of leaders and followers or one writer and multiple readers. I have multiple readers or 
+   followers and my application can tolerate multiple pods going down at the same time. I want to update this application more than
+   one pod at a time so that i get the new updates rolled out quickly, especially if the number of instances of my application are
+   large. Note that my application still requires unique identity per pod.
+
+
+In order to support such scenarios, Kubernetes 1.24 includes a new alpha feature to help. Before you can use the new feature you must
+enable the `MaxUnavailableStatefulSet` feature flag. Once you enable that, you can specify a new field called `maxUnavailable`, part 
+of the `spec` for a StatefulSet. For example: 
+
+```
+apiVersion: apps/v1
+kind: StatefulSet
+metadata:
+  name: web
+  namespace: default
+spec:
+  podManagementPolicy: OrderedReady  # you must set OrderedReady
+  replicas: 5
+  selector:
+    matchLabels:
+      app: nginx
+  template:
+    metadata:
+      labels:
+        app: nginx
+    spec:
+      containers:
+      - image: k8s.gcr.io/nginx-slim:0.8
+        imagePullPolicy: IfNotPresent
+        name: nginx
+  updateStrategy:
+    rollingUpdate:
+      maxUnavailable: 2 # this is the new alpha field, whose default value is 1
+      partition: 0
+    type: RollingUpdate
+```
+
+If you enable the new feature and you don't specify a value for `maxUnavailable` in a StatefulSet, Kubernetes applies a default
+`maxUnavailable: 1`. This matches the behavior you would see if you don't enable the new feature.
+
+I'll run through a scenario based on that example manifest to demonstrate how this feature works. I will deploy a StatefulSet that
+has 5 replicas, with `maxUnavailable` set to 2 and `partition` set to 0.
+
+I can trigger a rolling update by changing the image to `k8s.gcr.io/nginx-slim:0.9`. Once I initiate the rolling update, I can 
+watch the pods update 2 at a time as the current value of maxUnavailable is 2. The below output shows a span of time and is not 
+complete.  The maxUnavailable can be an absolute number (for example, 2) or a percentage of desired Pods (for example, 10%). The 
+absolute number is calculated from percentage by rounding down.  
+```
+kubectl get pods --watch 
+```
+
+```
+NAME    READY   STATUS    RESTARTS   AGE
+web-0   1/1     Running   0          85s
+web-1   1/1     Running   0          2m6s
+web-2   1/1     Running   0          106s
+web-3   1/1     Running   0          2m47s
+web-4   1/1     Running   0          2m27s
+web-4   1/1     Terminating   0          5m43s ----> start terminating 4
+web-3   1/1     Terminating   0          6m3s  ----> start terminating 3
+web-3   0/1     Terminating   0          6m7s
+web-3   0/1     Pending       0          0s
+web-3   0/1     Pending       0          0s
+web-4   0/1     Terminating   0          5m48s
+web-4   0/1     Terminating   0          5m48s
+web-3   0/1     ContainerCreating   0          2s
+web-3   1/1     Running             0          2s
+web-4   0/1     Pending             0          0s
+web-4   0/1     Pending             0          0s
+web-4   0/1     ContainerCreating   0          0s
+web-4   1/1     Running             0          1s
+web-2   1/1     Terminating         0          5m46s ----> start terminating 2 (only after both 4 and 3 are running)
+web-1   1/1     Terminating         0          6m6s  ----> start terminating 1
+web-2   0/1     Terminating         0          5m47s
+web-1   0/1     Terminating         0          6m7s
+web-1   0/1     Pending             0          0s
+web-1   0/1     Pending             0          0s
+web-1   0/1     ContainerCreating   0          1s
+web-1   1/1     Running             0          2s
+web-2   0/1     Pending             0          0s
+web-2   0/1     Pending             0          0s
+web-2   0/1     ContainerCreating   0          0s
+web-2   1/1     Running             0          1s
+web-0   1/1     Terminating         0          6m6s ----> start terminating 0 (only after 2 and 1 are running)
+web-0   0/1     Terminating         0          6m7s
+web-0   0/1     Pending             0          0s
+web-0   0/1     Pending             0          0s
+web-0   0/1     ContainerCreating   0          0s
+web-0   1/1     Running             0          1s
+```
+Note that as soon as the rolling update starts, both 4 and 3 (the two highest ordinal pods) start terminating at the same time. Pods 
+with ordinal 4 and 3 may become ready at their own pace. As soon as both pods 4 and 3 are ready, pods 2 and 1 start terminating at the
+same time. When pods 2 and 1 are both running and ready, pod 0 starts terminating. 
+
+In Kubernetes, updates to StatefulSets follow a strict ordering when updating Pods. In this example, the update starts at replica 4, then
+replica 3, then replica 2, and so on, one pod at a time. When going one pod at a time, its not possible for 3 to be running and ready 
+before 4. When `maxUnavailable` is more than 1 (in the example scenario I set `maxUnavailable` to 2), it is possible that replica 3 becomes 
+ready and running before replica 4 is ready&mdash;and that is ok. If you're a developer and you set `maxUnavailable` to more than 1, you should 
+know that this outcome is possible and you must ensure that your application is able to handle such ordering issues that occur
+if any. When you set `maxUnavailable` greater than 1, the ordering is guaranteed in between each batch of pods being updated. That guarantee
+means that pods in update batch 2 (replicas 2 and 1) cannot start updating until the pods from batch 0 (replicas 4 and 3) are ready.
+
+Although Kubernetes refers to these as _replicas_, your stateful application may have a different view and each pod of the StatefulSet may
+be holding completely different data than other pods. The important thing here is that updates to StatefulSets happen in batches, and you can
+now have a batch size larger than 1 (as an alpha feature).
+
+Also note, that the above behavior is with `podManagementPolicy: OrderedReady`. If you defined a StatefulSet as `podManagementPolicy: Parallel`,
+not only `maxUnavailable` number of replicas are terminated at the same time; `maxUnavailable` number of replicas start in `ContainerCreating`
+phase at the same time as well. This is called bursting.
+
+So, now you may have a lot of questions about:-
+- What is the behavior when you set `podManagementPolicy: Parallel`?
+- What is the behavior when `partition` to a value other than `0`?
+
+It might be better to try and see it for yourself. This is an alpha feature, and the Kubernetes contributors are looking for feedback on this feature. Did
+this help you achieve your stateful scenarios Did you find a bug or do you think the behavior as implemented is not intuitive or can
+break applications or catch them by surprise? Please [open an issue](https://github.com/kubernetes/kubernetes/issues) to let us know.
+
+## Further reading and next steps {#next-steps}
+- [Maximum unavailable Pods](/docs/concepts/workloads/controllers/statefulset/#maximum-unavailable-pods)
+- [KEP for MaxUnavailable for StatefulSet](https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/961-maxunavailable-for-statefulset)
+- [Implementation](https://github.com/kubernetes/kubernetes/pull/82162/files)
+- [Enhancement Tracking Issue](https://github.com/kubernetes/enhancements/issues/961)