Merge pull request #37418 from pwschuurman/kep-3335-blog

k8s-ci-robot · web-flow · commit df7d75fedcc6 · 2023-04-26T09:20:14.000-07:00
Add 1.27 feature blog article about StatefulSet Start Ordinals (KEP-3335)
diff --git a/content/en/blog/_posts/2023-04-28-statefulset-migration.md b/content/en/blog/_posts/2023-04-28-statefulset-migration.md
@@ -0,0 +1,223 @@
+---
+layout: blog
+title: "Kubernetes 1.27: StatefulSet Start Ordinal Simplifies Migration"
+date: 2023-04-28
+slug: statefulset-start-ordinal
+---
+
+**Author**: Peter Schuurman (Google)
+
+Kubernetes v1.26 introduced a new, alpha-level feature for
+[StatefulSets](/docs/concepts/workloads/controllers/statefulset/) that controls
+the ordinal numbering of Pod replicas. As of Kubernetes v1.27, this feature is
+now beta. Ordinals can start from arbitrary
+non-negative numbers. This blog post will discuss how this feature can be
+used.
+
+## Background
+
+StatefulSets ordinals provide sequential identities for pod replicas. When using
+[`OrderedReady` Pod management](/docs/tutorials/stateful-application/basic-stateful-set/#orderedready-pod-management)
+Pods are created from ordinal index `0` up to `N-1`.
+
+With Kubernetes today, orchestrating a StatefulSet migration across clusters is
+challenging. Backup and restore solutions exist, but these require the
+application to be scaled down to zero replicas prior to migration. In today's
+fully connected world, even planned application downtime may not allow you to
+meet your business goals. You could use
+[Cascading Delete](/docs/tutorials/stateful-application/basic-stateful-set/#cascading-delete)
+or
+[On Delete](/docs/tutorials/stateful-application/basic-stateful-set/#on-delete)
+to migrate individual pods, however this is error prone and tedious to manage.
+You lose the self-healing benefit of the StatefulSet controller when your Pods
+fail or are evicted.
+
+Kubernetes v1.26 enables a StatefulSet to be responsible for a range of ordinals
+within a range {0..N-1} (the ordinals 0, 1, ... up to N-1).
+With it, you can scale down a range
+{0..k-1} in a source cluster, and scale up the complementary range {k..N-1}
+in a destination cluster, while maintaining application availability. This
+enables you to retain *at most one* semantics (meaning there is at most one Pod
+with a given identity running in a StatefulSet) and
+[Rolling Update](/docs/tutorials/stateful-application/basic-stateful-set/#rolling-update)
+behavior when orchestrating a migration across clusters.
+
+## Why would I want to use this feature?
+
+Say you're running your StatefulSet in one cluster, and need to migrate it out
+to a different cluster. There are many reasons why you would need to do this:
+ * **Scalability**: Your StatefulSet has scaled too large for your cluster, and
+   has started to disrupt the quality of service for other workloads in your
+   cluster.
+ * **Isolation**: You're running a StatefulSet in a cluster that is accessed 
+   by multiple users, and namespace isolation isn't sufficient.
+ * **Cluster Configuration**: You want to move your StatefulSet to a different
+   cluster to use some environment that is not available on your current
+   cluster.
+ * **Control Plane Upgrades**: You want to move your StatefulSet to a cluster
+   running an upgraded control plane, and can't handle the risk or downtime of
+   in-place control plane upgrades.
+
+## How do I use it?
+
+Enable the `StatefulSetStartOrdinal` feature gate on a cluster, and create a
+StatefulSet with a customized `.spec.ordinals.start`.
+
+## Try it out
+
+In this demo, I'll use the new mechanism to migrate a
+StatefulSet from one Kubernetes cluster to another. The
+[redis-cluster](https://github.com/bitnami/charts/tree/main/bitnami/redis-cluster)
+Bitnami Helm chart will be used to install Redis.
+
+Tools Required:
+ * [yq](https://github.com/mikefarah/yq)
+ * [helm](https://helm.sh/docs/helm/helm_install/)
+
+### Pre-requisites {#demo-pre-requisites}
+
+To do this, I need two Kubernetes clusters that can both access common
+networking and storage; I've named my clusters `source` and `destination`.
+Specifically, I need:
+
+* The `StatefulSetStartOrdinal` feature gate enabled on both clusters.
+* Client configuration for `kubectl` that lets me access both clusters as an
+  administrator.
+* The same `StorageClass` installed on both clusters, and set as the default
+  StorageClass for both clusters. This `StorageClass` should provision
+  underlying storage that is accessible from either or both clusters.
+* A flat network topology that allows for pods to send and receive packets to
+  and from Pods in either clusters. If you are creating clusters on a cloud
+  provider, this configuration may be called private cloud or private network.
+
+1. Create a demo namespace on both clusters:
+
+   ```
+   kubectl create ns kep-3335
+   ```
+
+2. Deploy a Redis cluster with six replicas in the source cluster:
+
+   ```
+   helm repo add bitnami https://charts.bitnami.com/bitnami
+   helm install redis --namespace kep-3335 \
+     bitnami/redis-cluster \
+     --set persistence.size=1Gi \
+     --set cluster.nodes=6
+   ```
+
+3. Check the replication status in the source cluster:
+
+   ```
+   kubectl exec -it redis-redis-cluster-0 -- /bin/bash -c \
+     "redis-cli -c -h redis-redis-cluster -a $(kubectl get secret redis-redis-cluster -o jsonpath="{.data.redis-password}" | base64 -d) CLUSTER NODES;"
+   ```
+
+   ```
+   2ce30362c188aabc06f3eee5d92892d95b1da5c3 10.104.0.14:6379@16379 myself,master - 0 1669764411000 3 connected 10923-16383                                                                                                                                              
+   7743661f60b6b17b5c71d083260419588b4f2451 10.104.0.16:6379@16379 slave 2ce30362c188aabc06f3eee5d92892d95b1da5c3 0 1669764410000 3 connected                                                                                             
+   961f35e37c4eea507cfe12f96e3bfd694b9c21d4 10.104.0.18:6379@16379 slave a8765caed08f3e185cef22bd09edf409dc2bcc61 0 1669764411000 1 connected                                                                                                             
+   7136e37d8864db983f334b85d2b094be47c830e5 10.104.0.15:6379@16379 slave 2cff613d763b22c180cd40668da8e452edef3fc8 0 1669764412595 2 connected                                                                                                                    
+   a8765caed08f3e185cef22bd09edf409dc2bcc61 10.104.0.19:6379@16379 master - 0 1669764411592 1 connected 0-5460                                                                                                                                                   
+   2cff613d763b22c180cd40668da8e452edef3fc8 10.104.0.17:6379@16379 master - 0 1669764410000 2 connected 5461-10922
+   ```
+
+4. Deploy a Redis cluster with zero replicas in the destination cluster:
+
+   ```
+   helm install redis --namespace kep-3335 \
+     bitnami/redis-cluster \
+     --set persistence.size=1Gi \
+     --set cluster.nodes=0 \
+     --set redis.extraEnvVars\[0\].name=REDIS_NODES,redis.extraEnvVars\[0\].value="redis-redis-cluster-headless.kep-3335.svc.cluster.local" \
+     --set existingSecret=redis-redis-cluster
+   ```
+
+5. Scale down the `redis-redis-cluster` StatefulSet in the source cluster by 1,
+   to remove the replica `redis-redis-cluster-5`:
+
+   ```
+   kubectl patch sts redis-redis-cluster -p '{"spec": {"replicas": 5}}'
+   ```
+
+6. Migrate dependencies from the source cluster to the destination cluster:
+
+   The following commands copy resources from `source` to `destionation`. Details
+   that are not relevant in `destination` cluster are removed (eg: `uid`,
+   `resourceVersion`, `status`).
+
+   **Steps for the source cluster**
+
+   Note: If using a `StorageClass` with `reclaimPolicy: Delete` configured, you
+         should patch the PVs in `source` with `reclaimPolicy: Retain` prior to
+         deletion to retain the underlying storage used in `destination`. See
+         [Change the Reclaim Policy of a PersistentVolume](/docs/tasks/administer-cluster/change-pv-reclaim-policy/)
+         for more details.
+
+   ```
+   kubectl get pvc redis-data-redis-redis-cluster-5 -o yaml | yq 'del(.metadata.uid, .metadata.resourceVersion, .metadata.annotations, .metadata.finalizers, .status)' > /tmp/pvc-redis-data-redis-redis-cluster-5.yaml
+   kubectl get pv $(yq '.spec.volumeName' /tmp/pvc-redis-data-redis-redis-cluster-5.yaml) -o yaml | yq 'del(.metadata.uid, .metadata.resourceVersion, .metadata.annotations, .metadata.finalizers, .spec.claimRef, .status)' > /tmp/pv-redis-data-redis-redis-cluster-5.yaml
+   kubectl get secret redis-redis-cluster -o yaml | yq 'del(.metadata.uid, .metadata.resourceVersion)' > /tmp/secret-redis-redis-cluster.yaml
+   ```
+
+   **Steps for the destination cluster**
+
+   Note: For the PV/PVC, this procedure only works if the underlying storage system
+         that your PVs use can support being copied into `destination`. Storage
+         that is associated with a specific node or topology may not be supported.
+         Additionally, some storage systems may store addtional metadata about
+         volumes outside of a PV object, and may require a more specialized
+         sequence to import a volume.
+
+   ```
+   kubectl create -f /tmp/pv-redis-data-redis-redis-cluster-5.yaml
+   kubectl create -f /tmp/pvc-redis-data-redis-redis-cluster-5.yaml
+   kubectl create -f /tmp/secret-redis-redis-cluster.yaml
+   ```
+
+7. Scale up the `redis-redis-cluster` StatefulSet in the destination cluster by
+   1, with a start ordinal of 5:
+
+   ```
+   kubectl patch sts redis-redis-cluster -p '{"spec": {"ordinals": {"start": 5}, "replicas": 1}}'
+   ```
+
+8. Check the replication status in the destination cluster:
+
+   ```
+   kubectl exec -it redis-redis-cluster-5 -- /bin/bash -c \
+     "redis-cli -c -h redis-redis-cluster -a $(kubectl get secret redis-redis-cluster -o jsonpath="{.data.redis-password}" | base64 -d) CLUSTER NODES;"
+   ```
+
+   I should see that the new replica (labeled `myself`) has joined the Redis
+   cluster (the IP address belongs to a different CIDR block than the
+   replicas in the source cluster).
+
+   ```
+   2cff613d763b22c180cd40668da8e452edef3fc8 10.104.0.17:6379@16379 master - 0 1669766684000 2 connected 5461-10922
+   7136e37d8864db983f334b85d2b094be47c830e5 10.108.0.22:6379@16379 myself,slave 2cff613d763b22c180cd40668da8e452edef3fc8 0 1669766685609 2 connected
+   2ce30362c188aabc06f3eee5d92892d95b1da5c3 10.104.0.14:6379@16379 master - 0 1669766684000 3 connected 10923-16383
+   961f35e37c4eea507cfe12f96e3bfd694b9c21d4 10.104.0.18:6379@16379 slave a8765caed08f3e185cef22bd09edf409dc2bcc61 0 1669766683600 1 connected
+   a8765caed08f3e185cef22bd09edf409dc2bcc61 10.104.0.19:6379@16379 master - 0 1669766685000 1 connected 0-5460
+   7743661f60b6b17b5c71d083260419588b4f2451 10.104.0.16:6379@16379 slave 2ce30362c188aabc06f3eee5d92892d95b1da5c3 0 1669766686613 3 connected
+   ```
+
+9. Repeat steps #5 to #7 for the remainder of the replicas, until the
+   Redis StatefulSet in the source cluster is scaled to 0, and the Redis
+   StatefulSet in the destination cluster is healthy with 6 total replicas.
+
+## What's Next?
+
+This feature provides a building block for a StatefulSet to be split up across
+clusters, but does not prescribe the mechanism as to how the StatefulSet should
+be migrated. Migration requires coordination of StatefulSet replicas, along with
+orchestration of the storage and network layer. This is dependent on the storage
+and connectivity requirements of the application installed by the StatefulSet.
+Additionally, many StatefulSets are managed by
+[operators](/docs/concepts/extend-kubernetes/operator/), which adds another
+layer of complexity to migration.
+
+If you're interested in building enhancements to make these processes easier,
+get involved with
+[SIG Multicluster](https://github.com/kubernetes/community/blob/master/sig-multicluster)
+to contribute!