|
| 1 | +--- |
| 2 | +layout: blog |
| 3 | +title: "Kubernetes 1.27: StatefulSet Start Ordinal Simplifies Migration" |
| 4 | +date: 2023-04-28 |
| 5 | +slug: statefulset-start-ordinal |
| 6 | +--- |
| 7 | + |
| 8 | +**Author**: Peter Schuurman (Google) |
| 9 | + |
| 10 | +Kubernetes v1.26 introduced a new, alpha-level feature for |
| 11 | +[StatefulSets](/docs/concepts/workloads/controllers/statefulset/) that controls |
| 12 | +the ordinal numbering of Pod replicas. As of Kubernetes v1.27, this feature is |
| 13 | +now beta. Ordinals can start from arbitrary |
| 14 | +non-negative numbers. This blog post will discuss how this feature can be |
| 15 | +used. |
| 16 | + |
| 17 | +## Background |
| 18 | + |
| 19 | +StatefulSets ordinals provide sequential identities for pod replicas. When using |
| 20 | +[`OrderedReady` Pod management](/docs/tutorials/stateful-application/basic-stateful-set/#orderedready-pod-management) |
| 21 | +Pods are created from ordinal index `0` up to `N-1`. |
| 22 | + |
| 23 | +With Kubernetes today, orchestrating a StatefulSet migration across clusters is |
| 24 | +challenging. Backup and restore solutions exist, but these require the |
| 25 | +application to be scaled down to zero replicas prior to migration. In today's |
| 26 | +fully connected world, even planned application downtime may not allow you to |
| 27 | +meet your business goals. You could use |
| 28 | +[Cascading Delete](/docs/tutorials/stateful-application/basic-stateful-set/#cascading-delete) |
| 29 | +or |
| 30 | +[On Delete](/docs/tutorials/stateful-application/basic-stateful-set/#on-delete) |
| 31 | +to migrate individual pods, however this is error prone and tedious to manage. |
| 32 | +You lose the self-healing benefit of the StatefulSet controller when your Pods |
| 33 | +fail or are evicted. |
| 34 | + |
| 35 | +Kubernetes v1.26 enables a StatefulSet to be responsible for a range of ordinals |
| 36 | +within a range {0..N-1} (the ordinals 0, 1, ... up to N-1). |
| 37 | +With it, you can scale down a range |
| 38 | +{0..k-1} in a source cluster, and scale up the complementary range {k..N-1} |
| 39 | +in a destination cluster, while maintaining application availability. This |
| 40 | +enables you to retain *at most one* semantics (meaning there is at most one Pod |
| 41 | +with a given identity running in a StatefulSet) and |
| 42 | +[Rolling Update](/docs/tutorials/stateful-application/basic-stateful-set/#rolling-update) |
| 43 | +behavior when orchestrating a migration across clusters. |
| 44 | + |
| 45 | +## Why would I want to use this feature? |
| 46 | + |
| 47 | +Say you're running your StatefulSet in one cluster, and need to migrate it out |
| 48 | +to a different cluster. There are many reasons why you would need to do this: |
| 49 | + * **Scalability**: Your StatefulSet has scaled too large for your cluster, and |
| 50 | + has started to disrupt the quality of service for other workloads in your |
| 51 | + cluster. |
| 52 | + * **Isolation**: You're running a StatefulSet in a cluster that is accessed |
| 53 | + by multiple users, and namespace isolation isn't sufficient. |
| 54 | + * **Cluster Configuration**: You want to move your StatefulSet to a different |
| 55 | + cluster to use some environment that is not available on your current |
| 56 | + cluster. |
| 57 | + * **Control Plane Upgrades**: You want to move your StatefulSet to a cluster |
| 58 | + running an upgraded control plane, and can't handle the risk or downtime of |
| 59 | + in-place control plane upgrades. |
| 60 | + |
| 61 | +## How do I use it? |
| 62 | + |
| 63 | +Enable the `StatefulSetStartOrdinal` feature gate on a cluster, and create a |
| 64 | +StatefulSet with a customized `.spec.ordinals.start`. |
| 65 | + |
| 66 | +## Try it out |
| 67 | + |
| 68 | +In this demo, I'll use the new mechanism to migrate a |
| 69 | +StatefulSet from one Kubernetes cluster to another. The |
| 70 | +[redis-cluster](https://github.com/bitnami/charts/tree/main/bitnami/redis-cluster) |
| 71 | +Bitnami Helm chart will be used to install Redis. |
| 72 | + |
| 73 | +Tools Required: |
| 74 | + * [yq](https://github.com/mikefarah/yq) |
| 75 | + * [helm](https://helm.sh/docs/helm/helm_install/) |
| 76 | + |
| 77 | +### Pre-requisites {#demo-pre-requisites} |
| 78 | + |
| 79 | +To do this, I need two Kubernetes clusters that can both access common |
| 80 | +networking and storage; I've named my clusters `source` and `destination`. |
| 81 | +Specifically, I need: |
| 82 | + |
| 83 | +* The `StatefulSetStartOrdinal` feature gate enabled on both clusters. |
| 84 | +* Client configuration for `kubectl` that lets me access both clusters as an |
| 85 | + administrator. |
| 86 | +* The same `StorageClass` installed on both clusters, and set as the default |
| 87 | + StorageClass for both clusters. This `StorageClass` should provision |
| 88 | + underlying storage that is accessible from either or both clusters. |
| 89 | +* A flat network topology that allows for pods to send and receive packets to |
| 90 | + and from Pods in either clusters. If you are creating clusters on a cloud |
| 91 | + provider, this configuration may be called private cloud or private network. |
| 92 | + |
| 93 | +1. Create a demo namespace on both clusters: |
| 94 | + |
| 95 | + ``` |
| 96 | + kubectl create ns kep-3335 |
| 97 | + ``` |
| 98 | + |
| 99 | +2. Deploy a Redis cluster with six replicas in the source cluster: |
| 100 | + |
| 101 | + ``` |
| 102 | + helm repo add bitnami https://charts.bitnami.com/bitnami |
| 103 | + helm install redis --namespace kep-3335 \ |
| 104 | + bitnami/redis-cluster \ |
| 105 | + --set persistence.size=1Gi \ |
| 106 | + --set cluster.nodes=6 |
| 107 | + ``` |
| 108 | + |
| 109 | +3. Check the replication status in the source cluster: |
| 110 | + |
| 111 | + ``` |
| 112 | + kubectl exec -it redis-redis-cluster-0 -- /bin/bash -c \ |
| 113 | + "redis-cli -c -h redis-redis-cluster -a $(kubectl get secret redis-redis-cluster -o jsonpath="{.data.redis-password}" | base64 -d) CLUSTER NODES;" |
| 114 | + ``` |
| 115 | + |
| 116 | + ``` |
| 117 | + 2ce30362c188aabc06f3eee5d92892d95b1da5c3 10.104.0.14:6379@16379 myself,master - 0 1669764411000 3 connected 10923-16383 |
| 118 | + 7743661f60b6b17b5c71d083260419588b4f2451 10.104.0.16:6379@16379 slave 2ce30362c188aabc06f3eee5d92892d95b1da5c3 0 1669764410000 3 connected |
| 119 | + 961f35e37c4eea507cfe12f96e3bfd694b9c21d4 10.104.0.18:6379@16379 slave a8765caed08f3e185cef22bd09edf409dc2bcc61 0 1669764411000 1 connected |
| 120 | + 7136e37d8864db983f334b85d2b094be47c830e5 10.104.0.15:6379@16379 slave 2cff613d763b22c180cd40668da8e452edef3fc8 0 1669764412595 2 connected |
| 121 | + a8765caed08f3e185cef22bd09edf409dc2bcc61 10.104.0.19:6379@16379 master - 0 1669764411592 1 connected 0-5460 |
| 122 | + 2cff613d763b22c180cd40668da8e452edef3fc8 10.104.0.17:6379@16379 master - 0 1669764410000 2 connected 5461-10922 |
| 123 | + ``` |
| 124 | + |
| 125 | +4. Deploy a Redis cluster with zero replicas in the destination cluster: |
| 126 | + |
| 127 | + ``` |
| 128 | + helm install redis --namespace kep-3335 \ |
| 129 | + bitnami/redis-cluster \ |
| 130 | + --set persistence.size=1Gi \ |
| 131 | + --set cluster.nodes=0 \ |
| 132 | + --set redis.extraEnvVars\[0\].name=REDIS_NODES,redis.extraEnvVars\[0\].value="redis-redis-cluster-headless.kep-3335.svc.cluster.local" \ |
| 133 | + --set existingSecret=redis-redis-cluster |
| 134 | + ``` |
| 135 | + |
| 136 | +5. Scale down the `redis-redis-cluster` StatefulSet in the source cluster by 1, |
| 137 | + to remove the replica `redis-redis-cluster-5`: |
| 138 | + |
| 139 | + ``` |
| 140 | + kubectl patch sts redis-redis-cluster -p '{"spec": {"replicas": 5}}' |
| 141 | + ``` |
| 142 | + |
| 143 | +6. Migrate dependencies from the source cluster to the destination cluster: |
| 144 | + |
| 145 | + The following commands copy resources from `source` to `destionation`. Details |
| 146 | + that are not relevant in `destination` cluster are removed (eg: `uid`, |
| 147 | + `resourceVersion`, `status`). |
| 148 | + |
| 149 | + **Steps for the source cluster** |
| 150 | + |
| 151 | + Note: If using a `StorageClass` with `reclaimPolicy: Delete` configured, you |
| 152 | + should patch the PVs in `source` with `reclaimPolicy: Retain` prior to |
| 153 | + deletion to retain the underlying storage used in `destination`. See |
| 154 | + [Change the Reclaim Policy of a PersistentVolume](/docs/tasks/administer-cluster/change-pv-reclaim-policy/) |
| 155 | + for more details. |
| 156 | + |
| 157 | + ``` |
| 158 | + kubectl get pvc redis-data-redis-redis-cluster-5 -o yaml | yq 'del(.metadata.uid, .metadata.resourceVersion, .metadata.annotations, .metadata.finalizers, .status)' > /tmp/pvc-redis-data-redis-redis-cluster-5.yaml |
| 159 | + kubectl get pv $(yq '.spec.volumeName' /tmp/pvc-redis-data-redis-redis-cluster-5.yaml) -o yaml | yq 'del(.metadata.uid, .metadata.resourceVersion, .metadata.annotations, .metadata.finalizers, .spec.claimRef, .status)' > /tmp/pv-redis-data-redis-redis-cluster-5.yaml |
| 160 | + kubectl get secret redis-redis-cluster -o yaml | yq 'del(.metadata.uid, .metadata.resourceVersion)' > /tmp/secret-redis-redis-cluster.yaml |
| 161 | + ``` |
| 162 | + |
| 163 | + **Steps for the destination cluster** |
| 164 | + |
| 165 | + Note: For the PV/PVC, this procedure only works if the underlying storage system |
| 166 | + that your PVs use can support being copied into `destination`. Storage |
| 167 | + that is associated with a specific node or topology may not be supported. |
| 168 | + Additionally, some storage systems may store addtional metadata about |
| 169 | + volumes outside of a PV object, and may require a more specialized |
| 170 | + sequence to import a volume. |
| 171 | + |
| 172 | + ``` |
| 173 | + kubectl create -f /tmp/pv-redis-data-redis-redis-cluster-5.yaml |
| 174 | + kubectl create -f /tmp/pvc-redis-data-redis-redis-cluster-5.yaml |
| 175 | + kubectl create -f /tmp/secret-redis-redis-cluster.yaml |
| 176 | + ``` |
| 177 | + |
| 178 | +7. Scale up the `redis-redis-cluster` StatefulSet in the destination cluster by |
| 179 | + 1, with a start ordinal of 5: |
| 180 | + |
| 181 | + ``` |
| 182 | + kubectl patch sts redis-redis-cluster -p '{"spec": {"ordinals": {"start": 5}, "replicas": 1}}' |
| 183 | + ``` |
| 184 | + |
| 185 | +8. Check the replication status in the destination cluster: |
| 186 | + |
| 187 | + ``` |
| 188 | + kubectl exec -it redis-redis-cluster-5 -- /bin/bash -c \ |
| 189 | + "redis-cli -c -h redis-redis-cluster -a $(kubectl get secret redis-redis-cluster -o jsonpath="{.data.redis-password}" | base64 -d) CLUSTER NODES;" |
| 190 | + ``` |
| 191 | + |
| 192 | + I should see that the new replica (labeled `myself`) has joined the Redis |
| 193 | + cluster (the IP address belongs to a different CIDR block than the |
| 194 | + replicas in the source cluster). |
| 195 | + |
| 196 | + ``` |
| 197 | + 2cff613d763b22c180cd40668da8e452edef3fc8 10.104.0.17:6379@16379 master - 0 1669766684000 2 connected 5461-10922 |
| 198 | + 7136e37d8864db983f334b85d2b094be47c830e5 10.108.0.22:6379@16379 myself,slave 2cff613d763b22c180cd40668da8e452edef3fc8 0 1669766685609 2 connected |
| 199 | + 2ce30362c188aabc06f3eee5d92892d95b1da5c3 10.104.0.14:6379@16379 master - 0 1669766684000 3 connected 10923-16383 |
| 200 | + 961f35e37c4eea507cfe12f96e3bfd694b9c21d4 10.104.0.18:6379@16379 slave a8765caed08f3e185cef22bd09edf409dc2bcc61 0 1669766683600 1 connected |
| 201 | + a8765caed08f3e185cef22bd09edf409dc2bcc61 10.104.0.19:6379@16379 master - 0 1669766685000 1 connected 0-5460 |
| 202 | + 7743661f60b6b17b5c71d083260419588b4f2451 10.104.0.16:6379@16379 slave 2ce30362c188aabc06f3eee5d92892d95b1da5c3 0 1669766686613 3 connected |
| 203 | + ``` |
| 204 | + |
| 205 | +9. Repeat steps #5 to #7 for the remainder of the replicas, until the |
| 206 | + Redis StatefulSet in the source cluster is scaled to 0, and the Redis |
| 207 | + StatefulSet in the destination cluster is healthy with 6 total replicas. |
| 208 | + |
| 209 | +## What's Next? |
| 210 | + |
| 211 | +This feature provides a building block for a StatefulSet to be split up across |
| 212 | +clusters, but does not prescribe the mechanism as to how the StatefulSet should |
| 213 | +be migrated. Migration requires coordination of StatefulSet replicas, along with |
| 214 | +orchestration of the storage and network layer. This is dependent on the storage |
| 215 | +and connectivity requirements of the application installed by the StatefulSet. |
| 216 | +Additionally, many StatefulSets are managed by |
| 217 | +[operators](/docs/concepts/extend-kubernetes/operator/), which adds another |
| 218 | +layer of complexity to migration. |
| 219 | + |
| 220 | +If you're interested in building enhancements to make these processes easier, |
| 221 | +get involved with |
| 222 | +[SIG Multicluster](https://github.com/kubernetes/community/blob/master/sig-multicluster) |
| 223 | +to contribute! |
0 commit comments