@@ -437,9 +437,9 @@ This can inform certain test coverage improvements that we want to do before
437
437
extending the production code to implement this enhancement.
438
438
-->
439
439
440
- * ` pkg/apis/apps/validation/validation_test.go ` - Tests that the .spec.ordinals.start value is properly validated.
441
- * ` pkg/controller/statefulset/stateful_set_control_test.go ` - Tests that a StatefulSet slice can be created from specified starting ordinal.
442
- * ` pkg/registry/apps/statefulset/strategy_test.go ` - Tests the create/update strategy of a StatefulSet with start ordinals. Also validates enablement/disablement of the feature.
440
+ - k8s.io/kubernetes/ pkg/apis/apps/validation: 2023-02-05: 90.5%
441
+ - k8s.io/kubernetes/ pkg/controller/statefulset: 2023-02-05: 85.7%
442
+ - k8s.io/kubernetes/ pkg/registry/apps/statefulset: 2023-02-05: 65.2%
443
443
444
444
##### E2E tests
445
445
@@ -451,18 +451,15 @@ For Beta and GA, add links to added tests together with links to k8s-triage for
451
451
https://storage.googleapis.com/k8s-triage/index.html
452
452
-->
453
453
454
- ` Feature:StatefulSetStartOrdinal ` in ` k8s.io/kubernetes/test/e2e/apps/ ` .
455
-
456
- * Adding ` ordinals.start ` : Validate that setting ` ordinals.start ` to ` k ` causes StatefulSet ordinals to be scaled (pods ` [0, k-1] ` are terminated, pods ` [N, N+k-1] ` are created)
457
- * Increasing ` ordinals.start ` : Validate that increasing ` ordinals.start ` from ` m ` to ` n ` causes StatefulSet ordinals to be scaled (pods ` [m, n-1] ` are terminated, pods ` [m+N, n+N-1] ` are created)
458
- * Removing ` ordinals.start ` : Validate that setting ` ordinals.start ` causes StatefulSet ordinals to be scaled (pods ` [N-1, N+k-1] ` are terminated, pods ` [0, k-1] ` are created)
459
- * Decreasing ` ordinals.start ` : Validate that decreasing ` ordinals.start ` from ` m ` to ` n ` causes StatefulSet ordinals to be scaled (pods ` [m+N, n+N-1] ` are terminated, pods ` [m, n-1] ` are created)
454
+ - k8s.io/kubernetes/test/e2e/apps/statefulset
455
+ - [ Testgrid] ( https://testgrid.k8s.io/google-gce#gce-cos-master-default )
456
+ - [ k8s-triage] ( https://storage.googleapis.com/k8s-triage/index.html?test=TestStatefulSetStartOrdinal )
460
457
461
458
#### Integration tests
462
459
463
- ` StatefulSetStartOrdinal ` in ` k8s.io/kubernetes/test/integration/statefulset ` .
464
-
465
- * Pod Restart Tests: Validate that StatefulSet RollingUpdate behavior is preserved, with an replica ordinal offset starting at ` ordinals.start `
460
+ - k8s.io/kubernetes/test/integration/statefulset.TestStatefulSetStartOrdinal
461
+ - [ Testgrid ] ( https://testgrid.k8s.io/presubmits-kubernetes-blocking#pull-kubernetes-integration )
462
+ - [ k8s-triage ] ( https://storage.googleapis.com/k8s-triage/index.html?test=TestStatefulSetStartOrdinal )
466
463
467
464
### Graduation Criteria
468
465
@@ -531,12 +528,13 @@ in back-to-back releases.
531
528
#### Alpha
532
529
533
530
* Feature functionality implemented but hidden behind a feature gate
534
- * Add unit, e2e and functional tests to automated k8s test.
531
+ * Add unit and integration tests
535
532
536
533
#### Beta
537
534
538
535
* Validate with user workloads
539
536
* Enable feature gate for e2e pipelines
537
+ * Add e2e tests
540
538
541
539
### Upgrade / Downgrade Strategy
542
540
@@ -719,6 +717,10 @@ a divergence between these fields during steady state operations, this can
719
717
indicate that the number of replicas being created by the StatefulSet do not
720
718
match the expected number of replicas.
721
719
720
+ On a large scale (across a large number of StatefulSets) the distribution of the
721
+ ratio of these two metrics should not change when enabling this feature. If this
722
+ ratio changes significantly after enabling this feature, it could indicate a problem.
723
+
722
724
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
723
725
724
726
<!--
@@ -727,6 +729,17 @@ Longer term, we may want to require automated upgrade/rollback tests, but we
727
729
are missing a bunch of machinery and tooling and can't do that now.
728
730
-->
729
731
732
+ Manual upgrade->downgrade->upgrade scenario (to be validated):
733
+
734
+ - Create a cluster on a version that doesn't use this feature (eg: 1.26)
735
+ - Upgrade a cluster to a version that uses this feature (eg: 1.27)
736
+ - Install a StatefulSet that uses the ` .spec.ordinals.start ` field (eg: ` 2 ` )
737
+ - Validate the StatefulSet creates the correct pods
738
+ - Downgrade the cluster to the prior version that doesn't use this feature
739
+ - Validate the StatefulSet follows documented the rollback scenario and pods are re-created so start ordinal is ` 0 `
740
+ - Upgrade the cluster to the newer version that uses this feature
741
+ - Validate the StatefulSet pods are modified to start at ` .spec.ordinals.start `
742
+
730
743
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
731
744
732
745
<!--
@@ -755,8 +768,10 @@ logs or events for this purpose.
755
768
756
769
An operator can check the ` .spec.ordinals.start ` metric on the StatefulSet to
757
770
determine if this StatefulSet has a non-default start ordinal defined. The
758
- operator can also check if the ` statefulset_ordinals_start ` metric is set. A
759
- non-zero value indicates it is in use.
771
+ operator can also check if the ` kube_statefulset_ordinals_start ` metric is set.
772
+ If ` .spec.ordinals ` is set on the StatefulSet, this metric will be populated.
773
+ This metric can be counted across StatefulSets in a Kubernetes cluster, to
774
+ identify the number of StatefulSets using this feature.
760
775
761
776
###### How can someone using this feature know that it is working for their instance?
762
777
@@ -790,8 +805,7 @@ These goals will help you determine what you need to measure (SLIs) in the next
790
805
question.
791
806
-->
792
807
793
- The ` statefulset_reconcile_delay ` metric (time between StatefulSet reconciliation
794
- loops) should not significantly increase when using this feature.
808
+ This feature does not state a SLO.
795
809
796
810
For checking correctness, the ` kube_statefulset_status_replicas ` metric can be
797
811
compared against the ` kube_statefulset_replicas ` metric to check the expected
@@ -816,6 +830,9 @@ Pick one more of these and delete the rest.
816
830
- Metric name: ` kube_statefulset_status_replicas `
817
831
- [ Optional] Aggregation method: ` gauge `
818
832
- Components exposing the metric: ` pkg/controller/statefulset `
833
+ - Metric name: ` kube_statefulset_ordinals_start `
834
+ - [ Optional] Aggregation method: ` gauge `
835
+ - Components exposing the metric: ` pkg/controller/statefulset `
819
836
820
837
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
821
838
@@ -978,12 +995,40 @@ For each of them, fill in the following information by copying the below templat
978
995
- Testing: Are there any tests for failure mode? If not, describe why.
979
996
-->
980
997
981
- No other failure modes are known.
998
+ - Rollback: On feature rollback a user workload may be disrupted due to replica ordinal
999
+ changes. See
1000
+ [ Rollout, upgrade and rollback planning] ( #rollout-upgrade-and-rollback-planning )
1001
+ for context.
1002
+ - Detection: This issue can affect any workloads that are using a non-zero
1003
+ ` .spec.ordinals.start ` field prior to rollback. StatefulSets that are using
1004
+ this field can be identified through the
1005
+ ` kube_statefulset_ordinals_start ` metric.
1006
+ - Mitigations: To mitigate, pods can be orphaned from their StatefulSet by using
1007
+ ` --orphan=cascade ` to prevent the StatefulSet from deleting replica pods
1008
+ until the application operator has a chance to react to the feature rollback.
1009
+ - Testing: Unit tests exist to validate that the storage specification is preserved
1010
+ on rollback. This means that if the feature is re-enabled after rollback,
1011
+ the ` .spec.ordinals.start ` field will be preserved on the StatefulSet.
1012
+
1013
+ <!-- TODO: Add Diagnostics details after adding logging to StatefulSet controller -->
982
1014
983
1015
###### What steps should be taken if SLOs are not being met to determine the problem?
984
1016
985
- If the StatefulSet SLOs are not met, the kube-controller-manager should be
986
- restarted or examined/debugged.
1017
+ The StatefulSet should be validated to check if the correct number of replicas are
1018
+ running, and the replica ordinal numbering matches what is specified in the
1019
+ ` .spec.ordinals.start ` field. This can be done by looking at the running pods,
1020
+ and seeing if they are numbered from ` .spec.ordinals.start ` to
1021
+ ` .spec.ordinals.start ` + ` .spec.replicas ` .
1022
+
1023
+ If this is not the case, it could indicate that the StatefulSet controller
1024
+ is stuck reconciling. The StatefulSet controller creates new pod ordinals before
1025
+ it deletes lower pod ordinals, so the controller may be stuck reconciling higher
1026
+ order pods. This can happen if a higher order pod cannot be scheduled, so any
1027
+ pending or terminating pods in the selector can be inspected to determine why
1028
+ the StatefulSet is not reconciling to the expected ` .spec.replicas ` in status.
1029
+
1030
+ If further problems are experienced, the feature can be rolled back. Note the caveats
1031
+ around [ Rollback] ( #rollout-upgrade-and-rollback-planning ) prior to doing so.
987
1032
988
1033
## Implementation History
989
1034
@@ -998,9 +1043,9 @@ Major milestones might include:
998
1043
- when the KEP was retired or superseded
999
1044
-->
1000
1045
1001
- - 1.26, KEP created.
1002
- - 1.26, alpha implementation.
1003
- - 1.27, beta implementation.
1046
+ - 2022-06-02: KEP created.
1047
+ - 2022-10-06: Alpha implementation.
1048
+ - 2023-02-09: Beta implementation.
1004
1049
1005
1050
## Drawbacks
1006
1051
0 commit comments