Skip to content

Commit f76c230

Browse files
committed
Update StatefulSetStartOrdinal KEP to include more details on Beta FAQ
1 parent 3a1a7d6 commit f76c230

File tree

2 files changed

+69
-23
lines changed

2 files changed

+69
-23
lines changed

keps/sig-apps/3335-statefulset-slice/README.md

Lines changed: 68 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -437,9 +437,9 @@ This can inform certain test coverage improvements that we want to do before
437437
extending the production code to implement this enhancement.
438438
-->
439439

440-
* `pkg/apis/apps/validation/validation_test.go` - Tests that the .spec.ordinals.start value is properly validated.
441-
* `pkg/controller/statefulset/stateful_set_control_test.go` - Tests that a StatefulSet slice can be created from specified starting ordinal.
442-
* `pkg/registry/apps/statefulset/strategy_test.go` - Tests the create/update strategy of a StatefulSet with start ordinals. Also validates enablement/disablement of the feature.
440+
- k8s.io/kubernetes/pkg/apis/apps/validation: 2023-02-05: 90.5%
441+
- k8s.io/kubernetes/pkg/controller/statefulset: 2023-02-05: 85.7%
442+
- k8s.io/kubernetes/pkg/registry/apps/statefulset: 2023-02-05: 65.2%
443443

444444
##### E2E tests
445445

@@ -451,18 +451,15 @@ For Beta and GA, add links to added tests together with links to k8s-triage for
451451
https://storage.googleapis.com/k8s-triage/index.html
452452
-->
453453

454-
`Feature:StatefulSetStartOrdinal` in `k8s.io/kubernetes/test/e2e/apps/`.
455-
456-
* Adding `ordinals.start`: Validate that setting `ordinals.start` to `k` causes StatefulSet ordinals to be scaled (pods `[0, k-1]` are terminated, pods `[N, N+k-1]` are created)
457-
* Increasing `ordinals.start`: Validate that increasing `ordinals.start` from `m` to `n` causes StatefulSet ordinals to be scaled (pods `[m, n-1]` are terminated, pods `[m+N, n+N-1]` are created)
458-
* Removing `ordinals.start`: Validate that setting `ordinals.start` causes StatefulSet ordinals to be scaled (pods `[N-1, N+k-1]` are terminated, pods `[0, k-1]` are created)
459-
* Decreasing `ordinals.start`: Validate that decreasing `ordinals.start` from `m` to `n` causes StatefulSet ordinals to be scaled (pods `[m+N, n+N-1]` are terminated, pods `[m, n-1]` are created)
454+
- k8s.io/kubernetes/test/e2e/apps/statefulset
455+
- [Testgrid](https://testgrid.k8s.io/google-gce#gce-cos-master-default)
456+
- [k8s-triage](https://storage.googleapis.com/k8s-triage/index.html?test=TestStatefulSetStartOrdinal)
460457

461458
#### Integration tests
462459

463-
`StatefulSetStartOrdinal` in `k8s.io/kubernetes/test/integration/statefulset`.
464-
465-
* Pod Restart Tests: Validate that StatefulSet RollingUpdate behavior is preserved, with an replica ordinal offset starting at `ordinals.start`
460+
- k8s.io/kubernetes/test/integration/statefulset.TestStatefulSetStartOrdinal
461+
- [Testgrid](https://testgrid.k8s.io/presubmits-kubernetes-blocking#pull-kubernetes-integration)
462+
- [k8s-triage](https://storage.googleapis.com/k8s-triage/index.html?test=TestStatefulSetStartOrdinal)
466463

467464
### Graduation Criteria
468465

@@ -531,12 +528,13 @@ in back-to-back releases.
531528
#### Alpha
532529

533530
* Feature functionality implemented but hidden behind a feature gate
534-
* Add unit, e2e and functional tests to automated k8s test.
531+
* Add unit and integration tests
535532

536533
#### Beta
537534

538535
* Validate with user workloads
539536
* Enable feature gate for e2e pipelines
537+
* Add e2e tests
540538

541539
### Upgrade / Downgrade Strategy
542540

@@ -719,6 +717,10 @@ a divergence between these fields during steady state operations, this can
719717
indicate that the number of replicas being created by the StatefulSet do not
720718
match the expected number of replicas.
721719

720+
On a large scale (across a large number of StatefulSets) the distribution of the
721+
ratio of these two metrics should not change when enabling this feature. If this
722+
ratio changes significantly after enabling this feature, it could indicate a problem.
723+
722724
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
723725

724726
<!--
@@ -727,6 +729,17 @@ Longer term, we may want to require automated upgrade/rollback tests, but we
727729
are missing a bunch of machinery and tooling and can't do that now.
728730
-->
729731

732+
Manual upgrade->downgrade->upgrade scenario (to be validated):
733+
734+
- Create a cluster on a version that doesn't use this feature (eg: 1.26)
735+
- Upgrade a cluster to a version that uses this feature (eg: 1.27)
736+
- Install a StatefulSet that uses the `.spec.ordinals.start` field (eg: `2`)
737+
- Validate the StatefulSet creates the correct pods
738+
- Downgrade the cluster to the prior version that doesn't use this feature
739+
- Validate the StatefulSet follows documented the rollback scenario and pods are re-created so start ordinal is `0`
740+
- Upgrade the cluster to the newer version that uses this feature
741+
- Validate the StatefulSet pods are modified to start at `.spec.ordinals.start`
742+
730743
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
731744

732745
<!--
@@ -755,8 +768,10 @@ logs or events for this purpose.
755768

756769
An operator can check the `.spec.ordinals.start` metric on the StatefulSet to
757770
determine if this StatefulSet has a non-default start ordinal defined. The
758-
operator can also check if the `statefulset_ordinals_start` metric is set. A
759-
non-zero value indicates it is in use.
771+
operator can also check if the `kube_statefulset_ordinals_start` metric is set.
772+
If `.spec.ordinals` is set on the StatefulSet, this metric will be populated.
773+
This metric can be counted across StatefulSets in a Kubernetes cluster, to
774+
identify the number of StatefulSets using this feature.
760775

761776
###### How can someone using this feature know that it is working for their instance?
762777

@@ -790,8 +805,7 @@ These goals will help you determine what you need to measure (SLIs) in the next
790805
question.
791806
-->
792807

793-
The `statefulset_reconcile_delay` metric (time between StatefulSet reconciliation
794-
loops) should not significantly increase when using this feature.
808+
This feature does not state a SLO.
795809

796810
For checking correctness, the `kube_statefulset_status_replicas` metric can be
797811
compared against the `kube_statefulset_replicas` metric to check the expected
@@ -816,6 +830,9 @@ Pick one more of these and delete the rest.
816830
- Metric name: `kube_statefulset_status_replicas`
817831
- [Optional] Aggregation method: `gauge`
818832
- Components exposing the metric: `pkg/controller/statefulset`
833+
- Metric name: `kube_statefulset_ordinals_start`
834+
- [Optional] Aggregation method: `gauge`
835+
- Components exposing the metric: `pkg/controller/statefulset`
819836

820837
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
821838

@@ -978,12 +995,40 @@ For each of them, fill in the following information by copying the below templat
978995
- Testing: Are there any tests for failure mode? If not, describe why.
979996
-->
980997

981-
No other failure modes are known.
998+
- Rollback: On feature rollback a user workload may be disrupted due to replica ordinal
999+
changes. See
1000+
[Rollout, upgrade and rollback planning](#rollout-upgrade-and-rollback-planning)
1001+
for context.
1002+
- Detection: This issue can affect any workloads that are using a non-zero
1003+
`.spec.ordinals.start` field prior to rollback. StatefulSets that are using
1004+
this field can be identified through the
1005+
`kube_statefulset_ordinals_start` metric.
1006+
- Mitigations: To mitigate, pods can be orphaned from their StatefulSet by using
1007+
`--orphan=cascade` to prevent the StatefulSet from deleting replica pods
1008+
until the application operator has a chance to react to the feature rollback.
1009+
- Testing: Unit tests exist to validate that the storage specification is preserved
1010+
on rollback. This means that if the feature is re-enabled after rollback,
1011+
the `.spec.ordinals.start` field will be preserved on the StatefulSet.
1012+
1013+
<!-- TODO: Add Diagnostics details after adding logging to StatefulSet controller -->
9821014

9831015
###### What steps should be taken if SLOs are not being met to determine the problem?
9841016

985-
If the StatefulSet SLOs are not met, the kube-controller-manager should be
986-
restarted or examined/debugged.
1017+
The StatefulSet should be validated to check if the correct number of replicas are
1018+
running, and the replica ordinal numbering matches what is specified in the
1019+
`.spec.ordinals.start` field. This can be done by looking at the running pods,
1020+
and seeing if they are numbered from `.spec.ordinals.start` to
1021+
`.spec.ordinals.start` + `.spec.replicas`.
1022+
1023+
If this is not the case, it could indicate that the StatefulSet controller
1024+
is stuck reconciling. The StatefulSet controller creates new pod ordinals before
1025+
it deletes lower pod ordinals, so the controller may be stuck reconciling higher
1026+
order pods. This can happen if a higher order pod cannot be scheduled, so any
1027+
pending or terminating pods in the selector can be inspected to determine why
1028+
the StatefulSet is not reconciling to the expected `.spec.replicas` in status.
1029+
1030+
If further problems are experienced, the feature can be rolled back. Note the caveats
1031+
around [Rollback](#rollout-upgrade-and-rollback-planning) prior to doing so.
9871032

9881033
## Implementation History
9891034

@@ -998,9 +1043,9 @@ Major milestones might include:
9981043
- when the KEP was retired or superseded
9991044
-->
10001045

1001-
- 1.26, KEP created.
1002-
- 1.26, alpha implementation.
1003-
- 1.27, beta implementation.
1046+
- 2022-06-02: KEP created.
1047+
- 2022-10-06: Alpha implementation.
1048+
- 2023-02-09: Beta implementation.
10041049

10051050
## Drawbacks
10061051

keps/sig-apps/3335-statefulset-slice/kep.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,3 +38,4 @@ disable-supported: true
3838

3939
# The following PRR answers are required at beta release
4040
metrics:
41+
- kube_statefulset_ordinals_start

0 commit comments

Comments
 (0)