3131 - [ Upgrade/downgrade & ; feature enabled/disable tests] ( #upgradedowngrade--feature-enableddisable-tests )
3232 - [ Graduation Criteria] ( #graduation-criteria )
3333 - [ Alpha release] ( #alpha-release )
34+ - [ Beta release] ( #beta-release )
3435 - [ Upgrade / Downgrade Strategy] ( #upgrade--downgrade-strategy )
3536 - [ Version Skew Strategy] ( #version-skew-strategy )
3637- [ Production Readiness Review Questionnaire] ( #production-readiness-review-questionnaire )
@@ -322,16 +323,16 @@ to implement this enhancement.
322323
323324##### Integration tests
324325
325- - ` test/integration/statefulset ` : ` 2022-06-15 ` : These do not appear to be
326+ - ` test/integration/statefulset ` : ` 2022-09-21 ` : These do not appear to be
326327 running in a job visible to the triage dashboard, see for example a search
327328 for the previously existing [ TestStatefulSetStatusWithPodFail] ( https://storage.googleapis.com/k8s-triage/index.html?test=TestStatefulSetStatusWithPodFail ) .
328329
329330Added ` TestAutodeleteOwnerRefs ` to ` k8s.io/kubernetes/test/integration/statefulset ` .
330331
331332##### E2E tests
332333
333- - ` ci-kuberentes-e2e- gci-gce-statefulset` : ` 2022-06-15 ` : ` 3/646 Failures`
334- - Note that as this is behind the ` StatefulSetAutoDeletePVC ` feature gate,
334+ - ` [gci-gce-statefulset](https://testgrid.k8s.io/google-gce# gci-gce-statefulset) ` : ` 2022-09-21 ` : ` 0 Failures`
335+ - Note that as this KEP is behind the ` StatefulSetAutoDeletePVC ` feature gate,
335336 tests for this KEP are not being run.
336337
337338Added ` Feature:StatefulSetAutoDeletePVC ` tests to ` k8s.io/kubernetes/test/e2e/apps/ ` .
@@ -351,8 +352,12 @@ mechanism to run upgrade/downgrade tests.
351352### Graduation Criteria
352353
353354#### Alpha release
354- - Complete adding the items in the 'Changes required' section.
355- - Add unit, functional, upgrade and downgrade tests to automated k8s test.
355+ - (Done) Complete adding the items in the 'Changes required' section.
356+ - (Done) Add unit, functional, upgrade and downgrade tests to automated k8s test.
357+
358+ #### Beta release
359+ - Validate with customer workloads
360+ - Enable feature gate for e2e pipelines
356361
357362### Upgrade / Downgrade Strategy
358363
@@ -427,11 +432,10 @@ are not involved so there is no version skew between nodes and the control plane
427432 happens during a stateful set scale down or delete.
428433
429434* ** What specific metrics should inform a rollback?**
430- The operator can monitor the ` statefulset_pvcs_owned_by_* ` metrics to see if
431- there are possible pending deletions. If consistent behavior is required, the
432- operator can wait for those metrics to stablize. For example,
433- ` statefulset_pvcs_owned_by_pod ` going to zero indicates all scale down
434- deletions are complete.
435+ The operator can monitor ` kube_persistent_volume_* ` metrics from
436+ kube-state-metrics to watch for large numbers of undeleted
437+ PersistentVolumes. If consistent behavior is required, the operator can wait
438+ for those metrics to stablize.
435439
436440* ** Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
437441 Yes. The race condition wasn't exposed, but we confirmed the PVCs were updated correctly.
@@ -443,35 +447,32 @@ fields of API types, flags, etc.?**
443447
444448### Monitoring Requirements
445449
450+ Metrics are provided by ` kube-state-metrics ` unless otherwise noted.
451+
446452* ** How can an operator determine if the feature is in use by workloads?**
447- ` statefulset_when_deleted_policy ` or ` statefulset_when_scaled_policy ` will
448- have nonzero counts for the ` delete ` policy fields.
453+ `kube_statefulset_persistent_volume_claim_retention_policy will have nonzero
454+ counts for the ` delete ` policy fields.
449455
450456* ** What are the SLIs (Service Level Indicators) an operator can use to determine
451457the health of the service?**
452- - Metric name: ` statefulset_reconcile_delay `
453- - [ Optional] Aggregation method: ` quantile `
454- - Components exposing the metric: ` pke/controller/statefulset `
455- - Metric name: ` statefulset_unhealthy_pods `
456- - [ Optional] Aggregation method: ` sum `
457- - Components exposing the metric: ` pke/controller/statefulset `
458+ - Metric name: ` kube_statefulset_status_replicas_current ` should be near
459+ ` kube_statefulset_stats_replicas_ready ` .
460+ - [ Optional] Aggregation method: ` gauge `
461+ - Components exposing the metric: ` kube-state-metrics `
458462
459463* ** What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
460464
461- The reconcile delay (time between statefulset reconcilliation loops) should be
462- low. For example, the 99%ile should be at most minutes.
463-
464- This can be combined with the unhealthy pod count, although as unhealthy pods
465- are usually an application error rather than a problem with the stateful set
466- controller, this will be more a decision for the operator to decide on a
465+ `kube_statefulset_stats_replicas_ready /
466+ kube_statefulset_stats_replicas_current` should be near 1.0, although as
467+ unhealthy replicas are often an application error rather than a problem with
468+ the stateful set controller, this will need to be tuned by an operator on a
467469 per-cluster basis.
468470
469471* ** Are there any missing metrics that would be useful to have to improve observability
470472of this feature?**
471473
472- The stateful set controller has not had any metrics in the past despite it
473- being a core Kubernetes feature for some time. Hence which metrics are useful
474- in practice is an open question in spite of the stability of the feature.
474+ kube-state-metrics have filled a gap in the traditional lack of metrics from
475+ core Kubernetes controllers.
475476
476477### Dependencies
477478
@@ -534,8 +535,10 @@ control plane returns.
534535
535536* ** What are other known failure modes?**
536537 - PVCs from a stateful set not being deleted as expected.
537- - Detection: This can be deteted by lower than expected counts of the
538- ` statefulset_pvcs_owned_by_* ` metrics and by an operator listing and examining PVCs.
538+ - Detection: This can be deteted by higher than expected counts of
539+ ` kube_persistentvolumeclaim_status_phase{phase=Bound} ` , lower than
540+ expected counts of ` kube_persistentvolume_status_phase{phase=Released} ` ,
541+ and by an operator listing and examining PVCs.
539542 - Mitigations: We expect this to happen only if there are other,
540543 operator-installed, controllers that are also managing owner refs on
541544 PVCs. Any such PVCs can be deleted manually. The conflicting controllers
@@ -558,6 +561,7 @@ stateful set controller lives) should be examined and/or restarted.
558561
559562 - 1.21, KEP created.
560563 - 1.23, alpha implementation.
564+ - 1.26, graduation to beta.
561565
562566## Drawbacks
563567The StatefulSet field update is required.
0 commit comments