31
31
- [ Upgrade/downgrade & ; feature enabled/disable tests] ( #upgradedowngrade--feature-enableddisable-tests )
32
32
- [ Graduation Criteria] ( #graduation-criteria )
33
33
- [ Alpha release] ( #alpha-release )
34
+ - [ Beta release] ( #beta-release )
34
35
- [ Upgrade / Downgrade Strategy] ( #upgrade--downgrade-strategy )
35
36
- [ Version Skew Strategy] ( #version-skew-strategy )
36
37
- [ Production Readiness Review Questionnaire] ( #production-readiness-review-questionnaire )
@@ -322,16 +323,16 @@ to implement this enhancement.
322
323
323
324
##### Integration tests
324
325
325
- - ` test/integration/statefulset ` : ` 2022-06-15 ` : These do not appear to be
326
+ - ` test/integration/statefulset ` : ` 2022-09-21 ` : These do not appear to be
326
327
running in a job visible to the triage dashboard, see for example a search
327
328
for the previously existing [ TestStatefulSetStatusWithPodFail] ( https://storage.googleapis.com/k8s-triage/index.html?test=TestStatefulSetStatusWithPodFail ) .
328
329
329
330
Added ` TestAutodeleteOwnerRefs ` to ` k8s.io/kubernetes/test/integration/statefulset ` .
330
331
331
332
##### E2E tests
332
333
333
- - ` ci-kuberentes-e2e- gci-gce-statefulset` : ` 2022-06-15 ` : ` 3/646 Failures`
334
- - Note that as this is behind the ` StatefulSetAutoDeletePVC ` feature gate,
334
+ - ` [gci-gce-statefulset](https://testgrid.k8s.io/google-gce# gci-gce-statefulset) ` : ` 2022-09-21 ` : ` 0 Failures`
335
+ - Note that as this KEP is behind the ` StatefulSetAutoDeletePVC ` feature gate,
335
336
tests for this KEP are not being run.
336
337
337
338
Added ` Feature:StatefulSetAutoDeletePVC ` tests to ` k8s.io/kubernetes/test/e2e/apps/ ` .
@@ -351,8 +352,12 @@ mechanism to run upgrade/downgrade tests.
351
352
### Graduation Criteria
352
353
353
354
#### Alpha release
354
- - Complete adding the items in the 'Changes required' section.
355
- - Add unit, functional, upgrade and downgrade tests to automated k8s test.
355
+ - (Done) Complete adding the items in the 'Changes required' section.
356
+ - (Done) Add unit, functional, upgrade and downgrade tests to automated k8s test.
357
+
358
+ #### Beta release
359
+ - Validate with customer workloads
360
+ - Enable feature gate for e2e pipelines
356
361
357
362
### Upgrade / Downgrade Strategy
358
363
@@ -427,11 +432,10 @@ are not involved so there is no version skew between nodes and the control plane
427
432
happens during a stateful set scale down or delete.
428
433
429
434
* ** What specific metrics should inform a rollback?**
430
- The operator can monitor the ` statefulset_pvcs_owned_by_* ` metrics to see if
431
- there are possible pending deletions. If consistent behavior is required, the
432
- operator can wait for those metrics to stablize. For example,
433
- ` statefulset_pvcs_owned_by_pod ` going to zero indicates all scale down
434
- deletions are complete.
435
+ The operator can monitor ` kube_persistent_volume_* ` metrics from
436
+ kube-state-metrics to watch for large numbers of undeleted
437
+ PersistentVolumes. If consistent behavior is required, the operator can wait
438
+ for those metrics to stablize.
435
439
436
440
* ** Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
437
441
Yes. The race condition wasn't exposed, but we confirmed the PVCs were updated correctly.
@@ -443,35 +447,32 @@ fields of API types, flags, etc.?**
443
447
444
448
### Monitoring Requirements
445
449
450
+ Metrics are provided by ` kube-state-metrics ` unless otherwise noted.
451
+
446
452
* ** How can an operator determine if the feature is in use by workloads?**
447
- ` statefulset_when_deleted_policy ` or ` statefulset_when_scaled_policy ` will
448
- have nonzero counts for the ` delete ` policy fields.
453
+ `kube_statefulset_persistent_volume_claim_retention_policy will have nonzero
454
+ counts for the ` delete ` policy fields.
449
455
450
456
* ** What are the SLIs (Service Level Indicators) an operator can use to determine
451
457
the health of the service?**
452
- - Metric name: ` statefulset_reconcile_delay `
453
- - [ Optional] Aggregation method: ` quantile `
454
- - Components exposing the metric: ` pke/controller/statefulset `
455
- - Metric name: ` statefulset_unhealthy_pods `
456
- - [ Optional] Aggregation method: ` sum `
457
- - Components exposing the metric: ` pke/controller/statefulset `
458
+ - Metric name: ` kube_statefulset_status_replicas_current ` should be near
459
+ ` kube_statefulset_stats_replicas_ready ` .
460
+ - [ Optional] Aggregation method: ` gauge `
461
+ - Components exposing the metric: ` kube-state-metrics `
458
462
459
463
* ** What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
460
464
461
- The reconcile delay (time between statefulset reconcilliation loops) should be
462
- low. For example, the 99%ile should be at most minutes.
463
-
464
- This can be combined with the unhealthy pod count, although as unhealthy pods
465
- are usually an application error rather than a problem with the stateful set
466
- controller, this will be more a decision for the operator to decide on a
465
+ `kube_statefulset_stats_replicas_ready /
466
+ kube_statefulset_stats_replicas_current` should be near 1.0, although as
467
+ unhealthy replicas are often an application error rather than a problem with
468
+ the stateful set controller, this will need to be tuned by an operator on a
467
469
per-cluster basis.
468
470
469
471
* ** Are there any missing metrics that would be useful to have to improve observability
470
472
of this feature?**
471
473
472
- The stateful set controller has not had any metrics in the past despite it
473
- being a core Kubernetes feature for some time. Hence which metrics are useful
474
- in practice is an open question in spite of the stability of the feature.
474
+ kube-state-metrics have filled a gap in the traditional lack of metrics from
475
+ core Kubernetes controllers.
475
476
476
477
### Dependencies
477
478
@@ -534,8 +535,10 @@ control plane returns.
534
535
535
536
* ** What are other known failure modes?**
536
537
- PVCs from a stateful set not being deleted as expected.
537
- - Detection: This can be deteted by lower than expected counts of the
538
- ` statefulset_pvcs_owned_by_* ` metrics and by an operator listing and examining PVCs.
538
+ - Detection: This can be deteted by higher than expected counts of
539
+ ` kube_persistentvolumeclaim_status_phase{phase=Bound} ` , lower than
540
+ expected counts of ` kube_persistentvolume_status_phase{phase=Released} ` ,
541
+ and by an operator listing and examining PVCs.
539
542
- Mitigations: We expect this to happen only if there are other,
540
543
operator-installed, controllers that are also managing owner refs on
541
544
PVCs. Any such PVCs can be deleted manually. The conflicting controllers
@@ -558,6 +561,7 @@ stateful set controller lives) should be examined and/or restarted.
558
561
559
562
- 1.21, KEP created.
560
563
- 1.23, alpha implementation.
564
+ - 1.26, graduation to beta.
561
565
562
566
## Drawbacks
563
567
The StatefulSet field update is required.
0 commit comments