32
32
- [ Graduation Criteria] ( #graduation-criteria )
33
33
- [ Alpha release] ( #alpha-release )
34
34
- [ Beta release] ( #beta-release )
35
+ - [ GA release] ( #ga-release )
35
36
- [ Upgrade / Downgrade Strategy] ( #upgrade--downgrade-strategy )
36
37
- [ Version Skew Strategy] ( #version-skew-strategy )
37
38
- [ Production Readiness Review Questionnaire] ( #production-readiness-review-questionnaire )
38
39
- [ Feature Enablement and Rollback] ( #feature-enablement-and-rollback )
40
+ - [ How can this feature be enabled / disabled in a live cluster?] ( #how-can-this-feature-be-enabled--disabled-in-a-live-cluster )
41
+ - [ Does enabling the feature change any default behavior?] ( #does-enabling-the-feature-change-any-default-behavior )
42
+ - [ Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?] ( #can-the-feature-be-disabled-once-it-has-been-enabled-ie-can-we-roll-back-the-enablement )
43
+ - [ What happens if we reenable the feature if it was previously rolled back?] ( #what-happens-if-we-reenable-the-feature-if-it-was-previously-rolled-back )
44
+ - [ Are there any tests for feature enablement/disablement?] ( #are-there-any-tests-for-feature-enablementdisablement )
39
45
- [ Rollout, Upgrade and Rollback Planning] ( #rollout-upgrade-and-rollback-planning )
46
+ - [ How can a rollout fail? Can it impact already running workloads?] ( #how-can-a-rollout-fail-can-it-impact-already-running-workloads )
47
+ - [ What specific metrics should inform a rollback?] ( #what-specific-metrics-should-inform-a-rollback )
48
+ - [ Were upgrade and rollback tested? Was the upgrade-> ; downgrade-> ; upgrade path tested?] ( #were-upgrade-and-rollback-tested-was-the-upgrade-downgrade-upgrade-path-tested )
49
+ - [ Is the rollout accompanied by any deprecations and/or removals of features, APIs,] ( #is-the-rollout-accompanied-by-any-deprecations-andor-removals-of-features-apis )
40
50
- [ Monitoring Requirements] ( #monitoring-requirements )
51
+ - [ How can an operator determine if the feature is in use by workloads?] ( #how-can-an-operator-determine-if-the-feature-is-in-use-by-workloads )
52
+ - [ What are the SLIs (Service Level Indicators) an operator can use to determine] ( #what-are-the-slis-service-level-indicators-an-operator-can-use-to-determine )
53
+ - [ What are the reasonable SLOs (Service Level Objectives) for the above SLIs?] ( #what-are-the-reasonable-slos-service-level-objectives-for-the-above-slis )
54
+ - [ Are there any missing metrics that would be useful to have to improve observability] ( #are-there-any-missing-metrics-that-would-be-useful-to-have-to-improve-observability )
41
55
- [ Dependencies] ( #dependencies )
56
+ - [ Does this feature depend on any specific services running in the cluster?] ( #does-this-feature-depend-on-any-specific-services-running-in-the-cluster )
42
57
- [ Scalability] ( #scalability )
58
+ - [ Will enabling / using this feature result in any new API calls?] ( #will-enabling--using-this-feature-result-in-any-new-api-calls )
59
+ - [ Will enabling / using this feature result in introducing new API types?] ( #will-enabling--using-this-feature-result-in-introducing-new-api-types )
60
+ - [ Will enabling / using this feature result in any new calls to the cloud] ( #will-enabling--using-this-feature-result-in-any-new-calls-to-the-cloud )
61
+ - [ Will enabling / using this feature result in increasing size or count of] ( #will-enabling--using-this-feature-result-in-increasing-size-or-count-of )
62
+ - [ Will enabling / using this feature result in increasing time taken by any] ( #will-enabling--using-this-feature-result-in-increasing-time-taken-by-any )
63
+ - [ Will enabling / using this feature result in non-negligible increase of] ( #will-enabling--using-this-feature-result-in-non-negligible-increase-of )
43
64
- [ Troubleshooting] ( #troubleshooting )
65
+ - [ How does this feature react if the API server and/or etcd is unavailable?] ( #how-does-this-feature-react-if-the-api-server-andor-etcd-is-unavailable )
66
+ - [ What are other known failure modes?] ( #what-are-other-known-failure-modes )
67
+ - [ What steps should be taken if SLOs are not being met to determine the problem?] ( #what-steps-should-be-taken-if-slos-are-not-being-met-to-determine-the-problem )
44
68
- [ Implementation History] ( #implementation-history )
45
69
- [ Drawbacks] ( #drawbacks )
46
70
- [ Alternatives] ( #alternatives )
50
74
51
75
Items marked with (R) are required * prior to targeting to a milestone / release* .
52
76
53
- - [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [ kubernetes/enhancements] (not the initial KEP PR)
77
+ - [X ] (R) Enhancement issue in release milestone, which links to KEP dir in [ kubernetes/enhancements] (not the initial KEP PR)
54
78
- [X] (R) KEP approvers have approved the KEP status as ` implementable `
55
79
- [X] (R) Design details are appropriately documented
56
80
- [X] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
57
- - [ ] (R) Graduation criteria is in place
58
- - [ ] (R) Production readiness review completed
59
- - [ ] (R) Production readiness review approved
60
- - [ ] "Implementation History" section is up-to-date for milestone
81
+ - [X] e2e Tests for all Beta API Operations (endpoints)
82
+ - [ ] (R) Ensure GA e2e tests meet requirements for [ Conformance Tests] ( https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md )
83
+ - [X] (R) Graduation criteria is in place
84
+ - [X] (R) Production readiness review completed
85
+ - [X] (R) Production readiness review approved
86
+ - [X] "Implementation History" section is up-to-date for milestone
61
87
- [X] User-facing documentation has been created in [ kubernetes/website] , for publication to [ kubernetes.io]
62
88
- [ ] Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
63
89
@@ -356,8 +382,11 @@ mechanism to run upgrade/downgrade tests.
356
382
- (Done) Add unit, functional, upgrade and downgrade tests to automated k8s test.
357
383
358
384
#### Beta release
385
+ - (Done) Enable feature gate for e2e pipelines
386
+
387
+ #### GA release
359
388
- Validate with customer workloads
360
- - Enable feature gate for e2e pipelines
389
+
361
390
362
391
### Upgrade / Downgrade Strategy
363
392
@@ -378,21 +407,21 @@ are not involved so there is no version skew between nodes and the control plane
378
407
379
408
### Feature Enablement and Rollback
380
409
381
- * ** How can this feature be enabled / disabled in a live cluster?**
410
+ ##### How can this feature be enabled / disabled in a live cluster?
382
411
- [x] Feature gate (also fill in values in ` kep.yaml ` )
383
412
- Feature gate name: StatefulSetAutoDeletePVC
384
413
- Components depending on the feature gate
385
414
- kube-controller-manager, which orchestrates the volume deletion.
386
415
- kube-apiserver, to manage the new policy field in the StatefulSet
387
416
resource (eg dropDisabledFields).
388
417
389
- * ** Does enabling the feature change any default behavior?**
418
+ ##### Does enabling the feature change any default behavior?
390
419
No. What happens during StatefulSet deletion differs from current behavior
391
420
only when the user explicitly specifies the
392
421
` PersistentVolumeClaimDeletePolicy ` . Hence no change in any user visible
393
422
behavior change by default.
394
423
395
- * ** Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?**
424
+ ##### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
396
425
Yes. Disabling the feature gate will cause the new field to be ignored. If the feature
397
426
gate is re-enabled, the new behavior will start working.
398
427
@@ -405,7 +434,7 @@ are not involved so there is no version skew between nodes and the control plane
405
434
be discovered during feature testing. In any case the mitigation will be to
406
435
manually delete any PVCs.
407
436
408
- * ** What happens if we reenable the feature if it was previously rolled back?**
437
+ ##### What happens if we reenable the feature if it was previously rolled back?
409
438
In the simple case of reenabling the feature without concurrent StatefulSet
410
439
deletion or scale-down, nothing needs to be done when the deletion policy has
411
440
` whenScaled ` set to ` Delete ` . When the policy has ` whenDeleted ` set to ` Delete ` , the
@@ -414,14 +443,14 @@ are not involved so there is no version skew between nodes and the control plane
414
443
As above, if there is a concurrent scale-down or StatefulSet deletion, more
415
444
care needs to be taken. This will be detailed further during feature testing.
416
445
417
- * ** Are there any tests for feature enablement/disablement?**
446
+ ##### Are there any tests for feature enablement/disablement?
418
447
Feature enablement and disablement tests will be added, including for
419
448
StatefulSet behavior during transitions in conjunction with scale-down or
420
449
deletion.
421
450
422
451
### Rollout, Upgrade and Rollback Planning
423
452
424
- * ** How can a rollout fail? Can it impact already running workloads?**
453
+ ##### How can a rollout fail? Can it impact already running workloads?
425
454
If there is a control plane update which disables the feature while a stateful
426
455
set is in the process of being deleted or scaled down, it is undefined which
427
456
PVCs will be deleted. Before the update, PVCs will be marked for deletion;
@@ -431,52 +460,52 @@ are not involved so there is no version skew between nodes and the control plane
431
460
an operator that there is an essential race condition when a cluster update
432
461
happens during a stateful set scale down or delete.
433
462
434
- * ** What specific metrics should inform a rollback?**
463
+ ##### What specific metrics should inform a rollback?
435
464
The operator can monitor ` kube_persistent_volume_* ` metrics from
436
465
kube-state-metrics to watch for large numbers of undeleted
437
466
PersistentVolumes. If consistent behavior is required, the operator can wait
438
467
for those metrics to stablize.
439
468
440
- * ** Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
469
+ ##### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
441
470
Yes. The race condition wasn't exposed, but we confirmed the PVCs were updated correctly.
442
471
443
- * ** Is the rollout accompanied by any deprecations and/or removals of features, APIs,
444
- fields of API types, flags, etc.?**
472
+ ##### Is the rollout accompanied by any deprecations and/or removals of features, APIs,
473
+ fields of API types, flags, etc.?
445
474
Enabling the feature also enables the ` PersistentVolumeClaimRetentionPolicy `
446
475
api field.
447
476
448
477
### Monitoring Requirements
449
478
450
479
Metrics are provided by ` kube-state-metrics ` unless otherwise noted.
451
480
452
- * ** How can an operator determine if the feature is in use by workloads?**
481
+ ##### How can an operator determine if the feature is in use by workloads?
453
482
` kube_statefulset_persistent_volume_claim_retention_policy ` will have nonzero
454
483
counts for the ` delete ` policy fields.
455
484
456
- * ** What are the SLIs (Service Level Indicators) an operator can use to determine
457
- the health of the service?**
485
+ ##### What are the SLIs (Service Level Indicators) an operator can use to determine
486
+ the health of the service?
458
487
- Metric name: ` kube_statefulset_status_replicas_current ` should be near
459
488
` kube_statefulset_stats_replicas_ready ` .
460
489
- [ Optional] Aggregation method: ` gauge `
461
490
- Components exposing the metric: ` kube-state-metrics `
462
491
463
- * ** What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
492
+ ##### What are the reasonable SLOs (Service Level Objectives) for the above SLIs?
464
493
465
494
`kube_statefulset_stats_replicas_ready /
466
495
kube_statefulset_stats_replicas_current` should be near 1.0, although as
467
496
unhealthy replicas are often an application error rather than a problem with
468
497
the stateful set controller, this will need to be tuned by an operator on a
469
498
per-cluster basis.
470
499
471
- * ** Are there any missing metrics that would be useful to have to improve observability
472
- of this feature?**
500
+ ##### Are there any missing metrics that would be useful to have to improve observability
501
+ of this feature?
473
502
474
503
kube-state-metrics have filled a gap in the traditional lack of metrics from
475
504
core Kubernetes controllers.
476
505
477
506
### Dependencies
478
507
479
- * ** Does this feature depend on any specific services running in the cluster?**
508
+ ##### Does this feature depend on any specific services running in the cluster?
480
509
481
510
No, outside of depending on the scheduler, the garbage collector and volume
482
511
management (provisioning, attaching, etc) as does almost anything in
@@ -486,7 +515,7 @@ of this feature?**
486
515
487
516
### Scalability
488
517
489
- * ** Will enabling / using this feature result in any new API calls?**
518
+ ##### Will enabling / using this feature result in any new API calls?
490
519
491
520
Yes and no. This feature will result in additional resource deletion calls, which will
492
521
scale like the number of pods in the stateful set (ie, one PVC per pod and possibly one
@@ -499,41 +528,44 @@ of this feature?**
499
528
there shouldn't be much overall increase beyond the second-order effect of
500
529
this feature allowing more automation.
501
530
502
- * ** Will enabling / using this feature result in introducing new API types?**
531
+ ##### Will enabling / using this feature result in introducing new API types?
503
532
No.
504
533
505
- * ** Will enabling / using this feature result in any new calls to the cloud
506
- provider?**
534
+ ##### Will enabling / using this feature result in any new calls to the cloud
535
+ provider?
507
536
PVC deletion may cause PV deletion, depending on reclaim policy, which will result in
508
537
cloud provider calls through the volume API. However, as noted above, these calls would
509
538
have been happening anyway, manually.
510
539
511
- * ** Will enabling / using this feature result in increasing size or count of
512
- the existing API objects?**
540
+ ##### Will enabling / using this feature result in increasing size or count of
541
+ the existing API objects?
513
542
- PVC, new ownerRef.
514
543
- StatefulSet, new field
515
544
516
- * ** Will enabling / using this feature result in increasing time taken by any
517
- operations covered by existing SLIs/SLOs?**
545
+ ##### Will enabling / using this feature result in increasing time taken by any
546
+ operations covered by existing SLIs/SLOs?
518
547
No. (There are currently no StatefulSet SLOs?)
519
548
520
549
Note that scale-up may be slower when volumes were deleted by scale-down. This
521
550
is by design of the feature.
522
551
523
- * ** Will enabling / using this feature result in non-negligible increase of
524
- resource usage (CPU, RAM, disk, IO, ...) in any components?**
552
+ ##### Will enabling / using this feature result in non-negligible increase of
553
+ resource usage (CPU, RAM, disk, IO, ...) in any components?
554
+ No.
555
+
556
+ ###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
525
557
No.
526
558
527
559
### Troubleshooting
528
560
529
- * ** How does this feature react if the API server and/or etcd is unavailable?**
561
+ ##### How does this feature react if the API server and/or etcd is unavailable?
530
562
531
563
PVC deletion will be paused. If the control plane went unavailable in the middle
532
564
of a stateful set being deleted or scaled down, there may be deleted Pods whose
533
565
PVCs have not yet been deleted. Deletion will continue normally after the
534
566
control plane returns.
535
567
536
- * ** What are other known failure modes?**
568
+ ##### What are other known failure modes?
537
569
- PVCs from a stateful set not being deleted as expected.
538
570
- Detection: This can be deteted by higher than expected counts of
539
571
` kube_persistentvolumeclaim_status_phase{phase=Bound} ` , lower than
@@ -548,7 +580,7 @@ control plane returns.
548
580
` StatefulSet ` controller, but Kubernetes does not test against external
549
581
custom controller.
550
582
551
- * ** What steps should be taken if SLOs are not being met to determine the problem?**
583
+ ##### What steps should be taken if SLOs are not being met to determine the problem?
552
584
553
585
Stateful set SLOs are new with this feature and are in process of being
554
586
evaluated. If they are not being met, the kube-controller-manager (where the
0 commit comments