Skip to content

Commit 1a918e1

Browse files
authored
Merge pull request #3847 from mattcary/update-127
KEP-1847: Update for 1.27 template
2 parents ad7c79d + 7aa1cbb commit 1a918e1

File tree

1 file changed

+68
-36
lines changed
  • keps/sig-apps/1847-autoremove-statefulset-pvcs

1 file changed

+68
-36
lines changed

keps/sig-apps/1847-autoremove-statefulset-pvcs/README.md

Lines changed: 68 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -32,15 +32,39 @@
3232
- [Graduation Criteria](#graduation-criteria)
3333
- [Alpha release](#alpha-release)
3434
- [Beta release](#beta-release)
35+
- [GA release](#ga-release)
3536
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
3637
- [Version Skew Strategy](#version-skew-strategy)
3738
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
3839
- [Feature Enablement and Rollback](#feature-enablement-and-rollback)
40+
- [How can this feature be enabled / disabled in a live cluster?](#how-can-this-feature-be-enabled--disabled-in-a-live-cluster)
41+
- [Does enabling the feature change any default behavior?](#does-enabling-the-feature-change-any-default-behavior)
42+
- [Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?](#can-the-feature-be-disabled-once-it-has-been-enabled-ie-can-we-roll-back-the-enablement)
43+
- [What happens if we reenable the feature if it was previously rolled back?](#what-happens-if-we-reenable-the-feature-if-it-was-previously-rolled-back)
44+
- [Are there any tests for feature enablement/disablement?](#are-there-any-tests-for-feature-enablementdisablement)
3945
- [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
46+
- [How can a rollout fail? Can it impact already running workloads?](#how-can-a-rollout-fail-can-it-impact-already-running-workloads)
47+
- [What specific metrics should inform a rollback?](#what-specific-metrics-should-inform-a-rollback)
48+
- [Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?](#were-upgrade-and-rollback-tested-was-the-upgrade-downgrade-upgrade-path-tested)
49+
- [Is the rollout accompanied by any deprecations and/or removals of features, APIs,](#is-the-rollout-accompanied-by-any-deprecations-andor-removals-of-features-apis)
4050
- [Monitoring Requirements](#monitoring-requirements)
51+
- [How can an operator determine if the feature is in use by workloads?](#how-can-an-operator-determine-if-the-feature-is-in-use-by-workloads)
52+
- [What are the SLIs (Service Level Indicators) an operator can use to determine](#what-are-the-slis-service-level-indicators-an-operator-can-use-to-determine)
53+
- [What are the reasonable SLOs (Service Level Objectives) for the above SLIs?](#what-are-the-reasonable-slos-service-level-objectives-for-the-above-slis)
54+
- [Are there any missing metrics that would be useful to have to improve observability](#are-there-any-missing-metrics-that-would-be-useful-to-have-to-improve-observability)
4155
- [Dependencies](#dependencies)
56+
- [Does this feature depend on any specific services running in the cluster?](#does-this-feature-depend-on-any-specific-services-running-in-the-cluster)
4257
- [Scalability](#scalability)
58+
- [Will enabling / using this feature result in any new API calls?](#will-enabling--using-this-feature-result-in-any-new-api-calls)
59+
- [Will enabling / using this feature result in introducing new API types?](#will-enabling--using-this-feature-result-in-introducing-new-api-types)
60+
- [Will enabling / using this feature result in any new calls to the cloud](#will-enabling--using-this-feature-result-in-any-new-calls-to-the-cloud)
61+
- [Will enabling / using this feature result in increasing size or count of](#will-enabling--using-this-feature-result-in-increasing-size-or-count-of)
62+
- [Will enabling / using this feature result in increasing time taken by any](#will-enabling--using-this-feature-result-in-increasing-time-taken-by-any)
63+
- [Will enabling / using this feature result in non-negligible increase of](#will-enabling--using-this-feature-result-in-non-negligible-increase-of)
4364
- [Troubleshooting](#troubleshooting)
65+
- [How does this feature react if the API server and/or etcd is unavailable?](#how-does-this-feature-react-if-the-api-server-andor-etcd-is-unavailable)
66+
- [What are other known failure modes?](#what-are-other-known-failure-modes)
67+
- [What steps should be taken if SLOs are not being met to determine the problem?](#what-steps-should-be-taken-if-slos-are-not-being-met-to-determine-the-problem)
4468
- [Implementation History](#implementation-history)
4569
- [Drawbacks](#drawbacks)
4670
- [Alternatives](#alternatives)
@@ -50,14 +74,16 @@
5074

5175
Items marked with (R) are required *prior to targeting to a milestone / release*.
5276

53-
- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
77+
- [X] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
5478
- [X] (R) KEP approvers have approved the KEP status as `implementable`
5579
- [X] (R) Design details are appropriately documented
5680
- [X] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
57-
- [ ] (R) Graduation criteria is in place
58-
- [ ] (R) Production readiness review completed
59-
- [ ] (R) Production readiness review approved
60-
- [ ] "Implementation History" section is up-to-date for milestone
81+
- [X] e2e Tests for all Beta API Operations (endpoints)
82+
- [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
83+
- [X] (R) Graduation criteria is in place
84+
- [X] (R) Production readiness review completed
85+
- [X] (R) Production readiness review approved
86+
- [X] "Implementation History" section is up-to-date for milestone
6187
- [X] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
6288
- [ ] Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
6389

@@ -356,8 +382,11 @@ mechanism to run upgrade/downgrade tests.
356382
- (Done) Add unit, functional, upgrade and downgrade tests to automated k8s test.
357383

358384
#### Beta release
385+
- (Done) Enable feature gate for e2e pipelines
386+
387+
#### GA release
359388
- Validate with customer workloads
360-
- Enable feature gate for e2e pipelines
389+
361390

362391
### Upgrade / Downgrade Strategy
363392

@@ -378,21 +407,21 @@ are not involved so there is no version skew between nodes and the control plane
378407

379408
### Feature Enablement and Rollback
380409

381-
* **How can this feature be enabled / disabled in a live cluster?**
410+
##### How can this feature be enabled / disabled in a live cluster?
382411
- [x] Feature gate (also fill in values in `kep.yaml`)
383412
- Feature gate name: StatefulSetAutoDeletePVC
384413
- Components depending on the feature gate
385414
- kube-controller-manager, which orchestrates the volume deletion.
386415
- kube-apiserver, to manage the new policy field in the StatefulSet
387416
resource (eg dropDisabledFields).
388417

389-
* **Does enabling the feature change any default behavior?**
418+
##### Does enabling the feature change any default behavior?
390419
No. What happens during StatefulSet deletion differs from current behavior
391420
only when the user explicitly specifies the
392421
`PersistentVolumeClaimDeletePolicy`. Hence no change in any user visible
393422
behavior change by default.
394423

395-
* **Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?**
424+
##### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
396425
Yes. Disabling the feature gate will cause the new field to be ignored. If the feature
397426
gate is re-enabled, the new behavior will start working.
398427

@@ -405,7 +434,7 @@ are not involved so there is no version skew between nodes and the control plane
405434
be discovered during feature testing. In any case the mitigation will be to
406435
manually delete any PVCs.
407436

408-
* **What happens if we reenable the feature if it was previously rolled back?**
437+
##### What happens if we reenable the feature if it was previously rolled back?
409438
In the simple case of reenabling the feature without concurrent StatefulSet
410439
deletion or scale-down, nothing needs to be done when the deletion policy has
411440
`whenScaled` set to `Delete`. When the policy has `whenDeleted` set to `Delete`, the
@@ -414,14 +443,14 @@ are not involved so there is no version skew between nodes and the control plane
414443
As above, if there is a concurrent scale-down or StatefulSet deletion, more
415444
care needs to be taken. This will be detailed further during feature testing.
416445

417-
* **Are there any tests for feature enablement/disablement?**
446+
##### Are there any tests for feature enablement/disablement?
418447
Feature enablement and disablement tests will be added, including for
419448
StatefulSet behavior during transitions in conjunction with scale-down or
420449
deletion.
421450

422451
### Rollout, Upgrade and Rollback Planning
423452

424-
* **How can a rollout fail? Can it impact already running workloads?**
453+
##### How can a rollout fail? Can it impact already running workloads?
425454
If there is a control plane update which disables the feature while a stateful
426455
set is in the process of being deleted or scaled down, it is undefined which
427456
PVCs will be deleted. Before the update, PVCs will be marked for deletion;
@@ -431,52 +460,52 @@ are not involved so there is no version skew between nodes and the control plane
431460
an operator that there is an essential race condition when a cluster update
432461
happens during a stateful set scale down or delete.
433462

434-
* **What specific metrics should inform a rollback?**
463+
##### What specific metrics should inform a rollback?
435464
The operator can monitor `kube_persistent_volume_*` metrics from
436465
kube-state-metrics to watch for large numbers of undeleted
437466
PersistentVolumes. If consistent behavior is required, the operator can wait
438467
for those metrics to stablize.
439468

440-
* **Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
469+
##### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
441470
Yes. The race condition wasn't exposed, but we confirmed the PVCs were updated correctly.
442471

443-
* **Is the rollout accompanied by any deprecations and/or removals of features, APIs,
444-
fields of API types, flags, etc.?**
472+
##### Is the rollout accompanied by any deprecations and/or removals of features, APIs,
473+
fields of API types, flags, etc.?
445474
Enabling the feature also enables the `PersistentVolumeClaimRetentionPolicy`
446475
api field.
447476

448477
### Monitoring Requirements
449478

450479
Metrics are provided by `kube-state-metrics` unless otherwise noted.
451480

452-
* **How can an operator determine if the feature is in use by workloads?**
481+
##### How can an operator determine if the feature is in use by workloads?
453482
`kube_statefulset_persistent_volume_claim_retention_policy` will have nonzero
454483
counts for the `delete` policy fields.
455484

456-
* **What are the SLIs (Service Level Indicators) an operator can use to determine
457-
the health of the service?**
485+
##### What are the SLIs (Service Level Indicators) an operator can use to determine
486+
the health of the service?
458487
- Metric name: `kube_statefulset_status_replicas_current` should be near
459488
`kube_statefulset_stats_replicas_ready`.
460489
- [Optional] Aggregation method: `gauge`
461490
- Components exposing the metric: `kube-state-metrics`
462491

463-
* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
492+
##### What are the reasonable SLOs (Service Level Objectives) for the above SLIs?
464493

465494
`kube_statefulset_stats_replicas_ready /
466495
kube_statefulset_stats_replicas_current` should be near 1.0, although as
467496
unhealthy replicas are often an application error rather than a problem with
468497
the stateful set controller, this will need to be tuned by an operator on a
469498
per-cluster basis.
470499

471-
* **Are there any missing metrics that would be useful to have to improve observability
472-
of this feature?**
500+
##### Are there any missing metrics that would be useful to have to improve observability
501+
of this feature?
473502

474503
kube-state-metrics have filled a gap in the traditional lack of metrics from
475504
core Kubernetes controllers.
476505

477506
### Dependencies
478507

479-
* **Does this feature depend on any specific services running in the cluster?**
508+
##### Does this feature depend on any specific services running in the cluster?
480509

481510
No, outside of depending on the scheduler, the garbage collector and volume
482511
management (provisioning, attaching, etc) as does almost anything in
@@ -486,7 +515,7 @@ of this feature?**
486515

487516
### Scalability
488517

489-
* **Will enabling / using this feature result in any new API calls?**
518+
##### Will enabling / using this feature result in any new API calls?
490519

491520
Yes and no. This feature will result in additional resource deletion calls, which will
492521
scale like the number of pods in the stateful set (ie, one PVC per pod and possibly one
@@ -499,41 +528,44 @@ of this feature?**
499528
there shouldn't be much overall increase beyond the second-order effect of
500529
this feature allowing more automation.
501530

502-
* **Will enabling / using this feature result in introducing new API types?**
531+
##### Will enabling / using this feature result in introducing new API types?
503532
No.
504533

505-
* **Will enabling / using this feature result in any new calls to the cloud
506-
provider?**
534+
##### Will enabling / using this feature result in any new calls to the cloud
535+
provider?
507536
PVC deletion may cause PV deletion, depending on reclaim policy, which will result in
508537
cloud provider calls through the volume API. However, as noted above, these calls would
509538
have been happening anyway, manually.
510539

511-
* **Will enabling / using this feature result in increasing size or count of
512-
the existing API objects?**
540+
##### Will enabling / using this feature result in increasing size or count of
541+
the existing API objects?
513542
- PVC, new ownerRef.
514543
- StatefulSet, new field
515544

516-
* **Will enabling / using this feature result in increasing time taken by any
517-
operations covered by existing SLIs/SLOs?**
545+
##### Will enabling / using this feature result in increasing time taken by any
546+
operations covered by existing SLIs/SLOs?
518547
No. (There are currently no StatefulSet SLOs?)
519548

520549
Note that scale-up may be slower when volumes were deleted by scale-down. This
521550
is by design of the feature.
522551

523-
* **Will enabling / using this feature result in non-negligible increase of
524-
resource usage (CPU, RAM, disk, IO, ...) in any components?**
552+
##### Will enabling / using this feature result in non-negligible increase of
553+
resource usage (CPU, RAM, disk, IO, ...) in any components?
554+
No.
555+
556+
###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
525557
No.
526558

527559
### Troubleshooting
528560

529-
* **How does this feature react if the API server and/or etcd is unavailable?**
561+
##### How does this feature react if the API server and/or etcd is unavailable?
530562

531563
PVC deletion will be paused. If the control plane went unavailable in the middle
532564
of a stateful set being deleted or scaled down, there may be deleted Pods whose
533565
PVCs have not yet been deleted. Deletion will continue normally after the
534566
control plane returns.
535567

536-
* **What are other known failure modes?**
568+
##### What are other known failure modes?
537569
- PVCs from a stateful set not being deleted as expected.
538570
- Detection: This can be deteted by higher than expected counts of
539571
`kube_persistentvolumeclaim_status_phase{phase=Bound}`, lower than
@@ -548,7 +580,7 @@ control plane returns.
548580
`StatefulSet` controller, but Kubernetes does not test against external
549581
custom controller.
550582

551-
* **What steps should be taken if SLOs are not being met to determine the problem?**
583+
##### What steps should be taken if SLOs are not being met to determine the problem?
552584

553585
Stateful set SLOs are new with this feature and are in process of being
554586
evaluated. If they are not being met, the kube-controller-manager (where the

0 commit comments

Comments
 (0)