Skip to content

Commit 942ae79

Browse files
authored
Merge pull request kubernetes#3535 from mattcary/ss-126
1847: Update for kube-state-metrics and 1.26 beta graduation
2 parents f84bea8 + 74555e9 commit 942ae79

File tree

2 files changed

+35
-31
lines changed

2 files changed

+35
-31
lines changed

keps/sig-apps/1847-autoremove-statefulset-pvcs/README.md

Lines changed: 33 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,7 @@
3131
- [Upgrade/downgrade & feature enabled/disable tests](#upgradedowngrade--feature-enableddisable-tests)
3232
- [Graduation Criteria](#graduation-criteria)
3333
- [Alpha release](#alpha-release)
34+
- [Beta release](#beta-release)
3435
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
3536
- [Version Skew Strategy](#version-skew-strategy)
3637
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
@@ -322,16 +323,16 @@ to implement this enhancement.
322323

323324
##### Integration tests
324325

325-
- `test/integration/statefulset`: `2022-06-15`: These do not appear to be
326+
- `test/integration/statefulset`: `2022-09-21`: These do not appear to be
326327
running in a job visible to the triage dashboard, see for example a search
327328
for the previously existing [TestStatefulSetStatusWithPodFail](https://storage.googleapis.com/k8s-triage/index.html?test=TestStatefulSetStatusWithPodFail).
328329

329330
Added `TestAutodeleteOwnerRefs` to `k8s.io/kubernetes/test/integration/statefulset`.
330331

331332
##### E2E tests
332333

333-
- `ci-kuberentes-e2e-gci-gce-statefulset`: `2022-06-15`: `3/646 Failures`
334-
- Note that as this is behind the `StatefulSetAutoDeletePVC` feature gate,
334+
- `[gci-gce-statefulset](https://testgrid.k8s.io/google-gce#gci-gce-statefulset)`: `2022-09-21`: `0 Failures`
335+
- Note that as this KEP is behind the `StatefulSetAutoDeletePVC` feature gate,
335336
tests for this KEP are not being run.
336337

337338
Added `Feature:StatefulSetAutoDeletePVC` tests to `k8s.io/kubernetes/test/e2e/apps/`.
@@ -351,8 +352,12 @@ mechanism to run upgrade/downgrade tests.
351352
### Graduation Criteria
352353

353354
#### Alpha release
354-
- Complete adding the items in the 'Changes required' section.
355-
- Add unit, functional, upgrade and downgrade tests to automated k8s test.
355+
- (Done) Complete adding the items in the 'Changes required' section.
356+
- (Done) Add unit, functional, upgrade and downgrade tests to automated k8s test.
357+
358+
#### Beta release
359+
- Validate with customer workloads
360+
- Enable feature gate for e2e pipelines
356361

357362
### Upgrade / Downgrade Strategy
358363

@@ -427,11 +432,10 @@ are not involved so there is no version skew between nodes and the control plane
427432
happens during a stateful set scale down or delete.
428433

429434
* **What specific metrics should inform a rollback?**
430-
The operator can monitor the `statefulset_pvcs_owned_by_*` metrics to see if
431-
there are possible pending deletions. If consistent behavior is required, the
432-
operator can wait for those metrics to stablize. For example,
433-
`statefulset_pvcs_owned_by_pod` going to zero indicates all scale down
434-
deletions are complete.
435+
The operator can monitor `kube_persistent_volume_*` metrics from
436+
kube-state-metrics to watch for large numbers of undeleted
437+
PersistentVolumes. If consistent behavior is required, the operator can wait
438+
for those metrics to stablize.
435439

436440
* **Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
437441
Yes. The race condition wasn't exposed, but we confirmed the PVCs were updated correctly.
@@ -443,35 +447,32 @@ fields of API types, flags, etc.?**
443447

444448
### Monitoring Requirements
445449

450+
Metrics are provided by `kube-state-metrics` unless otherwise noted.
451+
446452
* **How can an operator determine if the feature is in use by workloads?**
447-
`statefulset_when_deleted_policy` or `statefulset_when_scaled_policy` will
448-
have nonzero counts for the `delete` policy fields.
453+
`kube_statefulset_persistent_volume_claim_retention_policy will have nonzero
454+
counts for the `delete` policy fields.
449455

450456
* **What are the SLIs (Service Level Indicators) an operator can use to determine
451457
the health of the service?**
452-
- Metric name: `statefulset_reconcile_delay`
453-
- [Optional] Aggregation method: `quantile`
454-
- Components exposing the metric: `pke/controller/statefulset`
455-
- Metric name: `statefulset_unhealthy_pods`
456-
- [Optional] Aggregation method: `sum`
457-
- Components exposing the metric: `pke/controller/statefulset`
458+
- Metric name: `kube_statefulset_status_replicas_current` should be near
459+
`kube_statefulset_stats_replicas_ready`.
460+
- [Optional] Aggregation method: `gauge`
461+
- Components exposing the metric: `kube-state-metrics`
458462

459463
* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
460464

461-
The reconcile delay (time between statefulset reconcilliation loops) should be
462-
low. For example, the 99%ile should be at most minutes.
463-
464-
This can be combined with the unhealthy pod count, although as unhealthy pods
465-
are usually an application error rather than a problem with the stateful set
466-
controller, this will be more a decision for the operator to decide on a
465+
`kube_statefulset_stats_replicas_ready /
466+
kube_statefulset_stats_replicas_current` should be near 1.0, although as
467+
unhealthy replicas are often an application error rather than a problem with
468+
the stateful set controller, this will need to be tuned by an operator on a
467469
per-cluster basis.
468470

469471
* **Are there any missing metrics that would be useful to have to improve observability
470472
of this feature?**
471473

472-
The stateful set controller has not had any metrics in the past despite it
473-
being a core Kubernetes feature for some time. Hence which metrics are useful
474-
in practice is an open question in spite of the stability of the feature.
474+
kube-state-metrics have filled a gap in the traditional lack of metrics from
475+
core Kubernetes controllers.
475476

476477
### Dependencies
477478

@@ -534,8 +535,10 @@ control plane returns.
534535

535536
* **What are other known failure modes?**
536537
- PVCs from a stateful set not being deleted as expected.
537-
- Detection: This can be deteted by lower than expected counts of the
538-
`statefulset_pvcs_owned_by_*` metrics and by an operator listing and examining PVCs.
538+
- Detection: This can be deteted by higher than expected counts of
539+
`kube_persistentvolumeclaim_status_phase{phase=Bound}`, lower than
540+
expected counts of `kube_persistentvolume_status_phase{phase=Released}`,
541+
and by an operator listing and examining PVCs.
539542
- Mitigations: We expect this to happen only if there are other,
540543
operator-installed, controllers that are also managing owner refs on
541544
PVCs. Any such PVCs can be deleted manually. The conflicting controllers
@@ -558,6 +561,7 @@ stateful set controller lives) should be examined and/or restarted.
558561

559562
- 1.21, KEP created.
560563
- 1.23, alpha implementation.
564+
- 1.26, graduation to beta.
561565

562566
## Drawbacks
563567
The StatefulSet field update is required.

keps/sig-apps/1847-autoremove-statefulset-pvcs/kep.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,11 +20,11 @@ approvers:
2020

2121
stage: beta
2222

23-
latest-milestone: "v1.25"
23+
latest-milestone: "v1.26"
2424

2525
milestone:
2626
alpha: "v1.23"
27-
beta: "v1.25"
27+
beta: "v1.26"
2828
stable: "v1.27"
2929

3030
feature-gates:

0 commit comments

Comments
 (0)