Skip to content

Commit 851990a

Browse files
authored
Merge pull request kubernetes#2824 from ravisantoshgudimetla/add-minReadySeconds-beta
Promote STS minReadySeconds to beta
2 parents d3b2045 + 70800c1 commit 851990a

File tree

3 files changed

+37
-10
lines changed

3 files changed

+37
-10
lines changed
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
11
kep-number: 2599
22
alpha:
33
approver: "@ehashman"
4+
beta:
5+
approver: "@ehashman"

keps/sig-apps/2599-minreadyseconds-for-statefulsets/README.md

Lines changed: 33 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -403,16 +403,23 @@ This section must be completed when targeting beta to a release.
403403
Try to be as paranoid as possible - e.g., what if some components will restart
404404
mid-rollout?
405405
-->
406+
It shouldn't impact already running workloads. This is an opt-in feature since
407+
users need to explicitly set the minReadySeconds parameter in the StatefulSet spec i.e `.spec.minReadySeconds` field.
408+
If the feature is disabled the field is preserved. If it was already set in the persisted StatefulSet object, otherwise it is silently dropped.
406409

407410
###### What specific metrics should inform a rollback?
408411

409412
<!--
410413
What signals should users be paying attention to when the feature is young
411414
that might indicate a serious problem?
412415
-->
416+
We have a metric called `kube_statefulset_status_replicas_available`
417+
which we added recently to track the number of available replicas. The cluster-admin could use
418+
this metric to track the problems. If the value is immediately equal to the value of `Ready` replicas or if it is `0`, it can be considered as a feature failure.
413419

414420
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
415-
421+
Manually tested. No issues were found when we enabled the feature gate -> disabled it ->
422+
re-enabled the feature gate. We still need to test upgrade -> downgrade -> upgrade scenario.
416423
<!--
417424
Describe manual testing that was done and the outcomes.
418425
Longer term, we may want to require automated upgrade/rollback tests, but we
@@ -424,7 +431,7 @@ are missing a bunch of machinery and tooling and can't do that now.
424431
<!--
425432
Even if applying deprecation policies, they may still surprise some users.
426433
-->
427-
434+
None
428435
### Monitoring Requirements
429436

430437
<!--
@@ -438,19 +445,21 @@ Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
438445
checking if there are objects with field X set) may be a last resort. Avoid
439446
logs or events for this purpose.
440447
-->
448+
By checking the `kube_statefulset_status_replicas_available` metric. If all the `Ready` replicas are accounted for in `kube_statefulset_status_replicas_available` after waiting for `minReadySeconds`, we can consider the feature to be in use by workloads.
441449

442450
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
443451

444452
<!--
445453
Pick one more of these and delete the rest.
446454
-->
447455

448-
- [ ] Metrics
449-
- Metric name:
456+
- [x] Metrics
457+
- Metric name: `kube_statefulset_status_replicas_available`
450458
- [Optional] Aggregation method:
451-
- Components exposing the metric:
452-
- [ ] Other (treat as last resort)
453-
- Details:
459+
- Components exposing the metric: kube-controller-manager via kube_state_metrics. [PR which adds the metric](https://github.com/kubernetes/kube-state-metrics/pull/1532)
460+
461+
The `kube_statefulset_status_replicas_available` gives the number of replicas available. Since the
462+
`kube_statefulset_status_replicas_available` metric tracks available replicas, comparing it with `kube_statefulset_status_replicas_ready` metric should give us an understanding of the health of the feature. There should be certain times where `kube_statefulset_status_replicas_available` lags behind `kube_statefulset_status_replicas_ready` for a duration of minReadySeconds. This lag defines the correctness of the functionality.
454463

455464
###### What are the reasonable SLOs (Service Level Objectives) for the above SLIs?
456465

@@ -463,6 +472,7 @@ high level (needs more precise definitions) those may be things like:
463472
job creation time) for cron job <= 10%
464473
- 99,9% of /health requests per day finish with 200 code
465474
-->
475+
All the `Available` pods created should be more than the time specified in `.spec.minReadySeconds` 99% of the time.
466476

467477
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
468478

@@ -493,6 +503,7 @@ and creating new ones, as well as about cluster-level services (e.g. DNS):
493503
- Impact of its outage on the feature:
494504
- Impact of its degraded performance or high-error rates on the feature:
495505
-->
506+
None. It is part of the StatefulSet controller.
496507

497508
### Scalability
498509

@@ -589,6 +600,8 @@ details). For now, we leave it here.
589600

590601
###### How does this feature react if the API server and/or etcd is unavailable?
591602

603+
The controller won't be able to make progress, all currently queued resources are re-queued. This feature does not change current behavior of the controller in this regard.
604+
592605
###### What are other known failure modes?
593606

594607
<!--
@@ -603,11 +616,23 @@ For each of them, fill in the following information by copying the below templat
603616
Not required until feature graduated to beta.
604617
- Testing: Are there any tests for failure mode? If not, describe why.
605618
-->
619+
- `minReadySeconds` not respected and all the pods are shown `Available` immediately
620+
- Detection: Looking at `kube_statefulset_status_replicas_available` metric
621+
- Mitigations: Disable the `StatefulSetMinReadySeconds` feature flag
622+
- Diagnostics: Controller-manager when starting at log-level 4 and above
623+
- Testing: Yes, e2e tests are already in place
624+
- `minReadySeconds` not respected and none of the pods are shown as `Available` after `minReadySeconds`
625+
- Detection: Looking at `kube_statefulset_status_replicas_available`. None of the pods will be shown available
626+
- Mitigations: Disable the `StatefulSetMinReadySeconds` feature flag
627+
- Diagnostics: Controller-manager when starting at log-level 4 and above
628+
- Testing: Yes, e2e tests are already in place
606629

607630
###### What steps should be taken if SLOs are not being met to determine the problem?
608631

609632
## Implementation History
610-
633+
- 2021-04-29: Initial KEP merged
634+
- 2021-06-15: Initial implementation PR merged
635+
- 2021-07-14: Graduate the feature to Beta proposed
611636
<!--
612637
Major milestones in the lifecycle of a KEP should be tracked in this section.
613638
Major milestones might include:

keps/sig-apps/2599-minreadyseconds-for-statefulsets/kep.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -19,12 +19,12 @@ see-also:
1919

2020

2121
# The target maturity stage in the current dev cycle for this KEP.
22-
stage: alpha
22+
stage: beta
2323

2424
# The most recent milestone for which work toward delivery of this KEP has been
2525
# done. This can be the current (upcoming) milestone, if it is being actively
2626
# worked on.
27-
latest-milestone: "v1.22"
27+
latest-milestone: "v1.23"
2828

2929
# The milestone at which this feature was, or is targeted to be, at each stage.
3030
milestone:

0 commit comments

Comments
 (0)