You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: keps/sig-apps/2599-minreadyseconds-for-statefulsets/README.md
+33-8Lines changed: 33 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -403,16 +403,23 @@ This section must be completed when targeting beta to a release.
403
403
Try to be as paranoid as possible - e.g., what if some components will restart
404
404
mid-rollout?
405
405
-->
406
+
It shouldn't impact already running workloads. This is an opt-in feature since
407
+
users need to explicitly set the minReadySeconds parameter in the StatefulSet spec i.e `.spec.minReadySeconds` field.
408
+
If the feature is disabled the field is preserved. If it was already set in the persisted StatefulSet object, otherwise it is silently dropped.
406
409
407
410
###### What specific metrics should inform a rollback?
408
411
409
412
<!--
410
413
What signals should users be paying attention to when the feature is young
411
414
that might indicate a serious problem?
412
415
-->
416
+
We have a metric called `kube_statefulset_status_replicas_available`
417
+
which we added recently to track the number of available replicas. The cluster-admin could use
418
+
this metric to track the problems. If the value is immediately equal to the value of `Ready` replicas or if it is `0`, it can be considered as a feature failure.
413
419
414
420
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
415
-
421
+
Manually tested. No issues were found when we enabled the feature gate -> disabled it ->
422
+
re-enabled the feature gate. We still need to test upgrade -> downgrade -> upgrade scenario.
416
423
<!--
417
424
Describe manual testing that was done and the outcomes.
418
425
Longer term, we may want to require automated upgrade/rollback tests, but we
@@ -424,7 +431,7 @@ are missing a bunch of machinery and tooling and can't do that now.
424
431
<!--
425
432
Even if applying deprecation policies, they may still surprise some users.
426
433
-->
427
-
434
+
None
428
435
### Monitoring Requirements
429
436
430
437
<!--
@@ -438,19 +445,21 @@ Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
438
445
checking if there are objects with field X set) may be a last resort. Avoid
439
446
logs or events for this purpose.
440
447
-->
448
+
By checking the `kube_statefulset_status_replicas_available` metric. If all the `Ready` replicas are accounted for in `kube_statefulset_status_replicas_available` after waiting for `minReadySeconds`, we can consider the feature to be in use by workloads.
441
449
442
450
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Components exposing the metric: kube-controller-manager via kube_state_metrics. [PR which adds the metric](https://github.com/kubernetes/kube-state-metrics/pull/1532)
460
+
461
+
The `kube_statefulset_status_replicas_available` gives the number of replicas available. Since the
462
+
`kube_statefulset_status_replicas_available` metric tracks available replicas, comparing it with `kube_statefulset_status_replicas_ready` metric should give us an understanding of the health of the feature. There should be certain times where `kube_statefulset_status_replicas_available` lags behind `kube_statefulset_status_replicas_ready` for a duration of minReadySeconds. This lag defines the correctness of the functionality.
454
463
455
464
###### What are the reasonable SLOs (Service Level Objectives) for the above SLIs?
456
465
@@ -463,6 +472,7 @@ high level (needs more precise definitions) those may be things like:
463
472
job creation time) for cron job <= 10%
464
473
- 99,9% of /health requests per day finish with 200 code
465
474
-->
475
+
All the `Available` pods created should be more than the time specified in `.spec.minReadySeconds` 99% of the time.
466
476
467
477
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
468
478
@@ -493,6 +503,7 @@ and creating new ones, as well as about cluster-level services (e.g. DNS):
493
503
- Impact of its outage on the feature:
494
504
- Impact of its degraded performance or high-error rates on the feature:
495
505
-->
506
+
None. It is part of the StatefulSet controller.
496
507
497
508
### Scalability
498
509
@@ -589,6 +600,8 @@ details). For now, we leave it here.
589
600
590
601
###### How does this feature react if the API server and/or etcd is unavailable?
591
602
603
+
The controller won't be able to make progress, all currently queued resources are re-queued. This feature does not change current behavior of the controller in this regard.
604
+
592
605
###### What are other known failure modes?
593
606
594
607
<!--
@@ -603,11 +616,23 @@ For each of them, fill in the following information by copying the below templat
603
616
Not required until feature graduated to beta.
604
617
- Testing: Are there any tests for failure mode? If not, describe why.
605
618
-->
619
+
-`minReadySeconds` not respected and all the pods are shown `Available` immediately
620
+
- Detection: Looking at `kube_statefulset_status_replicas_available` metric
621
+
- Mitigations: Disable the `StatefulSetMinReadySeconds` feature flag
622
+
- Diagnostics: Controller-manager when starting at log-level 4 and above
623
+
- Testing: Yes, e2e tests are already in place
624
+
-`minReadySeconds` not respected and none of the pods are shown as `Available` after `minReadySeconds`
625
+
- Detection: Looking at `kube_statefulset_status_replicas_available`. None of the pods will be shown available
626
+
- Mitigations: Disable the `StatefulSetMinReadySeconds` feature flag
627
+
- Diagnostics: Controller-manager when starting at log-level 4 and above
628
+
- Testing: Yes, e2e tests are already in place
606
629
607
630
###### What steps should be taken if SLOs are not being met to determine the problem?
608
631
609
632
## Implementation History
610
-
633
+
- 2021-04-29: Initial KEP merged
634
+
- 2021-06-15: Initial implementation PR merged
635
+
- 2021-07-14: Graduate the feature to Beta proposed
611
636
<!--
612
637
Major milestones in the lifecycle of a KEP should be tracked in this section.
0 commit comments