Skip to content

Commit a900416

Browse files
committed
Update test links and metrics for beta release
Signed-off-by: Heba Elayoty <[email protected]>
1 parent ae9096a commit a900416

File tree

2 files changed

+32
-28
lines changed

2 files changed

+32
-28
lines changed

keps/sig-apps/961-maxunavailable-for-statefulset/README.md

Lines changed: 27 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -445,7 +445,9 @@ New proposed implementation: https://github.com/kubernetes/kubernetes/pull/13090
445445

446446
#### Metrics
447447

448-
We'll add a new metric named `statefulset_unavailability_violation`, it tracks how many violations are detected while processing StatefulSets with maxUnavailable > 1, (counter goes up if processed StatefulSet has spec.replicas - status.readyReplicas > maxUnavailable)
448+
We'll add two new metrics:
449+
- **statefulset_max_unavailable**: tracks the current `.spec.updateStrategy.rollingUpdate.maxUnavailable` value. This gauge reflects the configured maximum number of pods that can be unavailable during rolling updates, providing visibility into the availability constraints.
450+
- **statefulset_unavailable_replicas**: tracks the current number of unavailable pods in a StatefulSet. This gauge reflects the real-time count of pods that are either missing or unavailable (i.e., not ready for `.spec.minReadySeconds`).
449451

450452
### Test Plan
451453

@@ -545,6 +547,7 @@ We expect no non-infra related flakes in the last month as a GA graduation crite
545547
- test that rolling updates are working correctly for both PodManagementPolicy types when the MaxUnavailable is used.
546548
- include a test that fails currently but passes when https://github.com/kubernetes/kubernetes/issues/112307 is fixed, with a
547549
StatefulSet setting `minReadySeconds` and `updateStrategy.rollingUpdate.maxUnavailable` and checking for a correct rollout specially when scaling down during a rollout.
550+
- https://github.com/kubernetes/kubernetes/pull/133717
548551

549552
## Graduation Criteria
550553

@@ -566,7 +569,7 @@ Clearly define what graduation means by either linking to the [API doc
566569
definition](https://kubernetes.io/docs/concepts/overview/kubernetes-api/#api-versioning)
567570
or by redefining what graduation means.
568571
569-
In general we try to use the same stages (alpha, beta, GA), regardless of how the
572+
In general, we try to use the same stages (alpha, beta, GA), regardless of how the
570573
functionality is accessed.
571574
572575
[feature gate]: https://git.k8s.io/community/contributors/devel/sig-architecture/feature-gates.md
@@ -617,11 +620,11 @@ in back-to-back releases.
617620
#### Beta
618621

619622
- Enabled by default with default value of 1 with upgrade/downgrade tested at least manually.
620-
- Added `statefulset_unavailability_violation` metric in-tree
621-
- It is necessary to update the firstUnhealthyPod calculation to correctly call processCondemned. New tests should cover this and take into consideration that the controller should first wait for the predecessor condemned pods to become available before deleting them and delete the pod with the highest ordinal number
623+
- Added `statefulset_max_unavailable` and `statefulset_unavailable_replicas` metrics to in-tree.
624+
- It is necessary to update the `firstUnhealthyPod` calculation to correctly call processCondemned. New tests should cover this and take into consideration that the controller should first wait for the predecessor condemned pods to become available before deleting them and delete the pod with the highest ordinal number
622625
- minReadySeconds and maxUnavailable bugs https://github.com/kubernetes/kubernetes/issues/123911, https://github.com/kubernetes/kubernetes/issues/112307, https://github.com/kubernetes/kubernetes/issues/119234 and https://github.com/kubernetes/kubernetes/issues/123918 should be fixed before promotion of maxUnavailable.
623626
- Additional unit/e2e/integration tests listed in the test plan should be added covering the newly found bugs.
624-
- Users should be warned that maxUnavailable works differently for each podManagementPolicy (E.g for OrderedReady it is not applied until the StatefulSet had a chance to fully scale up). This can result in slower rollouts. For parallel this can skip ordering. This should be both mentioned in the API doc and website as a requirements for beta graduation.
627+
- Users should be warned that maxUnavailable works differently for each podManagementPolicy (e.g. for `OrderedReady` it is not applied until the StatefulSet had a chance to fully scale up). This can result in slower rollouts. For parallel this can skip ordering. This should be both mentioned in the API doc and website as a requirements for beta graduation.
625628

626629
#### GA
627630

@@ -743,7 +746,7 @@ mid-rollout?
743746
Be sure to consider highly-available clusters, where, for example,
744747
feature flags will be enabled on some API servers and not others during the
745748
rollout. Similarly, consider large clusters and how enablement/disablement
746-
will rollout across nodes.
749+
will roll out across nodes.
747750
-->
748751

749752
The rollout or rollback of the `maxUnavailable` feature for StatefulSets primarily affects how updates are managed, aiming to minimize disruptions. However, several scenarios could lead to potential issues:
@@ -789,28 +792,28 @@ Multiple violations of maxUnavailable might indicate issues with feature behavio
789792
A manual test was performed, as follows:
790793

791794
1. Create a cluster in 1.33.
792-
2. Upgrade to 1.34.
795+
2. Upgrade to 1.35.
793796
3. Create StatefulSet A with spec.updateStrategy.rollingUpdate.maxUnavailable set to 3, with 6 replicas
794797
4. Verify a rollout and check if only 3 pods are unavailable at a time ([currently with a bug if podManagementPolicy is set to Parallel](https://github.com/kubernetes/kubernetes/issues/112307))
795798
5. Downgrade to 1.33.
796799
6. Verify that the rollout only has 1 pod unavailable at a time, similar to setting maxUnavailable to 1
797800
7. Create another StatefulSet B not setting maxUnavailable (leaving it nil)
798-
8. Upgrade to 1.34.
801+
8. Upgrade to 1.35.
799802
9. Verify that the rollout has default behavior of only having one pod unavailable at a time
800803
Verify that the `maxUnavailable` can be set again to StatefulSet A and test the rollout behavior
801804

802805
TODO:
803806
A manual test will be performed, as follows:
804807

805808
1. Create a cluster in 1.33.
806-
2. Upgrade to 1.34.
809+
2. Upgrade to 1.35.
807810
3. Create StatefulSet A with spec.updateStrategy.rollingUpdate.maxUnavailable set to 3, with 6 replicas
808811
4. Verify a rollout and check if only 3 pods are unavailable at a time
809812
5. Check if rollout is also fine with podManagementPolicy set to Parallel
810813
6. Downgrade to 1.33.
811814
7. Verify that the rollout only has 1 pod unavailable at a time, similar to setting maxUnavailable to 1 (MaxUnavailableStatefulSet feature gate disabled by default).
812815
8. Create another StatefulSet B not setting maxUnavailable (leaving it nil)
813-
9. Upgrade to 1.34.
816+
9. Upgrade to 1.35.
814817
10. Verify that the rollout has default behavior of only having one pod unavailable at a time
815818
Verify that the `maxUnavailable` can be set again to StatefulSet A and test the rollout behavior
816819

@@ -822,8 +825,8 @@ No
822825

823826
###### How can an operator determine if the feature is in use by workloads?
824827

825-
If their StatefulSet rollingUpdate section has the field maxUnavailable specified with
826-
a value different than 1. While in alpha and beta, the feature-gate needs to be enabled.
828+
If their StatefulSet rollingUpdate section has the field `maxUnavailabl`e specified with
829+
a value different from 1. While in alpha and beta, the feature-gate needs to be enabled.
827830

828831
The command bellow should show the maxUnavailable value:
829832

@@ -839,7 +842,7 @@ kubectl get statefulsets -o yaml | grep maxUnavailable
839842
- Condition name:
840843
- Other field: .spec.updateStrategy.rollingUpdate.maxUnavailable
841844
- [X] Other (treat as last resort)
842-
- Details: Users can view the `statefulset_unavailability_violation` metric to see if there have been instances
845+
- Details: Users can view the `statefulset_unavailable_replicas` or `statefulset_max_unavailable` metrics to see if there have been instances
843846
where the feature is not working as intended.
844847

845848
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
@@ -861,7 +864,7 @@ question.
861864

862865
Startup latency of schedulable stateful pods should follow the [existing latency SLOs](https://github.com/kubernetes/community/blob/master/sig-scalability/slos/slos.md#steady-state-slisslos).
863866

864-
The total number of `statefulset_unavailability_violation` increments across all StatefulSets must not exceed 5 over a 28-day rolling window.
867+
`statefulset_unavailable_replicas` > `statefulset_max_unavailable` must not exceed the limit.
865868

866869
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
867870

@@ -883,13 +886,12 @@ Pick one more of these and delete the rest.
883886
- Metric name: `workqueue_work_duration_seconds`
884887
- Scope: Observes the time taken to process StatefulSet operations from the work queue.
885888
- Components exposing the metric: `kube-controller-manager`
886-
- Metric name: `workqueue_retries_total`
887-
888-
- Scope: Counts the total number of retries for StatefulSet update operations within the work queue. This metric provides insight into the stability and reliability of the StatefulSet update process, indicating potential issues when high.
889-
- Components Exposing the Metric: `kube-controller-manager`
889+
- Metric name: `workqueue_retries_total`
890+
- Scope: Counts the total number of retries for StatefulSet update operations within the work queue. This metric provides insight into the stability and reliability of the StatefulSet update process, indicating potential issues when high.
891+
- Components Exposing the Metric: `kube-controller-manager`
890892

891893
- Metric name: `statefulset_unavailability_violation`
892-
- Scope: Counts the number of times maxUnavailable has been violated (i.e spec.replicas - availableReplicas > maxUnavailable).
894+
- Scope: Counts the number of times maxUnavailable has been violated (i.e. `.spec.replicas` - availableReplicas > maxUnavailable).
893895
- Components Exposing the Metric: `kube-controller-manager`
894896

895897
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
@@ -938,7 +940,7 @@ No.
938940
###### How does this feature react if the API server and/or etcd is unavailable?
939941

940942
The RollingUpdate will fail or will not be able to proceed if etcd or API server is unavailable and
941-
hence this feature will also be not be able to be used.
943+
hence this feature will also not be able to be used.
942944

943945
###### What are other known failure modes?
944946

@@ -957,7 +959,7 @@ For each of them, fill in the following information by copying the below templat
957959

958960
- Incorrect Handling of minReadySeconds During StatefulSet Updates with Parallel Pod Management
959961
- Detection:
960-
- Monitor the `statefulset_unavailability_violation` metric of the StatefulSet during rolling updates. A large value of this metric could indicate the issue.
962+
- Monitor the `statefulset_unavailable_replicas` and `statefulset_max_unavailable` metrics of the StatefulSet during rolling updates. A large value of this metric could indicate the issue.
961963
- Review StatefulSet events or controller logs for rapid succession of pod updates without adherence to minReadySeconds, which could confirm that the delay is not being respected.
962964
- Mitigations:
963965
- Temporarily adjust the podManagementPolicy to OrderedReady as a workaround to ensure minReadySeconds is respected during updates, though this may slow down the rollout process.
@@ -975,10 +977,10 @@ For each of them, fill in the following information by copying the below templat
975977

976978
- 2019-01-01: KEP created.
977979
- 2019-08-30: PR Implemented with tests covered.
978-
- <<[UNRESOLVED bugs found in alpha and blockers to promotion @knelasevero @atiratree @bersalazar @leomichalski]>>
979-
Open PRs: https://github.com/kubernetes/kubernetes/pull/130909, https://github.com/kubernetes/kubernetes/pull/130951
980-
<<[/UNRESOLVED]>>
981-
- 2025-XX-XX: Bump to Beta.
980+
- bugs found in alpha and blockers to promotion @knelasevero @atiratree @bersalazar @leomichalski
981+
- 2025-07-07: https://github.com/kubernetes/kubernetes/pull/130909
982+
- 2025-09-01: https://github.com/kubernetes/kubernetes/pull/130951
983+
- 2025-09-30: Bump to Beta.
982984

983985
## Drawbacks
984986

keps/sig-apps/961-maxunavailable-for-statefulset/kep.yaml

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ authors:
55
- "@kerthcet"
66
- "@knelasevero"
77
- "@edwinhr716"
8+
- "@helayoty"
89
owning-sig: sig-apps
910
participating-sigs: []
1011
status: implementable
@@ -27,12 +28,12 @@ stage: beta
2728
# The most recent milestone for which work toward delivery of this KEP has been
2829
# done. This can be the current (upcoming) milestone, if it is being actively
2930
# worked on.
30-
latest-milestone: "v1.34"
31+
latest-milestone: "v1.35"
3132

3233
# The milestone at which this feature was, or is targeted to be, at each stage.
3334
milestone:
3435
alpha: "v1.24"
35-
beta: "v1.34"
36+
beta: "v1.35"
3637
stable: TBD
3738

3839
# The following PRR answers are required at alpha release
@@ -46,4 +47,5 @@ disable-supported: true
4647

4748
# The following PRR answers are required at beta release
4849
metrics:
49-
- statefulset_unavailability_violation
50+
- statefulset_max_unavailable
51+
- statefulset_unavailable_replicas

0 commit comments

Comments
 (0)