You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -106,16 +107,16 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
106
107
-[x] (R) KEP approvers have approved the KEP status as `implementable`
107
108
-[x] (R) Design details are appropriately documented
108
109
-[ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
109
-
-[] e2e Tests for all Beta API Operations (endpoints)
110
+
-[x] e2e Tests for all Beta API Operations (endpoints)
110
111
-[ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
111
112
-[ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
112
113
-[ ] (R) Graduation criteria is in place
113
114
-[ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
114
-
-[] (R) Production readiness review completed
115
-
-[] (R) Production readiness review approved
115
+
-[x] (R) Production readiness review completed
116
+
-[x] (R) Production readiness review approved
116
117
-[ ] "Implementation History" section is up-to-date for milestone
117
-
-[] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
118
-
-[] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
118
+
-[x] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
119
+
-[x] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
We expect no non-infra related flakes in the last month as a GA graduation criteria.
353
354
-->
354
355
355
-
We will add the follow [e2e autoscaling tests]:
356
+
Existing e2e tests ensure the autoscaling behavior uses the default tolerance when no
357
+
configurable tolerance is specified.
356
358
357
-
- For both scale up and scale down:
358
-
- Workload does not scale because the metric ratio is in tolerance.
359
-
- Workload scales successfully because the metric ratio is out of tolerance.
360
-
- Autoscaling uses the default when no tolerances are set.
359
+
The new [e2e autoscaling tests] covering this feature are:
360
+
361
+
-[Test with large configurable tolerance](https://github.com/kubernetes/kubernetes/blob/07142400ecd02126602ffaa6f91712cd3f1e170c/test/e2e/autoscaling/horizontal_pod_autoscaling_behavior.go#L509): [SIG autoscaling](https://testgrid.k8s.io/sig-autoscaling-hpa#gci-gce-autoscaling-hpa-cpu-alpha-beta-pull&include-filter-by-regex=HPAConfigurableTolerance.*large%20configurable%20tolerance), [triage search](https://storage.googleapis.com/k8s-triage/index.html?test=HPAConfigurableTolerance.*large%20configurable%20tolerance)
362
+
363
+
Before the graduation to beta, we will add an integration test verifying the autoscaling
364
+
behavior when smaller and larger than default tolerances are set on an HPA.
We will add a unit test verifying that HPAs with and without the new fields are
556
+
[Unit tests have been added](https://github.com/kubernetes/kubernetes/blob/07142400ecd02126602ffaa6f91712cd3f1e170c/pkg/apis/autoscaling/validation/validation_test.go#L1648) to verify that HPAs with and without the new fields are
547
557
properly validated, both when the feature gate is enabled or not.
548
558
549
559
### Rollout, Upgrade and Rollback Planning
@@ -564,13 +574,20 @@ rollout. Similarly, consider large clusters and how enablement/disablement
564
574
will rollout across nodes.
565
575
-->
566
576
577
+
This feature does not introduce new failure modes: during rollout/rollback, some
578
+
API servers will allow or disallow setting the new 'tolerance' field. The new
579
+
field is possibly ignored until the controller manager is fully updated.
580
+
567
581
###### What specific metrics should inform a rollback?
568
582
569
583
<!--
570
584
What signals should users be paying attention to when the feature is young
571
585
that might indicate a serious problem?
572
586
-->
573
587
588
+
A high `horizontal_pod_autoscaler_controller_metric_computation_duration_seconds`
589
+
metric can indicate a problem related to this feature.
590
+
574
591
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
575
592
576
593
<!--
@@ -579,12 +596,105 @@ Longer term, we may want to require automated upgrade/rollback tests, but we
579
596
are missing a bunch of machinery and tooling and can't do that now.
580
597
-->
581
598
599
+
The upgrade→downgrade→upgrade testing was done manually using a 1.33 cluster with the following steps:
4. Simulate downgrade by re-enabling the feature for api server and control-plane. Follow the procedure described
687
+
in step 1, and observe that the HPA description mentions `ScalingLimited: False`, demonstrates that the feature
688
+
is working again.
689
+
582
690
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
583
691
584
692
<!--
585
693
Even if applying deprecation policies, they may still surprise some users.
586
694
-->
587
695
696
+
No.
697
+
588
698
### Monitoring Requirements
589
699
590
700
<!--
@@ -625,9 +735,9 @@ values. Users can get both values using
625
735
and use them to verify that scaling events are triggered when their ratio is out
626
736
of tolerance.
627
737
628
-
We will update the controller-manager logs to help users understand the behavior
629
-
of the autoscaler. The data added to the logs will include the tolerance used
630
-
for each scaling decision.
738
+
The [controller-manager logs have been updated](https://github.com/kubernetes/kubernetes/blob/07142400ecd02126602ffaa6f91712cd3f1e170c/pkg/controller/podautoscaler/horizontal.go#L846)
739
+
to help users understand the behavior of the autoscaler. The data added to the
740
+
logs includes the tolerance used for each scaling decision.
631
741
632
742
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
633
743
@@ -646,7 +756,9 @@ These goals will help you determine what you need to measure (SLIs) in the next
646
756
question.
647
757
-->
648
758
649
-
N/A.
759
+
Although the absolute value of the `horizontal_pod_autoscaler_controller_metric_computation_duration_seconds`
760
+
metric depends on HPAs configuration, it should be unimpacted by this feature. This metric should not vary
761
+
by more than 5%.
650
762
651
763
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
652
764
@@ -658,8 +770,7 @@ This KEP is not expected to have any impact on SLIs/SLOs as it doesn't introduce
658
770
a new HPA behavior, but merely allows users to easily change the value of a
659
771
parameter that's otherwise difficult to update.
660
772
661
-
Standard HPA metrics (e.g.
662
-
`horizontal_pod_autoscaler_controller_metric_computation_duration_seconds`) can
773
+
The standard HPA metric `horizontal_pod_autoscaler_controller_metric_computation_duration_seconds` can
663
774
be used to verify the HPA controller health.
664
775
665
776
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
@@ -698,6 +809,8 @@ and creating new ones, as well as about cluster-level services (e.g. DNS):
698
809
- Impact of its degraded performance or high-error rates on the feature:
699
810
-->
700
811
812
+
No, this feature does not depend on any specific service.
813
+
701
814
### Scalability
702
815
703
816
<!--
@@ -817,6 +930,8 @@ details). For now, we leave it here.
817
930
818
931
###### How does this feature react if the API server and/or etcd is unavailable?
819
932
933
+
API server or etcd issues do not impact this feature.
934
+
820
935
###### What are other known failure modes?
821
936
822
937
<!--
@@ -832,8 +947,20 @@ For each of them, fill in the following information by copying the below templat
832
947
- Testing: Are there any tests for failure mode? If not, describe why.
833
948
-->
834
949
950
+
We do not expect any new failure mode. (While setting `tolerance` below 10% can cause HPAs
951
+
to scale up and down as frequently as every 30s, and higher values might stop scaling altogether
952
+
if the metric remains within the tolerance band, the feature is still working as intended.
953
+
To make HPAs respond faster, decrease the tolerance value. Conversely, to make them respond
954
+
slower, increase the tolerance value.)
955
+
835
956
###### What steps should be taken if SLOs are not being met to determine the problem?
836
957
958
+
If possible increase the log level for kube-controller-manager and check controller logs:
959
+
1. Search for "Proposing desired replicas", verify that the tolerance is set as expected,
960
+
and check (using `kubectl describe hpa`) if the ratio between the _current_ and _desired_
961
+
metric values is in tolerance.
962
+
3. Look for warnings and errors which might point where the problem lies.
963
+
837
964
## Implementation History
838
965
839
966
<!--
@@ -848,13 +975,18 @@ Major milestones might include:
0 commit comments