Skip to content

Commit 8eded88

Browse files
authored
Merge pull request kubernetes#5094 from jm-franc/prr-hpa-tolerance
KEP-4951: Fill PRR questionnaire
2 parents 54d0e74 + c3eadc0 commit 8eded88

File tree

3 files changed

+53
-24
lines changed

3 files changed

+53
-24
lines changed
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
kep-number: 4951
2+
alpha:
3+
approver: "@soltysh"

keps/sig-autoscaling/4951-configurable-hpa-tolerance/README.md

Lines changed: 46 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -102,9 +102,9 @@ checklist items _must_ be updated for the enhancement to be released.
102102

103103
Items marked with (R) are required *prior to targeting to a milestone / release*.
104104

105-
- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
106-
- [ ] (R) KEP approvers have approved the KEP status as `implementable`
107-
- [ ] (R) Design details are appropriately documented
105+
- [x] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
106+
- [x] (R) KEP approvers have approved the KEP status as `implementable`
107+
- [x] (R) Design details are appropriately documented
108108
- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
109109
- [ ] e2e Tests for all Beta API Operations (endpoints)
110110
- [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
@@ -283,7 +283,7 @@ when drafting this test plan.
283283
[testing-guidelines]: https://git.k8s.io/community/contributors/devel/sig-testing/testing.md
284284
-->
285285

286-
[ ] I/we understand the owners of the involved components may require updates to
286+
[x] I/we understand the owners of the involved components may require updates to
287287
existing tests to make this code solid enough prior to committing the changes necessary
288288
to implement this enhancement.
289289

@@ -335,7 +335,7 @@ For Beta and GA, add links to added tests together with links to k8s-triage for
335335
https://storage.googleapis.com/k8s-triage/index.html
336336
-->
337337

338-
- <test>: <link to test coverage>
338+
N/A, the feature is tested using unit tests and e2e tests.
339339

340340
##### e2e tests
341341

@@ -491,7 +491,8 @@ well as the [existing list] of feature gates.
491491

492492
- [x] Feature gate (also fill in values in `kep.yaml`)
493493
- Feature gate name: HPAConfigurableTolerance
494-
- Components depending on the feature gate: `kube-controller-manager`
494+
- Components depending on the feature gate: `kube-controller-manager` and
495+
`kube-apiserver`.
495496

496497
###### Does enabling the feature change any default behavior?
497498

@@ -517,7 +518,8 @@ NOTE: Also set `disable-supported` to `true` or `false` in `kep.yaml`.
517518

518519
The feature can be disabled by restarting the `kube-controller-manager` with the feature gate set to `false`.
519520

520-
Any `tolerance` values set on existing HPAs will be ignored by the `kube-controller-manager` when the feature gate is off.
521+
Any `tolerance` values set on existing HPAs will be ignored by the
522+
`kube-controller-manager` and `kube-apiserver` when the feature gate is off.
521523

522524
###### What happens if we reenable the feature if it was previously rolled back?
523525

@@ -538,6 +540,9 @@ You can take a look at one potential example of such test in:
538540
https://github.com/kubernetes/kubernetes/pull/97058/files#diff-7826f7adbc1996a05ab52e3f5f02429e94b68ce6bce0dc534d1be636154fded3R246-R282
539541
-->
540542

543+
We will add a unit test verifying that HPAs with and without the new fields are
544+
properly validated, both when the feature gate is enabled or not.
545+
541546
### Rollout, Upgrade and Rollback Planning
542547

543548
<!--
@@ -594,6 +599,9 @@ checking if there are objects with field X set) may be a last resort. Avoid
594599
logs or events for this purpose.
595600
-->
596601

602+
The presence of the new `tolerance` HPA field indicates that the feature is
603+
used.
604+
597605
###### How can someone using this feature know that it is working for their instance?
598606

599607
<!--
@@ -605,13 +613,18 @@ and operation of this feature.
605613
Recall that end users cannot usually observe component logs or access metrics.
606614
-->
607615

608-
- [ ] Events
609-
- Event Reason:
610-
- [ ] API .status
611-
- Condition name:
612-
- Other field:
613-
- [ ] Other (treat as last resort)
614-
- Details:
616+
- [X] Events
617+
- Event Reason: `SuccessfulRescale`
618+
619+
The tolerance is applied on the ratio between the _current_ and _desired_ metric
620+
values. Users can get both values using
621+
[`kubectl describe`](https://github.com/kubernetes/kubernetes/blob/1b7a0591871772fbbc0fda430b3b73bc24c0e738/staging/src/k8s.io/kubectl/pkg/describe/describe.go#L4109)
622+
and use them to verify that scaling events are triggered when their ratio is out
623+
of tolerance.
624+
625+
We will update the controller-manager logs to help users understand the behavior
626+
of the autoscaler. The data added to the logs will include the tolerance used
627+
for each scaling decision.
615628

616629
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
617630

@@ -630,18 +643,21 @@ These goals will help you determine what you need to measure (SLIs) in the next
630643
question.
631644
-->
632645

646+
N/A.
647+
633648
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
634649

635650
<!--
636651
Pick one more of these and delete the rest.
637652
-->
638653

639-
- [ ] Metrics
640-
- Metric name:
641-
- [Optional] Aggregation method:
642-
- Components exposing the metric:
643-
- [ ] Other (treat as last resort)
644-
- Details:
654+
This KEP is not expected to have any impact on SLIs/SLOs as it doesn't introduce
655+
a new HPA behavior, but merely allows users to easily change the value of a
656+
parameter that's otherwise difficult to update.
657+
658+
Standard HPA metrics (e.g.
659+
`horizontal_pod_autoscaler_controller_metric_computation_duration_seconds`) can
660+
be used to verify the HPA controller health.
645661

646662
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
647663

@@ -650,6 +666,12 @@ Describe the metrics themselves and the reasons why they weren't added (e.g., co
650666
implementation difficulties, etc.).
651667
-->
652668

669+
Users may want to see a signal that autoscaling isn't happening because of the
670+
tolerance, but this is not directly related to this KEP (this problem already
671+
exists today with the hard-coded 10% tolerance), and taking this KEP as an
672+
opportunity to improve the situation is difficult (see
673+
[this thread](https://github.com/kubernetes/enhancements/pull/4954#discussion_r1857098884)).
674+
653675
### Dependencies
654676

655677
<!--
@@ -775,6 +797,8 @@ Are there any tests that were run/should be run to understand performance charac
775797
and validate the declared limits?
776798
-->
777799

800+
No.
801+
778802
### Troubleshooting
779803

780804
<!--
@@ -820,6 +844,8 @@ Major milestones might include:
820844
- when the KEP was retired or superseded
821845
-->
822846

847+
2025-01-21: KEP PR merged.
848+
823849
## Drawbacks
824850

825851
<!--

keps/sig-autoscaling/4951-configurable-hpa-tolerance/kep.yaml

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,13 +4,13 @@ authors:
44
- "@pr00se"
55
- "@jm-franc"
66
owning-sig: sig-autoscaling
7-
status: provisional
7+
status: implementable
88
creation-date: 2024-11-05
99
reviewers:
1010
- "@gjtempleton"
1111
- "@raywainman"
1212
approvers:
13-
- TBD
13+
- "@gjtempleton"
1414

1515
see-also:
1616
- "/keps/sig-autoscaling/853-configurable-hpa-scale-velocity"
@@ -40,5 +40,5 @@ feature-gates:
4040
disable-supported: true
4141

4242
# The following PRR answers are required at beta release
43-
#metrics:
44-
# - my_feature_metric
43+
metrics:
44+
- horizontal_pod_autoscaler_controller_metric_computation_duration_seconds

0 commit comments

Comments
 (0)