Skip to content

Commit 6a75f94

Browse files
authored
Merge pull request #5358 from jm-franc/beta-graduation
KEP-4951: setting beta graduation target to v1.34.
2 parents 313590c + a1c1801 commit 6a75f94

File tree

3 files changed

+159
-20
lines changed

3 files changed

+159
-20
lines changed
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
11
kep-number: 4951
22
alpha:
33
approver: "@soltysh"
4+
beta:
5+
approver: "@soltysh"

keps/sig-autoscaling/4951-configurable-hpa-tolerance/README.md

Lines changed: 154 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -67,6 +67,7 @@ tags, and then generate with `hack/update-toc.sh`.
6767
- [e2e tests](#e2e-tests)
6868
- [Graduation Criteria](#graduation-criteria)
6969
- [Alpha](#alpha)
70+
- [Beta](#beta)
7071
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
7172
- [Upgrade](#upgrade)
7273
- [Downgrade](#downgrade)
@@ -106,16 +107,16 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
106107
- [x] (R) KEP approvers have approved the KEP status as `implementable`
107108
- [x] (R) Design details are appropriately documented
108109
- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
109-
- [ ] e2e Tests for all Beta API Operations (endpoints)
110+
- [x] e2e Tests for all Beta API Operations (endpoints)
110111
- [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
111112
- [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
112113
- [ ] (R) Graduation criteria is in place
113114
- [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
114-
- [ ] (R) Production readiness review completed
115-
- [ ] (R) Production readiness review approved
115+
- [x] (R) Production readiness review completed
116+
- [x] (R) Production readiness review approved
116117
- [ ] "Implementation History" section is up-to-date for milestone
117-
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
118-
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
118+
- [x] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
119+
- [x] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
119120

120121
[kubernetes.io]: https://kubernetes.io/
121122
[kubernetes/enhancements]: https://git.k8s.io/enhancements
@@ -352,12 +353,15 @@ https://storage.googleapis.com/k8s-triage/index.html
352353
We expect no non-infra related flakes in the last month as a GA graduation criteria.
353354
-->
354355

355-
We will add the follow [e2e autoscaling tests]:
356+
Existing e2e tests ensure the autoscaling behavior uses the default tolerance when no
357+
configurable tolerance is specified.
356358

357-
- For both scale up and scale down:
358-
- Workload does not scale because the metric ratio is in tolerance.
359-
- Workload scales successfully because the metric ratio is out of tolerance.
360-
- Autoscaling uses the default when no tolerances are set.
359+
The new [e2e autoscaling tests] covering this feature are:
360+
361+
- [Test with large configurable tolerance](https://github.com/kubernetes/kubernetes/blob/07142400ecd02126602ffaa6f91712cd3f1e170c/test/e2e/autoscaling/horizontal_pod_autoscaling_behavior.go#L509): [SIG autoscaling](https://testgrid.k8s.io/sig-autoscaling-hpa#gci-gce-autoscaling-hpa-cpu-alpha-beta-pull&include-filter-by-regex=HPAConfigurableTolerance.*large%20configurable%20tolerance), [triage search](https://storage.googleapis.com/k8s-triage/index.html?test=HPAConfigurableTolerance.*large%20configurable%20tolerance)
362+
363+
Before the graduation to beta, we will add an integration test verifying the autoscaling
364+
behavior when smaller and larger than default tolerances are set on an HPA.
361365

362366
[e2e autoscaling tests]: https://github.com/kubernetes/kubernetes/tree/master/test/e2e/autoscaling
363367

@@ -430,6 +434,12 @@ in back-to-back releases.
430434
- Feature implemented behind a `HPAConfigurableTolerance` feature flag
431435
- Initial e2e tests completed and enabled
432436

437+
#### Beta
438+
439+
- All tests described in the [`e2e tests` section](#e2e-tests) are implemented
440+
and linked in this KEP.
441+
- We have monitored for negative user feedback and addressed relevant concerns.
442+
433443
### Upgrade / Downgrade Strategy
434444

435445
#### Upgrade
@@ -543,7 +553,7 @@ You can take a look at one potential example of such test in:
543553
https://github.com/kubernetes/kubernetes/pull/97058/files#diff-7826f7adbc1996a05ab52e3f5f02429e94b68ce6bce0dc534d1be636154fded3R246-R282
544554
-->
545555

546-
We will add a unit test verifying that HPAs with and without the new fields are
556+
[Unit tests have been added](https://github.com/kubernetes/kubernetes/blob/07142400ecd02126602ffaa6f91712cd3f1e170c/pkg/apis/autoscaling/validation/validation_test.go#L1648) to verify that HPAs with and without the new fields are
547557
properly validated, both when the feature gate is enabled or not.
548558

549559
### Rollout, Upgrade and Rollback Planning
@@ -564,13 +574,20 @@ rollout. Similarly, consider large clusters and how enablement/disablement
564574
will rollout across nodes.
565575
-->
566576

577+
This feature does not introduce new failure modes: during rollout/rollback, some
578+
API servers will allow or disallow setting the new 'tolerance' field. The new
579+
field is possibly ignored until the controller manager is fully updated.
580+
567581
###### What specific metrics should inform a rollback?
568582

569583
<!--
570584
What signals should users be paying attention to when the feature is young
571585
that might indicate a serious problem?
572586
-->
573587

588+
A high `horizontal_pod_autoscaler_controller_metric_computation_duration_seconds`
589+
metric can indicate a problem related to this feature.
590+
574591
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
575592

576593
<!--
@@ -579,12 +596,105 @@ Longer term, we may want to require automated upgrade/rollback tests, but we
579596
are missing a bunch of machinery and tooling and can't do that now.
580597
-->
581598

599+
The upgrade→downgrade→upgrade testing was done manually using a 1.33 cluster with the following steps:
600+
601+
1. Start the cluster with the HPA enabled:
602+
603+
```sh
604+
kind create cluster --name configurable-tolerance --image kindest/node:v1.33.0 --config config.yaml
605+
```
606+
with the following `config.yaml` file content:
607+
```yaml
608+
kind: Cluster
609+
apiVersion: kind.x-k8s.io/v1alpha4
610+
featureGates:
611+
"HPAConfigurableTolerance": true
612+
nodes:
613+
- role: control-plane
614+
- role: worker
615+
```
616+
617+
Install metrics-server:
618+
```sh
619+
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/download/v0.7.2/components.yaml
620+
kubectl patch -n kube-system deployment metrics-server --type=json -p '[{"op":"add","path":"/spec/template/spec/containers/0/args/-","value":"--kubelet-insecure-tls"}]'
621+
```
622+
623+
Create a deployment starting Pods that consume a 50% CPU utilization, and an associated HPA with a very large tolerance:
624+
```sh
625+
kubectl apply -f configurable-tolerance-test.yaml
626+
```
627+
with the following `configurable-tolerance-test.yaml` file content:
628+
```yaml
629+
apiVersion: apps/v1
630+
kind: Deployment
631+
metadata:
632+
name: cpu-stress-deployment
633+
labels:
634+
app: cpu-stressor
635+
spec:
636+
replicas: 1
637+
selector:
638+
matchLabels:
639+
app: cpu-stressor
640+
template:
641+
metadata:
642+
labels:
643+
app: cpu-stressor
644+
spec:
645+
containers:
646+
- name: cpu-stressor
647+
image: alpine:latest
648+
command: ["/bin/sh"]
649+
args: # Load: 1% (10 milliCPU)
650+
- "-c"
651+
- "apk add --no-cache stress-ng && stress-ng --cpu 1 --cpu-load 1 --cpu-method=crc16 --timeout 3600s"
652+
resources:
653+
requests:
654+
cpu: "20m"
655+
---
656+
apiVersion: autoscaling/v2
657+
kind: HorizontalPodAutoscaler
658+
metadata:
659+
name: cpu-stress-hpa
660+
spec:
661+
scaleTargetRef:
662+
apiVersion: apps/v1
663+
kind: Deployment
664+
name: cpu-stress-deployment
665+
minReplicas: 1
666+
maxReplicas: 5
667+
metrics:
668+
- type: Resource
669+
resource:
670+
name: cpu
671+
target:
672+
type: Utilization
673+
averageUtilization: 10
674+
behavior:
675+
scaleUp:
676+
tolerance: 20. # 2000%
677+
```
678+
679+
Check that, after a 5 minutes, `kubectl describe hpa cpu-stress-hpa` displays `ScalingLimited: False` (i.e.
680+
the HPA doesn't recommend to scale up because of the large tolerance).
681+
682+
2. Simulate downgrade by disabling the feature for api server and control-plane (update the `config.yaml` file
683+
to set it to false). Follow the procedure described in step 1, and observe that this time
684+
`kubectl describe hpa cpu-stress-hpa` displays `ScalingLimited: True`.
685+
686+
4. Simulate downgrade by re-enabling the feature for api server and control-plane. Follow the procedure described
687+
in step 1, and observe that the HPA description mentions `ScalingLimited: False`, demonstrates that the feature
688+
is working again.
689+
582690
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
583691
584692
<!--
585693
Even if applying deprecation policies, they may still surprise some users.
586694
-->
587695
696+
No.
697+
588698
### Monitoring Requirements
589699
590700
<!--
@@ -625,9 +735,9 @@ values. Users can get both values using
625735
and use them to verify that scaling events are triggered when their ratio is out
626736
of tolerance.
627737
628-
We will update the controller-manager logs to help users understand the behavior
629-
of the autoscaler. The data added to the logs will include the tolerance used
630-
for each scaling decision.
738+
The [controller-manager logs have been updated](https://github.com/kubernetes/kubernetes/blob/07142400ecd02126602ffaa6f91712cd3f1e170c/pkg/controller/podautoscaler/horizontal.go#L846)
739+
to help users understand the behavior of the autoscaler. The data added to the
740+
logs includes the tolerance used for each scaling decision.
631741
632742
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
633743
@@ -646,7 +756,9 @@ These goals will help you determine what you need to measure (SLIs) in the next
646756
question.
647757
-->
648758

649-
N/A.
759+
Although the absolute value of the `horizontal_pod_autoscaler_controller_metric_computation_duration_seconds`
760+
metric depends on HPAs configuration, it should be unimpacted by this feature. This metric should not vary
761+
by more than 5%.
650762

651763
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
652764

@@ -658,8 +770,7 @@ This KEP is not expected to have any impact on SLIs/SLOs as it doesn't introduce
658770
a new HPA behavior, but merely allows users to easily change the value of a
659771
parameter that's otherwise difficult to update.
660772

661-
Standard HPA metrics (e.g.
662-
`horizontal_pod_autoscaler_controller_metric_computation_duration_seconds`) can
773+
The standard HPA metric `horizontal_pod_autoscaler_controller_metric_computation_duration_seconds` can
663774
be used to verify the HPA controller health.
664775

665776
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
@@ -698,6 +809,8 @@ and creating new ones, as well as about cluster-level services (e.g. DNS):
698809
- Impact of its degraded performance or high-error rates on the feature:
699810
-->
700811

812+
No, this feature does not depend on any specific service.
813+
701814
### Scalability
702815

703816
<!--
@@ -817,6 +930,8 @@ details). For now, we leave it here.
817930
818931
###### How does this feature react if the API server and/or etcd is unavailable?
819932
933+
API server or etcd issues do not impact this feature.
934+
820935
###### What are other known failure modes?
821936
822937
<!--
@@ -832,8 +947,20 @@ For each of them, fill in the following information by copying the below templat
832947
- Testing: Are there any tests for failure mode? If not, describe why.
833948
-->
834949
950+
We do not expect any new failure mode. (While setting `tolerance` below 10% can cause HPAs
951+
to scale up and down as frequently as every 30s, and higher values might stop scaling altogether
952+
if the metric remains within the tolerance band, the feature is still working as intended.
953+
To make HPAs respond faster, decrease the tolerance value. Conversely, to make them respond
954+
slower, increase the tolerance value.)
955+
835956
###### What steps should be taken if SLOs are not being met to determine the problem?
836957
958+
If possible increase the log level for kube-controller-manager and check controller logs:
959+
1. Search for "Proposing desired replicas", verify that the tolerance is set as expected,
960+
and check (using `kubectl describe hpa`) if the ratio between the _current_ and _desired_
961+
metric values is in tolerance.
962+
3. Look for warnings and errors which might point where the problem lies.
963+
837964
## Implementation History
838965
839966
<!--
@@ -848,13 +975,18 @@ Major milestones might include:
848975
-->
849976
850977
2025-01-21: KEP PR merged.
978+
2025-03-24: [Implementation PR](https://github.com/kubernetes/kubernetes/pull/130797) merged.
979+
2025-05-15: Kubernetes v1.33 released (includes this feature).
980+
2025-05-16: This KEP updated for beta graduation.
851981
852982
## Drawbacks
853983
854984
<!--
855985
Why should this KEP _not_ be implemented?
856986
-->
857987
988+
No major drawbacks have been identified.
989+
858990
## Alternatives
859991
860992
<!--
@@ -863,10 +995,15 @@ not need to be as detailed as the proposal, but should include enough
863995
information to express the idea and why it was not acceptable.
864996
-->
865997
998+
On non-managed Kubernetes instances, users can update the cluster-wide
999+
`--horizontal-pod-autoscaler-tolerance` tolerance parameter,
1000+
8661001
## Infrastructure Needed (Optional)
8671002
8681003
<!--
8691004
Use this section if you need things from the project/SIG. Examples include a
8701005
new subproject, repos requested, or GitHub details. Listing these here allows a
8711006
SIG to get the process for these resources started right away.
8721007
-->
1008+
1009+
N/A.

keps/sig-autoscaling/4951-configurable-hpa-tolerance/kep.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -17,17 +17,17 @@ see-also:
1717
replaces:
1818

1919
# The target maturity stage in the current dev cycle for this KEP.
20-
stage: alpha
20+
stage: beta
2121

2222
# The most recent milestone for which work toward delivery of this KEP has been
2323
# done. This can be the current (upcoming) milestone, if it is being actively
2424
# worked on.
25-
latest-milestone: "v1.33"
25+
latest-milestone: "v1.34"
2626

2727
# The milestone at which this feature was, or is targeted to be, at each stage.
2828
milestone:
2929
alpha: "v1.33"
30-
beta: TBD
30+
beta: "v1.34"
3131
stable: TBD
3232

3333
# The following PRR answers are required at alpha release

0 commit comments

Comments
 (0)