Merge pull request #5358 from jm-franc/beta-graduation

k8s-ci-robot · web-flow · commit 6a75f948e0e4 · 2025-06-16T08:36:59.000-07:00
KEP-4951: setting beta graduation target to v1.34.
diff --git a/keps/prod-readiness/sig-autoscaling/4951.yaml b/keps/prod-readiness/sig-autoscaling/4951.yaml
@@ -1,3 +1,5 @@
 kep-number: 4951
 alpha:
   approver: "@soltysh"
+beta:
+  approver: "@soltysh"
diff --git a/keps/sig-autoscaling/4951-configurable-hpa-tolerance/README.md b/keps/sig-autoscaling/4951-configurable-hpa-tolerance/README.md
@@ -67,6 +67,7 @@ tags, and then generate with `hack/update-toc.sh`.
       - [e2e tests](#e2e-tests)
   - [Graduation Criteria](#graduation-criteria)
     - [Alpha](#alpha)
+    - [Beta](#beta)
   - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
     - [Upgrade](#upgrade)
     - [Downgrade](#downgrade)
@@ -106,16 +107,16 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
 - [x] (R) KEP approvers have approved the KEP status as `implementable`
 - [x] (R) Design details are appropriately documented
 - [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
-  - [ ] e2e Tests for all Beta API Operations (endpoints)
+  - [x] e2e Tests for all Beta API Operations (endpoints)
   - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
   - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
 - [ ] (R) Graduation criteria is in place
   - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
-- [ ] (R) Production readiness review completed
-- [ ] (R) Production readiness review approved
+- [x] (R) Production readiness review completed
+- [x] (R) Production readiness review approved
 - [ ] "Implementation History" section is up-to-date for milestone
-- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
-- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
+- [x] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
+- [x] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
 
 [kubernetes.io]: https://kubernetes.io/
 [kubernetes/enhancements]: https://git.k8s.io/enhancements
@@ -352,12 +353,15 @@ https://storage.googleapis.com/k8s-triage/index.html
 We expect no non-infra related flakes in the last month as a GA graduation criteria.
 -->
 
-We will add the follow [e2e autoscaling tests]:
+Existing e2e tests ensure the autoscaling behavior uses the default tolerance when no
+configurable tolerance is specified.
 
-- For both scale up and scale down:
-  - Workload does not scale because the metric ratio is in tolerance.
-  - Workload scales successfully because the metric ratio is out of tolerance.
-- Autoscaling uses the default when no tolerances are set.
+The new [e2e autoscaling tests] covering this feature are:
+
+- [Test with large configurable tolerance](https://github.com/kubernetes/kubernetes/blob/07142400ecd02126602ffaa6f91712cd3f1e170c/test/e2e/autoscaling/horizontal_pod_autoscaling_behavior.go#L509): [SIG autoscaling](https://testgrid.k8s.io/sig-autoscaling-hpa#gci-gce-autoscaling-hpa-cpu-alpha-beta-pull&include-filter-by-regex=HPAConfigurableTolerance.*large%20configurable%20tolerance), [triage search](https://storage.googleapis.com/k8s-triage/index.html?test=HPAConfigurableTolerance.*large%20configurable%20tolerance)
+
+Before the graduation to beta, we will add an integration test verifying the autoscaling
+behavior when smaller and larger than default tolerances are set on an HPA.
 
 [e2e autoscaling tests]: https://github.com/kubernetes/kubernetes/tree/master/test/e2e/autoscaling
 
@@ -430,6 +434,12 @@ in back-to-back releases.
 - Feature implemented behind a `HPAConfigurableTolerance` feature flag
 - Initial e2e tests completed and enabled
 
+#### Beta
+
+- All tests described in the [`e2e tests` section](#e2e-tests) are implemented
+  and linked in this KEP.
+- We have monitored for negative user feedback and addressed relevant concerns.
+
 ### Upgrade / Downgrade Strategy
 
 #### Upgrade
@@ -543,7 +553,7 @@ You can take a look at one potential example of such test in:
 https://github.com/kubernetes/kubernetes/pull/97058/files#diff-7826f7adbc1996a05ab52e3f5f02429e94b68ce6bce0dc534d1be636154fded3R246-R282
 -->
 
-We will add a unit test verifying that HPAs with and without the new fields are
+[Unit tests have been added](https://github.com/kubernetes/kubernetes/blob/07142400ecd02126602ffaa6f91712cd3f1e170c/pkg/apis/autoscaling/validation/validation_test.go#L1648) to verify that HPAs with and without the new fields are
 properly validated, both when the feature gate is enabled or not.
 
 ### Rollout, Upgrade and Rollback Planning
@@ -564,13 +574,20 @@ rollout. Similarly, consider large clusters and how enablement/disablement
 will rollout across nodes.
 -->
 
+This feature does not introduce new failure modes: during rollout/rollback, some
+API servers will allow or disallow setting the new 'tolerance' field. The new
+field is possibly ignored until the controller manager is fully updated.
+
 ###### What specific metrics should inform a rollback?
 
 <!--
 What signals should users be paying attention to when the feature is young
 that might indicate a serious problem?
 -->
 
+A high `horizontal_pod_autoscaler_controller_metric_computation_duration_seconds`
+metric can indicate a problem related to this feature.
+
 ###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
 
 <!--
@@ -579,12 +596,105 @@ Longer term, we may want to require automated upgrade/rollback tests, but we
 are missing a bunch of machinery and tooling and can't do that now.
 -->
 
+The upgrade→downgrade→upgrade testing was done manually using a 1.33 cluster with the following steps:
+
+1.  Start the cluster with the HPA enabled:
+
+    ```sh
+    kind create cluster --name configurable-tolerance --image kindest/node:v1.33.0 --config config.yaml
+    ```
+    with the following `config.yaml` file content:
+    ```yaml
+    kind: Cluster
+    apiVersion: kind.x-k8s.io/v1alpha4
+    featureGates:
+      "HPAConfigurableTolerance": true
+    nodes:
+    - role: control-plane
+    - role: worker
+    ```
+
+    Install metrics-server:
+    ```sh
+    kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/download/v0.7.2/components.yaml
+    kubectl patch -n kube-system deployment metrics-server --type=json   -p '[{"op":"add","path":"/spec/template/spec/containers/0/args/-","value":"--kubelet-insecure-tls"}]'
+    ```
+    
+    Create a deployment starting Pods that consume a 50% CPU utilization, and an associated HPA with a very large tolerance:
+    ```sh
+    kubectl apply -f configurable-tolerance-test.yaml
+    ```
+    with the following `configurable-tolerance-test.yaml` file content:
+    ```yaml
+    apiVersion: apps/v1
+    kind: Deployment
+    metadata:
+      name: cpu-stress-deployment
+      labels:
+        app: cpu-stressor
+    spec:
+      replicas: 1
+      selector:
+        matchLabels:
+          app: cpu-stressor
+      template:
+        metadata:
+          labels:
+            app: cpu-stressor
+        spec:
+          containers:
+            - name: cpu-stressor
+              image: alpine:latest
+              command: ["/bin/sh"]
+              args:  # Load: 1% (10 milliCPU)
+                - "-c"
+                - "apk add --no-cache stress-ng && stress-ng --cpu 1 --cpu-load 1 --cpu-method=crc16 --timeout 3600s"
+              resources:
+                requests:
+                  cpu: "20m"
+    ---
+    apiVersion: autoscaling/v2
+    kind: HorizontalPodAutoscaler
+    metadata:
+      name: cpu-stress-hpa
+    spec:
+      scaleTargetRef:
+        apiVersion: apps/v1
+        kind: Deployment
+        name: cpu-stress-deployment
+      minReplicas: 1
+      maxReplicas: 5
+      metrics:
+        - type: Resource
+          resource:
+            name: cpu
+            target:
+              type: Utilization
+              averageUtilization: 10
+      behavior:
+        scaleUp:
+          tolerance: 20.  # 2000%
+    ```
+
+    Check that, after a 5 minutes, `kubectl describe hpa cpu-stress-hpa` displays `ScalingLimited: False` (i.e.
+    the HPA doesn't recommend to scale up because of the large tolerance).
+
+2.  Simulate downgrade by disabling the feature for api server and control-plane (update the `config.yaml` file
+    to set it to false). Follow the procedure described in step 1, and observe that this time
+    `kubectl describe hpa cpu-stress-hpa` displays `ScalingLimited: True`.
+
+4. Simulate downgrade by re-enabling the feature for api server and control-plane. Follow the procedure described
+   in step 1, and observe that the HPA description mentions `ScalingLimited: False`, demonstrates that the feature
+   is working again.
+
 ###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
 
 <!--
 Even if applying deprecation policies, they may still surprise some users.
 -->
 
+No.
+
 ### Monitoring Requirements
 
 <!--
@@ -625,9 +735,9 @@ values. Users can get both values using
 and use them to verify that scaling events are triggered when their ratio is out
 of tolerance.
 
-We will update the controller-manager logs to help users understand the behavior
-of the autoscaler. The data added to the logs will include the tolerance used
-for each scaling decision.
+The [controller-manager logs have been updated](https://github.com/kubernetes/kubernetes/blob/07142400ecd02126602ffaa6f91712cd3f1e170c/pkg/controller/podautoscaler/horizontal.go#L846) 
+to help users understand the behavior of the autoscaler. The data added to the
+logs includes the tolerance used for each scaling decision.
 
 ###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
 
@@ -646,7 +756,9 @@ These goals will help you determine what you need to measure (SLIs) in the next
 question.
 -->
 
-N/A.
+Although the absolute value of the `horizontal_pod_autoscaler_controller_metric_computation_duration_seconds`
+metric depends on HPAs configuration, it should be unimpacted by this feature. This metric should not vary
+by more than 5%.
 
 ###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
 
@@ -658,8 +770,7 @@ This KEP is not expected to have any impact on SLIs/SLOs as it doesn't introduce
 a new HPA behavior, but merely allows users to easily change the value of a
 parameter that's otherwise difficult to update.
 
-Standard HPA metrics (e.g.
-`horizontal_pod_autoscaler_controller_metric_computation_duration_seconds`) can
+The standard HPA metric `horizontal_pod_autoscaler_controller_metric_computation_duration_seconds` can
 be used to verify the HPA controller health.
 
 ###### Are there any missing metrics that would be useful to have to improve observability of this feature?
@@ -698,6 +809,8 @@ and creating new ones, as well as about cluster-level services (e.g. DNS):
       - Impact of its degraded performance or high-error rates on the feature:
 -->
 
+No, this feature does not depend on any specific service.
+
 ### Scalability
 
 <!--
@@ -817,6 +930,8 @@ details). For now, we leave it here.
 
 ###### How does this feature react if the API server and/or etcd is unavailable?
 
+API server or etcd issues do not impact this feature.
+
 ###### What are other known failure modes?
 
 <!--
@@ -832,8 +947,20 @@ For each of them, fill in the following information by copying the below templat
     - Testing: Are there any tests for failure mode? If not, describe why.
 -->
 
+We do not expect any new failure mode. (While setting `tolerance` below 10% can cause HPAs
+to scale up and down as frequently as every 30s, and higher values might stop scaling altogether
+if the metric remains within the tolerance band, the feature is still working as intended.
+To make HPAs respond faster, decrease the tolerance value. Conversely, to make them respond
+slower, increase the tolerance value.)
+
 ###### What steps should be taken if SLOs are not being met to determine the problem?
 
+If possible increase the log level for kube-controller-manager and check controller logs:
+1. Search for "Proposing desired replicas", verify that the tolerance is set as expected,
+   and check (using `kubectl describe hpa`) if the ratio between the _current_ and _desired_
+   metric values is in tolerance.
+3. Look for warnings and errors which might point where the problem lies.
+
 ## Implementation History
 
 <!--
@@ -848,13 +975,18 @@ Major milestones might include:
 -->
 
 2025-01-21: KEP PR merged.
+2025-03-24: [Implementation PR](https://github.com/kubernetes/kubernetes/pull/130797) merged.
+2025-05-15: Kubernetes v1.33 released (includes this feature).
+2025-05-16: This KEP updated for beta graduation.
 
 ## Drawbacks
 
 <!--
 Why should this KEP _not_ be implemented?
 -->
 
+No major drawbacks have been identified.
+
 ## Alternatives
 
 <!--
@@ -863,10 +995,15 @@ not need to be as detailed as the proposal, but should include enough
 information to express the idea and why it was not acceptable.
 -->
 
+On non-managed Kubernetes instances, users can update the cluster-wide
+`--horizontal-pod-autoscaler-tolerance` tolerance parameter,
+
 ## Infrastructure Needed (Optional)
 
 <!--
 Use this section if you need things from the project/SIG. Examples include a
 new subproject, repos requested, or GitHub details. Listing these here allows a
 SIG to get the process for these resources started right away.
 -->
+
+N/A.
diff --git a/keps/sig-autoscaling/4951-configurable-hpa-tolerance/kep.yaml b/keps/sig-autoscaling/4951-configurable-hpa-tolerance/kep.yaml
@@ -17,17 +17,17 @@ see-also:
 replaces:
 
 # The target maturity stage in the current dev cycle for this KEP.
-stage: alpha
+stage: beta
 
 # The most recent milestone for which work toward delivery of this KEP has been
 # done. This can be the current (upcoming) milestone, if it is being actively
 # worked on.
-latest-milestone: "v1.33"
+latest-milestone: "v1.34"
 
 # The milestone at which this feature was, or is targeted to be, at each stage.
 milestone:
   alpha: "v1.33"
-  beta: TBD
+  beta: "v1.34"
   stable: TBD
 
 # The following PRR answers are required at alpha release