Skip to content

Commit b3fadb9

Browse files
authored
Merge pull request #4406 from sanposhiho/ga-graduate
KEP-1610: graduate ContainerResource to stable
2 parents dfded09 + 07db9a1 commit b3fadb9

File tree

3 files changed

+38
-21
lines changed

3 files changed

+38
-21
lines changed
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
11
kep-number: 1610
22
beta:
33
approver: "@johnbelamaric"
4+
stable:
5+
approver: "@johnbelamaric"

keps/sig-autoscaling/1610-container-resource-autoscaling/README.md

Lines changed: 32 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -733,7 +733,7 @@ You can take a look at one potential example of such test in:
733733
https://github.com/kubernetes/kubernetes/pull/97058/files#diff-7826f7adbc1996a05ab52e3f5f02429e94b68ce6bce0dc534d1be636154fded3R246-R282
734734
-->
735735

736-
No. But, the tests to confirm the behavior on switching the feature gate will be added by beta. ([issue](https://github.com/kubernetes/kubernetes/issues/115467))
736+
No. But, the tests to confirm the behavior on switching the feature gate will be added by beta. ([issue](https://github.com/kubernetes/kubernetes/issues/123189))
737737

738738
### Rollout, Upgrade and Rollback Planning
739739

@@ -767,12 +767,14 @@ What signals should users be paying attention to when the feature is young
767767
that might indicate a serious problem?
768768
-->
769769

770-
- The container resource metric takes much longer time compared to other metrics.
771-
which can be monitored via the 1st metrics described in [What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?](#what-are-the-slis-service-level-indicators-an-operator-can-use-to-determine-the-health-of-the-service) section.
772-
- Increase the overall performance of HPA controller
773-
which can be monitored via the 2nd metrics described in [What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?](#what-are-the-slis-service-level-indicators-an-operator-can-use-to-determine-the-health-of-the-service) section.
774-
- Many error occurrence on the container resource metrics
775-
which can be monitored via the 3rd metrics described in [What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?](#what-are-the-slis-service-level-indicators-an-operator-can-use-to-determine-the-health-of-the-service) section.
770+
- `reconciliation_duration_seconds`: The time(seconds) that the HPA controller takes to reconcile once.
771+
- You should rollback if you see an increase in the overall performance of HPA controller
772+
- `metric_computation_duration_seconds{metric_type=ContainerResource}`: The time(seconds) that the HPA controller takes to calculate one metric.
773+
- You should rollback if you see the container resource metric takes much longer time compared to other metrics.
774+
- `reconciliations_total{error=internal}`: Number of internal errors in reconciliation of HPA controller.
775+
- You should rollback if you see many error occurrence on the reconciliation.
776+
- `metric_computation_total{error=internal,{metric_type=ContainerResource}`: Number of internal errors in the calculation of `type: ContainerResource`.
777+
- You should rollback if you see many error occurrence on the container resource metrics
776778

777779
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
778780

@@ -782,7 +784,6 @@ Longer term, we may want to require automated upgrade/rollback tests, but we
782784
are missing a bunch of machinery and tooling and can't do that now.
783785
-->
784786

785-
Not yet.
786787
But, as described in [Are there any tests for feature enablement/disablement?](#Are-there-any-tests-for-feature-enablement/disablement?), the tests to confirm the behavior on switching the feature gate will be added. ([issue](https://github.com/kubernetes/kubernetes/issues/115467))
787788

788789
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
@@ -826,7 +827,6 @@ Recall that end users cannot usually observe component logs or access metrics.
826827

827828
- [x] Events
828829
- `SuccessfulRescale` event with `memory/cpu/etc resource utilization (percentage of request) above/below target`
829-
- Note that we cannot know if this reason is due to the `Resource` metric or `ContainerResource` in the current implementation. We'll change this reason for `ContainerResource` to `memory/cpu/etc container resource utilization (percentage of request) above/below target` so that we can distinguish.
830830
- [x] API .status
831831
- When something wrong with the container metrics, `ScalingActive` condition will be false with `FailedGetContainerResourceMetric` reason.
832832

@@ -861,14 +861,11 @@ Pick one more of these and delete the rest.
861861
- Details:
862862
-->
863863

864-
HPA controller have no metrics in it now.
865-
The following metrics will be implemented by beta. ([issue](https://github.com/kubernetes/kubernetes/issues/115639))
866-
1. How long does each metric type take to compute the ideal replica num.
867-
- so that users can confirm the container resource metric doesn't take long time compared to other metrics.
868-
2. How long does the HPA controller take to complete reconcile one HPA object.
869-
- so that users can confirm the container resource metric doesn't increse the whole time of scaling.
870-
3. Provide the metric to show error occurrence for each metric.
871-
- so that users can confirm no much error occurrence on the container resource metric.
864+
- [x] Metrics
865+
- `metric_computation_duration_seconds`: The time(seconds) that the HPA controller takes to calculate one metric.
866+
- `metric_computation_total`: Number of metric computations.
867+
- `reconciliations_total`: Number of reconciliation of HPA controller.
868+
- `reconciliation_duration_seconds`: The time(seconds) that the HPA controller takes to reconcile once.
872869

873870
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
874871

@@ -996,6 +993,20 @@ This through this both in small and large cases, again with respect to the
996993

997994
No.
998995

996+
###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
997+
998+
<!--
999+
Focus not just on happy cases, but primarily on more pathological cases
1000+
(e.g. probes taking a minute instead of milliseconds, failed pods consuming resources, etc.).
1001+
If any of the resources can be exhausted, how this is mitigated with the existing limits
1002+
(e.g. pods per node) or new limits added by this KEP?
1003+
1004+
Are there any tests that were run/should be run to understand performance characteristics better
1005+
and validate the declared limits?
1006+
-->
1007+
1008+
No.
1009+
9991010
### Troubleshooting
10001011

10011012
<!--
@@ -1037,6 +1048,10 @@ For each of them, fill in the following information by copying the below templat
10371048

10381049
###### What steps should be taken if SLOs are not being met to determine the problem?
10391050

1051+
Check `metric_computation_duration_seconds` or `reconciliation_duration_seconds` to see which metric encountered the latency issue.
1052+
And, if it is a latency problem only specific in `type: ContainerResource`,
1053+
you can opt-out this feature by removing the `type: ContainerResource` metric from HPA(s).
1054+
10401055
## Implementation History
10411056

10421057
* 2020-04-03 Initial KEP merged

keps/sig-autoscaling/1610-container-resource-autoscaling/kep.yaml

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -12,16 +12,16 @@ approvers:
1212
- "@josephburnett"
1313
- "@gjtempleton"
1414
creation-date: 2020-02-18
15-
last-updated: 2023-02-02
15+
last-updated: 2024-01-15
1616
status: implementable
1717

18-
latest-milestone: "1.27"
19-
stage: "beta"
18+
latest-milestone: "1.30"
19+
stage: "stable"
2020

2121
milestone:
2222
alpha: "v1.20"
2323
beta: "v1.27"
24-
stable: "v1.29"
24+
stable: "v1.30"
2525

2626
feature-gates:
2727
- name: HPAContainerMetrics

0 commit comments

Comments
 (0)