Skip to content

Commit 07db9a1

Browse files
committed
address reviews
1 parent b3184e4 commit 07db9a1

File tree

1 file changed

+18
-17
lines changed
  • keps/sig-autoscaling/1610-container-resource-autoscaling

1 file changed

+18
-17
lines changed

keps/sig-autoscaling/1610-container-resource-autoscaling/README.md

Lines changed: 18 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -733,7 +733,7 @@ You can take a look at one potential example of such test in:
733733
https://github.com/kubernetes/kubernetes/pull/97058/files#diff-7826f7adbc1996a05ab52e3f5f02429e94b68ce6bce0dc534d1be636154fded3R246-R282
734734
-->
735735

736-
No. But, the tests to confirm the behavior on switching the feature gate will be added by beta. ([issue](https://github.com/kubernetes/kubernetes/issues/115467))
736+
No. But, the tests to confirm the behavior on switching the feature gate will be added by beta. ([issue](https://github.com/kubernetes/kubernetes/issues/123189))
737737

738738
### Rollout, Upgrade and Rollback Planning
739739

@@ -767,12 +767,14 @@ What signals should users be paying attention to when the feature is young
767767
that might indicate a serious problem?
768768
-->
769769

770-
- The container resource metric takes much longer time compared to other metrics.
771-
which can be monitored via the 1st metrics described in [What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?](#what-are-the-slis-service-level-indicators-an-operator-can-use-to-determine-the-health-of-the-service) section.
772-
- Increase the overall performance of HPA controller
773-
which can be monitored via the 2nd metrics described in [What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?](#what-are-the-slis-service-level-indicators-an-operator-can-use-to-determine-the-health-of-the-service) section.
774-
- Many error occurrence on the container resource metrics
775-
which can be monitored via the 3rd metrics described in [What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?](#what-are-the-slis-service-level-indicators-an-operator-can-use-to-determine-the-health-of-the-service) section.
770+
- `reconciliation_duration_seconds`: The time(seconds) that the HPA controller takes to reconcile once.
771+
- You should rollback if you see an increase in the overall performance of HPA controller
772+
- `metric_computation_duration_seconds{metric_type=ContainerResource}`: The time(seconds) that the HPA controller takes to calculate one metric.
773+
- You should rollback if you see the container resource metric takes much longer time compared to other metrics.
774+
- `reconciliations_total{error=internal}`: Number of internal errors in reconciliation of HPA controller.
775+
- You should rollback if you see many error occurrence on the reconciliation.
776+
- `metric_computation_total{error=internal,{metric_type=ContainerResource}`: Number of internal errors in the calculation of `type: ContainerResource`.
777+
- You should rollback if you see many error occurrence on the container resource metrics
776778

777779
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
778780

@@ -782,7 +784,6 @@ Longer term, we may want to require automated upgrade/rollback tests, but we
782784
are missing a bunch of machinery and tooling and can't do that now.
783785
-->
784786

785-
Not yet.
786787
But, as described in [Are there any tests for feature enablement/disablement?](#Are-there-any-tests-for-feature-enablement/disablement?), the tests to confirm the behavior on switching the feature gate will be added. ([issue](https://github.com/kubernetes/kubernetes/issues/115467))
787788

788789
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
@@ -826,7 +827,6 @@ Recall that end users cannot usually observe component logs or access metrics.
826827

827828
- [x] Events
828829
- `SuccessfulRescale` event with `memory/cpu/etc resource utilization (percentage of request) above/below target`
829-
- Note that we cannot know if this reason is due to the `Resource` metric or `ContainerResource` in the current implementation. We'll change this reason for `ContainerResource` to `memory/cpu/etc container resource utilization (percentage of request) above/below target` so that we can distinguish.
830830
- [x] API .status
831831
- When something wrong with the container metrics, `ScalingActive` condition will be false with `FailedGetContainerResourceMetric` reason.
832832

@@ -861,14 +861,11 @@ Pick one more of these and delete the rest.
861861
- Details:
862862
-->
863863

864-
HPA controller have no metrics in it now.
865-
The following metrics will be implemented by beta. ([issue](https://github.com/kubernetes/kubernetes/issues/115639))
866-
1. How long does each metric type take to compute the ideal replica num.
867-
- so that users can confirm the container resource metric doesn't take long time compared to other metrics.
868-
2. How long does the HPA controller take to complete reconcile one HPA object.
869-
- so that users can confirm the container resource metric doesn't increse the whole time of scaling.
870-
3. Provide the metric to show error occurrence for each metric.
871-
- so that users can confirm no much error occurrence on the container resource metric.
864+
- [x] Metrics
865+
- `metric_computation_duration_seconds`: The time(seconds) that the HPA controller takes to calculate one metric.
866+
- `metric_computation_total`: Number of metric computations.
867+
- `reconciliations_total`: Number of reconciliation of HPA controller.
868+
- `reconciliation_duration_seconds`: The time(seconds) that the HPA controller takes to reconcile once.
872869

873870
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
874871

@@ -1051,6 +1048,10 @@ For each of them, fill in the following information by copying the below templat
10511048

10521049
###### What steps should be taken if SLOs are not being met to determine the problem?
10531050

1051+
Check `metric_computation_duration_seconds` or `reconciliation_duration_seconds` to see which metric encountered the latency issue.
1052+
And, if it is a latency problem only specific in `type: ContainerResource`,
1053+
you can opt-out this feature by removing the `type: ContainerResource` metric from HPA(s).
1054+
10541055
## Implementation History
10551056

10561057
* 2020-04-03 Initial KEP merged

0 commit comments

Comments
 (0)