You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
No. But, the tests to confirm the behavior on switching the feature gate will be added by beta. ([issue](https://github.com/kubernetes/kubernetes/issues/115467))
736
+
No. But, the tests to confirm the behavior on switching the feature gate will be added by beta. ([issue](https://github.com/kubernetes/kubernetes/issues/123189))
737
737
738
738
### Rollout, Upgrade and Rollback Planning
739
739
@@ -767,12 +767,14 @@ What signals should users be paying attention to when the feature is young
767
767
that might indicate a serious problem?
768
768
-->
769
769
770
-
- The container resource metric takes much longer time compared to other metrics.
771
-
which can be monitored via the 1st metrics described in [What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?](#what-are-the-slis-service-level-indicators-an-operator-can-use-to-determine-the-health-of-the-service) section.
772
-
- Increase the overall performance of HPA controller
773
-
which can be monitored via the 2nd metrics described in [What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?](#what-are-the-slis-service-level-indicators-an-operator-can-use-to-determine-the-health-of-the-service) section.
774
-
- Many error occurrence on the container resource metrics
775
-
which can be monitored via the 3rd metrics described in [What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?](#what-are-the-slis-service-level-indicators-an-operator-can-use-to-determine-the-health-of-the-service) section.
770
+
- `reconciliation_duration_seconds`: The time(seconds) that the HPA controller takes to reconcile once.
771
+
- You should rollback if you see an increase in the overall performance of HPA controller
772
+
- `metric_computation_duration_seconds{metric_type=ContainerResource}`: The time(seconds) that the HPA controller takes to calculate one metric.
773
+
- You should rollback if you see the container resource metric takes much longer time compared to other metrics.
774
+
- `reconciliations_total{error=internal}`: Number of internal errors in reconciliation of HPA controller.
775
+
- You should rollback if you see many error occurrence on the reconciliation.
776
+
- `metric_computation_total{error=internal,{metric_type=ContainerResource}`: Number of internal errors in the calculation of `type: ContainerResource`.
777
+
- You should rollback if you see many error occurrence on the container resource metrics
776
778
777
779
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
778
780
@@ -782,7 +784,6 @@ Longer term, we may want to require automated upgrade/rollback tests, but we
782
784
are missing a bunch of machinery and tooling and can't do that now.
783
785
-->
784
786
785
-
Not yet.
786
787
But, as described in [Are there any tests for feature enablement/disablement?](#Are-there-any-tests-for-feature-enablement/disablement?), the tests to confirm the behavior on switching the feature gate will be added. ([issue](https://github.com/kubernetes/kubernetes/issues/115467))
787
788
788
789
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
@@ -826,7 +827,6 @@ Recall that end users cannot usually observe component logs or access metrics.
826
827
827
828
- [x] Events
828
829
- `SuccessfulRescale`event with `memory/cpu/etc resource utilization (percentage of request) above/below target`
829
-
- Note that we cannot know if this reason is due to the `Resource` metric or `ContainerResource` in the current implementation. We'll change this reason for `ContainerResource` to `memory/cpu/etc container resource utilization (percentage of request) above/below target` so that we can distinguish.
830
830
- [x] API .status
831
831
- When something wrong with the container metrics, `ScalingActive` condition will be false with `FailedGetContainerResourceMetric` reason.
832
832
@@ -861,14 +861,11 @@ Pick one more of these and delete the rest.
861
861
- Details:
862
862
-->
863
863
864
-
HPA controller have no metrics in it now.
865
-
The following metrics will be implemented by beta. ([issue](https://github.com/kubernetes/kubernetes/issues/115639))
866
-
1. How long does each metric type take to compute the ideal replica num.
867
-
- so that users can confirm the container resource metric doesn't take long time compared to other metrics.
868
-
2. How long does the HPA controller take to complete reconcile one HPA object.
869
-
- so that users can confirm the container resource metric doesn't increse the whole time of scaling.
870
-
3. Provide the metric to show error occurrence for each metric.
871
-
- so that users can confirm no much error occurrence on the container resource metric.
864
+
- [x] Metrics
865
+
- `metric_computation_duration_seconds`: The time(seconds) that the HPA controller takes to calculate one metric.
866
+
- `metric_computation_total`: Number of metric computations.
867
+
- `reconciliations_total`: Number of reconciliation of HPA controller.
868
+
- `reconciliation_duration_seconds`: The time(seconds) that the HPA controller takes to reconcile once.
872
869
873
870
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
874
871
@@ -996,6 +993,20 @@ This through this both in small and large cases, again with respect to the
996
993
997
994
No.
998
995
996
+
###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
997
+
998
+
<!--
999
+
Focus not just on happy cases, but primarily on more pathological cases
1000
+
(e.g. probes taking a minute instead of milliseconds, failed pods consuming resources, etc.).
1001
+
If any of the resources can be exhausted, how this is mitigated with the existing limits
1002
+
(e.g. pods per node) or new limits added by this KEP?
1003
+
1004
+
Are there any tests that were run/should be run to understand performance characteristics better
1005
+
and validate the declared limits?
1006
+
-->
1007
+
1008
+
No.
1009
+
999
1010
### Troubleshooting
1000
1011
1001
1012
<!--
@@ -1037,6 +1048,10 @@ For each of them, fill in the following information by copying the below templat
1037
1048
1038
1049
###### What steps should be taken if SLOs are not being met to determine the problem?
1039
1050
1051
+
Check `metric_computation_duration_seconds` or `reconciliation_duration_seconds` to see which metric encountered the latency issue.
1052
+
And, if it is a latency problem only specific in `type: ContainerResource`,
1053
+
you can opt-out this feature by removing the `type: ContainerResource` metric from HPA(s).
0 commit comments