Skip to content

Commit f0bc845

Browse files
Updates motivation, graduation criteria, scalability and monitoring
1 parent 163a90b commit f0bc845

File tree

1 file changed

+94
-9
lines changed
  • keps/sig-autoscaling/2702-graduate-hpa-api-to-GA

1 file changed

+94
-9
lines changed

keps/sig-autoscaling/2702-graduate-hpa-api-to-GA/README.md

Lines changed: 94 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,39 @@
11
# Graduate v2beta2 Autoscaling API to GA
22

3+
**Note:** When your KEP is complete, all of these comment blocks should be removed.
4+
5+
To get started with this template:
6+
7+
- [ x ] **Pick a hosting SIG.**
8+
Make sure that the problem space is something the SIG is interested in taking
9+
up. KEPs should not be checked in without a sponsoring SIG.
10+
- [ x ] **Create an issue in kubernetes/enhancements**
11+
When filing an enhancement tracking issue, please make sure to complete all
12+
fields in that template. One of the fields asks for a link to the KEP. You
13+
can leave that blank until this KEP is filed, and then go back to the
14+
enhancement and add the link.
15+
- [ ] **Make a copy of this template directory.**
16+
Copy this template into the owning SIG's directory and name it
17+
`NNNN-short-descriptive-title`, where `NNNN` is the issue number (with no
18+
leading-zero padding) assigned to your enhancement above.
19+
- [ ] **Fill out as much of the kep.yaml file as you can.**
20+
At minimum, you should fill in the "Title", "Authors", "Owning-sig",
21+
"Status", and date-related fields.
22+
- [ x ] **Fill out this file as best you can.**
23+
At minimum, you should fill in the "Summary" and "Motivation" sections.
24+
These should be easy if you've preflighted the idea of the KEP with the
25+
appropriate SIG(s).
26+
- [ ] **Create a PR for this KEP.**
27+
Assign it to people in the SIG who are sponsoring this process.
28+
- [ ] **Merge early and iterate.**
29+
Avoid getting hung up on specific details and instead aim to get the goals of
30+
the KEP clarified and merged quickly. The best way to do this is to just
31+
start with the high-level sections and fill out details incrementally in
32+
subsequent PRs.
33+
34+
Just because a KEP is merged does not mean it is complete or approved. Any KEP
35+
marked as `provisional` is a working document and subject to change. You can
36+
denote sections that are under active debate as follows:
337
## Table of Contents
438

539
<!-- toc -->
@@ -30,7 +64,11 @@
3064
This document outlines required steps to graduate autoscaling v2beta2 API to GA.
3165

3266
## Motivation
33-
67+
The HPA v2 series APIs were first introduced in November, 2016 (5 years ago).
68+
The primary feature of the v2 series is adding support for multiple and custom metrics. The structure was improved
69+
slightly in the v2beta2 API which became available in May 2018 and has remained largely unchanged since then.
70+
The v2beta2 API has been used extensively and informally treated as stable.
71+
The motivation for this KEP is to push it over the line to make it formally so.
3472

3573
### Goals
3674

@@ -84,9 +122,8 @@ The following code changes must be made for graduating to GA
84122

85123
* Move API objects to `v2` and support conversion internally
86124

87-
The following code changes must be made to take <TBD> GA
125+
* Add behavior and container target E2E tests.
88126

89-
* TBD
90127

91128
### Version Skew Strategy
92129

@@ -143,6 +180,12 @@ feature, can it break the existing applications?).
143180
144181
NOTE: Also set `disable-supported` to `true` or `false` in `kep.yaml`.
145182
-->
183+
The feature can be enabled by adding `autoscaling/v2` to the `--runtime-config` flag:
184+
https://github.com/kubernetes/kubernetes/blob/ea0764452222146c47ec826977f49d7001b0ea8c/staging/src/k8s.io/apiserver/pkg/server/options/api_enablement.go#L45
185+
186+
Adding `api/all` will also include `autoscaling/v2`.
187+
188+
The feature can be disabled by removing the `--runtime-config` entry.
146189

147190
###### What happens if we reenable the feature if it was previously rolled back?
148191

@@ -199,6 +242,18 @@ Even if applying deprecation policies, they may still surprise some users.
199242
<!--
200243
This section must be completed when targeting beta to a release.
201244
-->
245+
The HPA requires the `metrics.k8s.io` APIs to be available in the cluster to operate. This API is served by the
246+
Metrics Server. An operator can verify the Metrics Server is available to provide resource metrics to the HPA by running
247+
the command `kubectl get apiservices` and looking for the status of `v1beta1.metrics.k8s.io` (version subject to change).
248+
Operators should take care to make sure Metrics Server is up and running to maintain resource autoscaling.
249+
250+
The v2 HPA requires the `custom.metrics.k8s.io` and `external.metrics.k8s.io` APIs as well to retrieve custom and
251+
external metrics. There is no default implementation of these APIs and cluster operators must install an "adapter" for
252+
their metrics backend (e.g. [Prometheus](https://github.com/kubernetes-sigs/prometheus-adapter)).
253+
254+
An operator can verify the adapter is working properly by running the same kubectl for apiservices and looking for the
255+
`v1beta1.custom.metrics.k8s.io` and `v1beta1.external.metrics.k8s.io` APIs (usually served by the same adapter).
256+
Care should be taken to ensure the adapter and specific metrics backend is available to maintain custom metric autoscaling.
202257

203258
###### How can an operator determine if the feature is in use by workloads?
204259

@@ -207,6 +262,7 @@ Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
207262
checking if there are objects with field X set) may be a last resort. Avoid
208263
logs or events for this purpose.
209264
-->
265+
All HPA objects are stored in v1 format on disk. They are up converted the requested version. And down converted upon update.
210266

211267
###### How can someone using this feature know that it is working for their instance?
212268

@@ -219,13 +275,28 @@ and operation of this feature.
219275
Recall that end users cannot usually observe component logs or access metrics.
220276
-->
221277

222-
- [ ] Events
278+
- [ x ] Events
223279
- Event Reason:
224-
- [ ] API .status
280+
The event type `Normal`, reason `SuccessfulRescale`, note `New size: N; reason: FOO` indicates autoscaling is operating normally.
281+
Abnormal events type `Warning` include reasons such as `FailedRescale` and `FailedComputeMetricsReplicas` and
282+
will include details about the error in the note.
283+
- [ x ] API .status
225284
- Condition name:
285+
There are three condition types which indicate the operating status of the HPA. They are `ScalingEnabled`, `AbleToScale`
286+
and `ScalingLimited` (see type [comments](https://pkg.go.dev/k8s.io/api/autoscaling/v2beta2#HorizontalPodAutoscalerConditionType))
287+
Under normal operating circumstances `ScalingEnabled` and `AbleToScale` should be status `true`, indicating the HPA is
288+
successfully reconciling the scale. `ScalingLimited` indicates user configuration is limiting the "ideal" scale with a
289+
minimum, maximum, rate or delay. Which limit is the cause will be indicated in the message.
290+
It's normal for this to be `true` or `false` periodically.
226291
- Other field:
227-
- [ ] Other (treat as last resort)
292+
- [ x ] Other (treat as last resort)
228293
- Details:
294+
The HPA status includes the current observed metric values, one for each given target. Using these
295+
values an operator can verify the HPA is maintaining the desired target for the dominant metric.
296+
The operator can also see the number of pods the HPA observed under `status.currentReplicas` and the most
297+
recent recommendation under `status.desiredReplicas`.
298+
The latest observed generation is echoed back in status so an operator can verify the HPA is keeping up-to-date
299+
with configuration changes.
229300

230301
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
231302

@@ -298,6 +369,22 @@ For beta, this section is required: reviewers must answer these questions.
298369
For GA, this section is required: approvers should be able to confirm the
299370
previous answers based on experience in the field.
300371
-->
372+
The HPA v2 APIs allow users to configure multiple metrics, each with a separate target. A recommendation is calculated
373+
for each metric and the largest recommendation is used. The more metrics are added to a given HPA the longer it will
374+
take to reconcile. The HPA is single-threaded processing recommendations one-at-a-time. When default reconciliation
375+
period is 15 seconds. If there is too much work to do reconciliation will slow down and happen less frequently than
376+
every 15 seconds. This will cause autoscaling to be less responsive at high scale.
377+
378+
Previously v1 scaled along two dimensions, number of HPA and number of pods selected by each HPA (linearly).
379+
Now it will scale with the number of metrics defined in HPAs and the number of pods selected each metric (linearly).
380+
381+
Additionally, v2 adds a behavior structure which allows the user configure that rate and delay of scaling and down.
382+
Enforcing these constraints require storing previous recommendations and scaling events in memory. The longer the
383+
configured interval the more memory is used. The maximum window allows is 60 minutes ([code](https://pkg.go.dev/k8s.io/api/autoscaling/v2beta2#HPAScalingRules))
384+
so 240 recommendations / events per configured metric. Each recommendation is an `int32` and `time.Time`.
385+
Each scaling event is an `int32`, a `time.Time` and a `bool` ([code](https://pkg.go.dev/k8s.io/api/autoscaling/v2beta2#HPAScalingRules))
386+
so the memory footprint is relatively small.
387+
It will scale linearly with the number of metrics defined and the size of the HPA's configured window.
301388

302389
###### Will enabling / using this feature result in any new API calls?
303390

@@ -323,6 +410,7 @@ Describe them, providing:
323410
- Supported number of objects per cluster
324411
- Supported number of objects per namespace (for namespace-scoped objects)
325412
-->
413+
Yes. It will introduce the new autoscaling/v2 API types.
326414

327415
###### Will enabling / using this feature result in any new calls to the cloud provider?
328416

@@ -395,9 +483,6 @@ For each of them, fill in the following information by copying the below templat
395483
- Testing: Are there any tests for failure mode? If not, describe why.
396484
-->
397485

398-
###### What steps should be taken if SLOs are not being met to determine the problem?
399-
400-
401486
## Implementation History
402487

403488
* HPA v1

0 commit comments

Comments
 (0)