Skip to content

Commit 15295c7

Browse files
authored
Merge pull request #7863 from laoj2/cpu-boost-kep
AEP-7862: Support CPU Startup Boost in VPA
2 parents 990ab04 + 455d290 commit 15295c7

File tree

1 file changed

+312
-0
lines changed
  • vertical-pod-autoscaler/enhancements/7862-cpu-startup-boost

1 file changed

+312
-0
lines changed
Lines changed: 312 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,312 @@
1+
# AEP-7862: CPU Startup Boost
2+
3+
<!-- toc -->
4+
- [AEP-7862: CPU Startup Boost](#aep-7862-cpu-startup-boost)
5+
- [Summary](#summary)
6+
- [Goals](#goals)
7+
- [Non-Goals](#non-goals)
8+
- [Proposal](#proposal)
9+
- [Design Details](#design-details)
10+
- [Workflow](#workflow)
11+
- [API Changes](#api-changes)
12+
- [Priority of `StartupBoost`](#priority-of-startupboost)
13+
- [Validation](#validation)
14+
- [Static Validation](#static-validation)
15+
- [Dynamic Validation](#dynamic-validation)
16+
- [Mitigating Failed In-Place Downsizes](#mitigating-failed-in-place-downsizes)
17+
- [Feature Enablement and Rollback](#feature-enablement-and-rollback)
18+
- [How can this feature be enabled / disabled in a live cluster?](#how-can-this-feature-be-enabled--disabled-in-a-live-cluster)
19+
- [Kubernetes Version Compatibility](#kubernetes-version-compatibility)
20+
- [Test Plan](#test-plan)
21+
- [Examples](#examples)
22+
- [CPU Boost Only](#cpu-boost-only)
23+
- [CPU Boost and Vanilla VPA](#cpu-boost-and-vanilla-vpa)
24+
- [Implementation History](#implementation-history)
25+
<!-- /toc -->
26+
27+
## Summary
28+
29+
Long application start time is a known problem for more traditional workloads
30+
running in containerized applications, especially Java workloads. This delay can
31+
negatively impact the user experience and overall application performance. One
32+
potential solution is to provide additional CPU resources to pods during their
33+
startup phase, but this can lead to waste if the extra CPU resources are not
34+
set back to their original values after the pods have started up.
35+
36+
This proposal allows VPA to boost the CPU request and limit of containers during
37+
the pod startup and to scale the CPU resources back down when the pod is
38+
`Ready` or after certain time has elapsed, leveraging the
39+
[in-place pod resize Kubernetes feature](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1287-in-place-update-pod-resources).
40+
41+
> [!NOTE]
42+
> This feature depends on the new `InPlaceOrRecreate` VPA mode:
43+
> [AEP-4016: Support for in place updates in VPA](https://github.com/kubernetes/autoscaler/blob/master/vertical-pod-autoscaler/enhancements/4016-in-place-updates-support/README.md)
44+
45+
### Goals
46+
47+
* Allow VPA to boost the CPU request and limit of a pod's containers during the
48+
pod (re-)creation time.
49+
* Allow VPA to scale pods down [in-place](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1287-in-place-update-pod-resources)
50+
to the existing VPA recommendation for that container, if any, or to the CPU
51+
resources configured in the pod spec, as soon as their [`Ready`](https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-conditions)
52+
condition is true and `StartupBoost.CPU.Duration` has elapsed.
53+
54+
### Non-Goals
55+
56+
* Allow VPA to boost CPU resources of pods outside of the pod (re-)creation
57+
time.
58+
* Allow VPA to boost memory resources.
59+
* This is out of scope for now because the in-place pod resize feature
60+
[does not support memory limit decrease yet.](https://github.com/kubernetes/enhancements/tree/758ea034908515a934af09d03a927b24186af04c/keps/sig-node/1287-in-place-update-pod-resources#memory-limit-decreases)
61+
62+
## Proposal
63+
64+
* To extend [`ContainerResourcePolicy`](https://github.com/kubernetes/autoscaler/blob/vertical-pod-autoscaler-1.3.0/vertical-pod-autoscaler/pkg/apis/autoscaling.k8s.io/v1/types.go#L191)
65+
with a new `StartupBoost` field to allow users to configure the CPU startup
66+
boost.
67+
68+
* To extend [`ContainerScalingMode`](https://github.com/kubernetes/autoscaler/blob/vertical-pod-autoscaler-1.3.0/vertical-pod-autoscaler/pkg/apis/autoscaling.k8s.io/v1/types.go#L231-L236)
69+
with a new `StartupBoostOnly` mode to allow users to only enable the startup
70+
boost feature and not vanilla VPA altogether.
71+
72+
* To allow CPU startup boost if a `StartupBoost` config is specified in `Auto`
73+
[`ContainerScalingMode`](https://github.com/kubernetes/autoscaler/blob/vertical-pod-autoscaler-1.3.0/vertical-pod-autoscaler/pkg/apis/autoscaling.k8s.io/v1/types.go#L231-L236)
74+
container policies.
75+
76+
## Design Details
77+
78+
### Workflow
79+
80+
1. The user first configures the CPU startup boost on their VPA object
81+
82+
1. When a pod targeted by that VPA is created, the kube-apiserver invokes the
83+
VPA Admission Controller
84+
85+
1. The VPA Admission Controller modifies the pod's containers CPU request and
86+
limits to align with its `StartupBoost` policy, if specified, during the pod
87+
creation.
88+
89+
1. The VPA Updater monitors pods targeted by the VPA object and when the pod
90+
condition is `Ready` and `StartupBoost.CPU.Duration` has elapsed, it scales
91+
down the CPU resources to the appropriate non-boosted value:
92+
`existing VPA recommendation for that container` (if any) OR the
93+
`CPU resources configured in the pod spec`.
94+
* The scale down is applied [in-place](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1287-in-place-update-pod-resources).
95+
96+
### API Changes
97+
98+
The new `StartupBoost` parameter will be added to the [`ContainerResourcePolicy`](https://github.com/kubernetes/autoscaler/blob/vertical-pod-autoscaler-1.3.0/vertical-pod-autoscaler/pkg/apis/autoscaling.k8s.io/v1/types.go#L191)
99+
and contain the following fields:
100+
* `StartupBoost.CPU.Factor`: the factor by which to multiply the initial
101+
resource request and limit of the containers' targeted by the VPA object.
102+
* `StartupBoost.CPU.Value`: the target value of the CPU request or limit
103+
during the startup boost phase.
104+
* [Optional] `StartupBoost.CPU.Duration`: if specified, it indicates for how
105+
long to keep the pod boosted **after** it goes to `Ready`.
106+
107+
> [!IMPORTANT]
108+
> The boosted CPU value will be capped by
109+
> [`--container-recommendation-max-allowed-cpu`](https://github.com/kubernetes/autoscaler/blob/4d294562e505431d518a81e8833accc0ec99c9b8/vertical-pod-autoscaler/pkg/recommender/main.go#L122)
110+
> flag value, if set.
111+
112+
> [!IMPORTANT]
113+
> Only one of `Factor` or `Value` may be specified per container policy.
114+
115+
116+
> [!NOTE]
117+
> To ensure that containers are unboosted only after their applications are
118+
> started and ready, it is recommended to configure a
119+
> [Readiness or a Startup probe](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/)
120+
> for the containers that will be CPU boosted. Check the [Test Plan](#test-plan)
121+
> section for more details on this feature's behavior for different combinations
122+
> of probers + `StartupBoost.CPU.Duration`.
123+
124+
We will also add a new mode to the [`ContainerScalingMode`](https://github.com/kubernetes/autoscaler/blob/vertical-pod-autoscaler-1.3.0/vertical-pod-autoscaler/pkg/apis/autoscaling.k8s.io/v1/types.go#L231-L236):
125+
* **NEW**: `StartupBoostOnly`: new mode that will allow users to only enable
126+
the startup boost feature for a container and not vanilla VPA altogether.
127+
* **NEW**: `Auto`: we will modify the existing `Auto` mode to enable both
128+
vanilla VPA and CPU Startup Boost (when `StartupBoost` parameter is
129+
specified).
130+
131+
#### Priority of `StartupBoost`
132+
133+
The new `StartupBoost` field will take precedence over the rest of the container
134+
resource policy configurations. Functioning independently from all other fields
135+
in [`ContainerResourcePolicy`](https://github.com/kubernetes/autoscaler/blob/vertical-pod-autoscaler-1.3.0/vertical-pod-autoscaler/pkg/apis/autoscaling.k8s.io/v1/types.go#L191),
136+
**except for**:
137+
* [`ContainerName`](https://github.com/kubernetes/autoscaler/blob/vertical-pod-autoscaler-1.3.0/vertical-pod-autoscaler/pkg/apis/autoscaling.k8s.io/v1/types.go#L192-L195)
138+
* [`Mode`](https://github.com/kubernetes/autoscaler/blob/vertical-pod-autoscaler-1.3.0/vertical-pod-autoscaler/pkg/apis/autoscaling.k8s.io/v1/types.go#L196-L198)
139+
* [`ControlledValues`](https://github.com/kubernetes/autoscaler/blob/vertical-pod-autoscaler-1.3.0/vertical-pod-autoscaler/pkg/apis/autoscaling.k8s.io/v1/types.go#L214-L217)
140+
141+
This means that a container's CPU request/limit can be boosted during startup
142+
beyond [`MaxAllowed`](https://github.com/kubernetes/autoscaler/blob/vertical-pod-autoscaler-1.3.0/vertical-pod-autoscaler/pkg/apis/autoscaling.k8s.io/v1/types.go#L203-L206),
143+
for example, or it will be able to be boosted even if CPU is explicitly
144+
excluded from [`ControlledResources`](https://github.com/kubernetes/autoscaler/blob/vertical-pod-autoscaler-1.3.0/vertical-pod-autoscaler/pkg/apis/autoscaling.k8s.io/v1/types.go#L208-L212).
145+
146+
### Validation
147+
148+
#### Static Validation
149+
150+
* We will check that the `startupBoost` configuration is valid when VPA objects
151+
are created/updated:
152+
* The VPA autoscaling mode must be `InPlaceOrRecreate` (since it does not
153+
make sense to use this feature with disruptive modes of VPA).
154+
* The boost factor is >= 1 (via CRD validation rules)
155+
* Only one of `StartupBoost.CPU.Factor` or `StartupBoost.CPU.Value` is
156+
specified
157+
* The [feature enablement](#feature-enablement) flags must be on.
158+
159+
160+
#### Dynamic Validation
161+
162+
* `StartupBoost.CPU.Value` must be greater than the CPU request or limit of the
163+
container during the boost phase, otherwise we risk downscaling the container.
164+
165+
### Mitigating Failed In-Place Downsizes
166+
167+
The VPA Updater **will not** evict a pod if it attempted to scaled the pod down
168+
in place (to unboost its CPU resources) and the update failed (see the
169+
[scenarios](https://github.com/kubernetes/autoscaler/blob/0a34bf5d3a71b486bdaa440f1af7f8d50dc8e391/vertical-pod-autoscaler/enhancements/4016-in-place-updates-support/README.md?plain=1#L164-L169 ) where the VPA
170+
updater will consider that the update failed). This is to avoid an eviction
171+
loop:
172+
173+
1. A pod is created and has its CPU resources boosted
174+
1. The pod meets the conditions to be unboosted. VPA Updater tries to downscale
175+
the pod in-place and it fails.
176+
1. VPA Updater evicts the pod. Logic flow goes back to (1).
177+
178+
### Feature Enablement and Rollback
179+
180+
#### How can this feature be enabled / disabled in a live cluster?
181+
182+
* Feature gates names: `CPUStartupBoost` and `InPlaceOrRecreate` (from
183+
[AEP-4016](https://github.com/kubernetes/autoscaler/blob/master/vertical-pod-autoscaler/enhancements/4016-in-place-updates-support/README.md#feature-enablement-and-rollback))
184+
* Components depending on the feature gates:
185+
* admission-controller
186+
* updater
187+
188+
Enabling of feature gates `CPUStartupBoost` AND `InPlaceOrRecreate` will cause
189+
the following to happen:
190+
* admission-controller to **accept** new VPA objects being created with
191+
`StartupBoostOnly` configured.
192+
* admission-controller to **boost** CPU resources.
193+
* updater to **unboost** the CPU resources.
194+
195+
Disabling of feature gates `CPUStartupBoost` OR `InPlaceOrRecreate` will cause
196+
the following to happen:
197+
* admission-controller to **reject** new VPA objects being created with
198+
`StartupBoostOnly` configured.
199+
* A descriptive error message should be returned to the user letting them
200+
know that they are using a feature gated feature.
201+
* admission-controller **to not** boost CPU resources, should it encounter a
202+
VPA configured with a `StartupBoost` config and `StartupBoostOnly` or `Auto`
203+
`ContainerScalingMode`.
204+
* updater **to not** unboost CPU resources when pods meet the scale down
205+
requirements, should it encounter a VPA configured with a `StartupBoost`
206+
config and `StartupBoostOnly` or `Auto` `ContainerScalingMode`.
207+
208+
### Kubernetes Version Compatibility
209+
210+
Similarly to [AEP-4016](https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler/enhancements/4016-in-place-updates-support#kubernetes-version-compatibility),
211+
`StartupBoost` configuration and `StartupBoostOnly` mode are built assuming that
212+
VPA will be running on a Kubernetes 1.33+ with the beta version of
213+
[KEP-1287: In-Place Update of Pod Resources](https://github.com/kubernetes/enhancements/issues/1287)
214+
enabled. If this is not the case, VPA's attempt to unboost pods may fail and the
215+
pods may remain boosted for their whole lifecycle.
216+
217+
## Test Plan
218+
219+
Other than comprehensive unit tests, we will also add the following scenarios to
220+
our e2e tests:
221+
222+
* CPU Startup Boost recommendation is applied to pod controlled by VPA until it
223+
becomes `Ready` and `StartupBoost.CPU.Duration` has elapsed. Then, the pod is
224+
scaled back down in-place. We'll also test the following sub-cases:
225+
* Boost is applied to all containers of a pod.
226+
* Boost is applied only to a subset of containers in a pod.
227+
* Combinations of probes + `StartupBoost.CPU.Duration`:
228+
* No probes and no `StartupBoost.CPU.Duration` specified: unboost will
229+
likely happen immediately.
230+
* No probes and a 60s `StartupBoost.CPU.Duration`: unboost will likely
231+
happen after 60s.
232+
* A readiness/startup probe and no `StartupBoost.CPU.Duration` specified:
233+
unboost will likely as soon as the pod becomes `Ready`.
234+
* A readiness/startup probe and a 60s `StartupBoost.CPU.Duration`
235+
specified: unboost will likely happen 60s **after** the pod becomes `Ready`.
236+
237+
* Pod is not evicted if the in-place update fails when scaling the pod back
238+
down.
239+
240+
## Examples
241+
242+
Here are some examples of the VPA CR incorporating CPU boosting for different
243+
scenarios.
244+
245+
### CPU Boost Only
246+
247+
All containers under `example` deployment will receive "regular" VPA updates,
248+
**except for** `boosted-container-name`. `boosted-container-name` will only be
249+
CPU boosted/unboosted, because it has a `StartupBoostOnly` container policy.
250+
251+
```yaml
252+
apiVersion: "autoscaling.k8s.io/v1"
253+
kind: VerticalPodAutoscaler
254+
metadata:
255+
name: example-vpa
256+
spec:
257+
targetRef:
258+
apiVersion: "apps/v1"
259+
kind: Deployment
260+
name: example
261+
updatePolicy:
262+
# VPA Update mode must be InPlaceOrRecreate
263+
updateMode: "InPlaceOrRecreate"
264+
resourcePolicy:
265+
containerPolicies:
266+
- containerName: "boosted-container-name"
267+
mode: "StartupBoostOnly"
268+
startupBoost:
269+
cpu:
270+
factor: 2.0
271+
```
272+
273+
### CPU Boost and Vanilla VPA
274+
275+
All containers under `example` deployment will receive "regular" VPA updates,
276+
**including** `boosted-container-name`. Additionally, `boosted-container-name`
277+
will be CPU boosted/unboosted, because it has a `StartupBoost` config in its
278+
container policy and `Auto` container policy mode.
279+
280+
```yaml
281+
apiVersion: "autoscaling.k8s.io/v1"
282+
kind: VerticalPodAutoscaler
283+
metadata:
284+
name: example-vpa
285+
spec:
286+
targetRef:
287+
apiVersion: "apps/v1"
288+
kind: Deployment
289+
name: example
290+
updatePolicy:
291+
# VPA Update mode must be InPlaceOrRecreate
292+
updateMode: "InPlaceOrRecreate"
293+
resourcePolicy:
294+
containerPolicies:
295+
- containerName: "boosted-container-name"
296+
mode: "Auto" # Vanilla VPA mode + Startup Boost
297+
minAllowed:
298+
cpu: "250m"
299+
memory: "100Mi"
300+
maxAllowed:
301+
cpu: "500m"
302+
memory: "600Mi"
303+
# The CPU boosted resources can go beyond maxAllowed.
304+
startupBoost:
305+
cpu:
306+
value: 4
307+
```
308+
309+
## Implementation History
310+
311+
* 2025-03-20: Initial version.
312+

0 commit comments

Comments
 (0)