|
| 1 | +# AEP-7862: CPU Startup Boost |
| 2 | + |
| 3 | +<!-- toc --> |
| 4 | +- [AEP-7862: CPU Startup Boost](#aep-7862-cpu-startup-boost) |
| 5 | + - [Summary](#summary) |
| 6 | + - [Goals](#goals) |
| 7 | + - [Non-Goals](#non-goals) |
| 8 | + - [Proposal](#proposal) |
| 9 | + - [Design Details](#design-details) |
| 10 | + - [Workflow](#workflow) |
| 11 | + - [API Changes](#api-changes) |
| 12 | + - [Priority of `StartupBoost`](#priority-of-startupboost) |
| 13 | + - [Validation](#validation) |
| 14 | + - [Static Validation](#static-validation) |
| 15 | + - [Dynamic Validation](#dynamic-validation) |
| 16 | + - [Mitigating Failed In-Place Downsizes](#mitigating-failed-in-place-downsizes) |
| 17 | + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) |
| 18 | + - [How can this feature be enabled / disabled in a live cluster?](#how-can-this-feature-be-enabled--disabled-in-a-live-cluster) |
| 19 | + - [Kubernetes Version Compatibility](#kubernetes-version-compatibility) |
| 20 | + - [Test Plan](#test-plan) |
| 21 | + - [Examples](#examples) |
| 22 | + - [CPU Boost Only](#cpu-boost-only) |
| 23 | + - [CPU Boost and Vanilla VPA](#cpu-boost-and-vanilla-vpa) |
| 24 | + - [Implementation History](#implementation-history) |
| 25 | +<!-- /toc --> |
| 26 | + |
| 27 | +## Summary |
| 28 | + |
| 29 | +Long application start time is a known problem for more traditional workloads |
| 30 | +running in containerized applications, especially Java workloads. This delay can |
| 31 | +negatively impact the user experience and overall application performance. One |
| 32 | +potential solution is to provide additional CPU resources to pods during their |
| 33 | +startup phase, but this can lead to waste if the extra CPU resources are not |
| 34 | +set back to their original values after the pods have started up. |
| 35 | + |
| 36 | +This proposal allows VPA to boost the CPU request and limit of containers during |
| 37 | +the pod startup and to scale the CPU resources back down when the pod is |
| 38 | +`Ready` or after certain time has elapsed, leveraging the |
| 39 | +[in-place pod resize Kubernetes feature](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1287-in-place-update-pod-resources). |
| 40 | + |
| 41 | +> [!NOTE] |
| 42 | +> This feature depends on the new `InPlaceOrRecreate` VPA mode: |
| 43 | +> [AEP-4016: Support for in place updates in VPA](https://github.com/kubernetes/autoscaler/blob/master/vertical-pod-autoscaler/enhancements/4016-in-place-updates-support/README.md) |
| 44 | +
|
| 45 | +### Goals |
| 46 | + |
| 47 | +* Allow VPA to boost the CPU request and limit of a pod's containers during the |
| 48 | +pod (re-)creation time. |
| 49 | +* Allow VPA to scale pods down [in-place](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1287-in-place-update-pod-resources) |
| 50 | +to the existing VPA recommendation for that container, if any, or to the CPU |
| 51 | +resources configured in the pod spec, as soon as their [`Ready`](https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-conditions) |
| 52 | +condition is true and `StartupBoost.CPU.Duration` has elapsed. |
| 53 | + |
| 54 | +### Non-Goals |
| 55 | + |
| 56 | +* Allow VPA to boost CPU resources of pods outside of the pod (re-)creation |
| 57 | +time. |
| 58 | +* Allow VPA to boost memory resources. |
| 59 | + * This is out of scope for now because the in-place pod resize feature |
| 60 | + [does not support memory limit decrease yet.](https://github.com/kubernetes/enhancements/tree/758ea034908515a934af09d03a927b24186af04c/keps/sig-node/1287-in-place-update-pod-resources#memory-limit-decreases) |
| 61 | + |
| 62 | +## Proposal |
| 63 | + |
| 64 | +* To extend [`ContainerResourcePolicy`](https://github.com/kubernetes/autoscaler/blob/vertical-pod-autoscaler-1.3.0/vertical-pod-autoscaler/pkg/apis/autoscaling.k8s.io/v1/types.go#L191) |
| 65 | +with a new `StartupBoost` field to allow users to configure the CPU startup |
| 66 | +boost. |
| 67 | + |
| 68 | +* To extend [`ContainerScalingMode`](https://github.com/kubernetes/autoscaler/blob/vertical-pod-autoscaler-1.3.0/vertical-pod-autoscaler/pkg/apis/autoscaling.k8s.io/v1/types.go#L231-L236) |
| 69 | +with a new `StartupBoostOnly` mode to allow users to only enable the startup |
| 70 | +boost feature and not vanilla VPA altogether. |
| 71 | + |
| 72 | +* To allow CPU startup boost if a `StartupBoost` config is specified in `Auto` |
| 73 | +[`ContainerScalingMode`](https://github.com/kubernetes/autoscaler/blob/vertical-pod-autoscaler-1.3.0/vertical-pod-autoscaler/pkg/apis/autoscaling.k8s.io/v1/types.go#L231-L236) |
| 74 | +container policies. |
| 75 | + |
| 76 | +## Design Details |
| 77 | + |
| 78 | +### Workflow |
| 79 | + |
| 80 | +1. The user first configures the CPU startup boost on their VPA object |
| 81 | + |
| 82 | +1. When a pod targeted by that VPA is created, the kube-apiserver invokes the |
| 83 | +VPA Admission Controller |
| 84 | + |
| 85 | +1. The VPA Admission Controller modifies the pod's containers CPU request and |
| 86 | +limits to align with its `StartupBoost` policy, if specified, during the pod |
| 87 | +creation. |
| 88 | + |
| 89 | +1. The VPA Updater monitors pods targeted by the VPA object and when the pod |
| 90 | +condition is `Ready` and `StartupBoost.CPU.Duration` has elapsed, it scales |
| 91 | +down the CPU resources to the appropriate non-boosted value: |
| 92 | +`existing VPA recommendation for that container` (if any) OR the |
| 93 | +`CPU resources configured in the pod spec`. |
| 94 | + * The scale down is applied [in-place](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1287-in-place-update-pod-resources). |
| 95 | + |
| 96 | +### API Changes |
| 97 | + |
| 98 | +The new `StartupBoost` parameter will be added to the [`ContainerResourcePolicy`](https://github.com/kubernetes/autoscaler/blob/vertical-pod-autoscaler-1.3.0/vertical-pod-autoscaler/pkg/apis/autoscaling.k8s.io/v1/types.go#L191) |
| 99 | +and contain the following fields: |
| 100 | + * `StartupBoost.CPU.Factor`: the factor by which to multiply the initial |
| 101 | + resource request and limit of the containers' targeted by the VPA object. |
| 102 | + * `StartupBoost.CPU.Value`: the target value of the CPU request or limit |
| 103 | + during the startup boost phase. |
| 104 | + * [Optional] `StartupBoost.CPU.Duration`: if specified, it indicates for how |
| 105 | + long to keep the pod boosted **after** it goes to `Ready`. |
| 106 | + |
| 107 | +> [!IMPORTANT] |
| 108 | +> The boosted CPU value will be capped by |
| 109 | +> [`--container-recommendation-max-allowed-cpu`](https://github.com/kubernetes/autoscaler/blob/4d294562e505431d518a81e8833accc0ec99c9b8/vertical-pod-autoscaler/pkg/recommender/main.go#L122) |
| 110 | +> flag value, if set. |
| 111 | +
|
| 112 | +> [!IMPORTANT] |
| 113 | +> Only one of `Factor` or `Value` may be specified per container policy. |
| 114 | +
|
| 115 | + |
| 116 | +> [!NOTE] |
| 117 | +> To ensure that containers are unboosted only after their applications are |
| 118 | +> started and ready, it is recommended to configure a |
| 119 | +> [Readiness or a Startup probe](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/) |
| 120 | +> for the containers that will be CPU boosted. Check the [Test Plan](#test-plan) |
| 121 | +> section for more details on this feature's behavior for different combinations |
| 122 | +> of probers + `StartupBoost.CPU.Duration`. |
| 123 | +
|
| 124 | +We will also add a new mode to the [`ContainerScalingMode`](https://github.com/kubernetes/autoscaler/blob/vertical-pod-autoscaler-1.3.0/vertical-pod-autoscaler/pkg/apis/autoscaling.k8s.io/v1/types.go#L231-L236): |
| 125 | + * **NEW**: `StartupBoostOnly`: new mode that will allow users to only enable |
| 126 | + the startup boost feature for a container and not vanilla VPA altogether. |
| 127 | + * **NEW**: `Auto`: we will modify the existing `Auto` mode to enable both |
| 128 | + vanilla VPA and CPU Startup Boost (when `StartupBoost` parameter is |
| 129 | + specified). |
| 130 | + |
| 131 | +#### Priority of `StartupBoost` |
| 132 | + |
| 133 | +The new `StartupBoost` field will take precedence over the rest of the container |
| 134 | +resource policy configurations. Functioning independently from all other fields |
| 135 | +in [`ContainerResourcePolicy`](https://github.com/kubernetes/autoscaler/blob/vertical-pod-autoscaler-1.3.0/vertical-pod-autoscaler/pkg/apis/autoscaling.k8s.io/v1/types.go#L191), |
| 136 | +**except for**: |
| 137 | + * [`ContainerName`](https://github.com/kubernetes/autoscaler/blob/vertical-pod-autoscaler-1.3.0/vertical-pod-autoscaler/pkg/apis/autoscaling.k8s.io/v1/types.go#L192-L195) |
| 138 | + * [`Mode`](https://github.com/kubernetes/autoscaler/blob/vertical-pod-autoscaler-1.3.0/vertical-pod-autoscaler/pkg/apis/autoscaling.k8s.io/v1/types.go#L196-L198) |
| 139 | + * [`ControlledValues`](https://github.com/kubernetes/autoscaler/blob/vertical-pod-autoscaler-1.3.0/vertical-pod-autoscaler/pkg/apis/autoscaling.k8s.io/v1/types.go#L214-L217) |
| 140 | + |
| 141 | +This means that a container's CPU request/limit can be boosted during startup |
| 142 | +beyond [`MaxAllowed`](https://github.com/kubernetes/autoscaler/blob/vertical-pod-autoscaler-1.3.0/vertical-pod-autoscaler/pkg/apis/autoscaling.k8s.io/v1/types.go#L203-L206), |
| 143 | +for example, or it will be able to be boosted even if CPU is explicitly |
| 144 | +excluded from [`ControlledResources`](https://github.com/kubernetes/autoscaler/blob/vertical-pod-autoscaler-1.3.0/vertical-pod-autoscaler/pkg/apis/autoscaling.k8s.io/v1/types.go#L208-L212). |
| 145 | + |
| 146 | +### Validation |
| 147 | + |
| 148 | +#### Static Validation |
| 149 | + |
| 150 | +* We will check that the `startupBoost` configuration is valid when VPA objects |
| 151 | +are created/updated: |
| 152 | + * The VPA autoscaling mode must be `InPlaceOrRecreate` (since it does not |
| 153 | + make sense to use this feature with disruptive modes of VPA). |
| 154 | + * The boost factor is >= 1 (via CRD validation rules) |
| 155 | + * Only one of `StartupBoost.CPU.Factor` or `StartupBoost.CPU.Value` is |
| 156 | + specified |
| 157 | + * The [feature enablement](#feature-enablement) flags must be on. |
| 158 | + |
| 159 | + |
| 160 | +#### Dynamic Validation |
| 161 | + |
| 162 | +* `StartupBoost.CPU.Value` must be greater than the CPU request or limit of the |
| 163 | + container during the boost phase, otherwise we risk downscaling the container. |
| 164 | + |
| 165 | +### Mitigating Failed In-Place Downsizes |
| 166 | + |
| 167 | +The VPA Updater **will not** evict a pod if it attempted to scaled the pod down |
| 168 | +in place (to unboost its CPU resources) and the update failed (see the |
| 169 | +[scenarios](https://github.com/kubernetes/autoscaler/blob/0a34bf5d3a71b486bdaa440f1af7f8d50dc8e391/vertical-pod-autoscaler/enhancements/4016-in-place-updates-support/README.md?plain=1#L164-L169 ) where the VPA |
| 170 | +updater will consider that the update failed). This is to avoid an eviction |
| 171 | +loop: |
| 172 | + |
| 173 | +1. A pod is created and has its CPU resources boosted |
| 174 | +1. The pod meets the conditions to be unboosted. VPA Updater tries to downscale |
| 175 | +the pod in-place and it fails. |
| 176 | +1. VPA Updater evicts the pod. Logic flow goes back to (1). |
| 177 | + |
| 178 | +### Feature Enablement and Rollback |
| 179 | + |
| 180 | +#### How can this feature be enabled / disabled in a live cluster? |
| 181 | + |
| 182 | +* Feature gates names: `CPUStartupBoost` and `InPlaceOrRecreate` (from |
| 183 | +[AEP-4016](https://github.com/kubernetes/autoscaler/blob/master/vertical-pod-autoscaler/enhancements/4016-in-place-updates-support/README.md#feature-enablement-and-rollback)) |
| 184 | +* Components depending on the feature gates: |
| 185 | + * admission-controller |
| 186 | + * updater |
| 187 | + |
| 188 | +Enabling of feature gates `CPUStartupBoost` AND `InPlaceOrRecreate` will cause |
| 189 | +the following to happen: |
| 190 | + * admission-controller to **accept** new VPA objects being created with |
| 191 | +`StartupBoostOnly` configured. |
| 192 | + * admission-controller to **boost** CPU resources. |
| 193 | + * updater to **unboost** the CPU resources. |
| 194 | + |
| 195 | +Disabling of feature gates `CPUStartupBoost` OR `InPlaceOrRecreate` will cause |
| 196 | +the following to happen: |
| 197 | + * admission-controller to **reject** new VPA objects being created with |
| 198 | + `StartupBoostOnly` configured. |
| 199 | + * A descriptive error message should be returned to the user letting them |
| 200 | + know that they are using a feature gated feature. |
| 201 | + * admission-controller **to not** boost CPU resources, should it encounter a |
| 202 | + VPA configured with a `StartupBoost` config and `StartupBoostOnly` or `Auto` |
| 203 | + `ContainerScalingMode`. |
| 204 | + * updater **to not** unboost CPU resources when pods meet the scale down |
| 205 | + requirements, should it encounter a VPA configured with a `StartupBoost` |
| 206 | + config and `StartupBoostOnly` or `Auto` `ContainerScalingMode`. |
| 207 | + |
| 208 | +### Kubernetes Version Compatibility |
| 209 | + |
| 210 | +Similarly to [AEP-4016](https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler/enhancements/4016-in-place-updates-support#kubernetes-version-compatibility), |
| 211 | +`StartupBoost` configuration and `StartupBoostOnly` mode are built assuming that |
| 212 | +VPA will be running on a Kubernetes 1.33+ with the beta version of |
| 213 | +[KEP-1287: In-Place Update of Pod Resources](https://github.com/kubernetes/enhancements/issues/1287) |
| 214 | +enabled. If this is not the case, VPA's attempt to unboost pods may fail and the |
| 215 | +pods may remain boosted for their whole lifecycle. |
| 216 | + |
| 217 | +## Test Plan |
| 218 | + |
| 219 | +Other than comprehensive unit tests, we will also add the following scenarios to |
| 220 | +our e2e tests: |
| 221 | + |
| 222 | +* CPU Startup Boost recommendation is applied to pod controlled by VPA until it |
| 223 | +becomes `Ready` and `StartupBoost.CPU.Duration` has elapsed. Then, the pod is |
| 224 | +scaled back down in-place. We'll also test the following sub-cases: |
| 225 | + * Boost is applied to all containers of a pod. |
| 226 | + * Boost is applied only to a subset of containers in a pod. |
| 227 | + * Combinations of probes + `StartupBoost.CPU.Duration`: |
| 228 | + * No probes and no `StartupBoost.CPU.Duration` specified: unboost will |
| 229 | + likely happen immediately. |
| 230 | + * No probes and a 60s `StartupBoost.CPU.Duration`: unboost will likely |
| 231 | + happen after 60s. |
| 232 | + * A readiness/startup probe and no `StartupBoost.CPU.Duration` specified: |
| 233 | + unboost will likely as soon as the pod becomes `Ready`. |
| 234 | + * A readiness/startup probe and a 60s `StartupBoost.CPU.Duration` |
| 235 | + specified: unboost will likely happen 60s **after** the pod becomes `Ready`. |
| 236 | + |
| 237 | +* Pod is not evicted if the in-place update fails when scaling the pod back |
| 238 | +down. |
| 239 | + |
| 240 | +## Examples |
| 241 | + |
| 242 | +Here are some examples of the VPA CR incorporating CPU boosting for different |
| 243 | +scenarios. |
| 244 | + |
| 245 | +### CPU Boost Only |
| 246 | + |
| 247 | +All containers under `example` deployment will receive "regular" VPA updates, |
| 248 | +**except for** `boosted-container-name`. `boosted-container-name` will only be |
| 249 | +CPU boosted/unboosted, because it has a `StartupBoostOnly` container policy. |
| 250 | + |
| 251 | +```yaml |
| 252 | +apiVersion: "autoscaling.k8s.io/v1" |
| 253 | +kind: VerticalPodAutoscaler |
| 254 | +metadata: |
| 255 | + name: example-vpa |
| 256 | +spec: |
| 257 | + targetRef: |
| 258 | + apiVersion: "apps/v1" |
| 259 | + kind: Deployment |
| 260 | + name: example |
| 261 | + updatePolicy: |
| 262 | + # VPA Update mode must be InPlaceOrRecreate |
| 263 | + updateMode: "InPlaceOrRecreate" |
| 264 | + resourcePolicy: |
| 265 | + containerPolicies: |
| 266 | + - containerName: "boosted-container-name" |
| 267 | + mode: "StartupBoostOnly" |
| 268 | + startupBoost: |
| 269 | + cpu: |
| 270 | + factor: 2.0 |
| 271 | +``` |
| 272 | +
|
| 273 | +### CPU Boost and Vanilla VPA |
| 274 | +
|
| 275 | +All containers under `example` deployment will receive "regular" VPA updates, |
| 276 | +**including** `boosted-container-name`. Additionally, `boosted-container-name` |
| 277 | +will be CPU boosted/unboosted, because it has a `StartupBoost` config in its |
| 278 | +container policy and `Auto` container policy mode. |
| 279 | + |
| 280 | +```yaml |
| 281 | +apiVersion: "autoscaling.k8s.io/v1" |
| 282 | +kind: VerticalPodAutoscaler |
| 283 | +metadata: |
| 284 | + name: example-vpa |
| 285 | +spec: |
| 286 | + targetRef: |
| 287 | + apiVersion: "apps/v1" |
| 288 | + kind: Deployment |
| 289 | + name: example |
| 290 | + updatePolicy: |
| 291 | + # VPA Update mode must be InPlaceOrRecreate |
| 292 | + updateMode: "InPlaceOrRecreate" |
| 293 | + resourcePolicy: |
| 294 | + containerPolicies: |
| 295 | + - containerName: "boosted-container-name" |
| 296 | + mode: "Auto" # Vanilla VPA mode + Startup Boost |
| 297 | + minAllowed: |
| 298 | + cpu: "250m" |
| 299 | + memory: "100Mi" |
| 300 | + maxAllowed: |
| 301 | + cpu: "500m" |
| 302 | + memory: "600Mi" |
| 303 | + # The CPU boosted resources can go beyond maxAllowed. |
| 304 | + startupBoost: |
| 305 | + cpu: |
| 306 | + value: 4 |
| 307 | +``` |
| 308 | + |
| 309 | +## Implementation History |
| 310 | + |
| 311 | +* 2025-03-20: Initial version. |
| 312 | + |
0 commit comments