Merge pull request #7863 from laoj2/cpu-boost-kep

k8s-ci-robot · web-flow · commit 15295c7b1b80 · 2025-03-23T05:42:31.000-07:00
AEP-7862: Support CPU Startup Boost in VPA
diff --git a/vertical-pod-autoscaler/enhancements/7862-cpu-startup-boost/README.md b/vertical-pod-autoscaler/enhancements/7862-cpu-startup-boost/README.md
@@ -0,0 +1,312 @@
+# AEP-7862: CPU Startup Boost
+
+<!-- toc -->
+- [AEP-7862: CPU Startup Boost](#aep-7862-cpu-startup-boost)
+  - [Summary](#summary)
+    - [Goals](#goals)
+    - [Non-Goals](#non-goals)
+  - [Proposal](#proposal)
+  - [Design Details](#design-details)
+    - [Workflow](#workflow)
+    - [API Changes](#api-changes)
+      - [Priority of `StartupBoost`](#priority-of-startupboost)
+    - [Validation](#validation)
+      - [Static Validation](#static-validation)
+      - [Dynamic Validation](#dynamic-validation)
+    - [Mitigating Failed In-Place Downsizes](#mitigating-failed-in-place-downsizes)
+    - [Feature Enablement and Rollback](#feature-enablement-and-rollback)
+      - [How can this feature be enabled / disabled in a live cluster?](#how-can-this-feature-be-enabled--disabled-in-a-live-cluster)
+    - [Kubernetes Version Compatibility](#kubernetes-version-compatibility)
+  - [Test Plan](#test-plan)
+  - [Examples](#examples)
+    - [CPU Boost Only](#cpu-boost-only)
+    - [CPU Boost and Vanilla VPA](#cpu-boost-and-vanilla-vpa)
+  - [Implementation History](#implementation-history)
+<!-- /toc -->
+
+## Summary
+
+Long application start time is a known problem for more traditional workloads
+running in containerized applications, especially Java workloads. This delay can
+negatively impact the user experience and overall application performance. One
+potential solution is to provide additional CPU resources to pods during their
+startup phase, but this can lead to waste if the extra CPU resources are not
+set back to their original values after the pods have started up.
+
+This proposal allows VPA to boost the CPU request and limit of containers during
+the pod startup and to scale the CPU resources back down when the pod is
+`Ready` or after certain time has elapsed, leveraging the
+[in-place pod resize Kubernetes feature](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1287-in-place-update-pod-resources).
+
+> [!NOTE]
+> This feature depends on the new `InPlaceOrRecreate` VPA mode:
+> [AEP-4016: Support for in place updates in VPA](https://github.com/kubernetes/autoscaler/blob/master/vertical-pod-autoscaler/enhancements/4016-in-place-updates-support/README.md)
+
+### Goals
+
+* Allow VPA to boost the CPU request and limit of a pod's containers during the
+pod (re-)creation time.
+* Allow VPA to scale pods down [in-place](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1287-in-place-update-pod-resources)
+to the existing VPA recommendation for that container, if any, or to the CPU
+resources configured in the pod spec, as soon as their [`Ready`](https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-conditions)
+condition is true and `StartupBoost.CPU.Duration` has elapsed.
+
+### Non-Goals
+
+* Allow VPA to boost CPU resources of pods outside of the pod (re-)creation
+time.
+* Allow VPA to boost memory resources.
+  * This is out of scope for now because the in-place pod resize feature
+  [does not support memory limit decrease yet.](https://github.com/kubernetes/enhancements/tree/758ea034908515a934af09d03a927b24186af04c/keps/sig-node/1287-in-place-update-pod-resources#memory-limit-decreases)
+
+## Proposal
+
+* To extend [`ContainerResourcePolicy`](https://github.com/kubernetes/autoscaler/blob/vertical-pod-autoscaler-1.3.0/vertical-pod-autoscaler/pkg/apis/autoscaling.k8s.io/v1/types.go#L191)
+with a new `StartupBoost` field to allow users to configure the CPU startup
+boost.
+
+* To extend [`ContainerScalingMode`](https://github.com/kubernetes/autoscaler/blob/vertical-pod-autoscaler-1.3.0/vertical-pod-autoscaler/pkg/apis/autoscaling.k8s.io/v1/types.go#L231-L236)
+with a new `StartupBoostOnly` mode to allow users to only enable the startup
+boost feature and not vanilla VPA altogether.
+
+* To allow CPU startup boost if a `StartupBoost` config is specified in `Auto`
+[`ContainerScalingMode`](https://github.com/kubernetes/autoscaler/blob/vertical-pod-autoscaler-1.3.0/vertical-pod-autoscaler/pkg/apis/autoscaling.k8s.io/v1/types.go#L231-L236)
+container policies.
+
+## Design Details
+
+### Workflow
+
+1.  The user first configures the CPU startup boost on their VPA object
+
+1. When a pod targeted by that VPA is created, the kube-apiserver invokes the
+VPA Admission Controller
+
+1. The VPA Admission Controller modifies the pod's containers CPU request and
+limits to align with its `StartupBoost` policy, if specified, during the pod
+creation.
+
+1. The VPA Updater monitors pods targeted by the VPA object and when the pod
+condition is `Ready` and  `StartupBoost.CPU.Duration` has elapsed, it scales
+down the CPU resources to the appropriate non-boosted value:
+`existing VPA recommendation for that container` (if any) OR the
+`CPU resources configured in the pod spec`.
+    * The scale down is applied [in-place](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1287-in-place-update-pod-resources).
+
+### API Changes
+
+The new `StartupBoost` parameter will be added to the [`ContainerResourcePolicy`](https://github.com/kubernetes/autoscaler/blob/vertical-pod-autoscaler-1.3.0/vertical-pod-autoscaler/pkg/apis/autoscaling.k8s.io/v1/types.go#L191)
+and contain the following fields:
+  * `StartupBoost.CPU.Factor`: the factor by which to multiply the initial
+  resource request and limit of the containers' targeted by the VPA object.
+  * `StartupBoost.CPU.Value`: the target value of the CPU request or limit
+  during the startup boost phase.
+  * [Optional] `StartupBoost.CPU.Duration`: if specified, it indicates for how
+  long to keep the pod boosted **after** it goes to `Ready`.
+
+> [!IMPORTANT]
+> The boosted CPU value will be capped by
+> [`--container-recommendation-max-allowed-cpu`](https://github.com/kubernetes/autoscaler/blob/4d294562e505431d518a81e8833accc0ec99c9b8/vertical-pod-autoscaler/pkg/recommender/main.go#L122)
+> flag value, if set.
+
+> [!IMPORTANT]
+> Only one of `Factor` or `Value` may be specified per container policy.
+
+
+> [!NOTE]
+> To ensure that containers are unboosted only after their applications are
+> started and ready, it is recommended to configure a
+> [Readiness or a Startup probe](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/)
+> for the containers that will be CPU boosted. Check the [Test Plan](#test-plan)
+> section for more details on this feature's behavior for different combinations
+> of probers + `StartupBoost.CPU.Duration`.
+
+We will also add a new mode to the [`ContainerScalingMode`](https://github.com/kubernetes/autoscaler/blob/vertical-pod-autoscaler-1.3.0/vertical-pod-autoscaler/pkg/apis/autoscaling.k8s.io/v1/types.go#L231-L236):
+  * **NEW**: `StartupBoostOnly`: new mode that will allow users to only enable
+  the startup boost feature for a container and not vanilla VPA altogether.
+  * **NEW**: `Auto`: we will modify the existing `Auto` mode to enable both
+  vanilla VPA and CPU Startup Boost (when `StartupBoost` parameter is
+  specified).
+
+#### Priority of `StartupBoost`
+
+The new `StartupBoost` field will take precedence over the rest of the container
+resource policy configurations. Functioning independently from all other fields
+in [`ContainerResourcePolicy`](https://github.com/kubernetes/autoscaler/blob/vertical-pod-autoscaler-1.3.0/vertical-pod-autoscaler/pkg/apis/autoscaling.k8s.io/v1/types.go#L191),
+**except for**:
+   * [`ContainerName`](https://github.com/kubernetes/autoscaler/blob/vertical-pod-autoscaler-1.3.0/vertical-pod-autoscaler/pkg/apis/autoscaling.k8s.io/v1/types.go#L192-L195)
+   * [`Mode`](https://github.com/kubernetes/autoscaler/blob/vertical-pod-autoscaler-1.3.0/vertical-pod-autoscaler/pkg/apis/autoscaling.k8s.io/v1/types.go#L196-L198)
+   * [`ControlledValues`](https://github.com/kubernetes/autoscaler/blob/vertical-pod-autoscaler-1.3.0/vertical-pod-autoscaler/pkg/apis/autoscaling.k8s.io/v1/types.go#L214-L217)
+
+This means that a container's CPU request/limit can be boosted during startup
+beyond [`MaxAllowed`](https://github.com/kubernetes/autoscaler/blob/vertical-pod-autoscaler-1.3.0/vertical-pod-autoscaler/pkg/apis/autoscaling.k8s.io/v1/types.go#L203-L206),
+for example, or it will be able to be boosted even if CPU is explicitly
+excluded from [`ControlledResources`](https://github.com/kubernetes/autoscaler/blob/vertical-pod-autoscaler-1.3.0/vertical-pod-autoscaler/pkg/apis/autoscaling.k8s.io/v1/types.go#L208-L212).
+
+### Validation
+
+#### Static Validation
+
+* We will check that the `startupBoost` configuration is valid when VPA objects
+are created/updated:
+   * The VPA autoscaling mode must be `InPlaceOrRecreate` (since it does not
+   make sense to use this feature with disruptive modes of VPA).
+   * The boost factor is >= 1 (via CRD validation rules)
+   * Only one of `StartupBoost.CPU.Factor` or `StartupBoost.CPU.Value` is
+   specified
+   * The [feature enablement](#feature-enablement) flags must be on.
+
+
+#### Dynamic Validation
+
+* `StartupBoost.CPU.Value` must be greater than the CPU request or limit of the
+  container during the boost phase, otherwise we risk downscaling the container.
+
+### Mitigating Failed In-Place Downsizes
+
+The VPA Updater **will not** evict a pod if it attempted to scaled the pod down
+in place (to unboost its CPU resources) and the update failed (see the
+[scenarios](https://github.com/kubernetes/autoscaler/blob/0a34bf5d3a71b486bdaa440f1af7f8d50dc8e391/vertical-pod-autoscaler/enhancements/4016-in-place-updates-support/README.md?plain=1#L164-L169 ) where the VPA
+updater will consider that the update failed). This is to avoid an eviction
+loop:
+
+1. A pod is created and has its CPU resources boosted
+1. The pod meets the conditions to be unboosted. VPA Updater tries to downscale
+the pod in-place and it fails.
+1. VPA Updater evicts the pod. Logic flow goes back to (1).
+
+### Feature Enablement and Rollback
+
+#### How can this feature be enabled / disabled in a live cluster?
+
+* Feature gates names: `CPUStartupBoost` and `InPlaceOrRecreate` (from
+[AEP-4016](https://github.com/kubernetes/autoscaler/blob/master/vertical-pod-autoscaler/enhancements/4016-in-place-updates-support/README.md#feature-enablement-and-rollback))
+* Components depending on the feature gates:
+  * admission-controller
+  * updater
+
+Enabling of feature gates `CPUStartupBoost` AND `InPlaceOrRecreate` will cause
+the following to happen:
+  * admission-controller to **accept** new VPA objects being created with
+`StartupBoostOnly` configured.
+  * admission-controller to **boost** CPU resources.
+  * updater to **unboost** the CPU resources.
+
+Disabling of feature gates `CPUStartupBoost` OR `InPlaceOrRecreate` will cause
+the following to happen:
+  * admission-controller to **reject** new VPA objects being created with
+  `StartupBoostOnly` configured.
+    * A descriptive error message should be returned to the user letting them
+    know that they are using a feature gated feature.
+  * admission-controller **to not** boost CPU resources, should it encounter a
+  VPA configured with a `StartupBoost` config and `StartupBoostOnly` or `Auto`
+  `ContainerScalingMode`.
+  * updater **to not** unboost CPU resources when pods meet the scale down
+  requirements, should it encounter a VPA configured with a `StartupBoost`
+  config and `StartupBoostOnly` or `Auto` `ContainerScalingMode`.
+
+### Kubernetes Version Compatibility
+
+Similarly to [AEP-4016](https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler/enhancements/4016-in-place-updates-support#kubernetes-version-compatibility),
+`StartupBoost` configuration and `StartupBoostOnly` mode are built assuming that
+VPA will be running on a Kubernetes 1.33+ with the beta version of
+[KEP-1287: In-Place Update of Pod Resources](https://github.com/kubernetes/enhancements/issues/1287)
+enabled. If this is not the case, VPA's attempt to unboost pods may fail and the
+pods may remain boosted for their whole lifecycle.
+
+## Test Plan
+
+Other than comprehensive unit tests, we will also add the following scenarios to
+our e2e tests:
+
+* CPU Startup Boost recommendation is applied to pod controlled by VPA until it
+becomes `Ready` and `StartupBoost.CPU.Duration` has elapsed. Then, the pod is
+scaled back down in-place. We'll also test the following sub-cases:
+  * Boost is applied to all containers of a pod.
+  * Boost is applied only to a subset of containers in a pod.
+  * Combinations of probes + `StartupBoost.CPU.Duration`:
+    * No probes and no `StartupBoost.CPU.Duration` specified: unboost will
+    likely happen immediately.
+    * No probes and a 60s `StartupBoost.CPU.Duration`: unboost will likely
+    happen after 60s.
+    * A readiness/startup probe and no `StartupBoost.CPU.Duration` specified:
+    unboost will likely as soon as the pod becomes `Ready`.
+    *  A readiness/startup probe and a 60s `StartupBoost.CPU.Duration`
+    specified: unboost will likely happen 60s **after** the pod becomes `Ready`.
+
+* Pod is not evicted if the in-place update fails when scaling the pod back
+down.
+
+## Examples
+
+Here are some examples of the VPA CR incorporating CPU boosting for different
+scenarios.
+
+### CPU Boost Only
+
+All containers under `example` deployment will receive "regular" VPA updates,
+**except for** `boosted-container-name`. `boosted-container-name` will only be
+CPU boosted/unboosted, because it has a `StartupBoostOnly` container policy.
+
+```yaml
+apiVersion: "autoscaling.k8s.io/v1"
+kind: VerticalPodAutoscaler
+metadata:
+  name: example-vpa
+spec:
+  targetRef:
+    apiVersion: "apps/v1"
+    kind: Deployment
+    name: example
+  updatePolicy:
+    # VPA Update mode must be InPlaceOrRecreate
+    updateMode: "InPlaceOrRecreate"
+  resourcePolicy:
+    containerPolicies:
+      - containerName: "boosted-container-name"
+        mode: "StartupBoostOnly"
+        startupBoost:
+          cpu:
+            factor: 2.0
+```
+
+### CPU Boost and Vanilla VPA
+
+All containers under `example` deployment will receive "regular" VPA updates,
+**including** `boosted-container-name`. Additionally, `boosted-container-name`
+will be CPU boosted/unboosted, because it has a `StartupBoost` config in its
+container policy and `Auto` container policy mode.
+
+```yaml
+apiVersion: "autoscaling.k8s.io/v1"
+kind: VerticalPodAutoscaler
+metadata:
+  name: example-vpa
+spec:
+  targetRef:
+    apiVersion: "apps/v1"
+    kind: Deployment
+    name: example
+  updatePolicy:
+    # VPA Update mode must be InPlaceOrRecreate
+    updateMode: "InPlaceOrRecreate"
+  resourcePolicy:
+    containerPolicies:
+      - containerName: "boosted-container-name"
+        mode: "Auto" # Vanilla VPA mode + Startup Boost
+        minAllowed:
+          cpu: "250m"
+          memory: "100Mi"
+        maxAllowed:
+          cpu: "500m"
+          memory: "600Mi"
+        # The CPU boosted resources can go beyond maxAllowed.
+        startupBoost:
+          cpu:
+            value: 4
+```
+
+## Implementation History
+
+* 2025-03-20: Initial version.
+