2525 - [ Integration tests] ( #integration-tests )
2626 - [ e2e tests] ( #e2e-tests )
2727 - [ Graduation Criteria] ( #graduation-criteria )
28- - [ Phase 1: Alpha (target 1.35)] ( #phase-1-alpha-target-135 )
28+ - [ Phase 1: Alpha (target 1.35) [ DONE ]] ( #phase-1-alpha-target-135-done )
2929 - [ Phase 2: Beta (target 1.36)] ( #phase-2--beta-target-136 )
3030 - [ GA (stable)] ( #ga-stable )
3131 - [ Upgrade / Downgrade Strategy] ( #upgrade--downgrade-strategy )
@@ -493,7 +493,7 @@ Following scenarios need to be covered:
493493# ## Graduation Criteria
494494
495495
496- # ### Phase 1: Alpha (target 1.35)
496+ # ### Phase 1: Alpha (target 1.35) [DONE]
497497* Feature is disabled by default. It is an opt-in feature which can be enabled by
498498 enabling the InPlacePodLevelResourcesVerticalScaling feature gate and by setting
499499 the new resources fields in PodSpec at Pod level.
@@ -828,6 +828,7 @@ Focusing mostly on:
828828 - One new PATCH PodStatus API call in response to Pod resize request.
829829 - No additional overhead unless Pod resize is invoked.
830830 - estimated throughput
831+ - Proportional to the number of resize requests ssued by users or controllers (e.g., VPA). For a typical cluster this is expected to be < 1% of total Pod update traffic.
831832 - originating component(s) (e.g. Kubelet, Feature-X-controller)
832833 - Kubelet
833834 focusing mostly on :
@@ -866,7 +867,9 @@ Describe them, providing:
866867-->
867868Negligible.
868869- API type(s) :
869- - Estimated increase in size : (e.g., new annotation of size 32B)
870+ - Estimated increase in size : (e.g., new annotation of size 32B): Each Pod object will grow by approximately
871+ 200-400 bytes due to the addition of `Resources` and `AllocatedResources`
872+ fields in `PodStatus`, plus the `Resources` stanza in `PodSpec`.
870873- Estimated amount of new objects : (e.g., new Object X for every existing Pod)
871874 - type PodStatus has 2 new fields of type v1.ResourceRequirements and v1.ResourceList
872875
@@ -924,6 +927,10 @@ details). For now, we leave it here.
924927
925928# ##### How does this feature react if the API server and/or etcd is unavailable?
926929
930+ If the API server or etcd is unavailable, existing pods will continue to run with their last
931+ known resource configurations. No new resize requests can be initiated, and the Kubelet
932+ will be unable to update the `PodStatus` to reflect any locally completed or failed
933+ resizes until connectivity is restored.
927934
928935# ##### What are other known failure modes?
929936
@@ -940,12 +947,35 @@ For each of them, fill in the following information by copying the below templat
940947 - Testing : Are there any tests for failure mode? If not, describe why.
941948-->
942949
950+ - **CRI Runtime doesn't support Pod Sandbox Resize**:
951+ - Detection : ` PodStatus.Resize` will be stuck in `InProgress` and Kubelet logs will
952+ show errors calling `UpdatePodSandboxResources`.
953+ - Mitigations : Disable the feature gate or upgrade the container runtime to a
954+ compatible version (e.g., latest containerd/CRI-O).
955+ - Diagnostics : Kubelet logs (search for `UpdatePodSandboxResources` errors) and
956+ ` kubectl get pod <name> -o yaml` to check `resizeStatus`.
957+ - **Cgroup update failure (OS level)**:
958+ - Detection : Kubelet will emit an event indicating failure to update cgroups.
959+ - Mitigations : Revert the resize request in the Pod spec to a known-good value.
960+ - Diagnostics : Kubelet logs and `dmesg` on the node for potential OOM or cgroup
961+ permission issues.
962+
943963
944964# ##### What steps should be taken if SLOs are not being met to determine the problem?
945965
966+ 1. Verify if the `InPlacePodLevelResourcesVerticalScaling` feature gate is enabled
967+ on all components (apiserver, scheduler, kubelet).
968+ 2. Check `apiserver_request_total{resource="pods", subresource="resize"}` to see
969+ if resize requests are being rejected at the API level.
970+ 3. Inspect Kubelet logs for errors related to `UpdatePodSandboxResources` or
971+ ` ResourceCalculation` .
972+ 4. Monitor `node_collector_evictions_total` to ensure pod-level limits aren't
973+ causing unexpected evictions.
974+
946975# # Implementation History
947976
948977- **2025-06-18:** KEP draft split from (KEP#2387)[https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/2837-pod-level-resource-spec/README.md]
978+ - **2026-01-28:** KEP moved to beta for 1.36 release
949979
950980# # Drawbacks
951981
0 commit comments