Skip to content

Commit 61128de

Browse files
committed
KEP-2837: PodLevelResources changes for 1.33
1 parent 1cee5af commit 61128de

File tree

2 files changed

+224
-88
lines changed

2 files changed

+224
-88
lines changed

keps/sig-node/2837-pod-level-resource-spec/README.md

Lines changed: 222 additions & 86 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@
1717
- [Components/Features changes](#componentsfeatures-changes)
1818
- [Cgroup Structure Remains unchanged](#cgroup-structure-remains-unchanged)
1919
- [PodSpec API changes](#podspec-api-changes)
20+
- [PodStatus API changes](#podstatus-api-changes)
2021
- [PodSpec Validation Rules](#podspec-validation-rules)
2122
- [Proposed Validation & Defaulting Rules](#proposed-validation--defaulting-rules)
2223
- [Comprehensive Tabular View](#comprehensive-tabular-view)
@@ -32,17 +33,20 @@
3233
- [Admission Controller](#admission-controller)
3334
- [Eviction Manager](#eviction-manager)
3435
- [Pod Overhead](#pod-overhead)
36+
- [Hugepages](#hugepages)
37+
- [Memory Manager](#memory-manager)
38+
- [In-Place Pod Resize](#in-place-pod-resize)
39+
- [API changes](#api-changes)
40+
- [Resize Restart Policy](#resize-restart-policy)
41+
- [Implementation Details](#implementation-details)
42+
- [[Scopred for Beta] CPU Manager](#scopred-for-beta-cpu-manager)
43+
- [[Scoped for Beta] Topology Manager](#scoped-for-beta-topology-manager)
3544
- [[Scoped for Beta] User Experience Survey](#scoped-for-beta-user-experience-survey)
3645
- [[Scoped for Beta] Surfacing Pod Resource Requirements](#scoped-for-beta-surfacing-pod-resource-requirements)
3746
- [The Challenge of Determining Effective Pod Resource Requirements](#the-challenge-of-determining-effective-pod-resource-requirements)
3847
- [Goals of surfacing Pod Resource Requirements](#goals-of-surfacing-pod-resource-requirements)
39-
- [Implementation Details](#implementation-details)
48+
- [Implementation Details](#implementation-details-1)
4049
- [Notes for implementation](#notes-for-implementation)
41-
- [[Scoped for Beta] HugeTLB cgroup](#scoped-for-beta-hugetlb-cgroup)
42-
- [[Scoped for Beta] Topology Manager](#scoped-for-beta-topology-manager)
43-
- [[Scoped for Beta] Memory Manager](#scoped-for-beta-memory-manager)
44-
- [[Scoped for Beta] CPU Manager](#scoped-for-beta-cpu-manager)
45-
- [[Scoped for Beta] In-Place Pod Resize](#scoped-for-beta-in-place-pod-resize)
4650
- [[Scoped for Beta] VPA](#scoped-for-beta-vpa)
4751
- [[Scoped for Beta] Cluster Autoscaler](#scoped-for-beta-cluster-autoscaler)
4852
- [[Scoped for Beta] Support for Windows](#scoped-for-beta-support-for-windows)
@@ -383,7 +387,7 @@ consumption of the pod.
383387
384388
#### PodSpec API changes
385389
386-
New field in `PodSpec`
390+
New field in `PodSpec`:
387391

388392
```
389393
type PodSpec struct {
@@ -396,6 +400,40 @@ type PodSpec struct {
396400
}
397401
```
398402
403+
#### PodStatus API changes
404+
405+
Extend `PodStatus` to include pod-level analog of the container status resource
406+
fields. Pod-level resource information in `PodStatus` is essential for pod-level [In-Place Pod
407+
Update]
408+
(https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/1287-in-place-update-pod-resources/README.md#api-changes)
409+
as it provides a way to track, report and use the actual resource allocation for the
410+
pod, both before and after a resize operation.
411+
412+
```
413+
type PodStatus struct {
414+
...
415+
// Resources represents the compute resource requests and limits that have been
416+
// applied at the pod level. If pod-level resources are not explicitly specified,
417+
// then these will be the aggregate resources computed from containers. If limits are
418+
// not defined for all containers (and pod-level limits are also not set), those
419+
// containers remain unrestricted, and no aggregate pod-level limits will be applied.
420+
// Pod-level limit aggregation is only performed, and is meaningful only, when all
421+
// containers have defined limits.
422+
// +featureGate=InPlacePodVerticalScaling
423+
// +featureGate=PodLevelResources
424+
// +optional
425+
Resources *ResourceRequirements
426+
427+
// AllocatedResources is the total requests allocated for this pod by the node.
428+
// Kubelet sets this to the accepted requests when a pod (or resize) is admitted.
429+
// If pod-level requests are not set, this will be the total requests aggregated
430+
// across containers in the pod.
431+
// +featureGate=InPlacePodVerticalScaling
432+
// +featureGate=PodLevelResources
433+
// +optional
434+
AllocatedResources ResourceList
435+
}
436+
```
399437
#### PodSpec Validation Rules
400438
401439
##### Proposed Validation & Defaulting Rules
@@ -1172,6 +1210,183 @@ back to aggregating container requests.
11721210
size of the pod's cgroup. This means the pod cgroup's resource limits will be
11731211
set to accommodate both pod-level requests and pod overhead.
11741212

1213+
#### Hugepages
1214+
1215+
With the proposed changes, support for hugepages(with prefix hugepages-*) will be extended to the pod-level resources specifications, alongside CPU and memory. The hugetlb cgroup for the
1216+
pod will then directly reflect the pod-level hugepage limits, if specified, rather than using an aggregated value from container limits. When scheduling, the scheduler will
1217+
consider hugepage requests at the pod level to find nodes with enough available
1218+
resources.
1219+
1220+
Containers will still need to mount an emptyDir volume to access the huge page filesystem (typically /dev/hugepages). This is the standard way for containers to interact with huge pages, and this will not change.
1221+
1222+
#### Memory Manager
1223+
1224+
With the introduction of pod-level resource specifications, the Kubernetes Memory
1225+
Manager will evolve to track and enforce resource limits at both the pod and
1226+
container levels. It will need to aggregate memory usage across all containers
1227+
within a pod to calculate the pod's total memory consumption. The Memory Manager
1228+
will then enforce the pod-level limit as the hard cap for the entire pod's memory
1229+
usage, preventing it from exceeding the allocated amount. While still
1230+
maintaining container-level limit enforcement, the Memory Manager will need to
1231+
coordinate with the Kubelet and eviction manager to make decisions about pod
1232+
eviction or individual container termination when the pod-level limit is
1233+
breached.
1234+
1235+
#### In-Place Pod Resize
1236+
1237+
##### API changes
1238+
1239+
IPPR for pod-level resources requires extending `PodStatus` to include pod-level
1240+
resource fields as detailed in [PodStatus API changes](#### PodStatus API changes)
1241+
section.
1242+
1243+
##### Resize Restart Policy
1244+
1245+
Pod-level resize policy is not supported in the alpha stage of Pod-level resource
1246+
feature. While a pod-level resize policy might be beneficial for VM-based runtimes
1247+
like Kata Containers (potentially allowing the hypervisor to restart the entire VM
1248+
on resize), this is a topic for future consideration. We plan to engage with the
1249+
Kata community to discuss this further and will re-evaluate the need for a pod-level
1250+
policy in subsequent development stages.
1251+
1252+
The absence of a pod-level resize policy means that container restarts are
1253+
exclusively managed by their individual `resizePolicy` configs. The example below of
1254+
a pod with pod-level resources demonstrates several key aspects of this behavior,
1255+
showing how containers without explicit limits (which inherit pod-level limits) interact
1256+
with resize policy, and how containers with specified resources remain unaffected by
1257+
pod-level resizes.
1258+
1259+
```yaml
1260+
apiVersion: v1
1261+
kind: Pod
1262+
metadata:
1263+
name: pod-level-resources
1264+
spec:
1265+
resources:
1266+
requests:
1267+
cpu: 100m
1268+
memory: 100Mi
1269+
limits:
1270+
cpu: 200m
1271+
memory: 200Mi
1272+
containers:
1273+
- name: c1
1274+
image: registry.k8s.io/pause:latest
1275+
resizePolicy:
1276+
- resourceName: "cpu"
1277+
restartPolicy: "NotRequired"
1278+
- resourceName: "memory"
1279+
restartPolicy: "RestartRequired"
1280+
- name: c2
1281+
image: registry.k8s.io/pause:latest
1282+
resources:
1283+
requests:
1284+
cpu: 50m
1285+
memory: 50Mi
1286+
limits:
1287+
cpu: 100m
1288+
memory: 100Mi
1289+
resizePolicy:
1290+
- resourceName: "cpu"
1291+
restartPolicy: "NotRequired"
1292+
- resourceName: "memory"
1293+
restartPolicy: "RestartRequired"
1294+
```
1295+
1296+
In this example:
1297+
* CPU resizes: Neither container requires a restart for CPU resizes, and therefore CPU resizes at neither the container nor pod level will trigger any restarts.
1298+
* Container c1 (inherited memory limit): c1 does not define any container level
1299+
resources, so the effective memory limit of the container is determined by the
1300+
pod-level limit. When the pod's limit is resized, c1's effective memory limit
1301+
changes. Because c1's memory resizePolicy is RestartRequired, a resize of the
1302+
pod-level memory limit will trigger a restart of container c1.
1303+
* Container c2 (specified memory limit): c2 does define container-level resources,
1304+
so the effective memory limit of c2 is the container level limit. Therefore, a
1305+
resize of the pod-level memory limit doesn't change the effective container limit,
1306+
so the c2 is not restarted when the pod-level memory limit is resized.
1307+
1308+
##### Implementation Details
1309+
1310+
###### Allocating Pod-level Resources
1311+
Allocation of pod-level resources will work the same as container-level resources. The allocated resources checkpoint will be extended to include pod-level resources, and the pod object will be updated with the allocated resources in the pod sync loop.
1312+
1313+
###### Actuating Pod-level Resource Resize
1314+
The mechanism for actuating pod-level resize remains largely unchanged from the
1315+
existing container-level resize process. When pod-level resource configurations are
1316+
applied, the system handles the resize in a similar manner as it does for
1317+
container-level resources. This includes extending the existing logic to incorporate
1318+
directly configured pod-level resource settings.
1319+
1320+
The same ordering rules for pod and container resource resizing will be applied for each
1321+
resource as needed:
1322+
1. Increase pod-level cgroup (if needed)
1323+
2. Decrease container resources
1324+
3. Decrease pod-level cgroup (if needed)
1325+
4. Increase container resources
1326+
1327+
###### Tracking Actual Pod-level Resources
1328+
To accurately track actual pod-level resources during in-place pod resizing, several
1329+
changes are required that are analogous to the changes made for container-level
1330+
in-place resizing:
1331+
1332+
1. Configuration reading: Pod-level resource config is currently read as part of the
1333+
resize flow, but will also need to be read during pod creation. Critically, the
1334+
configuration must be read again after the resize operation to capture the
1335+
updated resource values. Currently, the configuration is only read before a
1336+
resize.
1337+
1338+
2. Pod Status Update: Because the pod status is updated before the resize takes
1339+
effect, the status will not immediately reflect the new resource values. If a
1340+
container within the pod is also being resized, the container resize operation
1341+
will trigger a pod synchronization (pod-sync), which will refresh the pod's
1342+
status. However, if only pod-level resources are being resized, a pod-sync must
1343+
be explicitly triggered to update the pod status with the new resource
1344+
allocation.
1345+
1346+
3. [Scoped for Beta] Caching: Actual pod resource data may be cached to minimize API server load. This cache, if implemented, must be invalidated after each successful pod resize to ensure that subsequent reads retrieve the latest information. The need for and implementation of this caching mechanism will be evaluated in the beta phase. Performance benchmarking will be conducted to determine if caching is required and, if so, what caching strategy is most appropriate.
1347+
1348+
**Note for future enhancements for Ephemeral containers with pod-level resources and
1349+
IPPR**
1350+
Previously, assigning resources to ephemeral
1351+
containers wasn't allowed because pod resource allocations were immutable. With
1352+
the introduction of in-place pod resizing, users could gain more flexibility:
1353+
1354+
* Adjust pod-level resources to accommodate the needs of ephemeral containers. This
1355+
allows for a more dynamic allocation of resources within the pod.
1356+
* Specify resource requests and limits directly for ephemeral containers. Kubernetes will
1357+
then automatically resize the pod to ensure sufficient resources are available
1358+
for both regular and ephemeral containers.
1359+
1360+
Currently, setting `resources` for ephemeral containers is disallowed as pod
1361+
resource allocations were immutable before In-Place Pod Resizing feature. With
1362+
in-place pod resize for pod-level resource allocation, users should be able to
1363+
either modify the pod-level resources to accommodate ephemeral containers or
1364+
supply resources at container-level for ephemeral containers and kubernetes will
1365+
resize the pod to accommodate the ephemeral containers.
1366+
1367+
#### [Scopred for Beta] CPU Manager
1368+
1369+
With the introduction of pod-level resource specifications, the CPU manager in
1370+
Kubernetes will adapt to manage CPU requests and limits at the pod level rather
1371+
than solely at the container level. This change means that the CPU manager will
1372+
allocate and enforce CPU resources based on the total requirements of the entire
1373+
pod, allowing for more flexible and efficient CPU utilization across all
1374+
containers within a pod. The CPU manager will need to ensure that the aggregate
1375+
CPU usage of all containers in a pod does not exceed the pod-level limits.
1376+
1377+
The CPU Manager policies are container-level configurations that control the
1378+
fine-grained allocation of CPU resources to containers. While CPU manager
1379+
policies will operate within the constraints of pod-level resource limits, they
1380+
do not directly apply at the pod level.
1381+
1382+
#### [Scoped for Beta] Topology Manager
1383+
1384+
Note: This section includes only high level overview; Design details will be added in Beta stage.
1385+
1386+
* The pod level scope for topology aligntment will consider pod level requests and limits instead of container level aggregates.
1387+
* The hint providers will consider pod level requests and limits instead of
1388+
container level aggregates.
1389+
11751390
#### [Scoped for Beta] User Experience Survey
11761391

11771392
Before promoting the feature to Beta, we plan to conduct a UX survey to
@@ -1291,85 +1506,6 @@ KEPs. The first change doesn’t present any user visible change, and if
12911506
implemented, will in a small way reduce the effort for both of those KEPs by
12921507
providing a single place to update the pod resource calculation.
12931508
1294-
#### [Scoped for Beta] HugeTLB cgroup
1295-
1296-
Note: This section includes only high level overview; Design details will be added in Beta stage.
1297-
1298-
To support pod-level resource specifications for hugepages, Kubernetes will need to adjust how it handles hugetlb cgroups. Unlike memory, where an unset limit
1299-
means unlimited, an unset hugetlb limit is the same as setting it to 0.
1300-
1301-
With the proposed changes, hugepages-2Mi and hugepages-1Gi will be added to the pod-level resources section, alongside CPU and memory. The hugetlb cgroup for the
1302-
pod will then directly reflect the pod-level hugepage limits, rather than using an aggregated value from container limits. When scheduling, the scheduler will
1303-
consider hugepage requests at the pod level to find nodes with enough available resources.
1304-
1305-
1306-
#### [Scoped for Beta] Topology Manager
1307-
1308-
Note: This section includes only high level overview; Design details will be added in Beta stage.
1309-
1310-
1311-
* (Tentative) Only pod level scope for topology alignment will be supported if pod level requests and limits are specified without container-level requests and limits.
1312-
* The pod level scope for topology aligntment will consider pod level requests and limits instead of container level aggregates.
1313-
* The hint providers will consider pod level requests and limits instead of container level aggregates.
1314-
1315-
1316-
#### [Scoped for Beta] Memory Manager
1317-
1318-
Note: This section includes only high level overview; Design details will be
1319-
added in Beta stage.
1320-
1321-
With the introduction of pod-level resource specifications, the Kubernetes Memory
1322-
Manager will evolve to track and enforce resource limits at both the pod and
1323-
container levels. It will need to aggregate memory usage across all containers
1324-
within a pod to calculate the pod's total memory consumption. The Memory Manager
1325-
will then enforce the pod-level limit as the hard cap for the entire pod's memory
1326-
usage, preventing it from exceeding the allocated amount. While still
1327-
maintaining container-level limit enforcement, the Memory Manager will need to
1328-
coordinate with the Kubelet and eviction manager to make decisions about pod
1329-
eviction or individual container termination when the pod-level limit is
1330-
breached.
1331-
1332-
1333-
#### [Scoped for Beta] CPU Manager
1334-
1335-
Note: This section includes only high level overview; Design details will be
1336-
added in Beta stage.
1337-
1338-
With the introduction of pod-level resource specifications, the CPU manager in
1339-
Kubernetes will adapt to manage CPU requests and limits at the pod level rather
1340-
than solely at the container level. This change means that the CPU manager will
1341-
allocate and enforce CPU resources based on the total requirements of the entire
1342-
pod, allowing for more flexible and efficient CPU utilization across all
1343-
containers within a pod. The CPU manager will need to ensure that the aggregate
1344-
CPU usage of all containers in a pod does not exceed the pod-level limits.
1345-
1346-
#### [Scoped for Beta] In-Place Pod Resize
1347-
1348-
In-Place Pod resizing of resources is not supported in alpha stage of Pod-level
1349-
resources feature. **Users should avoid using in-place pod resizing if they are
1350-
utilizing pod-level resources.**
1351-
1352-
In version 1.33, the In-Place Pod resize functionality will be controlled by a
1353-
separate feature gate and introduced as an independent alpha feature. This is
1354-
necessary as it involves new fields in the PodStatus at the pod level.
1355-
1356-
Note for design & implementation: Previously, assigning resources to ephemeral
1357-
containers wasn't allowed because pod resource allocations were immutable. With
1358-
the introduction of in-place pod resizing, users will gain more flexibility:
1359-
1360-
* Adjust pod-level resources to accommodate the needs of ephemeral containers. This
1361-
allows for a more dynamic allocation of resources within the pod.
1362-
* Specify resource requests and limits directly for ephemeral containers. Kubernetes will
1363-
then automatically resize the pod to ensure sufficient resources are available
1364-
for both regular and ephemeral containers.
1365-
1366-
Currently, setting `resources` for ephemeral containers is disallowed as pod
1367-
resource allocations were immutable before In-Place Pod Resizing feature. With
1368-
in-place pod resize for pod-level resource allocation, users should be able to
1369-
either modify the pod-level resources to accommodate ephemeral containers or
1370-
supply resources at container-level for ephemeral containers and kubernetes will
1371-
resize the pod to accommodate the ephemeral containers.
1372-
13731509
#### [Scoped for Beta] VPA
13741510
13751511
TBD. Do not review for the alpha stage.

keps/sig-node/2837-pod-level-resource-spec/kep.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -26,11 +26,11 @@ stage: alpha
2626
# The most recent milestone for which work toward delivery of this KEP has been
2727
# done. This can be the current (upcoming) milestone, if it is being actively
2828
# worked on.
29-
latest-milestone: "v1.32"
29+
latest-milestone: "v1.33"
3030

3131
# The milestone at which this feature was, or is targeted to be, at each stage.
3232
milestone:
33-
alpha: "v1.32"
33+
alpha: "v1.33"
3434

3535
# The following PRR answers are required at alpha release
3636
# List the feature gate name and the components for which it must be enabled

0 commit comments

Comments
 (0)