Skip to content

Commit 37ac1ab

Browse files
committed
KEP-1287: Introduce Acknowledged resources concept to replace 'dirty-bit'
1 parent 3406b97 commit 37ac1ab

File tree

1 file changed

+66
-12
lines changed
  • keps/sig-node/1287-in-place-update-pod-resources

1 file changed

+66
-12
lines changed

keps/sig-node/1287-in-place-update-pod-resources/README.md

Lines changed: 66 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@
1818
- [CRI Changes](#cri-changes)
1919
- [Risks and Mitigations](#risks-and-mitigations)
2020
- [Design Details](#design-details)
21+
- [Resource States](#resource-states)
2122
- [Kubelet and API Server Interaction](#kubelet-and-api-server-interaction)
2223
- [Kubelet Restart Tolerance](#kubelet-restart-tolerance)
2324
- [Scheduler and API Server Interaction](#scheduler-and-api-server-interaction)
@@ -28,7 +29,7 @@
2829
- [Notes](#notes)
2930
- [Lifecycle Nuances](#lifecycle-nuances)
3031
- [Atomic Resizes](#atomic-resizes)
31-
- [Edge-triggered Resizes](#edge-triggered-resizes)
32+
- [Actuating Resizes](#actuating-resizes)
3233
- [Memory Limit Decreases](#memory-limit-decreases)
3334
- [Sidecars](#sidecars)
3435
- [QOS Class](#qos-class)
@@ -404,6 +405,38 @@ WindowsPodSandboxConfig.
404405

405406
## Design Details
406407

408+
### Resource States
409+
410+
In-place pod resizing adds a lot of new resource states. These are detailed in other sections of
411+
this KEP, but summarized here to help understand how they relate to each other.
412+
413+
The Kubelet now tracks 4 sets of resources for each pod/container:
414+
415+
1. Desired resources
416+
- What the user (or controller) asked for
417+
- Recorded in the API as the spec resources (`.spec.container[i].resources`)
418+
2. Allocated resources
419+
- The resources that the Kubelet admitted, and intends to actuate
420+
- Reported in the API through the `.status.containerStatuses[i].allocatedResources` field
421+
(allocated requests only)
422+
- Persisted locally on the node (requests + limits) in a checkpoint file
423+
3. Acknowledged resources
424+
- The resource configuration that the Kubelet passed to the runtime to actuate
425+
- Not reported in the API
426+
- Persisted locally on the node in a checkpoint file
427+
- See [Actuating Resizes](#actuating-resizes) for more details
428+
4. Actual resources
429+
- The actual resource configuration the containers are running with, reported by the runtime,
430+
typically read directly from the cgroup configuration
431+
- Reported in the API via the `.status.conatinerStatuses[i].resources` field
432+
433+
Changes are always propogated through these 4 resource states in order:
434+
435+
```
436+
Desired --> Allocated --> Acknowledged --> Actual
437+
```
438+
439+
407440
### Kubelet and API Server Interaction
408441

409442
When a new Pod is created, Scheduler is responsible for selecting a suitable
@@ -483,6 +516,7 @@ This is intentionally hitting various edge-cases for demonstration.
483516
- `spec.containers[0].resources.requests[cpu]` = 1
484517
- `status.resize` = unset
485518
- `status.containerStatuses[0].allocatedResources[cpu]` = 1
519+
- `acknowledged[cpu]` = 1
486520
- `status.containerStatuses[0].resources.requests[cpu]` = 1
487521
- actual CPU shares = 1024
488522

@@ -492,13 +526,25 @@ This is intentionally hitting various edge-cases for demonstration.
492526
- `spec.containers[0].resources.requests[cpu]` = 1.5
493527
- `status.resize` = unset
494528
- `status.containerStatuses[0].allocatedResources[cpu]` = 1
529+
- `acknowledged[cpu]` = 1
530+
- `status.containerStatuses[0].resources.requests[cpu]` = 1
531+
- actual CPU shares = 1024
532+
533+
1. Kubelet Restarts!
534+
- The allocated & acknowledged resources are read back from checkpoint
535+
- Pods are resynced from the API server, but admitted based on the allocated resources
536+
- `spec.containers[0].resources.requests[cpu]` = 1.5
537+
- `status.resize` = unset
538+
- `status.containerStatuses[0].allocatedResources[cpu]` = 1
539+
- `acknowledged[cpu]` = 1
495540
- `status.containerStatuses[0].resources.requests[cpu]` = 1
496541
- actual CPU shares = 1024
497542

498543
1. Kubelet syncs the pod, sees resize #1 and admits it
499544
- `spec.containers[0].resources.requests[cpu]` = 1.5
500545
- `status.resize` = `"InProgress"`
501546
- `status.containerStatuses[0].allocatedResources[cpu]` = 1.5
547+
- `acknowledged[cpu]` = 1
502548
- `status.containerStatuses[0].resources.requests[cpu]` = 1
503549
- actual CPU shares = 1024
504550

@@ -514,6 +560,7 @@ This is intentionally hitting various edge-cases for demonstration.
514560
- `spec.containers[0].resources.requests[cpu]` = 2
515561
- `status.resize` = `"InProgress"`
516562
- `status.containerStatuses[0].allocatedResources[cpu]` = 1.5
563+
- `acknowledged[cpu]` = 1.5
517564
- `status.containerStatuses[0].resources.requests[cpu]` = 1
518565
- actual CPU shares = 1536
519566

@@ -522,6 +569,7 @@ This is intentionally hitting various edge-cases for demonstration.
522569
- `spec.containers[0].resources.requests[cpu]` = 2
523570
- `status.resize[cpu]` = `"Deferred"`
524571
- `status.containerStatuses[0].allocatedResources[cpu]` = 1.5
572+
- `acknowledged[cpu]` = 1.5
525573
- `status.containerStatuses[0].resources.requests[cpu]` = 1.5
526574
- actual CPU shares = 1536
527575

@@ -530,27 +578,31 @@ This is intentionally hitting various edge-cases for demonstration.
530578
- `spec.containers[0].resources.requests[cpu]` = 1.6
531579
- `status.resize[cpu]` = `"Deferred"`
532580
- `status.containerStatuses[0].allocatedResources[cpu]` = 1.5
581+
- `acknowledged[cpu]` = 1.5
533582
- `status.containerStatuses[0].resources.requests[cpu]` = 1.5
534583
- actual CPU shares = 1536
535584

536585
1. Kubelet syncs the pod, and sees resize #3 and admits it
537586
- `spec.containers[0].resources.requests[cpu]` = 1.6
538587
- `status.resize[cpu]` = `"InProgress"`
539588
- `status.containerStatuses[0].allocatedResources[cpu]` = 1.6
589+
- `acknowledged[cpu]` = 1.5
540590
- `status.containerStatuses[0].resources.requests[cpu]` = 1.5
541591
- actual CPU shares = 1536
542592

543593
1. Container runtime applied cpu=1.6
544594
- `spec.containers[0].resources.requests[cpu]` = 1.6
545595
- `status.resize[cpu]` = `"InProgress"`
546596
- `status.containerStatuses[0].allocatedResources[cpu]` = 1.6
597+
- `acknowledged[cpu]` = 1.6
547598
- `status.containerStatuses[0].resources.requests[cpu]` = 1.5
548599
- actual CPU shares = 1638
549600

550601
1. Kubelet syncs the pod
551602
- `spec.containers[0].resources.requests[cpu]` = 1.6
552603
- `status.resize[cpu]` = unset
553604
- `status.containerStatuses[0].allocatedResources[cpu]` = 1.6
605+
- `acknowledged[cpu]` = 1.6
554606
- `status.containerStatuses[0].resources.requests[cpu]` = 1.6
555607
- actual CPU shares = 1638
556608

@@ -559,6 +611,7 @@ This is intentionally hitting various edge-cases for demonstration.
559611
- `spec.containers[0].resources.requests[cpu]` = 100
560612
- `status.resize[cpu]` = unset
561613
- `status.containerStatuses[0].allocatedResources[cpu]` = 1.6
614+
- `acknowledged[cpu]` = 1.6
562615
- `status.containerStatuses[0].resources.requests[cpu]` = 1.6
563616
- actual CPU shares = 1638
564617

@@ -567,6 +620,7 @@ This is intentionally hitting various edge-cases for demonstration.
567620
- `spec.containers[0].resources.requests[cpu]` = 100
568621
- `status.resize[cpu]` = `"Infeasible"`
569622
- `status.containerStatuses[0].allocatedResources[cpu]` = 1.6
623+
- `acknowledged[cpu]` = 1.6
570624
- `status.containerStatuses[0].resources.requests[cpu]` = 1.6
571625
- actual CPU shares = 1638
572626

@@ -695,7 +749,7 @@ intended.
695749

696750
The atomic resize requirement should be reevaluated prior to GA, and in the context of pod-level resources.
697751

698-
### Edge-triggered Resizes
752+
### Actuating Resizes
699753

700754
The resources specified by the Kubelet are not guaranteed to be the actual resources configured for
701755
a pod or container. Examples include:
@@ -706,16 +760,16 @@ a pod or container. Examples include:
706760
Therefore the Kubelet cannot reliably compare desired & actual resources to know whether to trigger
707761
a resize (a level-triggered approach).
708762

709-
To accommodate this, the Kubelet stores a bit along with every resource in the allocated resource
710-
checkpoint which tracks whether the resource has been successfully resized. For container resources,
711-
this means the `UpdateContainerResources` request succeeded. This status bit is persisted in the
712-
allocated resources checkpoint to avoid extra resize requests across Kubelet restarts. There is the
713-
possibility that a poorly timed restart could lead to a resize request being repeated, so
714-
`UpdateContainerResources` should be idempotent.
763+
To accommodate this, the Kubelet stores the set of "acknowledged" resources per container.
764+
Acknowledged resources represent the resource configuration that was passed to the runtime (either
765+
via a CreateContainer or UpdateContainerResources call) and received a successful response. The
766+
acknowledged resources are checkpointed alongside the allocated resources to persist across
767+
restarts. There is the possibility that a poorly timed restart could lead to a resize request being
768+
repeated, so `UpdateContainerResources` must be idempotent.
715769

716-
When a resize request succeeds, the pod will be marked for resync to read the latest resources. If
770+
When a resize CRI request succeeds, the pod will be marked for resync to read the latest resources. If
717771
the actual configured resources do not match the desired resources, this will be reflected in the
718-
pod status resources.
772+
pod status resources, but not otherwise acted upon.
719773

720774
### Memory Limit Decreases
721775

@@ -1293,7 +1347,7 @@ _This section must be completed when targeting beta graduation to a release._
12931347
container status.
12941348
- The `ResizeStatus` in the pod status should converge to the empty value, indicating the resize has completed.
12951349
- The `Resources` in the container status should converge to the resized resources, or an
1296-
approximation of it (see [Edge-triggered Resizes](#edge-triggered-resizes) for more details on
1350+
approximation of it (see [Actuating Resizes](#actuating-resizes) for more details on
12971351
when these resources can diverge).
12981352

12991353
* **What are the SLIs (Service Level Indicators) an operator can use to determine
@@ -1452,7 +1506,7 @@ _This section must be completed when targeting beta graduation to a release._
14521506
- Rename ResizeRestartPolicy `NotRequired` to `PreferNoRestart`,
14531507
and update CRI `UpdateContainerResources` contract
14541508
- Add back `AllocatedResources` field to resolve a scheduler corner case
1455-
- Switch to edge-triggered resize actuation
1509+
- Introduce Acknowledged resources for actuation
14561510

14571511
## Drawbacks
14581512

0 commit comments

Comments
 (0)