18
18
- [ CRI Changes] ( #cri-changes )
19
19
- [ Risks and Mitigations] ( #risks-and-mitigations )
20
20
- [ Design Details] ( #design-details )
21
+ - [ Resource States] ( #resource-states )
21
22
- [ Kubelet and API Server Interaction] ( #kubelet-and-api-server-interaction )
22
23
- [ Kubelet Restart Tolerance] ( #kubelet-restart-tolerance )
23
24
- [ Scheduler and API Server Interaction] ( #scheduler-and-api-server-interaction )
28
29
- [ Notes] ( #notes )
29
30
- [ Lifecycle Nuances] ( #lifecycle-nuances )
30
31
- [ Atomic Resizes] ( #atomic-resizes )
31
- - [ Edge-triggered Resizes] ( #edge-triggered -resizes )
32
+ - [ Actuating Resizes] ( #actuating -resizes )
32
33
- [ Memory Limit Decreases] ( #memory-limit-decreases )
33
34
- [ Sidecars] ( #sidecars )
34
35
- [ QOS Class] ( #qos-class )
@@ -404,6 +405,38 @@ WindowsPodSandboxConfig.
404
405
405
406
## Design Details
406
407
408
+ ### Resource States
409
+
410
+ In-place pod resizing adds a lot of new resource states. These are detailed in other sections of
411
+ this KEP, but summarized here to help understand how they relate to each other.
412
+
413
+ The Kubelet now tracks 4 sets of resources for each pod/container:
414
+
415
+ 1 . Desired resources
416
+ - What the user (or controller) asked for
417
+ - Recorded in the API as the spec resources (` .spec.container[i].resources ` )
418
+ 2 . Allocated resources
419
+ - The resources that the Kubelet admitted, and intends to actuate
420
+ - Reported in the API through the ` .status.containerStatuses[i].allocatedResources ` field
421
+ (allocated requests only)
422
+ - Persisted locally on the node (requests + limits) in a checkpoint file
423
+ 3 . Acknowledged resources
424
+ - The resource configuration that the Kubelet passed to the runtime to actuate
425
+ - Not reported in the API
426
+ - Persisted locally on the node in a checkpoint file
427
+ - See [ Actuating Resizes] ( #actuating-resizes ) for more details
428
+ 4 . Actual resources
429
+ - The actual resource configuration the containers are running with, reported by the runtime,
430
+ typically read directly from the cgroup configuration
431
+ - Reported in the API via the ` .status.conatinerStatuses[i].resources ` field
432
+
433
+ Changes are always propogated through these 4 resource states in order:
434
+
435
+ ```
436
+ Desired --> Allocated --> Acknowledged --> Actual
437
+ ```
438
+
439
+
407
440
### Kubelet and API Server Interaction
408
441
409
442
When a new Pod is created, Scheduler is responsible for selecting a suitable
@@ -483,6 +516,7 @@ This is intentionally hitting various edge-cases for demonstration.
483
516
- ` spec.containers[0].resources.requests[cpu] ` = 1
484
517
- ` status.resize ` = unset
485
518
- ` status.containerStatuses[0].allocatedResources[cpu] ` = 1
519
+ - ` acknowledged[cpu] ` = 1
486
520
- ` status.containerStatuses[0].resources.requests[cpu] ` = 1
487
521
- actual CPU shares = 1024
488
522
@@ -492,13 +526,25 @@ This is intentionally hitting various edge-cases for demonstration.
492
526
- ` spec.containers[0].resources.requests[cpu] ` = 1.5
493
527
- ` status.resize ` = unset
494
528
- ` status.containerStatuses[0].allocatedResources[cpu] ` = 1
529
+ - ` acknowledged[cpu] ` = 1
530
+ - ` status.containerStatuses[0].resources.requests[cpu] ` = 1
531
+ - actual CPU shares = 1024
532
+
533
+ 1 . Kubelet Restarts!
534
+ - The allocated & acknowledged resources are read back from checkpoint
535
+ - Pods are resynced from the API server, but admitted based on the allocated resources
536
+ - ` spec.containers[0].resources.requests[cpu] ` = 1.5
537
+ - ` status.resize ` = unset
538
+ - ` status.containerStatuses[0].allocatedResources[cpu] ` = 1
539
+ - ` acknowledged[cpu] ` = 1
495
540
- ` status.containerStatuses[0].resources.requests[cpu] ` = 1
496
541
- actual CPU shares = 1024
497
542
498
543
1 . Kubelet syncs the pod, sees resize #1 and admits it
499
544
- ` spec.containers[0].resources.requests[cpu] ` = 1.5
500
545
- ` status.resize ` = ` "InProgress" `
501
546
- ` status.containerStatuses[0].allocatedResources[cpu] ` = 1.5
547
+ - ` acknowledged[cpu] ` = 1
502
548
- ` status.containerStatuses[0].resources.requests[cpu] ` = 1
503
549
- actual CPU shares = 1024
504
550
@@ -514,6 +560,7 @@ This is intentionally hitting various edge-cases for demonstration.
514
560
- ` spec.containers[0].resources.requests[cpu] ` = 2
515
561
- ` status.resize ` = ` "InProgress" `
516
562
- ` status.containerStatuses[0].allocatedResources[cpu] ` = 1.5
563
+ - ` acknowledged[cpu] ` = 1.5
517
564
- ` status.containerStatuses[0].resources.requests[cpu] ` = 1
518
565
- actual CPU shares = 1536
519
566
@@ -522,6 +569,7 @@ This is intentionally hitting various edge-cases for demonstration.
522
569
- ` spec.containers[0].resources.requests[cpu] ` = 2
523
570
- ` status.resize[cpu] ` = ` "Deferred" `
524
571
- ` status.containerStatuses[0].allocatedResources[cpu] ` = 1.5
572
+ - ` acknowledged[cpu] ` = 1.5
525
573
- ` status.containerStatuses[0].resources.requests[cpu] ` = 1.5
526
574
- actual CPU shares = 1536
527
575
@@ -530,27 +578,31 @@ This is intentionally hitting various edge-cases for demonstration.
530
578
- ` spec.containers[0].resources.requests[cpu] ` = 1.6
531
579
- ` status.resize[cpu] ` = ` "Deferred" `
532
580
- ` status.containerStatuses[0].allocatedResources[cpu] ` = 1.5
581
+ - ` acknowledged[cpu] ` = 1.5
533
582
- ` status.containerStatuses[0].resources.requests[cpu] ` = 1.5
534
583
- actual CPU shares = 1536
535
584
536
585
1 . Kubelet syncs the pod, and sees resize #3 and admits it
537
586
- ` spec.containers[0].resources.requests[cpu] ` = 1.6
538
587
- ` status.resize[cpu] ` = ` "InProgress" `
539
588
- ` status.containerStatuses[0].allocatedResources[cpu] ` = 1.6
589
+ - ` acknowledged[cpu] ` = 1.5
540
590
- ` status.containerStatuses[0].resources.requests[cpu] ` = 1.5
541
591
- actual CPU shares = 1536
542
592
543
593
1 . Container runtime applied cpu=1.6
544
594
- ` spec.containers[0].resources.requests[cpu] ` = 1.6
545
595
- ` status.resize[cpu] ` = ` "InProgress" `
546
596
- ` status.containerStatuses[0].allocatedResources[cpu] ` = 1.6
597
+ - ` acknowledged[cpu] ` = 1.6
547
598
- ` status.containerStatuses[0].resources.requests[cpu] ` = 1.5
548
599
- actual CPU shares = 1638
549
600
550
601
1 . Kubelet syncs the pod
551
602
- ` spec.containers[0].resources.requests[cpu] ` = 1.6
552
603
- ` status.resize[cpu] ` = unset
553
604
- ` status.containerStatuses[0].allocatedResources[cpu] ` = 1.6
605
+ - ` acknowledged[cpu] ` = 1.6
554
606
- ` status.containerStatuses[0].resources.requests[cpu] ` = 1.6
555
607
- actual CPU shares = 1638
556
608
@@ -559,6 +611,7 @@ This is intentionally hitting various edge-cases for demonstration.
559
611
- ` spec.containers[0].resources.requests[cpu] ` = 100
560
612
- ` status.resize[cpu] ` = unset
561
613
- ` status.containerStatuses[0].allocatedResources[cpu] ` = 1.6
614
+ - ` acknowledged[cpu] ` = 1.6
562
615
- ` status.containerStatuses[0].resources.requests[cpu] ` = 1.6
563
616
- actual CPU shares = 1638
564
617
@@ -567,6 +620,7 @@ This is intentionally hitting various edge-cases for demonstration.
567
620
- ` spec.containers[0].resources.requests[cpu] ` = 100
568
621
- ` status.resize[cpu] ` = ` "Infeasible" `
569
622
- ` status.containerStatuses[0].allocatedResources[cpu] ` = 1.6
623
+ - ` acknowledged[cpu] ` = 1.6
570
624
- ` status.containerStatuses[0].resources.requests[cpu] ` = 1.6
571
625
- actual CPU shares = 1638
572
626
@@ -695,7 +749,7 @@ intended.
695
749
696
750
The atomic resize requirement should be reevaluated prior to GA, and in the context of pod-level resources.
697
751
698
- ### Edge-triggered Resizes
752
+ ### Actuating Resizes
699
753
700
754
The resources specified by the Kubelet are not guaranteed to be the actual resources configured for
701
755
a pod or container. Examples include:
@@ -706,16 +760,16 @@ a pod or container. Examples include:
706
760
Therefore the Kubelet cannot reliably compare desired & actual resources to know whether to trigger
707
761
a resize (a level-triggered approach).
708
762
709
- To accommodate this, the Kubelet stores a bit along with every resource in the allocated resource
710
- checkpoint which tracks whether the resource has been successfully resized. For container resources,
711
- this means the ` UpdateContainerResources ` request succeeded. This status bit is persisted in the
712
- allocated resources checkpoint to avoid extra resize requests across Kubelet restarts. There is the
713
- possibility that a poorly timed restart could lead to a resize request being repeated, so
714
- ` UpdateContainerResources ` should be idempotent.
763
+ To accommodate this, the Kubelet stores the set of "acknowledged" resources per container.
764
+ Acknowledged resources represent the resource configuration that was passed to the runtime (either
765
+ via a CreateContainer or UpdateContainerResources call) and received a successful response. The
766
+ acknowledged resources are checkpointed alongside the allocated resources to persist across
767
+ restarts. There is the possibility that a poorly timed restart could lead to a resize request being
768
+ repeated, so ` UpdateContainerResources ` must be idempotent.
715
769
716
- When a resize request succeeds, the pod will be marked for resync to read the latest resources. If
770
+ When a resize CRI request succeeds, the pod will be marked for resync to read the latest resources. If
717
771
the actual configured resources do not match the desired resources, this will be reflected in the
718
- pod status resources.
772
+ pod status resources, but not otherwise acted upon .
719
773
720
774
### Memory Limit Decreases
721
775
@@ -1293,7 +1347,7 @@ _This section must be completed when targeting beta graduation to a release._
1293
1347
container status.
1294
1348
- The ` ResizeStatus ` in the pod status should converge to the empty value, indicating the resize has completed.
1295
1349
- The ` Resources ` in the container status should converge to the resized resources, or an
1296
- approximation of it (see [ Edge-triggered Resizes] ( #edge-triggered -resizes ) for more details on
1350
+ approximation of it (see [ Actuating Resizes] ( #actuating -resizes ) for more details on
1297
1351
when these resources can diverge).
1298
1352
1299
1353
* ** What are the SLIs (Service Level Indicators) an operator can use to determine
@@ -1452,7 +1506,7 @@ _This section must be completed when targeting beta graduation to a release._
1452
1506
- Rename ResizeRestartPolicy ` NotRequired ` to ` PreferNoRestart ` ,
1453
1507
and update CRI ` UpdateContainerResources ` contract
1454
1508
- Add back ` AllocatedResources ` field to resolve a scheduler corner case
1455
- - Switch to edge-triggered resize actuation
1509
+ - Introduce Acknowledged resources for actuation
1456
1510
1457
1511
## Drawbacks
1458
1512
0 commit comments