Skip to content

Commit c6b13b4

Browse files
committed
KEP-1287: Kubelet restart analysis
1 parent 3bf6b82 commit c6b13b4

File tree

1 file changed

+37
-0
lines changed
  • keps/sig-node/1287-in-place-update-pod-resources

1 file changed

+37
-0
lines changed

keps/sig-node/1287-in-place-update-pod-resources/README.md

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@
2626
- [Container resource limit update ordering](#container-resource-limit-update-ordering)
2727
- [Container resource limit update failure handling](#container-resource-limit-update-failure-handling)
2828
- [CRI Changes Flow](#cri-changes-flow)
29+
- [Kubelet Restart Analysis](#kubelet-restart-analysis)
2930
- [Notes](#notes)
3031
- [Lifecycle Nuances](#lifecycle-nuances)
3132
- [Atomic Resizes](#atomic-resizes)
@@ -702,6 +703,42 @@ Pod Status in response to user changing the desired resources in Pod Spec.
702703
in ContainerStatus.Resources to update ContainerStatuses[i].Resources.Limits
703704
for that Container in the Pod's Status.
704705

706+
#### Kubelet Restart Analysis
707+
708+
Analysis of Kubelet restarts happening at various points of resize, and how recovery happens.
709+
Impacts of a restart outside of resource configuration are out of scope.
710+
711+
1. Kubelet Admits a new pod
712+
- Resource allocation checkpointed before sending the pod to the pod workers
713+
- Restart before checkpointing: pod goes through admission again as if new
714+
- Restart after checkpointing: pod goes through admission using the allocated resources
715+
1. Kubelet creates a container
716+
- Resources acknowledged after CreateContainer call succeeds
717+
- Restart before acknowledgement: Kubelet issues a superfluous UpdatePodResources request
718+
- Restart after acknowledgement: No resize needed
719+
1. Container starts, triggering a pod sync event
720+
- Kubelet updates status with actual resources reported by runtime, allocated resources from checkpoint
721+
- Allocated == Acknowledeged, so no resize needed
722+
- No races around restart.
723+
1. Pod is resized in the API, Kubelet observes the update
724+
- Triggers a pod sync
725+
- On restart, Kubelet reads the latest pod from the API and triggers a pod sync, so same effect as
726+
observing the update.
727+
1. Updated pod is synced: Check if pod can be admitted
728+
- No: resize status is deferred, no change to allocated resources
729+
- Restart: redo admission check, still deferred.
730+
- Yes: resize status is in-progress, update allocated checkpoint
731+
- Restart before update: readmit, then update allocated
732+
- Restart after update: allocated != acknowledged --> proceed with resize
733+
1. Allocated != Acknowledged
734+
- Trigger an `UpdateContainerResources` CRI call, then update Acknowledged resources on success
735+
- Restart before CRI call: allocated != acknowledged, will still trigger the update call
736+
- Restart after CRI call, before acknowledged update: will redo update call
737+
- Restart after acknowledged update: allocated == acknowledged, resize status cleared
738+
1. PLEG updates PodStatus cache, triggers pod sync
739+
- Pod status updated with actual resources, resize status cleared
740+
- Desired == Allocated == Acknowledged, no resize changes needed.
741+
705742
#### Notes
706743

707744
* If CPU Manager policy for a Node is set to 'static', then only integral

0 commit comments

Comments
 (0)