Introduce WG Checkpoint Restore #8508

adrianreber · 2025-07-03T13:33:14Z

As described in sig-wg-lifecycle.md this PR is the next step after sending an email to [email protected] about the creation of the Working Group Checkpoint Restore.

CC: @rst0git, @viktoriaas, @xhejtman

k8s-ci-robot · 2025-07-03T13:33:24Z

Welcome @adrianreber!

It looks like this is your first PR to kubernetes/community 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/community has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2025-07-03T13:33:24Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: adrianreber
Once this PR has been reviewed and has the lgtm label, please assign saschagrunert for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2025-07-03T13:33:25Z

Hi @adrianreber. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

OWNERS_ALIASES

kannon92 · 2025-07-10T18:31:17Z

/ok-to-test

kannon92 · 2025-07-15T02:42:37Z

Looking at #8519,

I see that we are missing a charter.

adrianreber · 2025-07-15T06:34:13Z

Looking at #8519,

I see that we are missing a charter.

In https://github.com/kubernetes/community/blob/master/sig-wg-lifecycle.md#GitHub is says to add a charter once this initial PR has been merged. That's why is skipped it.

SergeyKanzhelev · 2025-07-18T17:40:53Z

sigs.yaml

+    the integration of Checkpoint/Restore functionality into Kubernetes.
+
+  charter_link: charter.md
+  stakeholder_sigs:


sig auth may have a big say in security of this whole restoration pipeline

Thank you for pointing this out! Security is definitely an important topic that we need to discuss with sig-auth, both for the checkpoint API and the restoration pipeline. The following paper and master thesis describe our recent work on this topic:

Towards Efficient End-to-End Encryption for Container Checkpointing Systems

Improving Checkpoint/Restore Functionality in Kubernetes

I added sig auth to the list of stakeholder sigs

this showed up in the sig-auth meeting, we may have missed the discussion around this WG

if this WG is contemplating taking state from a running pod / saving it / letting it be consumed on another node or from another pod or another namespace, then sig-auth is definitely interested in making sure the permissions model around that exists and is ~consistent with similar things Kubernetes does elsewhere (like PVC / snapshots)

We're happy to consult on that, I'm not sure our awareness / involvement rises to the level of sponsoring the WG :)

cc @kubernetes/sig-auth-leads

SergeyKanzhelev · 2025-07-18T17:43:14Z

sigs.yaml

+    This working group aims to provide a central location for the community to discuss
+    the integration of Checkpoint/Restore functionality into Kubernetes.
+
+  charter_link: charter.md


is charter included into this PR?

now it is, I didn't add it initially as the lifecycle document mentions that it is added later, but looking at the WG PRs it seems to be common to have a charter in the initial PR.

adrianreber · 2025-07-25T12:51:09Z

/test pull-community-verify

adrianreber · 2025-07-25T12:57:55Z

/verify-owners

janetkuo · 2025-07-25T22:12:37Z

sigs.yaml

+    the integration of Checkpoint/Restore functionality into Kubernetes.
+
+  charter_link: charter.md
+  stakeholder_sigs:


This is a valuable initiative. The charter mentions that the scope includes checkpointing and restoring 'workloads' and providing 'guidance for developers on checkpoint-friendly app design.' Given this focus, it's essential for SIG Apps to be involved as a key stakeholder.

@janetkuo This is a good idea, thank you so much for suggesting it!

Yes, thanks @janetkuo. I added SIG Apps to the proposal.

I agree with Janet here, but please make sure to show up and present the scope of this proposal to one of the future SIG-Apps calls.

SergeyKanzhelev · 2025-07-28T17:46:26Z

wg-checkpoint-restore/charter.md

+- Investigate and propose Kubernetes APIs for checkpoint/restore operations.
+- Work with SIGs for the best integration of checkpoint/restore functionality
+  and APIs.
+- Provide guidance for developers on checkpoint-friendly app design and


there may be API needed to communicate between the app and API server that the checkopoint is requested AND/OR that the app is ready for checkpoint. Something that is beyond just guidance

That is actually something we discussed how to do in containers for years now (outside of Kubernetes). But we never found the right way how to do this. We were looking at kernel interfaces or systemd interfaces because for many applications it could be helpful to free temporary memory to reduce checkpoint size or even drop confidential information. Also after restore it would be good to tell the application that maybe certain cryptographic values need to be reset or regenerated. I will try to include something mentioning this. Thanks.

wg-checkpoint-restore/charter.md

Co-authored-by: Sergey Kanzhelev <[email protected]> Signed-off-by: Adrian Reber <[email protected]>

k8s-ci-robot · 2025-07-28T18:05:55Z

The following users are mentioned in OWNERS file(s) but are untrusted for the following reasons. One way to make the user trusted is to add them as members of the kubernetes org. You can then trigger verification by writing /verify-owners in a comment.

viktoriaas
- User is not a member of the org. Satisfy at least one of these conditions to make the user trusted.
rst0git
- User is not a member of the org. Satisfy at least one of these conditions to make the user trusted.

aramase · 2025-08-04T16:28:16Z

/assign ritazh

(assigned as part of SIG Auth triage; to review the SIG Auth updates)

k8s-ci-robot · 2025-08-05T05:54:08Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

BenTheElder · 2025-08-20T00:11:48Z

wg-checkpoint-restore/charter.md

+Stakeholders in this working group span multiple SIGs that own parts of the
+code in core kubernetes components and addons.
+
+- SIG CLI


has there been outreach to these SIGs yet? I see some SIG node participants/leaders but not CLI yet for example

Changing from SIG CLI to SIG API machinery after SIG CLI mentioned that new commands are first introduced via a plugin.

kubernetes/enhancements#5091

I have a simple prototype of a plugin for the kubectl checkpoint command that we used in our demos and can discuss further in the working group.

OK, we should remove the SIG if they are not a stakeholder, or else the WG organizers should reach out to the SIGs.

BenTheElder · 2025-08-20T00:11:53Z

@kubernetes/sig-node-leads are you all +1, officially?

haircommander · 2025-08-20T13:16:04Z

+1 from me

sigs.yaml

aojea · 2025-08-21T08:26:41Z

wg-checkpoint-restore/charter.md

+
+- maintain a solid communication line between the Kubernetes groups and the
+  wider CNCF community
+- submit a proposal to the KubeCon/CloudNativeCon maintainers track


I have doubts if this incentivize the right behavior and will encourage people to build WG to get a slot in the kubecon

I agree with Antonio, this particular line should be removed, it's sufficient what the previous point shows.

wg-checkpoint-restore/charter.md

wg-checkpoint-restore/README.md

soltysh · 2025-08-27T11:59:57Z

sig-cli/README.md

@@ -63,6 +63,7 @@ subprojects, and resolve cross-subproject technical issues and decisions.
 ## Working Groups

 The following [working groups][working-group-definition] are sponsored by sig-cli:
+* [WG Checkpoint Restore](/wg-checkpoint-restore)


With my SIG-CLI hat, I'm raising similar comment as the other one, this topic wasn't brought to SIG-CLI attention.

cc @kubernetes/sig-cli-leads

soltysh · 2025-08-27T12:03:07Z

sigs.yaml

+    the integration of Checkpoint/Restore functionality into Kubernetes.
+
+  charter_link: charter.md
+  stakeholder_sigs:


I agree with Janet here, but please make sure to show up and present the scope of this proposal to one of the future SIG-Apps calls.

sigs.yaml

soltysh · 2025-08-27T12:05:50Z

sigs.yaml

+  meetings: []
+  contact:
+    slack: wg-checkpoint-restore
+    mailing_list: https://groups.google.com/forum/#!forum/kubernetes-wg-checkpoint-restore


The mailing lists for all WGs/SIGs are part of our managed google groups, with a kubernetes.io domain. So this should be https://groups.google.com/a/kubernetes.io/g/wg-checkpoint-restore once you get the group created we'll be able to provision this for you.

soltysh · 2025-08-27T12:07:12Z

wg-checkpoint-restore/charter.md

+
+The Checkpoint/Restore Working Group aims to solve the problem of transparently
+checkpointing and restoring workloads in Kubernetes, a functionality discussed
+for over five years. The group will deliver the design and implementation of


Can you link to those discussions?

I believe most of the previous discussions are linked here: kubernetes/enhancements#2008

soltysh · 2025-08-27T12:08:45Z

wg-checkpoint-restore/charter.md

+The Checkpoint/Restore Working Group aims to solve the problem of transparently
+checkpointing and restoring workloads in Kubernetes, a functionality discussed
+for over five years. The group will deliver the design and implementation of
+Checkpoint/Restore functionality in Kubernetes, serving as a central hub for


Why it has to be a central part of Kubernetes, where multiple external solutions already exists?

i personally like the idea of having it be integrated because then the ecosystem can rely on it. for instance, we could make eviction or preemption less disruptive in kubelet/kueue respectively

Why it has to be a central part of Kubernetes, where multiple external solutions already exists?

Can you be more specific about what already exists? Not sure what you are referring to?

soltysh · 2025-08-27T12:11:22Z

wg-checkpoint-restore/charter.md

+Checkpoint/Restore functionality in Kubernetes, serving as a central hub for
+community information and discussion. This initiative addresses a wide range of
+problems, including fault tolerance, improved resource utilization, and
+accelerated application startup times.


This first thing that I'd like to point out is that there are 2 main use cases:

the whole control-plane snapshot

workload

Which one this group is planning to cover? As I'm reading this document I'm seeing both used interchangeably which is very confusing. That's why I'd start with clearly drawing the line between the two and properly documenting which one of these two (or both) are you planning to tackle.

What do yo mean by "control-plane-snapshot"?

In our presentations, we use the following diagram to illustrate how checkpoint/restore operations work in Kubernetes:

Reference: Efficient Transparent Checkpointing of AI/ML Workloads in Kubernetes

soltysh · 2025-08-27T12:19:01Z

wg-checkpoint-restore/charter.md

+
+- Ability to checkpoint and restore a container using kubectl
+- Ability to checkpoint and restore a pod using kubectl
+- Integration of container/pod checkpointing in scheduling decisions


Why pod checkpointing would have anything to do with scheduling?

Because, as far as I know, a pod is always scheduled on one node. It doesn't sound useful to base the scheduling on the possibility to migrate containers. Container migration is an important first step, but for automatic scheduling decisions, it would make more sense to be able to easily migrate a complete pod.

Our use-case is similar to how CRIU is integrated with Google's Borg ¹ and Microsoft's Singularity ² to enable preemptive and elastic scheduling.

Footnotes

Task Migration at Scale Using CRIU ↩

Singularity: Planet-Scale, Preemptive and Elastic Scheduling of AI Workloads ↩

soltysh · 2025-08-27T12:20:09Z

wg-checkpoint-restore/charter.md

+
+- maintain a solid communication line between the Kubernetes groups and the
+  wider CNCF community
+- submit a proposal to the KubeCon/CloudNativeCon maintainers track


I agree with Antonio, this particular line should be removed, it's sufficient what the previous point shows.

Co-authored-by: viktoriaas <[email protected]> Co-authored-by: Antonio Ojea <[email protected]>

k8s-ci-robot · 2025-09-01T15:25:08Z

@adrianreber: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-community-verify	`224da25`	link	true	`/test pull-community-verify`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jul 3, 2025

k8s-ci-robot requested review from deads2k and macsko July 3, 2025 13:33

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Jul 3, 2025

github-project-automation bot added this to SIG Scheduling Jul 3, 2025

github-project-automation bot moved this to Needs Triage in SIG Scheduling Jul 3, 2025

k8s-ci-robot added the do-not-merge/invalid-owners-file Indicates that a PR should not merge because it has an invalid OWNERS file in it. label Jul 3, 2025

adrianreber mentioned this pull request Jul 3, 2025

Checkpointing API kubernetes/enhancements#5091

Open

6 tasks

haircommander reviewed Jul 8, 2025

View reviewed changes

OWNERS_ALIASES Show resolved Hide resolved

adrianreber force-pushed the 2025-07-03-wg-cr branch from 33e97fb to abc1c26 Compare July 9, 2025 12:15

kannon92 reviewed Jul 10, 2025

View reviewed changes

OWNERS_ALIASES Show resolved Hide resolved

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jul 10, 2025

adrianreber mentioned this pull request Jul 17, 2025

REQUEST: New membership for adrianreber kubernetes/org#5706

Closed

11 tasks

SergeyKanzhelev reviewed Jul 18, 2025

View reviewed changes

adrianreber force-pushed the 2025-07-03-wg-cr branch from abc1c26 to 8bc6968 Compare July 20, 2025 15:15

k8s-ci-robot added the committee/steering Denotes an issue or PR intended to be handled by the steering committee. label Jul 20, 2025

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 22, 2025

k8s-ci-robot requested review from dom4ha and fedebongio July 22, 2025 16:02

janetkuo reviewed Jul 25, 2025

View reviewed changes

SergeyKanzhelev reviewed Jul 28, 2025

View reviewed changes

wg-checkpoint-restore/charter.md Outdated Show resolved Hide resolved

Introduce WG Checkpoint Restore

b60c587

Co-authored-by: Sergey Kanzhelev <[email protected]> Signed-off-by: Adrian Reber <[email protected]>

adrianreber force-pushed the 2025-07-03-wg-cr branch from 72d980d to b60c587 Compare July 28, 2025 18:05

k8s-ci-robot added the sig/apps Categorizes an issue or PR as relevant to SIG Apps. label Jul 28, 2025

rst0git mentioned this pull request Aug 4, 2025

REQUEST: New membership for rst0git kubernetes/org#5754

Closed

11 tasks

k8s-ci-robot assigned ritazh Aug 4, 2025

aramase moved this from Needs Triage to In Review in SIG Auth Aug 4, 2025

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 5, 2025

viktoriaas mentioned this pull request Aug 11, 2025

REQUEST: New membership for viktoriaas kubernetes/org#5764

Closed

11 tasks

BenTheElder reviewed Aug 20, 2025

View reviewed changes

sigs.yaml Show resolved Hide resolved

aojea reviewed Aug 21, 2025

View reviewed changes

wg-checkpoint-restore/charter.md Outdated Show resolved Hide resolved

aojea reviewed Aug 21, 2025

View reviewed changes

wg-checkpoint-restore/charter.md Show resolved Hide resolved

viktoriaas reviewed Aug 21, 2025

View reviewed changes

wg-checkpoint-restore/README.md Show resolved Hide resolved

soltysh reviewed Aug 27, 2025

View reviewed changes

Apply suggestions from code review

224da25

Co-authored-by: viktoriaas <[email protected]> Co-authored-by: Antonio Ojea <[email protected]>

Introduce WG Checkpoint Restore #8508

Are you sure you want to change the base?

Introduce WG Checkpoint Restore #8508

Conversation

adrianreber commented Jul 3, 2025

Uh oh!

k8s-ci-robot commented Jul 3, 2025

Uh oh!

k8s-ci-robot commented Jul 3, 2025

Uh oh!

k8s-ci-robot commented Jul 3, 2025

Uh oh!

Uh oh!

Uh oh!

kannon92 commented Jul 10, 2025

Uh oh!

kannon92 commented Jul 15, 2025

Uh oh!

adrianreber commented Jul 15, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rst0git Jul 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adrianreber commented Jul 25, 2025

Uh oh!

adrianreber commented Jul 25, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SergeyKanzhelev Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

k8s-ci-robot commented Jul 28, 2025

Uh oh!

aramase commented Aug 4, 2025

Uh oh!

k8s-ci-robot commented Aug 5, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rst0git Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BenTheElder commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

haircommander commented Aug 20, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rst0git Jul 19, 2025 •

edited

Loading

SergeyKanzhelev Jul 28, 2025 •

edited

Loading

rst0git Aug 20, 2025 •

edited

Loading

BenTheElder commented Aug 20, 2025 •

edited

Loading

rst0git Sep 1, 2025 •

edited

Loading