diff --git a/OWNERS_ALIASES b/OWNERS_ALIASES index 0839cb169b6..1f90e3076c8 100644 --- a/OWNERS_ALIASES +++ b/OWNERS_ALIASES @@ -130,6 +130,11 @@ aliases: - mwielgus - soltysh - swatisehgal + wg-checkpoint-restore-leads: + - adrianreber + - haircommander + - rst0git + - viktoriaas wg-data-protection-leads: - xing-yang - yuxiangqian diff --git a/sig-api-machinery/README.md b/sig-api-machinery/README.md index d2389c6600a..dbc765d5670 100644 --- a/sig-api-machinery/README.md +++ b/sig-api-machinery/README.md @@ -54,6 +54,7 @@ subprojects, and resolve cross-subproject technical issues and decisions. ## Working Groups The following [working groups][working-group-definition] are sponsored by sig-api-machinery: +* [WG Checkpoint Restore](/wg-checkpoint-restore) * [WG Structured Logging](/wg-structured-logging) diff --git a/sig-apps/README.md b/sig-apps/README.md index fa2e645ea70..a3b7d21e645 100644 --- a/sig-apps/README.md +++ b/sig-apps/README.md @@ -58,6 +58,7 @@ subprojects, and resolve cross-subproject technical issues and decisions. The following [working groups][working-group-definition] are sponsored by sig-apps: * [WG Batch](/wg-batch) +* [WG Checkpoint Restore](/wg-checkpoint-restore) * [WG Data Protection](/wg-data-protection) * [WG Node Lifecycle](/wg-node-lifecycle) * [WG Serving](/wg-serving) diff --git a/sig-auth/README.md b/sig-auth/README.md index ee26f690abb..a072e740b3d 100644 --- a/sig-auth/README.md +++ b/sig-auth/README.md @@ -62,6 +62,12 @@ subprojects, and resolve cross-subproject technical issues and decisions. - [@kubernetes/sig-auth-test-failures](https://github.com/orgs/kubernetes/teams/sig-auth-test-failures) - Test Failures and Triage - Steering Committee Liaison: Patrick Ohly (**[@pohly](https://github.com/pohly)**) +## Working Groups + +The following [working groups][working-group-definition] are sponsored by sig-auth: +* [WG Checkpoint Restore](/wg-checkpoint-restore) + + ## Subprojects The following [subprojects][subproject-definition] are owned by sig-auth: diff --git a/sig-cli/README.md b/sig-cli/README.md index 3fe661cb7ea..7f19251f64e 100644 --- a/sig-cli/README.md +++ b/sig-cli/README.md @@ -63,6 +63,7 @@ subprojects, and resolve cross-subproject technical issues and decisions. ## Working Groups The following [working groups][working-group-definition] are sponsored by sig-cli: +* [WG Checkpoint Restore](/wg-checkpoint-restore) * [WG Node Lifecycle](/wg-node-lifecycle) diff --git a/sig-list.md b/sig-list.md index f9bd8485863..4cb30eae780 100644 --- a/sig-list.md +++ b/sig-list.md @@ -62,6 +62,7 @@ When the need arises, a [new SIG can be created](sig-wg-lifecycle.md) | Name | Label | Stakeholder SIGs |Organizers | Contact | Meetings | |------|-------|------------------|-----------|---------|----------| |[Batch](wg-batch/README.md)|[batch](https://github.com/kubernetes/kubernetes/labels/wg%2Fbatch)|* Apps
* Autoscaling
* Node
* Scheduling
|* [Kevin Hannon](https://github.com/kannon92), Red Hat
* [Marcin Wielgus](https://github.com/mwielgus), Google
* [Maciej Szulik](https://github.com/soltysh), Defense Unicorns
* [Swati Sehgal](https://github.com/swatisehgal), Red Hat
|* [Slack](https://kubernetes.slack.com/messages/wg-batch)
* [Mailing List](https://groups.google.com/a/kubernetes.io/g/wg-batch)|* Regular Meeting ([calendar](https://calendar.google.com/calendar/embed?src=8ulop9k0jfpuo0t7kp8d9ubtj4%40group.calendar.google.com)): [Thursdays (starting February 15th 2024)s at 3PM CET (Central European Time) (monthly)](https://zoom.us/j/98329676612?pwd=c0N2bVV1aTh2VzltckdXSitaZXBKQT09)
+|[Checkpoint Restore](wg-checkpoint-restore/README.md)|[checkpoint-restore](https://github.com/kubernetes/kubernetes/labels/wg%2Fcheckpoint-restore)|* API Machinery
* Apps
* Auth
* CLI
* Node
* Scheduling
|* [Adrian Reber](https://github.com/adrianreber), Red Hat
* [Peter Hunt](https://github.com/haircommander), Red Hat
* [Radostin Stoyanov](https://github.com/rst0git), University of Oxford
* [Viktória Spišaková](https://github.com/viktoriaas), Masaryk University
|* [Slack](https://kubernetes.slack.com/messages/wg-checkpoint-restore)
* [Mailing List](https://groups.google.com/forum/#!forum/kubernetes-wg-checkpoint-restore)| |[Data Protection](wg-data-protection/README.md)|[data-protection](https://github.com/kubernetes/kubernetes/labels/wg%2Fdata-protection)|* Apps
* Storage
|* [Xing Yang](https://github.com/xing-yang), VMware
* [Xiangqian Yu](https://github.com/yuxiangqian), Google
|* [Slack](https://kubernetes.slack.com/messages/wg-data-protection)
* [Mailing List](https://groups.google.com/a/kubernetes.io/g/wg-data-protection)|* Regular WG Meeting: [Wednesdays at 9:00 PT (Pacific Time) (bi-weekly)](https://zoom.us/j/6933410772)
|[Device Management](wg-device-management/README.md)|[device-management](https://github.com/kubernetes/kubernetes/labels/wg%2Fdevice-management)|* Architecture
* Autoscaling
* Network
* Node
* Scheduling
|* [John Belamaric](https://github.com/johnbelamaric), Google
* [Kevin Klues](https://github.com/klueska), NVIDIA
* [Patrick Ohly](https://github.com/pohly), Intel
|* [Slack](https://kubernetes.slack.com/messages/wg-device-management)
* [Mailing List](https://groups.google.com/a/kubernetes.io/g/wg-device-management)|* Regular WG Meeting (Asia/Europe): [Wednesdays at 9:00 CET (Central European Time) (biweekly)](https://zoom.us/j/97238699195?pwd=cy9IMm1ZeERtRlJ3VS8yWUxHUWIrQT09)
* Regular WG Meeting (Europe/America): [Tuesdays at 8:30 PT (Pacific Time) (biweekly)](https://zoom.us/j/97238699195?pwd=cy9IMm1ZeERtRlJ3VS8yWUxHUWIrQT09)
|[etcd Operator](wg-etcd-operator/README.md)|[etcd-operator](https://github.com/kubernetes/kubernetes/labels/wg%2Fetcd-operator)|* Cluster Lifecycle
* etcd
|* [Benjamin Wang](https://github.com/ahrtr), VMware
* [Ciprian Hacman](https://github.com/hakman), Microsoft
* [Josh Berkus](https://github.com/jberkus), Red Hat
* [James Blair](https://github.com/jmhbnz), Red Hat
* [Justin Santa Barbara](https://github.com/justinsb), Google
|* [Slack](https://kubernetes.slack.com/messages/wg-etcd-operator)
* [Mailing List](https://groups.google.com/a/kubernetes.io/g/wg-etcd-operator)|* Regular WG Meeting: [Tuesdays at 11:00 PT (Pacific Time) (bi-weekly)](https://zoom.us/my/cncfetcdproject)
diff --git a/sig-node/README.md b/sig-node/README.md index 1ef3742dbb3..1826db21a11 100644 --- a/sig-node/README.md +++ b/sig-node/README.md @@ -54,6 +54,7 @@ subprojects, and resolve cross-subproject technical issues and decisions. The following [working groups][working-group-definition] are sponsored by sig-node: * [WG Batch](/wg-batch) +* [WG Checkpoint Restore](/wg-checkpoint-restore) * [WG Device Management](/wg-device-management) * [WG Node Lifecycle](/wg-node-lifecycle) * [WG Serving](/wg-serving) diff --git a/sig-scheduling/README.md b/sig-scheduling/README.md index 1d6b8c590f0..f0365b30bc6 100644 --- a/sig-scheduling/README.md +++ b/sig-scheduling/README.md @@ -63,6 +63,7 @@ subprojects, and resolve cross-subproject technical issues and decisions. The following [working groups][working-group-definition] are sponsored by sig-scheduling: * [WG Batch](/wg-batch) +* [WG Checkpoint Restore](/wg-checkpoint-restore) * [WG Device Management](/wg-device-management) * [WG Node Lifecycle](/wg-node-lifecycle) * [WG Serving](/wg-serving) diff --git a/sigs.yaml b/sigs.yaml index 5eda987da38..303efc25973 100644 --- a/sigs.yaml +++ b/sigs.yaml @@ -3583,6 +3583,43 @@ workinggroups: liaison: github: aojea name: Antonio Ojea +- dir: wg-checkpoint-restore + name: Checkpoint Restore + mission_statement: > + This working group aims to provide a central location for the community to discuss + the integration of Checkpoint/Restore functionality into Kubernetes. + + charter_link: charter.md + stakeholder_sigs: + - API Machinery + - Apps + - Auth + - CLI + - Node + - Scheduling + label: checkpoint-restore + leadership: + chairs: + - github: adrianreber + name: Adrian Reber + company: Red Hat + email: areber@redhat.com + - github: haircommander + name: Peter Hunt + company: Red Hat + email: pehunt@redhat.com + - github: rst0git + name: Radostin Stoyanov + company: University of Oxford + email: radostin.stoyanov@eng.ox.ac.uk + - github: viktoriaas + name: Viktória Spišaková + company: Masaryk University + email: spisakova@ics.muni.cz + meetings: [] + contact: + slack: wg-checkpoint-restore + mailing_list: https://groups.google.com/forum/#!forum/kubernetes-wg-checkpoint-restore - dir: wg-data-protection name: Data Protection mission_statement: > diff --git a/wg-checkpoint-restore/README.md b/wg-checkpoint-restore/README.md new file mode 100644 index 00000000000..7de361d4213 --- /dev/null +++ b/wg-checkpoint-restore/README.md @@ -0,0 +1,38 @@ + +# Checkpoint Restore Working Group + +This working group aims to provide a central location for the community to discuss the integration of Checkpoint/Restore functionality into Kubernetes. + +The [charter](charter.md) defines the scope and governance of the Checkpoint Restore Working Group. + +## Stakeholder SIGs +* [SIG API Machinery](/sig-api-machinery) +* [SIG Apps](/sig-apps) +* [SIG Auth](/sig-auth) +* [SIG CLI](/sig-cli) +* [SIG Node](/sig-node) +* [SIG Scheduling](/sig-scheduling) + + + +## Organizers + +* Adrian Reber (**[@adrianreber](https://github.com/adrianreber)**), Red Hat +* Peter Hunt (**[@haircommander](https://github.com/haircommander)**), Red Hat +* Radostin Stoyanov (**[@rst0git](https://github.com/rst0git)**), University of Oxford +* Viktória Spišaková (**[@viktoriaas](https://github.com/viktoriaas)**), Masaryk University + +## Contact +- Slack: [#wg-checkpoint-restore](https://kubernetes.slack.com/messages/wg-checkpoint-restore) +- [Mailing list](https://groups.google.com/forum/#!forum/kubernetes-wg-checkpoint-restore) +- [Open Community Issues/PRs](https://github.com/kubernetes/community/labels/wg%2Fcheckpoint-restore) + + + diff --git a/wg-checkpoint-restore/charter.md b/wg-checkpoint-restore/charter.md new file mode 100644 index 00000000000..f777899ad11 --- /dev/null +++ b/wg-checkpoint-restore/charter.md @@ -0,0 +1,91 @@ + +# WG Checkpoint Restore Charter + +This charter adheres to the conventions described in the [Kubernetes Charter README] and uses +the Roles and Organization Management outlined in [sig-governance]. + +## Scope + +The Checkpoint/Restore Working Group aims to solve the problem of transparently +checkpointing and restoring workloads in Kubernetes, a functionality discussed +for over five years. The group will deliver the design and implementation of +Checkpoint/Restore functionality in Kubernetes, serving as a central hub for +community information and discussion. This initiative addresses a wide range of +problems, including fault tolerance, improved resource utilization, and +accelerated application startup times. + +### In scope + +- Identify core Kubernetes checkpoint/restore use cases (e.g., live migration, + fault tolerance, debugging, snapshotting) and gather stakeholder requirements. +- Investigate and propose Kubernetes APIs for checkpoint/restore operations. +- Work with SIGs for the best integration of checkpoint/restore functionality + and APIs. +- Provide guidance for developers on checkpoint-friendly app design and + recommendations for operators on feature management. +- Work closely with relevant upstream projects (CRI-O, containerd, CRIU, gVisor) + for alignment and integration. +- Revisit the existing implementations to find and remedy possible inefficiencies. + One example is the existing checkpoint archive format which has already been + identified as being a major source of slowdown. + +### Out of scope + +- Not focused on general OS-level checkpointing outside Kubernetes + pods/containers. +- Will not dictate internal application checkpointing logic; focuses on + Kubernetes platform orchestration of *container/pod state. + +## Stakeholders + +Stakeholders in this working group span multiple SIGs that own parts of the +code in core kubernetes components and addons. + +- SIG CLI +- SIG API Machinery +- SIG Node +- SIG Scheduling +- SIG Auth +- SIG Apps + +## Deliverables + +The list of deliverables include the following high level features: + +- In the early stage, we mainly want to offer a well-defined location for the + community to find information, ask questions, and discuss the next steps of + enabling checkpoint and restore in Kubernetes. + +Later: + +- Ability to checkpoint and restore a container using kubectl +- Ability to checkpoint and restore a pod using kubectl +- Integration of container/pod checkpointing in scheduling decisions + +## Roles and Organization Management + +This WG adheres to the Roles and Organization Management outlined in [wg-governance] +and opts-in to updates and modifications to [wg-governance]. + +[wg-governance]: /committee-steering/governance/wg-governance.md + +Additionally, the WG commits to: + +- maintain a solid communication line between the Kubernetes groups and the + wider CNCF community +- submit a proposal to the KubeCon/CloudNativeCon maintainers track + +## Timelines and Disbanding + +As a first mandate, the WG will define a roadmap and tasks in the first quarter +of operation. + +After that the WG will distribute the different tasks to different community +members to define possible APIs and how it can be integrated in Kubernetes. + +Achieving the aforementioned deliverables, also mentioned in the `In Scope` +section, will allow us to decide when to disband this WG. There is no +expectations that the Working Group will be converted into a SIG long term. + +[sig-governance]: https://github.com/kubernetes/community/blob/master/committee-steering/governance/sig-governance.md +[Kubernetes Charter README]: https://github.com/kubernetes/community/blob/master/committee-steering/governance/README.md