|
2 | 2 |
|
3 | 3 | ## Current initiatives and Project Health
|
4 | 4 |
|
5 |
| - |
6 | 5 | 1. What work did the WG do this year that should be highlighted?
|
7 | 6 |
|
8 |
| -<!-- |
9 |
| - Some example items that might be worth highlighting: |
10 |
| - - artifacts |
11 |
| - - reports |
12 |
| - - white papers |
13 |
| - - work not tracked in KEPs |
14 |
| ---> |
| 7 | +See [2024 Highlights](#2024-highlights). |
15 | 8 |
|
16 | 9 | 2. Are there any areas and/or subprojects that your group needs help with (e.g. fewer than 2 active OWNERS)?
|
17 | 10 |
|
| 11 | + Yes, JobSet has 1 active owner at the moment. |
| 12 | + |
| 13 | +### 2024 Highlights |
| 14 | + |
| 15 | +We will breakdown our highlights into Sub Projects, KEPs, talks, community adoption. |
| 16 | + |
| 17 | +#### Sub Projects |
| 18 | + |
| 19 | +##### Kueue |
| 20 | + |
| 21 | +Kueue has had 5 minor releases in 2024. |
| 22 | + |
| 23 | +- [Release 0.6](https://github.com/kubernetes-sigs/kueue/releases/tag/v0.6.0) |
| 24 | + |
| 25 | +- [Release 0.7](https://github.com/kubernetes-sigs/kueue/releases/tag/v0.7.0) |
| 26 | + |
| 27 | +- [Release 0.8](https://github.com/kubernetes-sigs/kueue/releases/tag/v0.8.0) |
| 28 | + |
| 29 | +- [Release 0.9](https://github.com/kubernetes-sigs/kueue/releases/tag/v0.9.0) |
| 30 | + |
| 31 | +- [Release 0.10](https://github.com/kubernetes-sigs/kueue/releases/tag/v0.10.0) |
| 32 | + |
| 33 | +In 2024, the kueue community would like to highlight Topology aware scheduling, MultiKueue, Kueue Dashboard, KueueCtrl, Deployment/Statefulset integration for serving and Fair sharing. |
| 34 | + |
| 35 | +[Topology aware scheduling](https://kueue.sigs.k8s.io/docs/concepts/topology_aware_scheduling/) facilitates scheduling of workloads that take into account data center topology. |
| 36 | +Workloads benefit from using interconnects that are physically close together. |
| 37 | + |
| 38 | +[MultiKueue](https://kueue.sigs.k8s.io/docs/concepts/multikueue/) provides a way of dispatching batch workloads to worker clusters. |
| 39 | +Kueue provides multicluster dispatching for popular batch workloads such as Ray, Job, Kubeflow and JobSet. |
| 40 | +This feature went beta in 0.9. |
| 41 | + |
| 42 | +[Kueue Dashboards](https://github.com/kubernetes-sigs/kueue/tree/release-0.10/cmd/experimental/kueue-viz) has been a popular ask for Kueue. |
| 43 | +Users would like to have a visualization representation of queueing and we are happy to announce that a dashboard has been created for Kueue. |
| 44 | +This went into kueue in late 2024 and a big focus of 2025 will be to harden this for production. |
| 45 | + |
| 46 | +[KueueCtl](https://kueue.sigs.k8s.io/docs/reference/kubectl-kueue/) provides a cli for creating kueue objects. |
| 47 | +The plugin is hosted in krew and is easily installed as a kueue plugin. |
| 48 | + |
| 49 | +[Deployment](https://kueue.sigs.k8s.io/docs/tasks/run/deployment/) and [StatefulSet](https://kueue.sigs.k8s.io/docs/tasks/run/statefulset/) integration provides an avenue for the usage of Kueue for serving workloads. Serving leads to a need for sharing/preemption of model servers that may leverage accelerators. Kueue provides an integration with popular methods of deploying services (Deployment/StatefulSet). |
| 50 | + |
| 51 | +##### JobSet |
| 52 | + |
| 53 | +Jobset has had 4 minor releases in 2024. |
| 54 | + |
| 55 | +- [Release 0.4](https://github.com/kubernetes-sigs/jobset/releases/tag/v0.4.0) |
| 56 | + |
| 57 | +- [Release 0.5](https://github.com/kubernetes-sigs/jobset/releases/tag/v0.5.0) |
| 58 | + |
| 59 | +- [Release 0.6](https://github.com/kubernetes-sigs/jobset/releases/tag/v0.6.0) |
| 60 | + |
| 61 | +- [Release 0.7](https://github.com/kubernetes-sigs/jobset/releases/tag/v0.7.0) |
| 62 | + |
| 63 | +A major achievement of JobSet has been the adoption of JobSet |
| 64 | +as a component for [Kubeflow Trainer](https://github.com/kubeflow/trainer) V2, the next generation of the Kubeflow Training Operator project. |
| 65 | + |
| 66 | +[Metaflow](https://github.com/Netflix/metaflow/pull/1804) has adopted the use of JobSet for distributed ML training. |
| 67 | + |
| 68 | +##### KJob |
| 69 | + |
| 70 | +[KJob](https://github.com/kubernetes-sigs/kjob) has been started to provide a CLI friendly way for users to submit batch jobs. |
| 71 | +The HPC/ML community tend to prefer CLI over YAML so the focus was to provide a templated solution for submitting batch jobs. |
| 72 | +Another focus of this project is to provide a smooth transition for Slurm users. |
| 73 | + |
| 74 | +#### KEPs |
| 75 | + |
| 76 | +WG-Batch provided a series of kubernetes enhancements that improved the experience of batch workloads on Kubernetes. In 2024, this group proposed/implemented the following KEPs. |
| 77 | + |
| 78 | +- [Job Managed By](https://github.com/kubernetes/enhancements/issues/4368) |
| 79 | + - Promoted to beta. |
| 80 | + |
| 81 | +- [Job Success Policy](https://github.com/kubernetes/enhancements/issues/3998) |
| 82 | + - Promoted to beta. |
| 83 | + |
| 84 | +- [Elastic Index Jobs](https://github.com/kubernetes/enhancements/issues/3715) |
| 85 | + - Promoted to stable. |
| 86 | + |
| 87 | +- [Pod Failure Policy](https://github.com/kubernetes/enhancements/issues/3329) |
| 88 | + - Promoted to stable. |
| 89 | + |
| 90 | +- [Pod Index Label](https://github.com/kubernetes/enhancements/issues/4017) |
| 91 | + - Promoted to stable. |
| 92 | + |
| 93 | +### Talks |
| 94 | + |
| 95 | +- Democratizing AI Model Training on Kubernetes with Kubeflow TrainJob and JobSet |
| 96 | + - Speakers: Andrey Velichkevich and Yuki Iwai |
| 97 | + - Kubecon NA, Salt Lake City |
| 98 | + - [Recording](https://www.youtube.com/watch?v=Lgy4ir1AhYw) |
| 99 | + |
| 100 | +- WG-Batch Update at Kubecon |
| 101 | + - Speakers: Kevin Hannon and Marcin Wielgus |
| 102 | + - Kubecon NA, Salt Lake City |
| 103 | + - [Recording](https://www.youtube.com/watch?v=C2ABOEzZTWg&list=PLj6h78yzYM2Pw4mRw4S-1p_xLARMqPkA7&index=283&pp=iAQB) |
| 104 | + |
| 105 | +- Keynote: MultiCluster Batch Jobs Dispatching with Kueue at CERN |
| 106 | + - Speakers: Ricardo Rocha and Marcin Wielgus |
| 107 | + - Kubecon NA, Salt Lake City |
| 108 | + - [Recording](https://www.youtube.com/watch?v=xMmskWIlktA&list=PLj6h78yzYM2Pw4mRw4S-1p_xLARMqPkA7&index=193&pp=iAQB) |
| 109 | + |
| 110 | +- Multitenancy and Fairness at Scale with Kueue: A Case Study |
| 111 | + - Speakers: Aldo Culquicondor and Rajat Phull |
| 112 | + - Kubecon NA, Salt Lake City |
| 113 | + - [Recording](https://www.youtube.com/watch?v=GYiuTQCvTx8&list=PLj6h78yzYM2Mvqk_mNejD7kbe3tldxxsr&index=5&pp=iAQB) |
| 114 | + |
| 115 | +- Advanced Resource Management for Running AI/ML Workloads with Kueue |
| 116 | + - Speakers: Michał Woźniak and Yuki Iwai |
| 117 | + - Kubecon EU, Paris |
| 118 | + - [Recording](https://www.youtube.com/watch?v=6k_8Go3u8Qk) |
| 119 | + |
| 120 | +- Scale Your Batch / Big Data / AI Workloads Beyond the Kubernetes Scheduler |
| 121 | + - Speaker: Antonin Stefanutti and Anish Asthana |
| 122 | + - KubeCon EU, Paris |
| 123 | + - [Recording](https://www.youtube.com/watch?v=Ij5EAnuF-jk&list=PLj6h78yzYM2PWGv34W6w5ssq1b1meRmY7&index=15&pp=iAQB) |
| 124 | + |
| 125 | +- WG-Batch Update |
| 126 | + - Speaker: Michał Woźniak and Yuki Iwai |
| 127 | + - KubeCon EU, Paris |
| 128 | + - [Recording](https://www.youtube.com/watch?v=2D2QSzUnS0M&list=PLj6h78yzYM2N8nw1YcqqKveySH6_0VnI0&index=84&pp=iAQB) |
| 129 | + |
| 130 | +- How the Kubernetes Community is Improving Kubernetes for HPC/AI/ML Workloads |
| 131 | + - Author: Kevin Hannon |
| 132 | + - FOSDEM 2024 |
| 133 | + - [Recording](https://live.fosdem.org/watch/ua2118) |
| 134 | + |
| 135 | +### Community adoption |
| 136 | + |
| 137 | +- [Kubeflow Trainer v2](https://github.com/kubeflow/trainer/tree/62e958fa8c07ae73be0b10a30e1fb5e4c3d0e8f3/docs/proposals/2170-kubeflow-training-v2) will be using JobSet as a critical component for distributed training and LLMs fine-tuning. |
| 138 | +- [Metaflow supports JobSet](https://github.com/Netflix/metaflow/pull/1804) for distributed training. |
| 139 | + |
| 140 | +- Airflow has built an [integration](https://airflow.apache.org/docs/apache-airflow-providers-cncf-kubernetes/stable/_api/airflow/providers/cncf/kubernetes/operators/kueue/index.html) with Kueue. |
| 141 | + |
18 | 142 | ## Operational
|
19 | 143 |
|
20 | 144 | Operational tasks in [wg-governance.md]:
|
21 | 145 |
|
22 |
| -- [ ] [README.md] reviewed for accuracy and updated if needed |
23 |
| -- [ ] WG leaders in [sigs.yaml] are accurate and active, and updated if needed |
24 |
| -- [ ] Meeting notes and recordings for 2024 are linked from [README.md] and updated/uploaded if needed |
25 |
| -- [ ] Updates provided to sponsoring SIGs in 2024 |
26 |
| - - [$sig-name](https://git.k8s.io/community/$sig-id/) |
27 |
| - - links to email, meeting notes, slides, or recordings, etc |
28 |
| - - [$sig-name](https://git.k8s.io/community/$sig-id/) |
29 |
| - - links to email, meeting notes, slides, or recordings, etc |
30 |
| - - |
31 |
| - |
| 146 | +- [x] [README.md] reviewed for accuracy and updated if needed |
| 147 | +- [x] WG leaders in [sigs.yaml] are accurate and active, and updated if needed |
| 148 | +- [x] Meeting notes and recordings for 2024 are linked from [README.md] and updated/uploaded if needed |
| 149 | +- [x] Updates provided to sponsoring SIGs in 2024 |
| 150 | + - [WG-Batch Updates at Kubecon EU 2024](https://www.youtube.com/watch?v=2D2QSzUnS0M&list=PLj6h78yzYM2N8nw1YcqqKveySH6_0VnI0&index=84&pp=iAQB) |
| 151 | + - [WG-Batch Updates at Kubecon NA 2024](https://www.youtube.com/watch?v=C2ABOEzZTWg&list=PLj6h78yzYM2Pw4mRw4S-1p_xLARMqPkA7&index=283&pp=iAQB) |
32 | 152 | [wg-governance.md]: https://git.k8s.io/community/committee-steering/governance/wg-governance.md
|
33 | 153 | [README.md]: https://git.k8s.io/community/wg-batch/README.md
|
34 | 154 | [sigs.yaml]: https://git.k8s.io/community/sigs.yaml
|
0 commit comments