Skip to content

Commit ac49e23

Browse files
authored
Merge pull request #8287 from kannon92/wg-batch-report-2024
annual report for wg-batch 2024
2 parents 818f0a6 + d2777b0 commit ac49e23

File tree

1 file changed

+138
-18
lines changed

1 file changed

+138
-18
lines changed

wg-batch/annual-report-2024.md

Lines changed: 138 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -2,33 +2,153 @@
22

33
## Current initiatives and Project Health
44

5-
65
1. What work did the WG do this year that should be highlighted?
76

8-
<!--
9-
Some example items that might be worth highlighting:
10-
- artifacts
11-
- reports
12-
- white papers
13-
- work not tracked in KEPs
14-
-->
7+
See [2024 Highlights](#2024-highlights).
158

169
2. Are there any areas and/or subprojects that your group needs help with (e.g. fewer than 2 active OWNERS)?
1710

11+
Yes, JobSet has 1 active owner at the moment.
12+
13+
### 2024 Highlights
14+
15+
We will breakdown our highlights into Sub Projects, KEPs, talks, community adoption.
16+
17+
#### Sub Projects
18+
19+
##### Kueue
20+
21+
Kueue has had 5 minor releases in 2024.
22+
23+
- [Release 0.6](https://github.com/kubernetes-sigs/kueue/releases/tag/v0.6.0)
24+
25+
- [Release 0.7](https://github.com/kubernetes-sigs/kueue/releases/tag/v0.7.0)
26+
27+
- [Release 0.8](https://github.com/kubernetes-sigs/kueue/releases/tag/v0.8.0)
28+
29+
- [Release 0.9](https://github.com/kubernetes-sigs/kueue/releases/tag/v0.9.0)
30+
31+
- [Release 0.10](https://github.com/kubernetes-sigs/kueue/releases/tag/v0.10.0)
32+
33+
In 2024, the kueue community would like to highlight Topology aware scheduling, MultiKueue, Kueue Dashboard, KueueCtrl, Deployment/Statefulset integration for serving and Fair sharing.
34+
35+
[Topology aware scheduling](https://kueue.sigs.k8s.io/docs/concepts/topology_aware_scheduling/) facilitates scheduling of workloads that take into account data center topology.
36+
Workloads benefit from using interconnects that are physically close together.
37+
38+
[MultiKueue](https://kueue.sigs.k8s.io/docs/concepts/multikueue/) provides a way of dispatching batch workloads to worker clusters.
39+
Kueue provides multicluster dispatching for popular batch workloads such as Ray, Job, Kubeflow and JobSet.
40+
This feature went beta in 0.9.
41+
42+
[Kueue Dashboards](https://github.com/kubernetes-sigs/kueue/tree/release-0.10/cmd/experimental/kueue-viz) has been a popular ask for Kueue.
43+
Users would like to have a visualization representation of queueing and we are happy to announce that a dashboard has been created for Kueue.
44+
This went into kueue in late 2024 and a big focus of 2025 will be to harden this for production.
45+
46+
[KueueCtl](https://kueue.sigs.k8s.io/docs/reference/kubectl-kueue/) provides a cli for creating kueue objects.
47+
The plugin is hosted in krew and is easily installed as a kueue plugin.
48+
49+
[Deployment](https://kueue.sigs.k8s.io/docs/tasks/run/deployment/) and [StatefulSet](https://kueue.sigs.k8s.io/docs/tasks/run/statefulset/) integration provides an avenue for the usage of Kueue for serving workloads. Serving leads to a need for sharing/preemption of model servers that may leverage accelerators. Kueue provides an integration with popular methods of deploying services (Deployment/StatefulSet).
50+
51+
##### JobSet
52+
53+
Jobset has had 4 minor releases in 2024.
54+
55+
- [Release 0.4](https://github.com/kubernetes-sigs/jobset/releases/tag/v0.4.0)
56+
57+
- [Release 0.5](https://github.com/kubernetes-sigs/jobset/releases/tag/v0.5.0)
58+
59+
- [Release 0.6](https://github.com/kubernetes-sigs/jobset/releases/tag/v0.6.0)
60+
61+
- [Release 0.7](https://github.com/kubernetes-sigs/jobset/releases/tag/v0.7.0)
62+
63+
A major achievement of JobSet has been the adoption of JobSet
64+
as a component for [Kubeflow Trainer](https://github.com/kubeflow/trainer) V2, the next generation of the Kubeflow Training Operator project.
65+
66+
[Metaflow](https://github.com/Netflix/metaflow/pull/1804) has adopted the use of JobSet for distributed ML training.
67+
68+
##### KJob
69+
70+
[KJob](https://github.com/kubernetes-sigs/kjob) has been started to provide a CLI friendly way for users to submit batch jobs.
71+
The HPC/ML community tend to prefer CLI over YAML so the focus was to provide a templated solution for submitting batch jobs.
72+
Another focus of this project is to provide a smooth transition for Slurm users.
73+
74+
#### KEPs
75+
76+
WG-Batch provided a series of kubernetes enhancements that improved the experience of batch workloads on Kubernetes. In 2024, this group proposed/implemented the following KEPs.
77+
78+
- [Job Managed By](https://github.com/kubernetes/enhancements/issues/4368)
79+
- Promoted to beta.
80+
81+
- [Job Success Policy](https://github.com/kubernetes/enhancements/issues/3998)
82+
- Promoted to beta.
83+
84+
- [Elastic Index Jobs](https://github.com/kubernetes/enhancements/issues/3715)
85+
- Promoted to stable.
86+
87+
- [Pod Failure Policy](https://github.com/kubernetes/enhancements/issues/3329)
88+
- Promoted to stable.
89+
90+
- [Pod Index Label](https://github.com/kubernetes/enhancements/issues/4017)
91+
- Promoted to stable.
92+
93+
### Talks
94+
95+
- Democratizing AI Model Training on Kubernetes with Kubeflow TrainJob and JobSet
96+
- Speakers: Andrey Velichkevich and Yuki Iwai
97+
- Kubecon NA, Salt Lake City
98+
- [Recording](https://www.youtube.com/watch?v=Lgy4ir1AhYw)
99+
100+
- WG-Batch Update at Kubecon
101+
- Speakers: Kevin Hannon and Marcin Wielgus
102+
- Kubecon NA, Salt Lake City
103+
- [Recording](https://www.youtube.com/watch?v=C2ABOEzZTWg&list=PLj6h78yzYM2Pw4mRw4S-1p_xLARMqPkA7&index=283&pp=iAQB)
104+
105+
- Keynote: MultiCluster Batch Jobs Dispatching with Kueue at CERN
106+
- Speakers: Ricardo Rocha and Marcin Wielgus
107+
- Kubecon NA, Salt Lake City
108+
- [Recording](https://www.youtube.com/watch?v=xMmskWIlktA&list=PLj6h78yzYM2Pw4mRw4S-1p_xLARMqPkA7&index=193&pp=iAQB)
109+
110+
- Multitenancy and Fairness at Scale with Kueue: A Case Study
111+
- Speakers: Aldo Culquicondor and Rajat Phull
112+
- Kubecon NA, Salt Lake City
113+
- [Recording](https://www.youtube.com/watch?v=GYiuTQCvTx8&list=PLj6h78yzYM2Mvqk_mNejD7kbe3tldxxsr&index=5&pp=iAQB)
114+
115+
- Advanced Resource Management for Running AI/ML Workloads with Kueue
116+
- Speakers: Michał Woźniak and Yuki Iwai
117+
- Kubecon EU, Paris
118+
- [Recording](https://www.youtube.com/watch?v=6k_8Go3u8Qk)
119+
120+
- Scale Your Batch / Big Data / AI Workloads Beyond the Kubernetes Scheduler
121+
- Speaker: Antonin Stefanutti and Anish Asthana
122+
- KubeCon EU, Paris
123+
- [Recording](https://www.youtube.com/watch?v=Ij5EAnuF-jk&list=PLj6h78yzYM2PWGv34W6w5ssq1b1meRmY7&index=15&pp=iAQB)
124+
125+
- WG-Batch Update
126+
- Speaker: Michał Woźniak and Yuki Iwai
127+
- KubeCon EU, Paris
128+
- [Recording](https://www.youtube.com/watch?v=2D2QSzUnS0M&list=PLj6h78yzYM2N8nw1YcqqKveySH6_0VnI0&index=84&pp=iAQB)
129+
130+
- How the Kubernetes Community is Improving Kubernetes for HPC/AI/ML Workloads
131+
- Author: Kevin Hannon
132+
- FOSDEM 2024
133+
- [Recording](https://live.fosdem.org/watch/ua2118)
134+
135+
### Community adoption
136+
137+
- [Kubeflow Trainer v2](https://github.com/kubeflow/trainer/tree/62e958fa8c07ae73be0b10a30e1fb5e4c3d0e8f3/docs/proposals/2170-kubeflow-training-v2) will be using JobSet as a critical component for distributed training and LLMs fine-tuning.
138+
- [Metaflow supports JobSet](https://github.com/Netflix/metaflow/pull/1804) for distributed training.
139+
140+
- Airflow has built an [integration](https://airflow.apache.org/docs/apache-airflow-providers-cncf-kubernetes/stable/_api/airflow/providers/cncf/kubernetes/operators/kueue/index.html) with Kueue.
141+
18142
## Operational
19143

20144
Operational tasks in [wg-governance.md]:
21145

22-
- [ ] [README.md] reviewed for accuracy and updated if needed
23-
- [ ] WG leaders in [sigs.yaml] are accurate and active, and updated if needed
24-
- [ ] Meeting notes and recordings for 2024 are linked from [README.md] and updated/uploaded if needed
25-
- [ ] Updates provided to sponsoring SIGs in 2024
26-
- [$sig-name](https://git.k8s.io/community/$sig-id/)
27-
- links to email, meeting notes, slides, or recordings, etc
28-
- [$sig-name](https://git.k8s.io/community/$sig-id/)
29-
- links to email, meeting notes, slides, or recordings, etc
30-
-
31-
146+
- [x] [README.md] reviewed for accuracy and updated if needed
147+
- [x] WG leaders in [sigs.yaml] are accurate and active, and updated if needed
148+
- [x] Meeting notes and recordings for 2024 are linked from [README.md] and updated/uploaded if needed
149+
- [x] Updates provided to sponsoring SIGs in 2024
150+
- [WG-Batch Updates at Kubecon EU 2024](https://www.youtube.com/watch?v=2D2QSzUnS0M&list=PLj6h78yzYM2N8nw1YcqqKveySH6_0VnI0&index=84&pp=iAQB)
151+
- [WG-Batch Updates at Kubecon NA 2024](https://www.youtube.com/watch?v=C2ABOEzZTWg&list=PLj6h78yzYM2Pw4mRw4S-1p_xLARMqPkA7&index=283&pp=iAQB)
32152
[wg-governance.md]: https://git.k8s.io/community/committee-steering/governance/wg-governance.md
33153
[README.md]: https://git.k8s.io/community/wg-batch/README.md
34154
[sigs.yaml]: https://git.k8s.io/community/sigs.yaml

0 commit comments

Comments
 (0)