Skip to content

Commit 549c6b3

Browse files
committed
annual report for wg-batch 2024
1 parent 45b5baa commit 549c6b3

File tree

1 file changed

+117
-12
lines changed

1 file changed

+117
-12
lines changed

wg-batch/annual-report-2024.md

Lines changed: 117 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -2,27 +2,132 @@
22

33
## Current initiatives and Project Health
44

5-
65
1. What work did the WG do this year that should be highlighted?
76

8-
<!--
9-
Some example items that might be worth highlighting:
10-
- artifacts
11-
- reports
12-
- white papers
13-
- work not tracked in KEPs
14-
-->
7+
See [2024 Highlights](#2024-highlights).
158

169
2. Are there any areas and/or subprojects that your group needs help with (e.g. fewer than 2 active OWNERS)?
1710

11+
None.
12+
13+
### 2024 Highlights
14+
15+
We will breakdown our highlights into Sub Projects, KEPs, talks, community adoption.
16+
17+
#### Sub Projects
18+
19+
##### Kueue
20+
21+
Kueue has had 5 releases in 2024.
22+
23+
- [Release 0.6](https://github.com/kubernetes-sigs/kueue/releases/tag/v0.6.0)
24+
25+
- [Release 0.7](https://github.com/kubernetes-sigs/kueue/releases/tag/v0.7.0)
26+
27+
- [Release 0.8](https://github.com/kubernetes-sigs/kueue/releases/tag/v0.8.0)
28+
29+
- [Release 0.9](https://github.com/kubernetes-sigs/kueue/releases/tag/v0.9.0)
30+
31+
- [Release 0.10](https://github.com/kubernetes-sigs/kueue/releases/tag/v0.10.0)
32+
33+
In 2024, the kueue community would like to highlight are Topology aware scheduling, MultiKueue, Kueue Dashboard, KueueCtrl, Deployment/Statefulset integration for serving and Fair sharing.
34+
35+
Topology aware scheduling facilitates scheduling of workloads that take in account data center topology. Workloads benefit from using interconnects that are physically close together.
36+
37+
MultiKueue provides a way of dispatching batch workloads to worker clusters. Kueue provides multicluster dispatching for popular batch workloads such as Ray, Job, Kubeflow and JobSet. This feature went beta in 0.9.
38+
39+
Kueue Dashboards has been a popular ask for Kueue. Users would like to have a visualization representation of queueing and we are happy to announce that a dashboard has been created for Kueue. This went into kueue in late 2024 and a big focus of 2025 will be to harden this for production.
40+
41+
KueueCtrl provides a cli for creating kueue objects. The plugin is hosted in krew and is easily installed as a kueue plugin.
42+
43+
Deployment/StatefulSet integration provides an avenue for the usage of Kueue for serving workloads. Serving leads to a need for sharing/preemption of model servers that may leverage accelerators. Kueue provides an integration with popular methods of deploying services (Deployment/StatefulSet).
44+
45+
##### JobSet
46+
47+
Jobset has had 4 release in 2024.
48+
49+
- [Release 0.4](https://github.com/kubernetes-sigs/jobset/releases/tag/v0.4.0)
50+
51+
- [Release 0.5](https://github.com/kubernetes-sigs/jobset/releases/tag/v0.5.0)
52+
53+
- [Release 0.6](https://github.com/kubernetes-sigs/jobset/releases/tag/v0.6.0)
54+
55+
- [Release 0.7](https://github.com/kubernetes-sigs/jobset/releases/tag/v0.7.0)
56+
57+
A major achievement of JobSet has been the adoption of JobSet as a component for Kubeflow Training Operator V2.
58+
There has been a collaborative effort with the Kubeflow community and the batch community to implement the features needed for this integration.
59+
60+
[Metaflow](https://github.com/Netflix/metaflow/pull/1804) has adopted the use of JobSet for distributed ML training.
61+
62+
##### KJob
63+
64+
[KJob](https://github.com/kubernetes-sigs/kjob?tab=readme-ov-file#kjob) has been started to provide a CLI friendly way for users to submit batch jobs.
65+
The HPC/ML community tend to prefer CLI over YAML so the focus was to provide a templated solution for submitting batch jobs.
66+
Another focus of this project is to provide a smooth transition for Slurm users.
67+
68+
#### KEPs
69+
70+
WG-Batch provided a series of kubernetes enhancements that improved the experience of batch workloads on Kubernetes. In 2024, this group proposed/implemented the following KEPs.
71+
72+
- [Job Managed By](https://github.com/kubernetes/enhancements/issues/4368)
73+
- Promoted to beta in 2024
74+
75+
- [Job Success Policy](https://github.com/kubernetes/enhancements/issues/3998)
76+
- Promoted to beta.
77+
78+
- [Elastic Index Jobs](https://github.com/kubernetes/enhancements/issues/3715)
79+
- Promoted to stable.
80+
81+
- [Pod Failure Policy](https://github.com/kubernetes/enhancements/issues/3329)
82+
- Promoted to stable.
83+
84+
- [Pod Index Label](https://github.com/kubernetes/enhancements/issues/4017)
85+
- Promoted to stable.
86+
87+
### Talks
88+
89+
- WG-Batch Update at Kubecon NA 2024
90+
- Authors: Kevin Hannon and Marcin Wielgus
91+
92+
- Keynote: MultiCluster Batch Jobs Dispatching with Kueue at CERN
93+
- Authors: Ricardo Rocha and Marcin Wielgus
94+
- Kubecon NA 2024
95+
96+
- Multitenancy and Fairness at Scale with Kueue: A Case Study
97+
- Authors: Aldo Culquicondor & Rajat Phull
98+
- Kubecon NA 2024
99+
100+
- Advanced Resource Management for Running AI/ML Workloads with Kueue
101+
- Authors: Michał Woźniak & Yuki Iwai
102+
- Kubecon EU 2024
103+
104+
- Scale Your Batch / Big Data / AI Workloads Beyond the Kubernetes Scheduler
105+
- Authors: Antonin Stefanutti & Anish Asthana
106+
- KubeCon EU, March, Paris
107+
108+
- WG-Batch Update at Kubecon EU 2024
109+
- Authors: Martin Wielgus
110+
111+
- How the Kubernetes Community is Improving Kubernetes for HPC/AI/ML Workloads
112+
- Authors: Kevin Hannon
113+
- FOSDEM 2024
114+
115+
### Community adoption
116+
117+
- [Kubeflow Training Operator v2](https://github.com/kubeflow/training-operator/blob/0c30f5cd306611f061b6dd529d3c7b7981a7d27c/docs/proposals/2170-kubeflow-training-v2/README.md#kep-2170-kubeflow-training-v2-api) will be using JobSet as a critical component for training and finetuning.
118+
119+
- [Metaflow supports JobSet](https://github.com/Netflix/metaflow/pull/1804) for distributed training.
120+
121+
- Airflow has built an [integration](https://airflow.apache.org/docs/apache-airflow-providers-cncf-kubernetes/stable/_api/airflow/providers/cncf/kubernetes/operators/kueue/index.html) with Kueue.
122+
18123
## Operational
19124

20125
Operational tasks in [wg-governance.md]:
21126

22-
- [ ] [README.md] reviewed for accuracy and updated if needed
23-
- [ ] WG leaders in [sigs.yaml] are accurate and active, and updated if needed
24-
- [ ] Meeting notes and recordings for 2024 are linked from [README.md] and updated/uploaded if needed
25-
- [ ] Updates provided to sponsoring SIGs in 2024
127+
- [x] [README.md] reviewed for accuracy and updated if needed
128+
- [x] WG leaders in [sigs.yaml] are accurate and active, and updated if needed
129+
- [x] Meeting notes and recordings for 2024 are linked from [README.md] and updated/uploaded if needed
130+
- [] Updates provided to sponsoring SIGs in 2024
26131
- [$sig-name](https://git.k8s.io/community/$sig-id/)
27132
- links to email, meeting notes, slides, or recordings, etc
28133
- [$sig-name](https://git.k8s.io/community/$sig-id/)

0 commit comments

Comments
 (0)