|
2 | 2 |
|
3 | 3 | ## Current initiatives and Project Health
|
4 | 4 |
|
5 |
| - |
6 | 5 | 1. What work did the WG do this year that should be highlighted?
|
7 | 6 |
|
8 |
| -<!-- |
9 |
| - Some example items that might be worth highlighting: |
10 |
| - - artifacts |
11 |
| - - reports |
12 |
| - - white papers |
13 |
| - - work not tracked in KEPs |
14 |
| ---> |
| 7 | +See [2024 Highlights](#2024-highlights). |
15 | 8 |
|
16 | 9 | 2. Are there any areas and/or subprojects that your group needs help with (e.g. fewer than 2 active OWNERS)?
|
17 | 10 |
|
| 11 | + None. |
| 12 | + |
| 13 | +### 2024 Highlights |
| 14 | + |
| 15 | +We will breakdown our highlights into Sub Projects, KEPs, talks, community adoption. |
| 16 | + |
| 17 | +#### Sub Projects |
| 18 | + |
| 19 | +##### Kueue |
| 20 | + |
| 21 | +Kueue has had 5 releases in 2024. |
| 22 | + |
| 23 | +- [Release 0.6](https://github.com/kubernetes-sigs/kueue/releases/tag/v0.6.0) |
| 24 | + |
| 25 | +- [Release 0.7](https://github.com/kubernetes-sigs/kueue/releases/tag/v0.7.0) |
| 26 | + |
| 27 | +- [Release 0.8](https://github.com/kubernetes-sigs/kueue/releases/tag/v0.8.0) |
| 28 | + |
| 29 | +- [Release 0.9](https://github.com/kubernetes-sigs/kueue/releases/tag/v0.9.0) |
| 30 | + |
| 31 | +- [Release 0.10](https://github.com/kubernetes-sigs/kueue/releases/tag/v0.10.0) |
| 32 | + |
| 33 | +In 2024, the kueue community would like to highlight are Topology aware scheduling, MultiKueue, Kueue Dashboard, KueueCtrl, Deployment/Statefulset integration for serving and Fair sharing. |
| 34 | + |
| 35 | +Topology aware scheduling facilitates scheduling of workloads that take in account data center topology. Workloads benefit from using interconnects that are physically close together. |
| 36 | + |
| 37 | +MultiKueue provides a way of dispatching batch workloads to worker clusters. Kueue provides multicluster dispatching for popular batch workloads such as Ray, Job, Kubeflow and JobSet. This feature went beta in 0.9. |
| 38 | + |
| 39 | +Kueue Dashboards has been a popular ask for Kueue. Users would like to have a visualization representation of queueing and we are happy to announce that a dashboard has been created for Kueue. This went into kueue in late 2024 and a big focus of 2025 will be to harden this for production. |
| 40 | + |
| 41 | +KueueCtrl provides a cli for creating kueue objects. The plugin is hosted in krew and is easily installed as a kueue plugin. |
| 42 | + |
| 43 | +Deployment/StatefulSet integration provides an avenue for the usage of Kueue for serving workloads. Serving leads to a need for sharing/preemption of model servers that may leverage accelerators. Kueue provides an integration with popular methods of deploying services (Deployment/StatefulSet). |
| 44 | + |
| 45 | +##### JobSet |
| 46 | + |
| 47 | +Jobset has had 4 release in 2024. |
| 48 | + |
| 49 | +- [Release 0.4](https://github.com/kubernetes-sigs/jobset/releases/tag/v0.4.0) |
| 50 | + |
| 51 | +- [Release 0.5](https://github.com/kubernetes-sigs/jobset/releases/tag/v0.5.0) |
| 52 | + |
| 53 | +- [Release 0.6](https://github.com/kubernetes-sigs/jobset/releases/tag/v0.6.0) |
| 54 | + |
| 55 | +- [Release 0.7](https://github.com/kubernetes-sigs/jobset/releases/tag/v0.7.0) |
| 56 | + |
| 57 | +A major achievement of JobSet has been the adoption of JobSet as a component for Kubeflow Training Operator V2. |
| 58 | +There has been a collaborative effort with the Kubeflow community and the batch community to implement the features needed for this integration. |
| 59 | + |
| 60 | +[Metaflow](https://github.com/Netflix/metaflow/pull/1804) has adopted the use of JobSet for distributed ML training. |
| 61 | + |
| 62 | +##### KJob |
| 63 | + |
| 64 | +[KJob](https://github.com/kubernetes-sigs/kjob?tab=readme-ov-file#kjob) has been started to provide a CLI friendly way for users to submit batch jobs. |
| 65 | +The HPC/ML community tend to prefer CLI over YAML so the focus was to provide a templated solution for submitting batch jobs. |
| 66 | +Another focus of this project is to provide a smooth transition for Slurm users. |
| 67 | + |
| 68 | +#### KEPs |
| 69 | + |
| 70 | +WG-Batch provided a series of kubernetes enhancements that improved the experience of batch workloads on Kubernetes. In 2024, this group proposed/implemented the following KEPs. |
| 71 | + |
| 72 | +- [Job Managed By](https://github.com/kubernetes/enhancements/issues/4368) |
| 73 | + - Promoted to beta in 2024 |
| 74 | + |
| 75 | +- [Job Success Policy](https://github.com/kubernetes/enhancements/issues/3998) |
| 76 | + - Promoted to beta. |
| 77 | + |
| 78 | +- [Elastic Index Jobs](https://github.com/kubernetes/enhancements/issues/3715) |
| 79 | + - Promoted to stable. |
| 80 | + |
| 81 | +- [Pod Failure Policy](https://github.com/kubernetes/enhancements/issues/3329) |
| 82 | + - Promoted to stable. |
| 83 | + |
| 84 | +- [Pod Index Label](https://github.com/kubernetes/enhancements/issues/4017) |
| 85 | + - Promoted to stable. |
| 86 | + |
| 87 | +### Talks |
| 88 | + |
| 89 | +- WG-Batch Update at Kubecon NA 2024 |
| 90 | + - Authors: Kevin Hannon and Marcin Wielgus |
| 91 | + |
| 92 | +- Keynote: MultiCluster Batch Jobs Dispatching with Kueue at CERN |
| 93 | + - Authors: Ricardo Rocha and Marcin Wielgus |
| 94 | + - Kubecon NA 2024 |
| 95 | + |
| 96 | +- Multitenancy and Fairness at Scale with Kueue: A Case Study |
| 97 | + - Authors: Aldo Culquicondor & Rajat Phull |
| 98 | + - Kubecon NA 2024 |
| 99 | + |
| 100 | +- Advanced Resource Management for Running AI/ML Workloads with Kueue |
| 101 | + - Authors: Michał Woźniak & Yuki Iwai |
| 102 | + - Kubecon EU 2024 |
| 103 | + |
| 104 | +- Scale Your Batch / Big Data / AI Workloads Beyond the Kubernetes Scheduler |
| 105 | + - Authors: Antonin Stefanutti & Anish Asthana |
| 106 | + - KubeCon EU, March, Paris |
| 107 | + |
| 108 | +- WG-Batch Update at Kubecon EU 2024 |
| 109 | + - Authors: Martin Wielgus |
| 110 | + |
| 111 | +- How the Kubernetes Community is Improving Kubernetes for HPC/AI/ML Workloads |
| 112 | + - Authors: Kevin Hannon |
| 113 | + - FOSDEM 2024 |
| 114 | + |
| 115 | +### Community adoption |
| 116 | + |
| 117 | +- [Kubeflow Training Operator v2](https://github.com/kubeflow/training-operator/blob/0c30f5cd306611f061b6dd529d3c7b7981a7d27c/docs/proposals/2170-kubeflow-training-v2/README.md#kep-2170-kubeflow-training-v2-api) will be using JobSet as a critical component for training and finetuning. |
| 118 | + |
| 119 | +- [Metaflow supports JobSet](https://github.com/Netflix/metaflow/pull/1804) for distributed training. |
| 120 | + |
| 121 | +- Airflow has built an [integration](https://airflow.apache.org/docs/apache-airflow-providers-cncf-kubernetes/stable/_api/airflow/providers/cncf/kubernetes/operators/kueue/index.html) with Kueue. |
| 122 | + |
18 | 123 | ## Operational
|
19 | 124 |
|
20 | 125 | Operational tasks in [wg-governance.md]:
|
21 | 126 |
|
22 |
| -- [ ] [README.md] reviewed for accuracy and updated if needed |
23 |
| -- [ ] WG leaders in [sigs.yaml] are accurate and active, and updated if needed |
24 |
| -- [ ] Meeting notes and recordings for 2024 are linked from [README.md] and updated/uploaded if needed |
25 |
| -- [ ] Updates provided to sponsoring SIGs in 2024 |
| 127 | +- [x] [README.md] reviewed for accuracy and updated if needed |
| 128 | +- [x] WG leaders in [sigs.yaml] are accurate and active, and updated if needed |
| 129 | +- [x] Meeting notes and recordings for 2024 are linked from [README.md] and updated/uploaded if needed |
| 130 | +- [] Updates provided to sponsoring SIGs in 2024 |
26 | 131 | - [$sig-name](https://git.k8s.io/community/$sig-id/)
|
27 | 132 | - links to email, meeting notes, slides, or recordings, etc
|
28 | 133 | - [$sig-name](https://git.k8s.io/community/$sig-id/)
|
|
0 commit comments