diff --git a/KUBEFLOW-STEERING-COMMITTEE.md b/KUBEFLOW-STEERING-COMMITTEE.md index c5aa476db..c5cb3c206 100644 --- a/KUBEFLOW-STEERING-COMMITTEE.md +++ b/KUBEFLOW-STEERING-COMMITTEE.md @@ -1,4 +1,5 @@ # Kubeflow Steering Committee + The Kubeflow Steering Committee (KSC) is the governing body of the Kubeflow project, providing decision-making and oversight pertaining to the Kubeflow project policies, sub-organizations, and financial planning, and defines the project values and structure. The governance of Kubeflow is an open, living document, and will continue to evolve as the community and project change. @@ -16,6 +17,7 @@ The governance of Kubeflow is an open, living document, and will continue to evo ## Committee Meetings KSC currently meets at least bi-weekly, or as-needed. Meetings are open to the public and held online, unless they pertain to sensitive or privileged matters. Examples of such matters are: + - Privacy related issues - Private emails to the committee - Code of conduct violations @@ -29,23 +31,64 @@ Questions and proposals for changes to governance are posted as issues in the ku ## Committee members -KSC is composed of 5 (five) members. They are elected according to the election policy [TODO: add link]. -Seats on the Steering Committee are held by an individual, not by their employer. +KSC is composed of 5 (five) members. They are elected according to [the election policy](proposals/kubeflow-steering-committee-election-proposal.md). +Seats on the Steering Committee are held by an individual, not by their employer. The current membership of the committee is (listed alphabetically by first name): -| Name | Organization | GitHub | Term Start | Term End | -|---------------------|--------------|----------------------------------------------------|------------|------------| -| Andrey Velichkevich | Apple | [andreyvelich](https://github.com/andreyvelich/) | 02/01/2024 | 02/01/2026 | -| Johnu George | Nutanix | [johnugeorge](https://github.com/johnugeorge/) | 02/01/2024 | 02/01/2026 | -| Josh Bottum | Independent | [jbottum](https://github.com/jbottum/) | 02/01/2024 | 02/01/2025 | -| James Wu | Google | [james-jwu](https://github.com/james-jwu/) | 02/01/2024 | 02/01/2025 | -| Yuan Tang | Red Hat | [terrytangyuan](https://github.com/terrytangyuan/) | 02/01/2024 | 02/01/2026 | - +| Name | Organization | GitHub | Term Start | Term End | +| ------------------- | ------------ | ---------------------------------------------------------------- | ---------- | ---------- | +| Andrey Velichkevich | Apple | [andreyvelich](https://github.com/andreyvelich/) | 02/01/2024 | 02/01/2026 | +| Francisco Arceo | Red Hat | [franciscojavierarceo](https://github.com/franciscojavierarceo/) | 02/01/2025 | 02/01/2027 | +| Johnu George | Nutanix | [johnugeorge](https://github.com/johnugeorge/) | 02/01/2024 | 02/01/2026 | +| Julius von Kohout | DHL | [juliusvonkohout](https://github.com/juliusvonkohout/) | 02/01/2025 | 02/01/2027 | +| Yuan Tang | Red Hat | [terrytangyuan](https://github.com/terrytangyuan/) | 02/01/2024 | 02/01/2026 | ## Emeritus Committee Members -[This section will be populated when there are retired committee members.] +| Name | Organization | GitHub | Term Start | Term End | +| ----------- | ------------ | ------------------------------------------ | ---------- | ---------- | +| Josh Bottum | Independent | [jbottum](https://github.com/jbottum/) | 02/01/2024 | 02/01/2025 | +| James Wu | Google | [james-jwu](https://github.com/james-jwu/) | 02/01/2024 | 02/01/2025 | + +## Ownership Transfer + +KSC members hold administrative ownership of Kubeflow assets. When new members of the KSC are elected, +a GitHub issue must be created to facilitate the transfer to the incoming members. + +GitHub issue name: + +``` +Transfer Ownership to KSC 2025 +``` + +GitHub issue content: + +- [ ] Update Kubeflow Steering Committee document with the new members and emeritus members. +- [ ] Archive the current Slack channel (e.g. `#archived-ksc-2024`) and create the new Slack channel (e.g. `kubeflow-steering-committee`). +- [ ] Schedule weekly calls with the new members. +- [ ] Update [admins for Kubeflow GitHub org](https://github.com/kubeflow/internal-acls/blob/master/github-orgs/kubeflow/org.yaml#L7). +- [ ] Update the [`kubeflow-steering-committee` GitHub team](https://github.com/kubeflow/internal-acls/blob/master/github-orgs/kubeflow/org.yaml). +- [ ] Update approvers for the following OWNERS files (e.g the past members should be moved to `emeritus_approvers`): + - `kubeflow/kubeflow` [OWNERS file](https://github.com/kubeflow/kubeflow/blob/master/OWNERS). + - `kubeflow/community` [OWNERS file](https://github.com/kubeflow/community/blob/master/OWNERS). + - `kubeflow/internal-acls` [OWNERS file](https://github.com/kubeflow/internal-acls/blob/master/OWNERS). + - `kubeflow/website` [OWNERS file](https://github.com/kubeflow/website/blob/master/OWNERS). + - `kubeflow/blog` [OWNERS file](https://github.com/kubeflow/blog/blob/master/OWNERS). +- [ ] Kubeflow GCP projects under `kubeflow.org` organization for ACLs and DNS management. + - Access for `kf-admin-cluster` GKE cluster in `kubeflow-admin` GCP project for the GitHub ACLs sync. + - Access for `kubeflow-dns` GCP project for the DNS management. +- [ ] Access for Kubeflow GKE cluster `kf-ci-v1` in `kubeflow-ci` GCP project (No Organization) + where Prow is running. +- [ ] Kubeflow [Google Group](https://groups.google.com/g/kubeflow-discuss). +- [ ] Update members for [KSC Google Group](https://groups.google.com/a/kubeflow.org/g/ksc). +- [ ] Access to Kubeflow `1password` account. +- [ ] Kubeflow social media resources. + - Kubeflow [LinkedIn](https://www.linkedin.com/company/kubeflow/) + - Kubeflow [X](https://x.com/kubeflow). + - Kubeflow [Bluesky](https://bsky.app/profile/kubefloworg.bsky.social). + - [Kubeflow Community](https://www.youtube.com/@KubeflowCommunity) YouTube channel. + - [Kubeflow](https://www.youtube.com/@Kubeflow) YouTube channel. ## Decision process @@ -54,6 +97,7 @@ The steering committee desires to always reach consensus. ### Normal decision process Decisions requiring a vote include: + - Issuing written policy - Amending existing written policy - Accepting, or removing a Kubeflow component @@ -70,6 +114,7 @@ Members of KSC may abstain from a vote. Abstaining members will only be consider ### Special decision process Issues that impacts the KSC governance requires a special decision process. Issues include: + - Changes to the KSC charter - KSC voting rules - Election rules @@ -77,12 +122,14 @@ Issues that impacts the KSC governance requires a special decision process. Issu The issue may pass with 70% of the members (rounded up) of the committee supporting it. One organization may cast 1 vote. Votes cast by members from the same organization are equally weighted. Example: + - If KSC is made up of employees from organizations A, A, B, C, D, each vote from organization A is weighted by a factor of 0.5. The total number of votes is 4, and 3 votes (70% rounded up) is required to pass a proposal. This rule is designed to remove organization A's ability to defeat a proposal that is supported by all other KSC members. -- Similarly, if KSC is made up of employees from organizations A, A, B, B, C, the total number of votes is 3, and 2.5 votes is required to pass a proposal. +- Similarly, if KSC is made up of employees from organizations A, A, B, B, C, the total number of votes is 3, and 2.5 votes is required to pass a proposal. ### Results The results of the decision process are recorded and made publicly available, unless they pertain to sensitive or privileged matters. The results will include: + - Description of the issue - Names of members who supported, opposed, and abstained from the vote. diff --git a/OWNERS b/OWNERS index 27ca78e1d..b786830cc 100644 --- a/OWNERS +++ b/OWNERS @@ -4,3 +4,9 @@ approvers: - juliusvonkohout - johnugeorge - terrytangyuan + + +emeritus_approvers: + - james-jwu + - jbottum + diff --git a/README.md b/README.md index 488e30072..e606f9be5 100644 --- a/README.md +++ b/README.md @@ -38,3 +38,7 @@ please reach out to ksc@kubeflow.org. * [proposals](https://github.com/kubeflow/community/tree/master/proposals): Kubeflow design proposals * [how-to](https://github.com/kubeflow/community/tree/master/how-to): for documenting community and other project processes + +## Legal + +The Linux Foundation® (TLF) has registered trademarks and uses trademarks. For a list of TLF trademarks, see [Trademark Usage](https://www.linuxfoundation.org/trademark-usage/). diff --git a/calendar/calendar.yaml b/calendar/calendar.yaml index 23f826a96..49b3ab523 100644 --- a/calendar/calendar.yaml +++ b/calendar/calendar.yaml @@ -139,9 +139,9 @@ - id: kf042 name: Kubeflow Spark Operator Meeting - date: 10/18/2024 - time: 4:00PM-5:00PM - frequency: every-4-weeks + date: 03/07/2025 + time: 8:00AM-9:00AM + frequency: bi-weekly video: https://zoom.us/j/93870602975?pwd=NWFNT2xrZU03alVTTXFBTEsvdDdMQT09 attendees: - email: kubeflow-discuss@googlegroups.com @@ -453,7 +453,6 @@ Zoom: Provided in meeting notes organizer: woop - - id: kf034 name: Kubeflow Security Team Call (US West/APAC) date: 12/19/2023 @@ -514,4 +513,4 @@ Join with Phone (USA): +1 669 900 6833 or +1 646 558 8656 International numbers: https://zoom.us/zoomconference?m=Os1EjlUlpb2_XUMaQ6dX1azqMK5CkfWH - organizer: thesuperzapper \ No newline at end of file + organizer: thesuperzapper diff --git a/how-to/join_kubeflow_ecosystem.md b/how-to/join_kubeflow_ecosystem.md new file mode 100644 index 000000000..a4b2678ea --- /dev/null +++ b/how-to/join_kubeflow_ecosystem.md @@ -0,0 +1,157 @@ +# Application for a Project to Join the Kubeflow Ecosystem + +Please see the [proposals/new-project-join-process.md](Documentation) to +better understand the full process for submitting a new project. +In short, copy this Application Template and populate the document. + + +## Changes to the application process +Changes to the application process charter may be proposed through a Pull Request +on this document by a Kubeflow community member. + +Amendments are accepted following the Kubeflow Steering Committee's [Normal Decision Process](../KUBEFLOW-STEERING-COMMITTEE.md#normal-decision-process). + +Proposals and amendments to the application process are available for at +least a period of one week for comments and questions before a vote will occur. + +## CNCF Short Checklist + +- [ ] All project metadata and resources are vendor-neutral +- [ ] Governance structure +- [ ] Contributing guides +- [ ] Public list of adopters + + +## Background information + +1. Submitter Name + - + +1. Submitter’s relationship to project / title + - + +1. Project Name + - + +1. Why is this project is valuable to the Kubeflow Community? + - + +1. Why is it beneficial for this project to be a part of the Kubeflow Community? + - + +1. List of existing (and potential) integrations with Kubeflow Core components + - + +1. Short Description / Functionality + - + +1. Adoption + - + +1. License Agreement + - + +1. Part of an Open Source Foundation? (e.g., Apache, Liniux, CNCF, etc.) + - + +1. Vendor Neutrality + - + +1. Trademark transition + - + +1. CI/CD Infra Requirements + - + +1. Governance Structure + - + +1. Website + - + +1. GitHub repository + - + +1. 1st Release date + - + +1. Project Meeting Times + - + +1. Meeting Notes + - + +1. Installation Documentation + - + +1. Project Documentation + - + +1. Security Profile (CVE scanning, Pod Security Standards, Network Policies) + - + +1. Ownership / Legal Profile (license type, any potential issues for CNCF) + - + +1. Authorization, Isolation mechanisms + - + +1. Project Roadmap + - + +1. Other Information + - + +## Metrics + +- Number of Maintainers and their Affiliations +- Number of Releases in last 12 months +- Number of Contributors +- Number of Users +- Number of Forks +- Number of Stars +- Number of package/project installations/downloads + +## Kubeflow Checklist + +1. Overlap with existing Kubeflow projects + - [ ] Yes (If so please list them) + - [ ] No + +1. Manifest Integration + - [ ] Yes + - [ ] No + - [ ] Planned + +1. Commitment to Kubeflow Conformance Program + - [ ] Yes + - [ ] No + - [ ] Uncertain + +1. Installation + - [ ] Standalone/Self-contained Component + - [ ] Part of Manifests + - [ ] Part of Distributions + +1. Installation Documentation (Current Quality) + - [ ] Good + - [ ] Fair + - [ ] Part of Kubeflow + +1. CI/CD + - [ ] Yes + - [ ] No + +1. Release Process + - [ ] Automated + - [ ] Semi-automated + - [ ] Not Automated + +1. Kubeflow Website Documentation + - [ ] Yes + - [ ] No + +1. Blog/Social Media + - [ ] Yes + - [ ] No + diff --git a/proposals/mpi-operator-proposal.md b/proposals/139-mpi-operator/README.md similarity index 87% rename from proposals/mpi-operator-proposal.md rename to proposals/139-mpi-operator/README.md index 4073ce5ab..9bc57ce22 100644 --- a/proposals/mpi-operator-proposal.md +++ b/proposals/139-mpi-operator/README.md @@ -1,6 +1,7 @@ -**Table of Contents** *generated with [DocToc](https://github.com/thlorenz/doctoc)* + +**Table of Contents** _generated with [DocToc](https://github.com/thlorenz/doctoc)_ - [Motivation](#motivation) - [Goals](#goals) @@ -17,11 +18,14 @@ _Status_ -* 2018-05-25 - Accepted -* 2018-06-02 - Implementation Started -* 2018-07-02 - v1alpha1 is released in 0.2 +- 2018-05-25 - Accepted +- 2018-06-02 - Implementation Started +- 2018-07-02 - v1alpha1 is released in 0.2 + +# KEP-139: MPI Operator ## Motivation + Kubeflow currently supports distributed training of TensorFlow models using [tf-operator](https://github.com/kubeflow/tf-operator), which relies on centralized parameter servers for coordination between workers. An alternative @@ -46,18 +50,20 @@ Kubernetes. By providing a CRD and a custom controller, we can make allreduce-style distributed training as simple as training on a single node. ## Goals -* Provide a common Custom Resource Definition (CRD) for defining a single-gpu, -multi-gpu, or multi-node training job. -* Implement a custom controller to manage the CRD, create dependent resources, -and reconcile the desired states. -* The cross-pod communication should be secure, without granting unnecessary -permissions to any pod. -* Though the initial version focuses on TensorFlow/Horovod, the approach can be -applied to other frameworks with MPI support, such as -[ChainerMN](https://github.com/chainer/chainermn) or -[CNTK](https://docs.microsoft.com/en-us/cognitive-toolkit/multiple-gpus-and-machines). + +- Provide a common Custom Resource Definition (CRD) for defining a single-gpu, + multi-gpu, or multi-node training job. +- Implement a custom controller to manage the CRD, create dependent resources, + and reconcile the desired states. +- The cross-pod communication should be secure, without granting unnecessary + permissions to any pod. +- Though the initial version focuses on TensorFlow/Horovod, the approach can be + applied to other frameworks with MPI support, such as + [ChainerMN](https://github.com/chainer/chainermn) or + [CNTK](https://docs.microsoft.com/en-us/cognitive-toolkit/multiple-gpus-and-machines). ## Non-Goals + In theory, this operator can be used to run arbitrary MPI jobs (e.g. computational fluid dynamics), but it's not our focus. @@ -75,11 +81,13 @@ It's also expected that the user would invoke `mpirun`, either through an SSH is not needed (or used). ### Custom Resource Definition + The custom resource can be defined in two ways, in terms of how GPU resources are specified. In the simple version, user specifies the total number of GPUs and the operator figures out how to allocate them efficiently: + ```yaml apiVersion: kubeflow.org/v1alpha1 kind: MPIJob @@ -96,6 +104,7 @@ spec: For more flexibility, user can choose to specify the resources explicitly (this example also shows the full `mpirun` command line): + ```yaml apiVersion: kubeflow.org/v1alpha1 kind: MPIJob @@ -125,6 +134,7 @@ spec: Either case would result in a worker `StatefulSet` and a launcher `Job`. ### Resulting Worker + ```yaml apiVersion: apps/v1 kind: StatefulSet @@ -164,6 +174,7 @@ spec: ``` ### Resulting Launcher + ```yaml apiVersion: batch/v1 kind: Job @@ -219,15 +230,17 @@ The initial handshake is done through `kubectl exec` instead of SSH. Logs can be accessed through the launcher pod. ## Design + We create a new custom controller that listens for `MPIJob` resources. When a -new `MPIJob` is created, the controller goes through the following *logical* +new `MPIJob` is created, the controller goes through the following _logical_ steps: + 1. Create a `ConfigMap` that contains: - * A helper shell script that can be used by `mpirun` in place of ssh. It - invokes `kubectl exec` for remote execution. - * A `hostfile` that lists the pods in the worker `StatefulSet` (in the - form of `${job-id}-worker-0`, `${job-id}-worker-1`, ...), and the - available slots (GPUs) in each pod. + - A helper shell script that can be used by `mpirun` in place of ssh. It + invokes `kubectl exec` for remote execution. + - A `hostfile` that lists the pods in the worker `StatefulSet` (in the + form of `${job-id}-worker-0`, `${job-id}-worker-1`, ...), and the + available slots (GPUs) in each pod. 1. Create the RBAC resources (`Role`, `ServiceAccount`, `RoleBinding`) to allow remote execution (`pods/exec`). 1. Create the worker `StatefulSet` that contains the desired replicas minus 1, @@ -240,7 +253,7 @@ steps: 1. After the launcher job finishes, set the `replicas` to 0 in the worker `StatefulSet`. -![MPI Operator](diagrams/mpi-operator.png) +![MPI Operator](mpi-operator.png) It may be desirable to schedule all the GPUs in a single Kubernetes resource (for example, for gang scheduling). We can add an option to the operator so that @@ -248,6 +261,7 @@ the worker `StatefulSet` does all the work, thus acquiring all the GPUs needed. The launcher job then becomes very light weight and no longer requires any GPUs. ## Alternatives Considered + One option is to add `allreduce` support to the existing tf-operator, but the modes of operation are quite different. Combining them may make the user experience unnecessarily complicated. A user would typically pick one approach diff --git a/proposals/diagrams/mpi-operator.png b/proposals/139-mpi-operator/mpi-operator.png similarity index 100% rename from proposals/diagrams/mpi-operator.png rename to proposals/139-mpi-operator/mpi-operator.png diff --git a/proposals/chainer-operator-proposal.md b/proposals/141-chainer-operator/README.md similarity index 75% rename from proposals/chainer-operator-proposal.md rename to proposals/141-chainer-operator/README.md index 9e5a618e4..b4381c4e9 100644 --- a/proposals/chainer-operator-proposal.md +++ b/proposals/141-chainer-operator/README.md @@ -1,6 +1,7 @@ -**Table of Contents** *generated with [DocToc](https://github.com/thlorenz/doctoc)* + +**Table of Contents** _generated with [DocToc](https://github.com/thlorenz/doctoc)_ - [Motivation](#motivation) - [Goals](#goals) @@ -18,34 +19,40 @@ _Status_ -* 2018-06-01 - Accepted -* 2018-06-14 - Implementation Started +- 2018-06-01 - Accepted +- 2018-06-14 - Implementation Started + +# KEP-141: Chainer Operator ## Motivation -[Chainer][Chainer] is a Python-based, standalone open source framework for deep learning models. Chainer provides a flexible, intuitive, and high-performance means of implementing a full range of deep learning models, including state-of-the-art models such as recurrent neural networks and variational autoencoders. +[Chainer][Chainer] is a Python-based, standalone open source framework for deep learning models. Chainer provides a flexible, intuitive, and high-performance means of implementing a full range of deep learning models, including state-of-the-art models such as recurrent neural networks and variational autoencoders. -[ChainerMN][ChainerMN] is an additional package for [Chainer][Chainer], which enables multi-node distributed deep learning in a scalable, flexible and easy way. [ChainerMN][ChainerMN] currently supports MPI to initialize process groups or do collective communications(e.g. broadcast, all-reduce, etc.) among processes attending the distributed learning. They are now planning to extend the support to other communication backends (e.g. [gloo][gloo] or other custom ones). +[ChainerMN][ChainerMN] is an additional package for [Chainer][Chainer], which enables multi-node distributed deep learning in a scalable, flexible and easy way. [ChainerMN][ChainerMN] currently supports MPI to initialize process groups or do collective communications(e.g. broadcast, all-reduce, etc.) among processes attending the distributed learning. They are now planning to extend the support to other communication backends (e.g. [gloo][gloo] or other custom ones). -Moreover, [Chainer][Chainer]/[ChainerMN][ChainerMN] achieved to [train ResNet-50 on ImageNet in 15 Minutes](https://arxiv.org/pdf/1711.04325.pdf) in the environment equipped with GPUs and InfiniBand FDR. [The recent research](https://chainer.org/general/2018/05/25/chainermn-v1-3.html) revealed that [ChainerMN][ChainerMN]'s latest feature (Double-buffering and All-Reduce in half-precision float values) enables users to expect _almost_ linear scalability without sacrificing model accuracy even in environments (e.g. AWS) which doesn't equip InfiniBand. +Moreover, [Chainer][Chainer]/[ChainerMN][ChainerMN] achieved to [train ResNet-50 on ImageNet in 15 Minutes](https://arxiv.org/pdf/1711.04325.pdf) in the environment equipped with GPUs and InfiniBand FDR. [The recent research](https://chainer.org/general/2018/05/25/chainermn-v1-3.html) revealed that [ChainerMN][ChainerMN]'s latest feature (Double-buffering and All-Reduce in half-precision float values) enables users to expect _almost_ linear scalability without sacrificing model accuracy even in environments (e.g. AWS) which doesn't equip InfiniBand. However, [Chainer][Chainer]/[ChainerMN][ChainerMN] currently does not have an operator/controller for Kubernetes. This proposal is aimed at defining what the operator should behave, and add it to Kubeflow. ## Goals -A Kubeflow user should be able to run training using [Chainer][Chainer]/[ChainerMN][ChainerMN] as easily as then can using Tensorflow/PyTorch. This proposal is centered around a Kubernetes operator for [Chainer]/[ChainerMN]. A user should be able to run both single node with [Chainer][Chainer] and distributed training jobs with [ChainerMN][ChainerMN]. + +A Kubeflow user should be able to run training using [Chainer][Chainer]/[ChainerMN][ChainerMN] as easily as then can using Tensorflow/PyTorch. This proposal is centered around a Kubernetes operator for [Chainer]/[ChainerMN]. A user should be able to run both single node with [Chainer][Chainer] and distributed training jobs with [ChainerMN][ChainerMN]. This proposal defines the following: + - A Chainer operator - A way to deploy the operator with ksonnet - A single pod Chainer example - A distributed (multiple pods) Chainer example ## Non-Goals + Currently, for the scope of this proposal, we won't be addressing the method for serving the model. ## API (CRD and resulting objects) ### Custom Resource Definition + ```yaml apiVersion: kubeflow.org/v1alpha1 kind: ChainerJob @@ -56,7 +63,7 @@ spec: # "gloo", or custom backend will be supported in the future. backend: mpi # chief would be better like TfJorb? - master: + master: # replicas of master can be ommitted but must be 1 in master. replicas: 1 # In master, only backoffLimit/activeDeadlineSeconds @@ -70,10 +77,20 @@ spec: name: master imagePullPolicy: IfNotPresent command: ["mpiexec"] - args: [ - "-n", "3", "-N", "1", - "python3", "/train_mnist.py", "-e", "2", "-b", "100", "-g" - ] + args: + [ + "-n", + "3", + "-N", + "1", + "python3", + "/train_mnist.py", + "-e", + "2", + "-b", + "100", + "-g", + ] restartPolicy: OnFailure worker: replicas: 3 @@ -85,20 +102,22 @@ spec: restartPolicy: OnFailure ``` -This `ChainerJob` resembles the existing `TfJob`/`PyTorchJob`. The main differences are being the omission of `masterPort` options. +This `ChainerJob` resembles the existing `TfJob`/`PyTorchJob`. The main differences are being the omission of `masterPort` options. -`backend` defines the protocol the [ChainerMN][ChainerMN] processes will use to communicate when initializing the worker group. As stated above, [ChainerMN][ChainerMN] currently support MPI only for backend. But they are now planning to extend the support to other communication backend (e.g. [gloo][gloo] or other custom ones). +`backend` defines the protocol the [ChainerMN][ChainerMN] processes will use to communicate when initializing the worker group. As stated above, [ChainerMN][ChainerMN] currently support MPI only for backend. But they are now planning to extend the support to other communication backend (e.g. [gloo][gloo] or other custom ones). ### Container Image -When `backend: mpi`, the same assumption with [mpi-operator](mpi-operator-proposal.md) would be applied. In addition, to bring out the best performance with CUDA and NVIDIA GPU power, CUDA-aware MPI should be built and installed in the container image. + +When `backend: mpi`, the same assumption with [mpi-operator](mpi-operator-proposal.md) would be applied. In addition, to bring out the best performance with CUDA and NVIDIA GPU power, CUDA-aware MPI should be built and installed in the container image. ### Resulting Master/Workers -This resulting master/workers resembles ones in [mpi-operator](mpi-operator-proposal.md) very much. It is because that when `backend: mpi`, the main mission of chainer operator would be a setup of MPI cluster on Kubernetes which is failt-tolerant in some extent. +This resulting master/workers resembles ones in [mpi-operator](mpi-operator-proposal.md) very much. It is because that when `backend: mpi`, the main mission of chainer operator would be a setup of MPI cluster on Kubernetes which is failt-tolerant in some extent. -The difference is that one of master's initContainers makes sure all the cluster pods are up and can connect to them with `kubectl exec`. It is because that it makes chainer-operator not to needs to watch failure of jobs or StatefulSets. This simplifies implementation of chainer-operator. +The difference is that one of master's initContainers makes sure all the cluster pods are up and can connect to them with `kubectl exec`. It is because that it makes chainer-operator not to needs to watch failure of jobs or StatefulSets. This simplifies implementation of chainer-operator. #### Master + ```yaml apiVersion: batch/v1 kind: Job @@ -106,7 +125,7 @@ metadata: name: ${job-id}-master spec: backoffLimit: 5 - activeDeadlineSeconds: 100 + activeDeadlineSeconds: 100 template: spec: initContainers: @@ -235,11 +254,11 @@ The sequence is very similar to [mpi-operator](mpi-operator-proposal.md#Design). - chainer-operator needs not to wait for pods in the `StatefulSet` are up and can connect to them because `master` pod has `initContainer` to do it. - When `Job` finishes (even when `DeadlineExceeded`), it will scale `StatefulSet` to `0`. - ## Alternatives Considered -We know [mpi-operator](mpi-operator-proposal.md) is already proposed. As a design alternative, chiner-operator could emit `kind: MPIJob` custom resource instead of emitting similar constructs. -Please be noted that [ChainerMN][ChainerMN] is now planning to expand backend support other than MPI. So, even in the case which chainer-operator just emmits `kind: MPIJob` resources, chainer-operator would be worth to introduce. +We know [mpi-operator](mpi-operator-proposal.md) is already proposed. As a design alternative, chiner-operator could emit `kind: MPIJob` custom resource instead of emitting similar constructs. + +Please be noted that [ChainerMN][ChainerMN] is now planning to expand backend support other than MPI. So, even in the case which chainer-operator just emmits `kind: MPIJob` resources, chainer-operator would be worth to introduce. [ChainerMN]: https://github.com/chainer/chainermn [Chainer]: https://chainer.org diff --git a/proposals/pvc-template.md b/proposals/263-pvc-template/README.md similarity index 82% rename from proposals/pvc-template.md rename to proposals/263-pvc-template/README.md index 47aa29e73..dac8c2826 100644 --- a/proposals/pvc-template.md +++ b/proposals/263-pvc-template/README.md @@ -1,75 +1,81 @@ -# Kubeflow PVC Template Support +# KEP-263: Kubeflow PVC Template Support ## Table of Contents + - [Summary](#summary) - [Motivation](#motivation) - - [User Stories](#user-stories-optional) - - [Goals](#goals) - - [Non-Goals](#non-goals) + - [User Stories](#user-stories-optional) + - [Goals](#goals) + - [Non-Goals](#non-goals) - [Proposal](#proposal) - - [Implementation Details](#implementation-details) - - [Risks and Mitigations](#risks-and-mitigations) + - [Implementation Details](#implementation-details) + - [Risks and Mitigations](#risks-and-mitigations) - [Testing Plan](#testing-plan) - ## Summary + Kubeflow cannot automatically provision volume for user Pods as their working space. This doc proposes PVC template support which adds an automatic volume provisioning option to user. ## Motivation + Machine learning related workloads usually requires to process large amount of data sets for training purpose, Tensorflow, PyTorch are just example of them, and after the workload finishes, the data is usually discarded. Today, to provide such scratch space for KubeFlow workloads, user would have the following options: -* Use host disk such as `EmptyDir` or `HostPath` -* Mount shared file system such as `AWS EFS` -* Pre-provision block devices such as `AWS EBS` -* Implement customized volume provisioning logics via `CSI` or `FlexVolume` + +- Use host disk such as `EmptyDir` or `HostPath` +- Mount shared file system such as `AWS EFS` +- Pre-provision block devices such as `AWS EBS` +- Implement customized volume provisioning logics via `CSI` or `FlexVolume` For the above options, each in its own way has cons: -* EmptyDir - * Requires careful host disk space pre-provisioning - * Scheduling multiple training jobs onto same host might cause tension in host disk space since default - kube-scheduler does not take `EmptyDir` size into consideration - * If training job gets retried and scheduled onto a different host, it need to fetch all its data again -* Shared file system - * Throughput bottleneck, as jobs might use storage in a bursty fashion during phases such as downloading training data - * Shared file system is usually expensive -* Pre provisioned block device - * Requires additional manual / automated work for provisioning devices and synchronize volume naming for replica -* CSI / FlexVolume - * Additional dev work would be required. - * [CSI just GA-ed](https://kubernetes.io/blog/2019/01/15/container-storage-interface-ga/), and has not been widely adopped yet - * Flex volume is out of tree and is [deprecated](https://github.com/kubernetes/community/blob/master/sig-storage/volume-plugin-faq.md) - + +- EmptyDir + - Requires careful host disk space pre-provisioning + - Scheduling multiple training jobs onto same host might cause tension in host disk space since default + kube-scheduler does not take `EmptyDir` size into consideration + - If training job gets retried and scheduled onto a different host, it need to fetch all its data again +- Shared file system + - Throughput bottleneck, as jobs might use storage in a bursty fashion during phases such as downloading training data + - Shared file system is usually expensive +- Pre provisioned block device + - Requires additional manual / automated work for provisioning devices and synchronize volume naming for replica +- CSI / FlexVolume + - Additional dev work would be required. + - [CSI just GA-ed](https://kubernetes.io/blog/2019/01/15/container-storage-interface-ga/), and has not been widely adopped yet + - Flex volume is out of tree and is [deprecated](https://github.com/kubernetes/community/blob/master/sig-storage/volume-plugin-faq.md) + While block device is the best volume choice as the scratch space of training job, and k8s has native support for auto block device provisioning through [Persistent Volume Claim](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#lifecycle-of-a-volume-and-claim), it would be a easy leverage for kubeflow to support dynamic scratch volume provisioning for training jobs. ### User Stories + - As a data scientist / engineer, I want to use arbitrary disk type and size for my different training jobs - As a data scientist / engineer, I want the infrastructure of my training jobs get automatically handled and I don't need to worry about provisioning and recycling them - As an infrastructure developer, I don't want to have user application to use host disk space with unlimited space as this increases the risk of using up host disk and affect stability - As an infrastructure developer, I want to provision infrastructure to my customers only when needed or there will be a waste of money ### Goals + - Automatically provision / de-provision block device as scratch space for user in an efficient and reliable way - Design should be generic enough so any distributed training job can take this feature - When a pod gets retried (recreated), it should get it's volume back so it has the option to continue with its work - User don't have to define volume for every single replica of worker - Volume creation throttling - - This has to be added for volume provisioning feature to be production ready since most cloud provider throttles - API calls per account - if there is a burst of volume creation and cloud provider started to throttle calls, all - clusters under the entire cloud account would get affected, and massive back-offs are hard to handle given the - error might propagate through upstream services - + - This has to be added for volume provisioning feature to be production ready since most cloud provider throttles + API calls per account - if there is a burst of volume creation and cloud provider started to throttle calls, all + clusters under the entire cloud account would get affected, and massive back-offs are hard to handle given the + error might propagate through upstream services + ### Non-Goals + - Volume pooling for reducing Create/Delete calls as we can re-use volume for different jobs - Smart throttling such that one training job will not starve other ones due to volume throttling, similar for different worker types within same training job - [Volume resizing](https://kubernetes.io/blog/2018/07/12/resizing-persistent-volumes-using-kubernetes/) - Volume reclaim policy, i.e. user can choose if they want their PVC gets deleted after deleting the training job - ## Proposal ### Implementation Details @@ -79,24 +85,27 @@ to older versions. Currently we need to modify `v1.ReplicaSpec`, and add volume to be commonly shared by any type of distributed training job controller. #### API Change + Similar to stateful set, we will add a field in `common.ReplicaSpec` to define PVC template: + ```go // ReplicaSpec is a description of the replica type ReplicaSpec struct { // ... - + // VolumeClaimTemplates specifies a list of volume claim templates, which defines volumes specified in template. VolumeClaimTemplates []corev1.PersistentVolumeClaim `json:"volumeClaimTemplates,omitempty"` } ``` + List is used here instead of map since according to Kubernetes convention, [lists is preferred over maps](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#lists-of-named-subobjects-preferred-over-maps). In this case, if user redefine volume claim templates with same name but different spec, **the actual outcome will be undefined**. After v1, `common.ReplicaSpec` will be able to be shared with all types of controllers. - #### PVC Management + VolumeClaimTemplate will be per worker type, which means that all replicas will have same volume configuration. The exact volume type will be up to the user through storage class, not our responsibility. @@ -118,8 +127,7 @@ volume back. Here is an example of TFJob: ```yaml - -# This is a snippet of TFJobSpec +# This is a snippet of TFJobSpec tfReplicaSpecs: Worker: replicas: 1 @@ -147,6 +155,7 @@ tfReplicaSpecs: We will generate volume definition in the actual Pod spec (only showing worker-0, pod and volume will be identical for worker-1 besides the ordinal will change from 0 to 1): + ```yaml # worker 0 pod apiVersion: v1 @@ -172,12 +181,12 @@ kind: PersistentVolumeClaim metadata: name: "my-data-mytf-worker-0" ownerReferences: - - apiVersion: kubeflow.org/v1beta1 - blockOwnerDeletion: true - controller: true - kind: TFJob - name: "mytf" - uid: "xxxxxxx" + - apiVersion: kubeflow.org/v1beta1 + blockOwnerDeletion: true + controller: true + kind: TFJob + name: "mytf" + uid: "xxxxxxx" spec: accessModes: - "ReadWriteOnce" @@ -192,12 +201,15 @@ Once Pod and PVC are created, k8s control plane will handle the rest of the coor #### Volume Provisioning / De-provisioning Throttling ##### Provision + Though k8s PV controller has relatively aggressive exponential and jittered back-off when making cloud provider calls, a burst in creating PVC will still result in a burst in cloud provider calls. Therefore, cluster administrator should be able to configure the rate each application should use to create PVC. ##### De-provision + We do not need to have a centralized throttle of volume de-provision for the following 2 reasons: + 1. Volume de-provision is relatively light weighted (only contains unmount and detach, not necessarily delete volume) 2. Volume de-provision happens in a distributed fashion and have more randomness on timing, as container tear-down time varies @@ -205,6 +217,7 @@ We can rely on k8s garbage collector to recycle Pod and Volume when a training j that it does not start volume tear down routine before all containers are terminated. ##### Throttling + With volume creation throttling, controller will send creation request that contains Pod/PVC pair to a queue and let the volume and pod to be created asynchronously. We always create volume first before we create Pod. The reason is this will make control plane more efficient as Pod without its volume get created first will make k8s controller / scheduler retry @@ -218,7 +231,6 @@ type podVolumeThrottleRequest struct { } ``` - A worker will spin up that takes requests from queue and create PVC with throttled rate. K8s provides well tested [token bucket filter through its client-go](https://godoc.org/k8s.io/client-go/util/flowcontrol#NewTokenBucketRateLimiter), and we will be able to use that directly. Worker will requeue failed requests with back-off. With this throttling, worker @@ -226,25 +238,22 @@ will have head-of-line blocking globally (among all training jobs controlled by as we want to have a global volume creation throttling. Also, this worker should be able to be commonly shared among different types of controllers so should reside in `common`. - - ##### Alternatives + 1. Where to throttle volume creation? - An alternative is to work with k8s upstream to add throttling in PV controller, but given the dev/release cycle, it'd be -more convenient to add it in application, while keeping upstream logic stable and simple. Also adding throttling logic in -application has finer granularity as we can configure throttling. - + more convenient to add it in application, while keeping upstream logic stable and simple. Also adding throttling logic in + application has finer granularity as we can configure throttling. #### Status Report and Error Propagation + For now I don't think it necessary for propagating volume status back to `common.JobStatus`, as PVC will have its own status update. If we fail to create PVC, we will f event for the targeting training job. Since we don't have distributed training job specific metrics exporting for now, adding metrics for volume provisioning failure or so will be out of the scope of this design. - - ### Risks and Mitigations + 1. Even though we can throttle volume creation, there is still possibility that we hit [k8s volume binding bug](https://github.com/kubernetes/kubernetes/pull/72045) if user's storage class is `WaitForFirstConsumer` with k8s before 1.14.0. When this bug is hit, manual intervention is needed to restart scheduler to refresh its cache. But from kubeflow perspective, allowing user to specifying volume provisioning @@ -255,9 +264,7 @@ scope of this design. throttling. For simplicity of this feature, we can start with FIFO queue but come up with fairer throttling later (Non Goal #2) - - ## Test Plan + 1. Unit tests are needed for proper transfer volume claim template into Pod's volume definitions 2. Controller integration test for testing volume creation and throttling - diff --git a/proposals/280-issue-triage/README.md b/proposals/280-issue-triage/README.md new file mode 100644 index 000000000..f21c4e0a7 --- /dev/null +++ b/proposals/280-issue-triage/README.md @@ -0,0 +1,92 @@ +# KEP-280: Kubeflow Issue Triage + +## TL;DR + +The purpose of this doc is to define a process for triaging Kubeflow issues. + +## Objectives + +- Establish well accepted criterion for determining whether issues have been triaged +- Establish a process for ensuring issues are triaged in a timely fashion +- Define metrics for measuring whether we are keeping up with issues + +## Triage Conditions + +The following are necessary and sufficient conditions for an issue to be considered triaged. + +- The issue must have a label indicating which one of the following kinds of issues it is + + - **bug** + - Something is not working as intended in general. + - **question** + - Clear question statement + - Something is not working as intended in author's specific use case and he/she doesn't know why. + - **feature** + - Everything is working as intended, but could be better (i.e more user friendly) + - **process** + - Typically used to leave a paper trail for updating Kubeflow infrastructure. It helps to track the changes to infrastructure for easy debugging in the future. + +- The issue must have at least one [area or platform label](https://github.com/kubeflow/community/blob/master/labels-owners.yaml) grouping related issues and relevant owners. + +- The issue must have a priority attached to it. Here is a guideline for priority + + - **P0** - Urgent - Work must begin immediately to fix with a patch release: + - Bugs that state that something is really broken and not working as intended. + - Features/improvements that are blocking the next release. + - **P1** - Rush - Work must be scheduled to assure issue will be fixed in the next release. + - **P2** - Low - Never blocks a release, assigned to a relevant project backlog if applicable. + - **P3** - Very Low - Non-critical or cosmetic issues that could and probably should eventually be fixed but have no specific schedule, assigned to a relavant project backlog if applicable. + +- **P0** & **P1** issues must be attached to a Kanban board corresponding to the release it is targeting + +## Process + +1. Global triagers are responsible for ensuring new issues have an area or platform label + + - A weekly rotation will be established to designate a primary person to apply initial triage + + - Once issues have an area/platform label they should be moved into the appropriate [column "Assigned to Area Owners"](https://github.com/orgs/kubeflow/projects/26#column-7382310) in the Needs Triage Kanban board + + - There is an open issue [kubeflow/code-intelligence#72](https://github.com/kubeflow/code-intelligence/issues/72) to do this automatically + +1. Area/Platform owners are responsible for ensuring issues in their area are triaged + + - The oncall will attempt to satisfy the above criterion or reassign to an appropriate WG if there is some question + +## Tooling + +- The [Needs Triage](https://github.com/orgs/kubeflow/projects/26) Kanban board will be used to track issues that need triage + + - Cards will be setup to monitor various issues; e.g. issues requiring discussion by various WG's + +- The [GitHub Issue Triage action](https://github.com/kubeflow/code-intelligence/tree/master/Issue_Triage/action) can be used to + automatically add/remove issues from the Kanban board depending on whether they need triage or not + + - Follow the [instructions](https://github.com/kubeflow/code-intelligence/tree/master/Issue_Triage/action#installing-the-action-on-a-repository) to install the GitHub action on a repository + +- The [triage notebook](https://github.com/kubeflow/code-intelligence/blob/master/py/code_intelligence/triage.ipynb) can be used to generate reports about number of untriaged issues as well as find issues needing triage + +## Become a contributor + +- Make sure that you have enough permissions to assign labels to an issue and add it to a project. +- In order to get permissions, open a PR to add yourself to [project-maintainers](https://github.com/kubeflow/internal-acls/blob/4e44f623ea4df32132b2e8a973ed0f0dce4f4139/github-orgs/kubeflow/org.yaml#L389) group. + +## Triage guideline + +- Take an issue from "Needs Triage" project and open it in a new tab. +- Carefully read the description. +- Carefully read all comments below. (Some issues might be already resolved). +- Make sure that issue is still relevant. (Some issues might be open for months and still be relevant to current Kubeflow release whereas some might be outdated and can be closed). +- Ping one of the issue repliers if he/she is not replying for a while. +- Make sure that all triage conditions are satisfied. + +## Metrics + +We would like to begin to collect and track the following metrics + +- Time to triage issues +- Issue volume + +## References + +- [kubeflow/community](https://github.com/kubeflow/community/issues/280) diff --git a/proposals/tf-operator-design-v1alpha2.md b/proposals/30-tf-operator-v1alpha2/tf-operator-design-v1alpha2.md similarity index 89% rename from proposals/tf-operator-design-v1alpha2.md rename to proposals/30-tf-operator-v1alpha2/tf-operator-design-v1alpha2.md index 6e08fc0d1..acedc1077 100644 --- a/proposals/tf-operator-design-v1alpha2.md +++ b/proposals/30-tf-operator-v1alpha2/tf-operator-design-v1alpha2.md @@ -1,6 +1,7 @@ -**Table of Contents** *generated with [DocToc](https://github.com/thlorenz/doctoc)* + +**Table of Contents** _generated with [DocToc](https://github.com/thlorenz/doctoc)_ - [TF-Operator Design (v1alpha2)](#tf-operator-design-v1alpha2) - [Motivation](#motivation) @@ -22,15 +23,15 @@ _Authors:_ -* @ScorpioCPH - Penghao Cen <cenph@caicloud.io> +- @ScorpioCPH - Penghao Cen <cenph@caicloud.io> _Status_ -* 2018-03-18 - Accepted -* 2018-04-10 - Implementation Started -* 2018-07-02 - v1alpha2 released in 0.2 +- 2018-03-18 - Accepted +- 2018-04-10 - Implementation Started +- 2018-07-02 - v1alpha2 released in 0.2 -# TF-Operator Design (v1alpha2) +# KEP-30: TF-Operator Design (v1alpha2) ## Motivation @@ -40,10 +41,10 @@ Open this file to summarize the design details and move the version of API to `v ## Goals - Define the structure of API `v1alpha2`. - + Cover most of the refactoring requests we have discussed. - + Simplify the API definition. + - Cover most of the refactoring requests we have discussed. + - Simplify the API definition. - Define an `event-driven` mechanism for TFJob life-cycle management. - + And use `reconciler` mechanism as a double check. + - And use `reconciler` mechanism as a double check. - Clarify the `error handing` logic. - Provide a `test` mechanism to verify the design and implementation. @@ -56,6 +57,7 @@ Open this file to summarize the design details and move the version of API to `v The `TFJob` API v1alpha2 object will have the following structure: **TFJob**: + ```go // TFJob represents the configuration of signal TFJob type TFJob struct { @@ -76,6 +78,7 @@ type TFJob struct { ``` **TFJobSpec**: + ```go // TFJobSpec is a desired state description of the TFJob. type TFJobSpec struct { @@ -91,6 +94,7 @@ type TFJobSpec struct { ``` **TFReplicaSpec**: + ```go // TFReplicaSpec is a description of the TFReplica type TFReplicaSpec struct { @@ -126,6 +130,7 @@ const ( ``` **TFReplicaType**: + ```go // TFReplicaType is the type for TFReplica. type TFReplicaType string @@ -148,6 +153,7 @@ const ( ``` **TFJobStatus**: + ```go // TFJobStatus represents the current observed state of the TFJob. type TFJobStatus struct { @@ -176,6 +182,7 @@ type TFJobStatus struct { ``` **TFReplicaStatus**: + ```go // TFReplicaStatus represents the current observed state of the TFReplica. type TFReplicaStatus struct { @@ -191,6 +198,7 @@ type TFReplicaStatus struct { ``` **TFJobCondition**: + ```go // TFJobCondition describes the state of the TFJob at a certain point. type TFJobCondition struct { @@ -215,6 +223,7 @@ type TFJobCondition struct { ``` **TFJobConditionType**: + ```go // TFJobConditionType defines all kinds of types of TFJobStatus. type TFJobConditionType string @@ -321,34 +330,34 @@ Other user-defined arguments can also be passed into container by `Args` field i First, we should follow the `Event-Driven` pattern as other resource controller in kubernetes (e.g. Deployment/Job): - Start `tfJobInformer` to listen on CRUD events of TFJob. - + `tfJobInformer` was automatically generated from API definition by `informer-gen` script. + - `tfJobInformer` was automatically generated from API definition by `informer-gen` script. - Create one pair pod/service for each specify TFReplicaType + replica index in TFJob CreateHandler. - + For example, as a given TFReplicaSpec: - ``` - { - "PS": { - Replicas: 2, - }, - "Worker": { - Replicas: 3, - }, - } - ``` - We will create: - - `two` pair pods/services for PSs: - - tf-job-name-ps-1-uid - - tf-job-name-ps-2-uid - - `three` pair pods/services for Workers: - - tf-job-name-worker-1-uid - - tf-job-name-worker-2-uid - - tf-job-name-worker-3-uid - + We use a postfix `uid` to make each object name unique. - + Then set these objects' `OwnerReferences` to this TFJob object. + - For example, as a given TFReplicaSpec: + ``` + { + "PS": { + Replicas: 2, + }, + "Worker": { + Replicas: 3, + }, + } + ``` + We will create: + - `two` pair pods/services for PSs: + - tf-job-name-ps-1-uid + - tf-job-name-ps-2-uid + - `three` pair pods/services for Workers: + - tf-job-name-worker-1-uid + - tf-job-name-worker-2-uid + - tf-job-name-worker-3-uid + - We use a postfix `uid` to make each object name unique. + - Then set these objects' `OwnerReferences` to this TFJob object. - Listen on pods/services via `podInformer` and `serviceInformer`. - + On pod created/updated/deleted, get TFJob object by parsing `OwnerReferences`, set the `TFJob.Status` as defined above according to the whole TF cluster state. - + Update the `TFJob.Status.Condition` if needed. + - On pod created/updated/deleted, get TFJob object by parsing `OwnerReferences`, set the `TFJob.Status` as defined above according to the whole TF cluster state. + - Update the `TFJob.Status.Condition` if needed. - Terminate/Delete the TFJob object if every pod is completed (or leave pod phase as `Succeeded`). - + This maybe be lead to logs and model checkpoint files unreachable. + - This maybe be lead to logs and model checkpoint files unreachable. ### Reconciler @@ -389,15 +398,15 @@ As `tfJobImformer` provides a forcing resync mechanism by calling `UpdateFunc` w - UpdateFunc return a TFJob object periodically. - Check `LastReconcileTime` to determine whether we should trigger a reconciler call. - `tf-operator` will list all pods/services which related to this TFJob. - + Compare the current state to the spec of this TFJob. - + Try to recovery the failed pod/service to make the training healthy. - + Error handing is described below. + - Compare the current state to the spec of this TFJob. + - Try to recovery the failed pod/service to make the training healthy. + - Error handing is described below. - Update the status of this TFJob. - TODO: we should call this reconciler with an exponential back-off delay (15s, 30s, 60s …) capped at 5 minutes. ### Error Handling -To make the system robust, the tf-operator should be able to locally and automatically recover from errors. +To make the system robust, the tf-operator should be able to locally and automatically recover from errors. We extend kubernetes built-in `RestartPolicy` by adding new policy `ExitCode`: @@ -409,10 +418,12 @@ We extend kubernetes built-in `RestartPolicy` by adding new policy `ExitCode`: ``` We let users set this field according to their model code. - + If set RestartPolicy to `OnFailure`/`Always`, user should add reloading checkpoint code by themselves. - + Otherwise restarting will take no effect. + +- If set RestartPolicy to `OnFailure`/`Always`, user should add reloading checkpoint code by themselves. +- Otherwise restarting will take no effect. `ExitCode` policy means that user should add exit code by themselves, `tf-operator` will check these exit codes to determine the behavior when a error occurs: + - 1-127: permanent error, do not restart. - 128-255: retryable error, will restart the pod. @@ -433,7 +444,7 @@ We can use this model from TensorFlow [repo](https://github.com/tensorflow/tenso Apart from the above, we should add these abilities in the future: - Provide a properly mechanism to store training logs and checkpoint files. - + [FYI](https://github.com/kubeflow/tf-operator/issues/128) + - [FYI](https://github.com/kubeflow/tf-operator/issues/128) ### Related Issues diff --git a/proposals/pytorch-operator-proposal.md b/proposals/33-pytorch-operator/README.md similarity index 70% rename from proposals/pytorch-operator-proposal.md rename to proposals/33-pytorch-operator/README.md index 5c8118468..687bc44c5 100644 --- a/proposals/pytorch-operator-proposal.md +++ b/proposals/33-pytorch-operator/README.md @@ -1,6 +1,7 @@ -**Table of Contents** *generated with [DocToc](https://github.com/thlorenz/doctoc)* + +**Table of Contents** _generated with [DocToc](https://github.com/thlorenz/doctoc)_ - [Motivation](#motivation) - [Goals](#goals) @@ -16,29 +17,37 @@ _Status_ -* 2018-03-20 - Accepted -* 2018-03-15 - Implementation Started -* 2018-07-02 - v1alpha1 is released in 0.2 +- 2018-03-20 - Accepted +- 2018-03-15 - Implementation Started +- 2018-07-02 - v1alpha1 is released in 0.2 + +# KEP-33: PyTorch Operator ## Motivation + PyTorch is a popular machine learning framework which currently does not have an operator/controller for Kubernetes. This proposal is aimed at defining what that operator should look like, and adding it to Kubeflow. ## Goals -A Kubeflow user should be able to run training using PyTorch as easily as then can using Tensorflow. This proposal is centered around a Kubernetes operator for PyTorch. A user should be able to run both single node and distributed training jobs with PyTorch. + +A Kubeflow user should be able to run training using PyTorch as easily as then can using Tensorflow. This proposal is centered around a Kubernetes operator for PyTorch. A user should be able to run both single node and distributed training jobs with PyTorch. This proposal defines the following: + - A PyTorch operator - A way to deploy the operator with ksonnet - A single pod PyTorch example - A distributed PyTorch example ## Non-Goals + For the scope of this proposal, we won't be addressing the method for serving the model. ## API (CRD and resulting objects) ### Custom Resource Definition + The custom resource submitted to the Kubernetes API would look something like this: + ```yaml apiVersion: "kubeflow.org/v1alpha1" kind: "PyTorchJob" @@ -67,13 +76,14 @@ spec: restartPolicy: OnFailure ``` -This PyTorchJob resembles the existing TFJob for the tf-operator. The main differences being the omission of the parameter server replica type, and the addition of `masterPort` and `backend` options. +This PyTorchJob resembles the existing TFJob for the tf-operator. The main differences being the omission of the parameter server replica type, and the addition of `masterPort` and `backend` options. `backend` Defines the protocol the PyTorch workers will use to communicate when initializing the worker group. Information on the different backends (and the functions they support) can be found [here](http://pytorch.org/docs/master/distributed.html). `masterPort` Defines the port the group will use to communicate with the master's Kubernetes service. ### Resulting Master + ```yaml kind: Service apiVersion: v1 @@ -83,9 +93,10 @@ spec: selector: app: pytorch-master-${job_id} ports: - - port: 23456 - targetPort: 23456 + - port: 23456 + targetPort: 23456 ``` + ```yaml apiVersion: v1 kind: Pod @@ -95,28 +106,29 @@ metadata: app: pytorchmaster-${job_id} spec: containers: - - image: pytorch/pytorch:latest - imagePullPolicy: IfNotPresent - name: master - env: - - name: MASTER_PORT - value: "23456" - - name: MASTER_ADDR - value: "localhost" - - name: WORLD_SIZE - value: "3" - # Rank 0 is the master - - name: RANK - value: "0" - ports: - - name: masterPort - containerPort: 23456 + - image: pytorch/pytorch:latest + imagePullPolicy: IfNotPresent + name: master + env: + - name: MASTER_PORT + value: "23456" + - name: MASTER_ADDR + value: "localhost" + - name: WORLD_SIZE + value: "3" + # Rank 0 is the master + - name: RANK + value: "0" + ports: + - name: masterPort + containerPort: 23456 restartPolicy: OnFailure ``` -The master spec will create a service and a pod. The environment variables provided are used when initializing a distributed process group with PyTorch. `WORLD_SIZE` is determined by adding the number of replicas in both 'MASTER' and 'WORKER' replicaSpecs. `RANK` is 0 for the master. +The master spec will create a service and a pod. The environment variables provided are used when initializing a distributed process group with PyTorch. `WORLD_SIZE` is determined by adding the number of replicas in both 'MASTER' and 'WORKER' replicaSpecs. `RANK` is 0 for the master. ### Resulting Worker + ```yaml apiVersion: v1 kind: Pod @@ -124,32 +136,34 @@ metadata: name: py-torchjob-worker-${job_id} spec: containers: - - image: pytorch/pytorch:latest - imagePullPolicy: IfNotPresent - name: worker - env: - - name: MASTER_PORT - value: "23456" - - name: MASTER_ADDR - value: pytorch-master-${job_id} - - name: WORLD_SIZE - value: "3" - - name: RANK - value: "1" + - image: pytorch/pytorch:latest + imagePullPolicy: IfNotPresent + name: worker + env: + - name: MASTER_PORT + value: "23456" + - name: MASTER_ADDR + value: pytorch-master-${job_id} + - name: WORLD_SIZE + value: "3" + - name: RANK + value: "1" restartPolicy: OnFailure ``` The worker spec generates a pod. They will communicate to the master through the master's service name. ## Design + This is an implementaion of the PyTorch distributed design patterns, found [here](http://pytorch.org/tutorials/intermediate/dist_tuto.html), via the lense of TFJob found [here](https://github.com/kubeflow/tf-operator). In the case of Kubernetes, because the operator is able to easily apply configurations to each process, we will use the environment variable initialization method found [here](http://pytorch.org/tutorials/intermediate/dist_tuto.html#initialization-methods). In most training examples, the pods will communicate via the all-reduce function in order to average the gradients. -![All-Reduce Pytorch](diagrams/all-reduce-pytorch-operator.jpeg) - +![All-Reduce Pytorch](all-reduce-pytorch-operator.jpeg) ## Alternatives Considered + One alternative considered for the CRD spec is shown below: + ```yaml apiVersion: "kubeflow.org/v1alpha1" kind: "PyTorchJob" @@ -160,6 +174,7 @@ spec: masterPort: "23456" worldSize: 3 container: - - image: pytorch/pytorch:latest + - image: pytorch/pytorch:latest ``` + The idea was the number of replicas for worker and masters could be derived from the `worldSize` given there would only be one master. It was decided against based on the fact that it deviates from a regular replicaSpec and provides less customization. diff --git a/proposals/diagrams/all-reduce-pytorch-operator.jpeg b/proposals/33-pytorch-operator/all-reduce-pytorch-operator.jpeg similarity index 100% rename from proposals/diagrams/all-reduce-pytorch-operator.jpeg rename to proposals/33-pytorch-operator/all-reduce-pytorch-operator.jpeg diff --git a/proposals/fate-operator-proposal.md b/proposals/335-fate-operator/README.md similarity index 81% rename from proposals/fate-operator-proposal.md rename to proposals/335-fate-operator/README.md index 3f54028ff..7e3dca700 100644 --- a/proposals/fate-operator-proposal.md +++ b/proposals/335-fate-operator/README.md @@ -1,78 +1,94 @@ - -- [Background](#background) -- [Motivation](#motivation) -- [Goals](#goals) -- [Non-Goals](#non-goals) -- [Design](#design) - - [Container images and deploying FATE cluster on - Kubernetes](#container-images-and-deploying-fate-cluster-on-kubernetes) - - [Custom Resource Definition](#custom-resource-definition) - - [Kubefate](#kubefate) - - [FATECluster](#fatecluster) - - [FateJob](#fatejob) - - [Controller design](#controller-design) -- [Reference](#reference) +- [Background](#background) +- [Motivation](#motivation) +- [Goals](#goals) +- [Non-Goals](#non-goals) +- [Design](#design) + - [Container images and deploying FATE cluster on + Kubernetes](#container-images-and-deploying-fate-cluster-on-kubernetes) + - [Custom Resource Definition](#custom-resource-definition) + - [Kubefate](#kubefate) + - [FATECluster](#fatecluster) + - [FateJob](#fatejob) + - [Controller design](#controller-design) +- [Reference](#reference) _Status_ -* 2020-6-3 - Draft v3 -* 2020-5-28 - Draft v2 -* 2020-5-14 – Draft v1 + +- 2020-6-3 - Draft v3 +- 2020-5-28 - Draft v2 +- 2020-5-14 – Draft v1 + +# KEP-335: Fate Operator ## Background + Federated machine learning (FML) is a machine learning setting where many clients (e.g. mobile devices or organizations) collaboratively train a model under the coordination of a central server while keeping the training data decentralized. Only the encrypted mediate parameters are exchanged between clients with MPC or homomorphic encryption. -![Federated Machine Learning](diagrams/fate-operator-fl.png) +![Federated Machine Learning](fate-operator-fl.png) -FML has received significant interest recently, because of its effectiveness to solve data silos and data privacy preserving problems. Companies participated in federated machine learning include 4Paradigm, Ant Financial, Data Republic, Google, Huawei, Intel, JD.com, Microsoft, Nvidia, OpenMind, Pingan Technology, Sharemind, Tencent, VMware, Webank etc. +FML has received significant interest recently, because of its effectiveness to solve data silos and data privacy preserving problems. Companies participated in federated machine learning include 4Paradigm, Ant Financial, Data Republic, Google, Huawei, Intel, JD.com, Microsoft, Nvidia, OpenMind, Pingan Technology, Sharemind, Tencent, VMware, Webank etc. -Depending on the differences in features and sample data space, federated machine learning can be classified into _horizontal federated machine learning_, _vertical federated machine learning_ and _federated transfer learning_. Horizontal federated machine learning is also called sample-based federated machine learning, which means data sets share the same feature space but have different samples. With horizontal federated machine learning, we can gather the relatively small or partial datasets into a big one to increase the performance of trained models. Vertical federated machine learning is applicable to the cases where there are two datasets with different feature space but share same sample ID. With vertical federated machine learning we can train a model with attributes from different organizations for a full profile. Vertical federated machine learning is required to redesign most of machine learning algorithms. Federated transfer learning applies to scenarios where there are two datasets with different features space but also different samples. +Depending on the differences in features and sample data space, federated machine learning can be classified into _horizontal federated machine learning_, _vertical federated machine learning_ and _federated transfer learning_. Horizontal federated machine learning is also called sample-based federated machine learning, which means data sets share the same feature space but have different samples. With horizontal federated machine learning, we can gather the relatively small or partial datasets into a big one to increase the performance of trained models. Vertical federated machine learning is applicable to the cases where there are two datasets with different feature space but share same sample ID. With vertical federated machine learning we can train a model with attributes from different organizations for a full profile. Vertical federated machine learning is required to redesign most of machine learning algorithms. Federated transfer learning applies to scenarios where there are two datasets with different features space but also different samples. -[FATE (Federated AI Technology Enabler)](https://fate.fedai.org) is an open source project initialized by Webank, [now hosted at the Linux Foundation](https://fate.fedai.org/2019/09/18/first-digital-only-bank-in-china-joins-linux-foundation/). FATE is the only open source FML framework that supports both horizontal and vertical FML currently. The architecture design of FATE is focused on providing FML platform for enterprises. [KubeFATE](https://github.com/FederatedAI/KubeFATE) is an open source project to deploy FATE on Kubernetes and is a proven effective solution for FML use cases. +[FATE (Federated AI Technology Enabler)](https://fate.fedai.org) is an open source project initialized by Webank, [now hosted at the Linux Foundation](https://fate.fedai.org/2019/09/18/first-digital-only-bank-in-china-joins-linux-foundation/). FATE is the only open source FML framework that supports both horizontal and vertical FML currently. The architecture design of FATE is focused on providing FML platform for enterprises. [KubeFATE](https://github.com/FederatedAI/KubeFATE) is an open source project to deploy FATE on Kubernetes and is a proven effective solution for FML use cases. More technologies of Federated machine learning, please refer to [Reference section](#reference) ## Case studies + There are 130+ enterprises and organizations, 150+ colleges participating in FATE project. FATE case studies refer to [Cases](https://www.fedai.org/cases/): + 1. [Utilization of FATE in Risk Management of Credit in Small and Micro Enterprises](https://www.fedai.org/cases/utilization-of-fate-in-risk-management-of-credit-in-small-and-micro-enterprises/) 2. [Computer vision Platform powered by Federated Learning](https://www.fedai.org/cases/computer-vision-platform-powered-by-federated-learning/) 3. [A case of traffic violations insurance-using federated learning](https://www.fedai.org/cases/a-case-of-traffic-violations-insurance-using-federated-learning/) 4. [Utilization of FATE in Anti Money Laundering Through Multiple Banks](https://www.fedai.org/cases/utilization-of-fate-in-anti-money-laundering-through-multiple-banks/) Other Federated Machine Learning cases: + 1. [Federated Learning for Mobile Keyboard Prediction](https://research.google/pubs/pub47586/) 2. [Federated Learning powered by NVIDIA Clara](https://devblogs.nvidia.com/federated-learning-clara/): hospitals and medical institutions collaboratively share and combine their local knowledge 3. [Owkin Launches the Collaborative COVID-19 Open AI Consortium (COAI)](https://www.unite.ai/covid-19-open-ai-consortium/) ## Motivation -Kubeflow provides a toolset for end-to-end machine learning workflow on Kubernetes. Introducing the capability of federated learning to Kubeflow helps FML users and researchers leverage existing Kubeflow toolkits in their workflows and help them more efficiently build FML solutions. + +Kubeflow provides a toolset for end-to-end machine learning workflow on Kubernetes. Introducing the capability of federated learning to Kubeflow helps FML users and researchers leverage existing Kubeflow toolkits in their workflows and help them more efficiently build FML solutions. A FATE-Operator is a start of supporting FML in Kubeflow. This proposal is aimed to defining what FATE operator should look like, and how to apply to Kubeflow. ## Goals + A Kubeflow user should be able to run training using FATE as easily as they can using PyTorch, Tensorflow. This proposal is centered around a Kubernetes Operator for FATE. With the FATE-Operator, a user can: -1. Provision and manage a FATE cluster; -2. Submit an FML job to FATE. + +1. Provision and manage a FATE cluster; +2. Submit an FML job to FATE. This proposal defines the following: -1. A FATE operator with three CRDs: - * FateJob: create an FML job; - * FateCluster: create a FATE cluster to serve FML jobs; - * Kubefate: the resource management component of FATE cluster. -2. Example of full lifecycle to create KubeFATE component, deploy FATE cluster and submit an FML job to created FATE and get the result. Note that, KubeFATE and FATE cluster only needs to be deployed once, and can handle multiple jobs. + +1. A FATE operator with three CRDs: + +- FateJob: create an FML job; +- FateCluster: create a FATE cluster to serve FML jobs; +- Kubefate: the resource management component of FATE cluster. + +2. Example of full lifecycle to create KubeFATE component, deploy FATE cluster and submit an FML job to created FATE and get the result. Note that, KubeFATE and FATE cluster only needs to be deployed once, and can handle multiple jobs. ## Non-Goals + For the scope of this proposal, we won’t be addressing the method of serving the model. ## Design ### Container images and deploying FATE cluster on Kubernetes -We have built a set of Docker images for FATE cluster, and put into: https://hub.docker.com/orgs/federatedai/repositories . All the images have been already used and verified by community users. + +We have built a set of Docker images for FATE cluster, and put into: https://hub.docker.com/orgs/federatedai/repositories . All the images have been already used and verified by community users. There is a provisioning and management component of FATE cluster, called [KubeFATE](https://github.com/FederatedAI/KubeFATE/tree/master/k8s-deploy). KubeFATE manages FATE clusters of one party in a federation. All images work well and are proven in users’ environments. ### Custom Resource Definition + #### Kubefate + ``` apiVersion: app.kubefate.net/v1beta1 kind: Kubefate @@ -109,10 +125,13 @@ spec: name: kubefate-secret key: kubefatePassword ``` -KubeFATE is a core component to manage and coordinate FATE clusters in one FML party. The above CRD defines the KubeFATE component. -* host defines other components how to access the service service gateway of KubeFATE exposed. + +KubeFATE is a core component to manage and coordinate FATE clusters in one FML party. The above CRD defines the KubeFATE component. + +- host defines other components how to access the service service gateway of KubeFATE exposed. #### FATECluster + ``` apiVersion: app.kubefate.net/v1beta1 kind: FateCluster @@ -129,20 +148,23 @@ spec: partyPort: "30010" egg: replica: 1 - + # KubeFATE service deployed in Org. kubefate: name: kubefate-sample namespace: kube-fate ``` -The FateCluster defines a deployment of FATE on Kubernetes. -* version defines the FATE version deployed in Kubernetes; -* partyId defines the FML party’s ID; -* proxyPort defines the exposed port for exchanging models and parameters between different parties in an FML training. It will be exposed as a node port; -* partyList defines the parties in a federation which take part in collaboratively learning; -* egg is the worker nodes of FATE. + +The FateCluster defines a deployment of FATE on Kubernetes. + +- version defines the FATE version deployed in Kubernetes; +- partyId defines the FML party’s ID; +- proxyPort defines the exposed port for exchanging models and parameters between different parties in an FML training. It will be exposed as a node port; +- partyList defines the parties in a federation which take part in collaboratively learning; +- egg is the worker nodes of FATE. #### FateJob + ``` apiVersion: app.kubefate.net/v1beta1 kind: FateJob @@ -202,34 +224,37 @@ spec: } } ``` + FateJob defines the job sent to FATE cluster: -* fateClusterRef defines the cluster of FATE deployed on Kubernetes. Its value is resource name of FATE cluster created by CRD “FateCluster”; -* jobConf defines the details of an FML job. It includes: - * pipeline: the workflow pipeline of FATE. In FATE, there are many prebuilt algorithm components (ref: https://github.com/FederatedAI/FATE/tree/master/federatedml and https://github.com/FederatedAI/FATE/tree/master/federatedrec), which can be used to train models. The pipeline defines how data are passed through and processed in the whole training flow; - * moduleConf: the detail configuration of each algorithm component, e.g. the optimizers, the batch size etc. + +- fateClusterRef defines the cluster of FATE deployed on Kubernetes. Its value is resource name of FATE cluster created by CRD “FateCluster”; +- jobConf defines the details of an FML job. It includes: + - pipeline: the workflow pipeline of FATE. In FATE, there are many prebuilt algorithm components (ref: https://github.com/FederatedAI/FATE/tree/master/federatedml and https://github.com/FederatedAI/FATE/tree/master/federatedrec), which can be used to train models. The pipeline defines how data are passed through and processed in the whole training flow; + - moduleConf: the detail configuration of each algorithm component, e.g. the optimizers, the batch size etc. ### Controller design -We created a new custom controller for FateCluster, FateJob and Kubefate resources. +We created a new custom controller for FateCluster, FateJob and Kubefate resources. The relationship between them and processes to make everything work are shown as following diagrams. Process 1. Creating Kubefate if it does not exist. The custom controller (1) listens for Kubefate resource, (2) creates RBAC resources (Role, Service Account, RoleBinding) to allow remote execution, (3) creates the related resource of Kubefate. (4) The controller waits the resource of Kubefate to be ready and returns the status. -![Process 1](diagrams/fate-operator-1.png) +![Process 1](fate-operator-1.png) Process 2. Creating FATE cluster. In one party, only one Kubefate instance needs to be provisioned. Multiple FATE clusters can be provisioned by KubeFATe for different purposes. (1) The custom FATE controller listens for FateCluster custom resource, (2) and calls Kubefate cluster of the one federated party, sends the metadata and configurations (3) to create a FATE cluster. (4) The controller waits the resource of FATE to be ready and returns the status. -![Process 2](diagrams/fate-operator-2.png) +![Process 2](fate-operator-2.png) -Process 3. Submitting an FML job to FATE cluster. (1) The custom FATE controller listens for FateJob CRD, and sends the job to FATE cluster, which includes the pipeline and modules configuration. (3) The FATE controller waits for the job results from FATE cluster. +Process 3. Submitting an FML job to FATE cluster. (1) The custom FATE controller listens for FateJob CRD, and sends the job to FATE cluster, which includes the pipeline and modules configuration. (3) The FATE controller waits for the job results from FATE cluster. -![Process 3](diagrams/fate-operator-3.png) +![Process 3](fate-operator-3.png) The overall architecture of the federated learning can be presented as following diagram, the FATE cluster will handle the communication and return to FATE controller once the job is done. -![Overall](diagrams/fate-operator-overall.png) +![Overall](fate-operator-overall.png) ## Reference + 1. Qiang Yang, Yang Liu, Tianjian Chen, and Yongxin Tong. Federated machine learning: Concept and applications. CoRR, abs/1902.04885, 2019. URL http://arxiv.org/abs/1902.04885 2. Peter Kairouz et al. Advances and open problems in federated learning. arXiv preprint arXiv:1912.04977 diff --git a/proposals/diagrams/fate-operator-1.png b/proposals/335-fate-operator/fate-operator-1.png similarity index 100% rename from proposals/diagrams/fate-operator-1.png rename to proposals/335-fate-operator/fate-operator-1.png diff --git a/proposals/diagrams/fate-operator-2.png b/proposals/335-fate-operator/fate-operator-2.png similarity index 100% rename from proposals/diagrams/fate-operator-2.png rename to proposals/335-fate-operator/fate-operator-2.png diff --git a/proposals/diagrams/fate-operator-3.png b/proposals/335-fate-operator/fate-operator-3.png similarity index 100% rename from proposals/diagrams/fate-operator-3.png rename to proposals/335-fate-operator/fate-operator-3.png diff --git a/proposals/diagrams/fate-operator-fl.png b/proposals/335-fate-operator/fate-operator-fl.png similarity index 100% rename from proposals/diagrams/fate-operator-fl.png rename to proposals/335-fate-operator/fate-operator-fl.png diff --git a/proposals/diagrams/fate-operator-overall.png b/proposals/335-fate-operator/fate-operator-overall.png similarity index 100% rename from proposals/diagrams/fate-operator-overall.png rename to proposals/335-fate-operator/fate-operator-overall.png diff --git a/proposals/caffe2-operator-proposal.md b/proposals/41-caffe2-operator/README.md similarity index 82% rename from proposals/caffe2-operator-proposal.md rename to proposals/41-caffe2-operator/README.md index 00a2eb465..4be2d7361 100644 --- a/proposals/caffe2-operator-proposal.md +++ b/proposals/41-caffe2-operator/README.md @@ -1,6 +1,7 @@ -**Table of Contents** *generated with [DocToc](https://github.com/thlorenz/doctoc)* + +**Table of Contents** _generated with [DocToc](https://github.com/thlorenz/doctoc)_ - [Motivation](#motivation) - [Goals](#goals) @@ -16,17 +17,22 @@ _Status_ -* 2018-04-06 - Accepted +- 2018-04-06 - Accepted + +# KEP-41: Caffe2 Operator ## Motivation + Caffe2 is a popular machine learning framework which currently does not have an operator/controller for Kubernetes. This proposal is aimed at defining what that operator should look like, and adding it to Kubeflow. -For distributed training, Caffe2 has no parameter server compared with Tensorflow, so it has to use Redis/MPI to find the other nodes to communicate. +For distributed training, Caffe2 has no parameter server compared with Tensorflow, so it has to use Redis/MPI to find the other nodes to communicate. ## Goals -A Kubeflow user should be able to run training using Caffe2 as easily as they can using Tensorflow. This proposal is centered around a Kubernetes operator for Caffe2. A user should be able to run both single node and distributed training jobs with Caffe2. + +A Kubeflow user should be able to run training using Caffe2 as easily as they can using Tensorflow. This proposal is centered around a Kubernetes operator for Caffe2. A user should be able to run both single node and distributed training jobs with Caffe2. This proposal defines the following: + - A Caffe2 operator - A way to deploy the operator with kubectl - A single pod Caffe2 example @@ -34,11 +40,13 @@ This proposal defines the following: - A distributed Caffe2 proposal with [batchd scheduler](https://github.com/kubernetes-incubator/kube-arbitrator) ## Non-Goals + For the scope of this proposal, we won't be addressing the method for serving the model. ## API (CRD and resulting objects) ### Custom Resource Definition + The custom resource submitted to the Kubernetes API would look something like this: ```yaml @@ -85,7 +93,7 @@ spec: restartPolicy: Never ``` -This Caffe2Job resembles the existing TFJob for the tf-operator. The main differences being the omission of the parameter server replica type, and the addition of `backend` options and `HELPER` replica type. +This Caffe2Job resembles the existing TFJob for the tf-operator. The main differences being the omission of the parameter server replica type, and the addition of `backend` options and `HELPER` replica type. `backend` Defines the distributed type the Caffe2 master and workers will use to communicate when initializing the worker group. Information on the different backends (and the functions they support) can be found [here](https://caffe2.ai/docs/distributed-training.html). @@ -98,8 +106,7 @@ apiVersion: v1 kind: Pod metadata: name: caffe2-master-${job_id} - labels: - app=caffe2-job-xx + labels: app=caffe2-job-xx caffe2_job_name=example-job controller-uid=dc3669c6-29f1-11e8-9ccd-ac1f6b8040c6 job-name=example-job-master-20lm-1 @@ -109,22 +116,22 @@ metadata: task_index=0 spec: containers: - - image: carmark/caffe2:latest - imagePullPolicy: IfNotPresent - name: caffe2 + - image: carmark/caffe2:latest + imagePullPolicy: IfNotPresent + name: caffe2 restartPolicy: Never ``` The labels variables provided are used when initializing a distributed process group with Caffe2. `task_index` is determined by adding the number of replicas in each 'MASTER' and 'WORKER' replicaSpecs. `job_type` is `MASTER` for the master. ### Resulting Worker + ```yaml apiVersion: v1 kind: Pod metadata: name: caffe2-worker-${job_id} - labels: - app=caffe2-job-xx + labels: app=caffe2-job-xx caffe2_job_name=example-job controller-uid=dc3669c6-29f1-11e8-9ccd-ac1f6b8040c6 job-name=example-job-worker-20lm-0 @@ -134,21 +141,22 @@ metadata: task_index=0 spec: containers: - - image: carmark/caffe2:latest - imagePullPolicy: IfNotPresent - name: caffe2 + - image: carmark/caffe2:latest + imagePullPolicy: IfNotPresent + name: caffe2 restartPolicy: Never ``` The worker spec generates a pod. They will communicate to the master through the redis's service name. ## Design + This is an implementaion of the Caffe2 distributed design patterns, found [here](https://caffe2.ai/docs/SynchronousSGD.html), via the lense of TFJob found [here](https://github.com/kubeflow/tf-operator). Diagram pending ## Other backends -Form [here](https://caffe2.ai/docs/distributed-training.html), Caffe2 also support [Gloo](https://github.com/facebookincubator/gloo) which is another communications library for multi-machine training. For Gloo with MPI, we do not neet the redis to communicate, the master and workers will communicate by ssh. So it should better to define another sshd port to communicate in container, then start the works first and then master container. +Form [here](https://caffe2.ai/docs/distributed-training.html), Caffe2 also support [Gloo](https://github.com/facebookincubator/gloo) which is another communications library for multi-machine training. For Gloo with MPI, we do not neet the redis to communicate, the master and workers will communicate by ssh. So it should better to define another sshd port to communicate in container, then start the works first and then master container. To finish this start process, we may invole the [batchd scheduler](https://github.com/kubernetes-incubator/kube-arbitrator) and use priority class to define the priority. diff --git a/proposals/kubeflow-distributions.md b/proposals/434-kubeflow-distribution/README.md similarity index 62% rename from proposals/kubeflow-distributions.md rename to proposals/434-kubeflow-distribution/README.md index bb0abc9f9..534550e46 100644 --- a/proposals/kubeflow-distributions.md +++ b/proposals/434-kubeflow-distribution/README.md @@ -1,3 +1,4 @@ +# KEP-434: Kubeflow Distributions ## Objective @@ -7,22 +8,21 @@ Clarify how Kubeflow distributions will be owned and developed going forward. Kubeflow can be divided into pieces - 1. Individual Kubeflow applications (e.g. Pipelines, KFServing, notebooks, etc...) - 1. Distributions of Kubeflow (e.g. Kubeflow on GCP, Kubeflow on AWS, MiniKF, etc...) +1. Individual Kubeflow applications (e.g. Pipelines, KFServing, notebooks, etc...) +1. Distributions of Kubeflow (e.g. Kubeflow on GCP, Kubeflow on AWS, MiniKF, etc...) +Since July, the Kubeflow community has been working on forming working groups to create greater +accountability for the different parts of Kubeflow. - Since July, the Kubeflow community has been working on forming working groups to create greater - accountability for the different parts of Kubeflow. +At this point in time, Kubeflow has formed working groups with clear ownership for all of the individual Kubeflow +applications. - At this point in time, Kubeflow has formed working groups with clear ownership for all of the individual Kubeflow - applications. +There is an ongoing debate about who should own and maintain Kubeflow distributions. - There is an ongoing debate about who should own and maintain Kubeflow distributions. +To date there are two categories of distributions - To date there are two categories of distributions - - 1. Kubeflow distributions tied to a specific platform (e.g. AWS, GCP, etc...) - 1. Generic distributions (e.g. for MiniKube, any conformant K8s cluster, etc...) +1. Kubeflow distributions tied to a specific platform (e.g. AWS, GCP, etc...) +1. Generic distributions (e.g. for MiniKube, any conformant K8s cluster, etc...) The former have been owned and maintained by the respective vendors. The general consensus is that these should continue to be owned and maintained by the respective vendors outside any KF working group. @@ -31,7 +31,7 @@ This leaves the question of what to do about generic distributions. In particula ## Proposal -Going forward all distributions of Kubeflow should be owned and maintained outside of Kubeflow. +Going forward all distributions of Kubeflow should be owned and maintained outside of Kubeflow. ### What is a Kubeflow Distribution @@ -41,9 +41,9 @@ A Kubeflow distribution is an opinionated bundle of Kubeflow applications optimi Going forward new distributions of Kubeflow should be developed outside of the Kubeflow GitHub org. This ensures - * Accountability for the distribution - * Insulates Kubeflow from the success or failure of the distribution - * Avoid further taxing Kubeflow's overstretched engprod resources(see[kubeflow/testing#737](https://github.com/kubeflow/testing/issues/737)) +- Accountability for the distribution +- Insulates Kubeflow from the success or failure of the distribution +- Avoid further taxing Kubeflow's overstretched engprod resources(see[kubeflow/testing#737](https://github.com/kubeflow/testing/issues/737)) The owners of existing distributions should work with the respective WG/repository/org owners to come up with appropriate transition plans. @@ -56,8 +56,8 @@ As an example, the name "KFCube" for a distribution targeting minikube is highly ### Releasing & Versioning -Releasing and versioning for each distribution is the responsibility of the distribution owners. -This includes determining the release cadence. The release cadence of distributions doesn't need to be in sync +Releasing and versioning for each distribution is the responsibility of the distribution owners. +This includes determining the release cadence. The release cadence of distributions doesn't need to be in sync with Kubeflow releases. ## Alternatives Considered @@ -66,5 +66,5 @@ An alternative would be to spin up a work group to own or maintain one or more g This has the following disadvantages -* Distributions aren't treated uniformly as some distributions are owned by Kubeflow and thus implicitly endorsed by Kubeflow -* Historically, creating accountability for generic distributions has been difficult +- Distributions aren't treated uniformly as some distributions are owned by Kubeflow and thus implicitly endorsed by Kubeflow +- Historically, creating accountability for generic distributions has been difficult diff --git a/proposals/paddle-operator-proposal.md b/proposals/502-paddle-operator/README.md similarity index 67% rename from proposals/paddle-operator-proposal.md rename to proposals/502-paddle-operator/README.md index 52138f87d..4d32c7868 100644 --- a/proposals/paddle-operator-proposal.md +++ b/proposals/502-paddle-operator/README.md @@ -1,4 +1,4 @@ -# Paddle Operator Proposal +# KEP-502: Paddle Operator Proposal ## Motivation @@ -10,10 +10,10 @@ Kubeflow user should be able to run training using PaddlePaddle easily on Kubern The proposal defines the followings: -* Provide a Custom Resource Definition (CRD) for defining PaddlePaddle training job, currently supports running two distributed tasks, ParameterServer (PS) and Collective. -* Implement a controller to manage the CRD, create dependent resources, and reconcile to the desired states. -* The script for operator and controller deployment. -* Several distributed PaddlePaddle training examples. +- Provide a Custom Resource Definition (CRD) for defining PaddlePaddle training job, currently supports running two distributed tasks, ParameterServer (PS) and Collective. +- Implement a controller to manage the CRD, create dependent resources, and reconcile to the desired states. +- The script for operator and controller deployment. +- Several distributed PaddlePaddle training examples. ## Non-Goals @@ -38,6 +38,7 @@ For the model serving part, it will not be included in the paddle-operator. ``` ### Custom Resource Definition + The custom resource definition yaml example is as following: ```yaml @@ -65,13 +66,13 @@ spec: image: registry.baidubce.com/paddle-operator/demo-wide-and-deep:v1 ``` -* The optional configuration of withGloo is 0 not enabled, 1 only starts the worker side, 2 starts all (worker and server), it is recommended to set 1; +- The optional configuration of withGloo is 0 not enabled, 1 only starts the worker side, 2 starts all (worker and server), it is recommended to set 1; -* The cleanPodPolicy can be optionally configured as Always/Never/OnFailure/OnCompletion, which indicates whether to delete the pod when the task is terminated (failed or successful). It is recommended to Never during debugging and OnCompletion during production; +- The cleanPodPolicy can be optionally configured as Always/Never/OnFailure/OnCompletion, which indicates whether to delete the pod when the task is terminated (failed or successful). It is recommended to Never during debugging and OnCompletion during production; -* The intranet can be optionally configured as Service/PodIP, which means the communication method between pods. The user does not need to configure it, and PodIP is used by default; +- The intranet can be optionally configured as Service/PodIP, which means the communication method between pods. The user does not need to configure it, and PodIP is used by default; -* The content of ps and worker is podTemplateSpec. Users can add more content according to the Kubernetes specification, such as GPU configuration. +- The content of ps and worker is podTemplateSpec. Users can add more content according to the Kubernetes specification, such as GPU configuration. We also provide PaddlePaddle collective mode with GPU. @@ -107,12 +108,12 @@ spec: medium: Memory ``` -* Here you need to add shared memory to mount to prevent cache errors; - -* This example uses the built-in data set. After the program is started, it will be downloaded. Depending on the network environment, it may wait a long time. +- Here you need to add shared memory to mount to prevent cache errors; +- This example uses the built-in data set. After the program is started, it will be downloaded. Depending on the network environment, it may wait a long time. ### Resulting Master + ```yaml apiVersion: v1 kind: Service @@ -120,21 +121,21 @@ metadata: name: wide-ande-deep-service-ps-0 namespace: paddle-system ownerReferences: - - apiVersion: batch.paddlepaddle.org/v1 - blockOwnerDeletion: true - controller: true - kind: PaddleJob - name: wide-ande-deep-service - uid: 8f432e67-8cda-482c-b147-91f9d4400067 + - apiVersion: batch.paddlepaddle.org/v1 + blockOwnerDeletion: true + controller: true + kind: PaddleJob + name: wide-ande-deep-service + uid: 8f432e67-8cda-482c-b147-91f9d4400067 resourceVersion: "9513616" selfLink: /api/v1/namespaces/paddle-system/services/wide-ande-deep-service-ps-0 uid: e274db1e-ee7f-4b6d-bc0c-034c32f4b7a1 spec: clusterIP: None ports: - - port: 2379 - protocol: TCP - targetPort: 2379 + - port: 2379 + protocol: TCP + targetPort: 2379 selector: paddle-res-name: wide-ande-deep-service-ps-0 sessionAffinity: None @@ -148,36 +149,37 @@ metadata: name: wide-ande-deep-ps-0 namespace: paddle-system ownerReferences: - - apiVersion: batch.paddlepaddle.org/v1 - blockOwnerDeletion: true - controller: true - kind: PaddleJob - name: wide-ande-deep - uid: f206587f-5dee-46f5-9399-e835bde02487 + - apiVersion: batch.paddlepaddle.org/v1 + blockOwnerDeletion: true + controller: true + kind: PaddleJob + name: wide-ande-deep + uid: f206587f-5dee-46f5-9399-e835bde02487 resourceVersion: "9506900" selfLink: /api/v1/namespaces/paddle-system/pods/wide-ande-deep-ps-0 uid: 36b27c8f-9712-474b-b21b-dd6b54aaef29 spec: containers: - - env: - - name: POD_IP - valueFrom: - fieldRef: - apiVersion: v1 - fieldPath: status.podIP - - name: PADDLE_TRAINER_ID - value: "0" - - name: TRAINING_ROLE - value: PSERVER - - name: PADDLE_TRAINING_ROLE - value: PSERVER - envFrom: - - configMapRef: - name: wide-ande-deep - image: registry.baidubce.com/paddle-operator/demo-wide-and-deep:v1 + - env: + - name: POD_IP + valueFrom: + fieldRef: + apiVersion: v1 + fieldPath: status.podIP + - name: PADDLE_TRAINER_ID + value: "0" + - name: TRAINING_ROLE + value: PSERVER + - name: PADDLE_TRAINING_ROLE + value: PSERVER + envFrom: + - configMapRef: + name: wide-ande-deep + image: registry.baidubce.com/paddle-operator/demo-wide-and-deep:v1 ``` ### Resulting Worker + ```yaml apiVersion: v1 kind: Pod @@ -185,33 +187,33 @@ metadata: name: wide-ande-deep-worker-0 namespace: paddle-system ownerReferences: - - apiVersion: batch.paddlepaddle.org/v1 - blockOwnerDeletion: true - controller: true - kind: PaddleJob - name: wide-ande-deep - uid: f206587f-5dee-46f5-9399-e835bde02487 + - apiVersion: batch.paddlepaddle.org/v1 + blockOwnerDeletion: true + controller: true + kind: PaddleJob + name: wide-ande-deep + uid: f206587f-5dee-46f5-9399-e835bde02487 resourceVersion: "9507629" selfLink: /api/v1/namespaces/paddle-system/pods/wide-ande-deep-worker-0 uid: e8534fe6-7c2e-4849-9a99-ffdcd5df76bb spec: containers: - - env: - - name: POD_IP - valueFrom: - fieldRef: - apiVersion: v1 - fieldPath: status.podIP - - name: PADDLE_TRAINER_ID - value: "0" - - name: TRAINING_ROLE - value: TRAINER - - name: PADDLE_TRAINING_ROLE - value: TRAINER - envFrom: - - configMapRef: - name: wide-ande-deep - image: registry.baidubce.com/paddle-operator/demo-wide-and-deep:v1 + - env: + - name: POD_IP + valueFrom: + fieldRef: + apiVersion: v1 + fieldPath: status.podIP + - name: PADDLE_TRAINER_ID + value: "0" + - name: TRAINING_ROLE + value: TRAINER + - name: PADDLE_TRAINING_ROLE + value: TRAINER + envFrom: + - configMapRef: + name: wide-ande-deep + image: registry.baidubce.com/paddle-operator/demo-wide-and-deep:v1 ``` The worker spec generates a pod. Currently worker will communicate to the master through the master's service name, we'll use a service registry for service discovery. @@ -219,11 +221,10 @@ The worker spec generates a pod. Currently worker will communicate to the master ## Design Here are some original design docs for paddle-perator on Kubernetes. - -* Paddle Operator Architecture on Kubernetes, please check out [design-arch](https://github.com/PaddleFlow/paddle-operator/blob/main/docs/design-arch.md) -* Paddle training job instance fault tolerant, please check out [design-fault-tolerant](https://github.com/PaddleFlow/paddle-operator/blob/main/docs/design-fault-tolerant.md) -* Co-scheduling training job to prevent job instances from resource deadlock, please check out [design-coschedule](https://github.com/PaddleFlow/paddle-operator/blob/main/docs/design-coschedule.md) +- Paddle Operator Architecture on Kubernetes, please check out [design-arch](https://github.com/PaddleFlow/paddle-operator/blob/main/docs/design-arch.md) +- Paddle training job instance fault tolerant, please check out [design-fault-tolerant](https://github.com/PaddleFlow/paddle-operator/blob/main/docs/design-fault-tolerant.md) +- Co-scheduling training job to prevent job instances from resource deadlock, please check out [design-coschedule](https://github.com/PaddleFlow/paddle-operator/blob/main/docs/design-coschedule.md) ## Alternatives Considered @@ -232,5 +233,3 @@ One option is to add PaddlePaddle support to the existing tf-operator, but the p ## Current Status We recently refactored the paddle-operator for better performance and code readability. And we'll merge the dev branch back into main branch soon, so our real code branch is `dev` at this moment. - - diff --git a/proposals/kubeflow-conformance-program-proposal.md b/proposals/524-kubeflow-conformance-program/README.md similarity index 57% rename from proposals/kubeflow-conformance-program-proposal.md rename to proposals/524-kubeflow-conformance-program/README.md index 1092a3e0a..8161aa7be 100644 --- a/proposals/kubeflow-conformance-program-proposal.md +++ b/proposals/524-kubeflow-conformance-program/README.md @@ -1,4 +1,4 @@ -# Kubeflow Conformance Test +# KEP-524: Kubeflow Conformance Test James Wu (james-jwu@) 2021-07-29 @@ -6,14 +6,14 @@ James Wu (james-jwu@) # Overview -The [Kubeflow Project](https://github.com/kubeflow) is currently managed by different [Working Groups](https://github.com/kubeflow/community/blob/master/wg-list.md) whose composition represents a broad spectrum of industry and community. The Kubeflow trademark is owned by Google. +The [Kubeflow Project](https://github.com/kubeflow) is currently managed by different [Working Groups](https://github.com/kubeflow/community/blob/master/wg-list.md) whose composition represents a broad spectrum of industry and community. The Kubeflow trademark is owned by Google. -The [Kubeflow Brand Guidelines](https://github.com/kubeflow/community/blob/master/KUBEFLOW_BRAND_GUIDELINES.pdf) was published in Mar. 2021. The guideline is broadly applicable to usage of Kubeflow trademark by products, events and publications. While the brand guidelines provide general guidance, it does not prescribe the definition of Kubeflow and what makes a distribution/application "Kubeflow" vs. not "Kubeflow". +The [Kubeflow Brand Guidelines](https://github.com/kubeflow/community/blob/master/KUBEFLOW_BRAND_GUIDELINES.pdf) was published in Mar. 2021. The guideline is broadly applicable to usage of Kubeflow trademark by products, events and publications. While the brand guidelines provide general guidance, it does not prescribe the definition of Kubeflow and what makes a distribution/application "Kubeflow" vs. not "Kubeflow". This document aims to define conformance criteria for the following usage of Kubeflow trademark: -- For Kubeflow distribution - "Certified Kubeflow" -- For Kubeflow application - TBD +- For Kubeflow distribution - "Certified Kubeflow" +- For Kubeflow application - TBD The goal is to ensure these special usages of Kubeflow trademark meet common standards of interoperability, increase cohesiveness of the Kubeflow platform, promote customer confidence, reduce the burden of Kubeflow maintainers, and extend Kubeflow’s influence beyond the Kubeflow project. @@ -21,22 +21,23 @@ The goal is to ensure these special usages of Kubeflow trademark meet common sta **A conformant Kubeflow Distribution is certified to provide a set of core functionalities and API integration options.** -The tests will be designed in a way similar to [Kubernetes conformance program](https://github.com/cncf/k8s-conformance). +The tests will be designed in a way similar to [Kubernetes conformance program](https://github.com/cncf/k8s-conformance). The tests will be versioned. Each versioned certification is valid for 1 year. After 1 year, recertification against the latest version of the test will be required to maintain certification standing. -## Entitlements ## +## Entitlements - Conformant distributions are entitled to refer to the distribution as "**Certified Kubeflow**". Example usage: - - Appending “(Certified Kubeflow)” to the distribution name: e.g. AI-IS-FUN (Certified Kubeflow) - - Reference the Certified Kubeflow designation in discussion with customers, or on public documentation + - Appending “(Certified Kubeflow)” to the distribution name: e.g. AI-IS-FUN (Certified Kubeflow) + - Reference the Certified Kubeflow designation in discussion with customers, or on public documentation - Display a logo (to be designed) on the public website and documentation of the distribution - Be listed under a partner web page under the Kubeflow project. - The naming of the distribution still needs to follow [Kubeflow Brand Guidelines](https://github.com/kubeflow/community/blob/master/KUBEFLOW_BRAND_GUIDELINES.pdf). -## Out of scope ## +## Out of scope The following are out of scope of the conformance tests: + 1. Product quality and supportability The test design is strongly influenced by Kubernetes conformance program, where a very narrow set of tests are established to verify key API functionality. Since the tests are versioned, it is hoped that unsupported distributions will fall out of conformance by discontinuing the certification with the latest test version. @@ -49,24 +50,24 @@ The test will not verify how the distribution is developed (e.g. in Kubeflow org Company X creates a distribution of Kubeflow and plans to name it "X Kubeflow Service". Company X tries to certify for Kubeflow conformance, which entails: -- Install the distribution -- Runs the Kubflow conformance test suite -- Submits the test log as a PR to Kubeflow Trademark Team for approval. -- Upon approval and following trademark guidelines, Company X changes the distribution name to "X service for Kubeflow", or "X Platform (Certified Kubeflow)" -- Optionally, the distribution may be listed in a catalog on Kubeflow website +- Install the distribution +- Runs the Kubflow conformance test suite +- Submits the test log as a PR to Kubeflow Trademark Team for approval. +- Upon approval and following trademark guidelines, Company X changes the distribution name to "X service for Kubeflow", or "X Platform (Certified Kubeflow)" +- Optionally, the distribution may be listed in a catalog on Kubeflow website The conformance tests make sure "X service for Kubeflow" supports KFP and Metadata. Company X may include more applications (e.g. TFJob, Katib) in the distribution, but it does not affect the conformance standing of "X service for Kubeflow". ## First version of conformance The first version of conformance aims to be inclusive of current components in Kubeflow organization. The number of tests are intentionally kept small to allow fast progress and iteration. We propose: -- Each Kubeflow Working Group nominate <= 10 tests to be included in the conformance suite - - We recommend the candidate tests to be simple API acceptance tests that run reliably. Please keep in mind that the certification body is looking for a simple pass/fail to determine certification standing. - - There is no precedence for including UI in conformance tests. That said, we will experiment with options to include UI, most likely through self attestation and supporting evidence (e.g. screenshot or video). The details are TBD. -- Each WG works with the conformance test team (currently staffed by Google) to include the nominated tests into the conformance suite. -**Example**: for Kubeflow Pipelines, the first version of conformance will be limited to V1 Pipeline Runtime conformance. A subset of tests outlined in Appendix A will be included. +- Each Kubeflow Working Group nominate <= 10 tests to be included in the conformance suite + - We recommend the candidate tests to be simple API acceptance tests that run reliably. Please keep in mind that the certification body is looking for a simple pass/fail to determine certification standing. + - There is no precedence for including UI in conformance tests. That said, we will experiment with options to include UI, most likely through self attestation and supporting evidence (e.g. screenshot or video). The details are TBD. +- Each WG works with the conformance test team (currently staffed by Google) to include the nominated tests into the conformance suite. +**Example**: for Kubeflow Pipelines, the first version of conformance will be limited to V1 Pipeline Runtime conformance. A subset of tests outlined in Appendix A will be included. # Kubeflow Native (for Kubernetes Application) @@ -84,11 +85,12 @@ We expect this test to evolve, due to the ambiguity of “Kubenetes Application The first version of the test verifies that the application is integrated with Kubeflow Pipelines. Pipeline and Metadata are the binding “glue” for other Kubeflow components. Metadata generation is automatic when Kubeflow Application conforms to standard Kubeflow Pipelines component interface. -## Entitlements ## +## Entitlements + - Conformant applications are entitled to refer to the application as “Kubeflow Native”. Example usage: - - Appending “(Kubeflow Native)” to the application name: e.g. SUPER-TRAINER (Kubeflow Native) - - Reference the Kubeflow Native designation in discussion with customers, or on public documentation - - Display a logo (to be designed) on the public website and documentation of the application + - Appending “(Kubeflow Native)” to the application name: e.g. SUPER-TRAINER (Kubeflow Native) + - Reference the Kubeflow Native designation in discussion with customers, or on public documentation + - Display a logo (to be designed) on the public website and documentation of the application - The application may be listed under an application catalog (to be created) under Kubeflow project. - The naming of the application still needs to follow [Kubeflow Brand Guidelines](https://github.com/kubeflow/community/blob/master/KUBEFLOW_BRAND_GUIDELINES.pdf). @@ -96,15 +98,15 @@ The first version of the test verifies that the application is integrated with K Company X creates a Kubernetes Custom Resource for model training, and wishes to certify the feature for Kubeflow Application. Company X needs to: -- Create a Kubeflow Pipelines component for launching the custom resource, with inputs and outputs appropriately defined using parameters and artifacts. The component may be published as a Python function or YAML. -- Add self-attestation to the readme file. -- Runs conformance tool against the Python source, by specifying the source file, and the component function (in the case of Python function). -- Submits the test results to Kubeflow Trademark Team for approval. -- Upon approval, Company X may name the component "X Training for Kubeflow", "X Training " +- Create a Kubeflow Pipelines component for launching the custom resource, with inputs and outputs appropriately defined using parameters and artifacts. The component may be published as a Python function or YAML. +- Add self-attestation to the readme file. +- Runs conformance tool against the Python source, by specifying the source file, and the component function (in the case of Python function). +- Submits the test results to Kubeflow Trademark Team for approval. +- Upon approval, Company X may name the component "X Training for Kubeflow", "X Training " ## Test principles -Kubeflow Application conformance test verifies the component function under test conforms to the Kubeflow Pipelines component definition. +Kubeflow Application conformance test verifies the component function under test conforms to the Kubeflow Pipelines component definition. Proposed CLI: @@ -116,45 +118,48 @@ $ kfp conformance verify-component - file=my_component.yaml The tests verify the component integrates well with the following: -- Kubeflow Pipeline integration: a well defined component interface ensures the Kubeflow Application under test plays well with other Kubeflow Applications. The test will not try to verify functionality or code quality. -- Metadata: Kubeflow Pipelines automatically records the input/output parameters and artifacts in metadata. The tests verify the component interfaces. Kubeflow Application candidates can optionally emit metadata, either by using output_metadata mechanism (to be explained), or some other mechanism added to KFP in the future. Kubeflow Application candidates are encouraged to log additional metadata to MLMD but are not required to do so. +- Kubeflow Pipeline integration: a well defined component interface ensures the Kubeflow Application under test plays well with other Kubeflow Applications. The test will not try to verify functionality or code quality. +- Metadata: Kubeflow Pipelines automatically records the input/output parameters and artifacts in metadata. The tests verify the component interfaces. Kubeflow Application candidates can optionally emit metadata, either by using output_metadata mechanism (to be explained), or some other mechanism added to KFP in the future. Kubeflow Application candidates are encouraged to log additional metadata to MLMD but are not required to do so. ## References -- Prior [discussion](https://groups.google.com/g/kubeflow-discuss/c/d6whgEgror8) in Kubeflow community on Kubeflow conformance +- Prior [discussion](https://groups.google.com/g/kubeflow-discuss/c/d6whgEgror8) in Kubeflow community on Kubeflow conformance # Appendix - ## Pipeline tests +## Pipeline tests + +- Pipeline runtime + + - V1 conformance + + - The tests use a designated version of KFP SDK to compile a set of pipelines and submit it to the distribution under test. -- Pipeline runtime - - V1 conformance - - The tests use a designated version of KFP SDK to compile a set of pipelines and submit it to the distribution under test. + - V2 conformance - - V2 conformance - - The tests uses a designated version of KFP SDK that compiles pipeline to [IR](https://docs.google.com/document/d/1PUDuSQ8vmeKSBloli53mp7GIvzekaY7sggg6ywy35Dk/edit) (Intermediate Representation) - - The IR is submitted to the pipeline server to exercise and verify different KFP features. + - The tests uses a designated version of KFP SDK that compiles pipeline to [IR](https://docs.google.com/document/d/1PUDuSQ8vmeKSBloli53mp7GIvzekaY7sggg6ywy35Dk/edit) (Intermediate Representation) + - The IR is submitted to the pipeline server to exercise and verify different KFP features. - - Below is a categorization of the features and is not meant to be exhaustive. - - Artifact and parameter passing - - Caching - - Executors: container / importer / resolver - - Control flow features: ParallelFor, conditional, exit handler - - A subset of Kubernetes features (e.g. secrets, volume) + - Below is a categorization of the features and is not meant to be exhaustive. + - Artifact and parameter passing + - Caching + - Executors: container / importer / resolver + - Control flow features: ParallelFor, conditional, exit handler + - A subset of Kubernetes features (e.g. secrets, volume) -- Pipeline management - - Pipeline template management - - Pipeline run management (Get / Delete / Cancel / Archive / Retry) - - Recurring runs +- Pipeline management + - Pipeline template management + - Pipeline run management (Get / Delete / Cancel / Archive / Retry) + - Recurring runs ## Metadata tests -- Metadata lineage - pipeline runs must produce the correct lineage graph including - - Context - - Execution - - Artifact - - Event +- Metadata lineage - pipeline runs must produce the correct lineage graph including -- Metrics - verifies metrics artifacts are produced -- Metadata APIs (future work) + - Context + - Execution + - Artifact + - Event +- Metrics - verifies metrics artifacts are produced +- Metadata APIs (future work) diff --git a/proposals/kfserving_transition.md b/proposals/525-kfserving-transition/README.md similarity index 93% rename from proposals/kfserving_transition.md rename to proposals/525-kfserving-transition/README.md index aacdcce3a..560dfadd7 100644 --- a/proposals/kfserving_transition.md +++ b/proposals/525-kfserving-transition/README.md @@ -1,6 +1,7 @@ -# KFServing Transition to independent Github Organization +# KEP-525: KFServing Transition to independent Github Organization ## Background and Objective + KFServing is a project created initially by Google, Bloomberg, IBM, NVidia, Seldon under Kubeflow in 2019. It aims to provide a standard production grade model serving solution on Kubernetes. After publishing the open source project, we have seen an explosion in demand for the software, leading to strong adoption and community growth. The scope of the project has changed since, we have also developed sub components like ModelMesh, Model Web App which now demands its own github organization. @@ -8,19 +9,23 @@ The KFServing WG members decide to move KFServing development code out of Kubefl the stewardship of [Kubeflow Serving Working Group](https://github.com/kubeflow/community/blob/master/wg-serving/README.md) leads. ## Project Rename + The project is renamed to `KServe` from `KFServing` to retain the connection and brand recognition. ## Scope of the project Rebrand ### API group change for core components -The API group is changed from `serving.kubeflow.org` to `serving.kserve.io`. + +The API group is changed from `serving.kubeflow.org` to `serving.kserve.io`. #### InferenceService Controller + - [Issue 1826:](https://github.com/kserve/kserve/issues/1826) Go module and API group change - Regenerate `InferenceService`, `TrainedModel` CRD with the new API group - Regenerate OpenAPI spec and `swagger.json` #### Python SDK + The Python SDK pypi package is renamed to [kserve](https://pypi.org/project/kserve/) from [kfserving](https://pypi.org/project/kfserving/), see [Issue 1827](https://github.com/kserve/kserve/issues/1827). @@ -30,30 +35,35 @@ see [Issue 1827](https://github.com/kserve/kserve/issues/1827). - Update SDK docs #### Installation Manifests + The KServe control plane is installed in `kserve` namespace instead of `kfserving-system`, see [Issue 1824](https://github.com/kserve/kserve/issues/1824). - Update API group for the webhook configurations - Update Standalone/Kubeflow installation manifests overlays #### Development Scripts + - Update quick install script - Update `Makefile` and image patch scripts ### KServe CI/CD #### Prow Setup + Prow is designed for using plugins like `/lgtm` `/approve` and integration with github repo members makes it easy to manage all the projects in a fine-grained way, though these can be implemented using individual github plugins. KServe has setup own Prow cluster installed with `Tide` for the github review and approval process using the `KServe OSS Bot`. #### E2E Tests + For now we reuse the current kubeflow AWS e2e testing infra in the kserve github organization by adding the [configuration](https://github.com/kubeflow/testing/blob/master/aws/GitOps/clusters/optional-test-infra-prow/namespaces/prow/config.yaml#L124) to submit the presubmit job to [KServe Github Repository](https://github.com/kserve/kserve/). -- Update e2e presubmit job to use AWS e2e test Bot +- Update e2e presubmit job to use AWS e2e test Bot - Update e2e test scripts with new SDK package `kserve` #### Github Actions + All the new images released from `KServe` should be published to the [kserve docker hub organization](https://hub.docker.com/u/kserve). - Migrate all existing images from gcr.io to docker hub @@ -62,17 +72,22 @@ All the new images released from `KServe` should be published to the [kserve doc ### Ecosystem #### Kubeflow Integration + We plan to integrate `KServe 0.7` in Kubeflow 1.5 release. #### Kubeflow Pipeline + - Create a new `KServe` component for `Kubeflow Pipeline`, see [issue 1829](https://github.com/kserve/kserve/issues/1829) - Add KServe component to Kubeflow conformance test #### Models WebApp -Separate out models web-app to its own repository and setup CI/CD, see [issue 1820](https://github.com/kserve/kserve/issues/1820). -- Test out the models UI after the name change + +Separate out models web-app to its own repository and setup CI/CD, see [issue 1820](https://github.com/kserve/kserve/issues/1820). + +- Test out the models UI after the name change ### Documentation + All the existing documentation and examples are moving to [kserve/website](https://github.com/kserve/website) which is built with `mkdocs` and the website is hosted on `netlify`. - Update main concept, architecture diagrams @@ -82,17 +97,17 @@ All the existing documentation and examples are moving to [kserve/website](https - Update explanation examples - Update logger and monitoring examples - Update drift detect and outlier examples -- Update notebook examples +- Update notebook examples - Update community and contribution guidelines - ### Migration + For users that are migrating from kfserving, `kserve` installs in its own namespace `kserve`. The migration script scales down the kfserving controller in the cluster, it then converts the `InferenceService CR` from `kubeflow.org` to `kserve.io`, and reconciled in the kserve controller. The migration should not impact the running `InferenceServices`. -![Migration Process](diagrams/kfserving_migration.png) - +![Migration Process](kfserving_migration.png) ### Support + The previous `KFServing 0.5.x` and `KFServing 0.6.x` released are still supported in six months time frame after `KServe 0.7` is released. diff --git a/proposals/diagrams/kfserving_migration.png b/proposals/525-kfserving-transition/kfserving_migration.png similarity index 100% rename from proposals/diagrams/kfserving_migration.png rename to proposals/525-kfserving-transition/kfserving_migration.png diff --git a/proposals/CONFORMANCE-COMMITTEE.md b/proposals/585-kubeflow-governance/CONFORMANCE-COMMITTEE.md similarity index 100% rename from proposals/CONFORMANCE-COMMITTEE.md rename to proposals/585-kubeflow-governance/CONFORMANCE-COMMITTEE.md diff --git a/proposals/GOVERNANCE.md b/proposals/585-kubeflow-governance/GOVERNANCE.md similarity index 100% rename from proposals/GOVERNANCE.md rename to proposals/585-kubeflow-governance/GOVERNANCE.md diff --git a/proposals/TECH-OVERSIGHT-COMMITTEE.md b/proposals/585-kubeflow-governance/TECH-OVERSIGHT-COMMITTEE.md similarity index 100% rename from proposals/TECH-OVERSIGHT-COMMITTEE.md rename to proposals/585-kubeflow-governance/TECH-OVERSIGHT-COMMITTEE.md diff --git a/proposals/kubeflow-steering-committee-election-proposal.md b/proposals/645-kubeflow-steering-committee-election/README.md similarity index 88% rename from proposals/kubeflow-steering-committee-election-proposal.md rename to proposals/645-kubeflow-steering-committee-election/README.md index 8c90f7b37..511d46f83 100644 --- a/proposals/kubeflow-steering-committee-election-proposal.md +++ b/proposals/645-kubeflow-steering-committee-election/README.md @@ -1,5 +1,4 @@ -# KSC Election Process Proposal - +# KEP-645: KSC Election Process Proposal ## 2 cohorts and Staggered elections @@ -52,32 +51,36 @@ Interim KSC member B ## Election Procedure -### Timeline ### +### Timeline + Steering Committee elections are held annually. 4 weeks or more before the election, the Steering Committee will appoint Election Officer(s) (see below). 2 weeks or more before the election, the Election Officer(s) will issue a call for nominations, publish the list of eligible voters, and open the call for exceptions. One week before the election the call for nominations and exceptions will be closed. The election will be open for voting not less than two weeks and not more than four. The results of the election will be announced within one week of closing the election. New Steering Committee members will take office in January of each year on the date the results are announced. The general timeline is as follows: + - November - - Election officers appointed + - Election officers appointed - December - Election preparation - publish the list of eligible voters, open call for exceptions (open for approximately 1 week) - - Call for nominations (open for approximately 2 weeks) + - Call for nominations (open for approximately 2 weeks) - Testimonial Phase (open for approximately 2 weeks) - Start of election (open for approximately 3 weeks) - January - Conclusion of election - Results announced within one week after the election concludes - - New steering committee members take office in January after the conclusion of the election. + - New steering committee members take office in January after the conclusion of the election. + +### Election Officer(s) -### Election Officer(s) ### 4 weeks or more before the election, the Steering Committee will appoint between one and three Election Officer(s) to administer the election. Elections Officers will be Kubeflow community members in good standing who are eligible to vote, are not running for Steering in that election, who are not currently part of the Steering Committee and can make a public promise of impartiality. They will be responsible for: -- Making all announcements associated with the election -- Preparing and distributing electronic ballots -- Judging exception requests -- Assisting candidates in preparing and sharing statements -- Tallying voting results according to the rules in this charter +- Making all announcements associated with the election +- Preparing and distributing electronic ballots +- Judging exception requests +- Assisting candidates in preparing and sharing statements +- Tallying voting results according to the rules in this charter + +### Eligibility to Vote -### Eligibility to Vote ### Anyone who has at least 50 contributions including at least 1 merged PR in the last 12 months is eligible to vote in the Steering election. Contributions are defined as opening PRs, reviewing and commenting on PRs, opening and commenting on issues, writing design docs, commenting on design docs, helping people on slack, participating in working groups, and other efforts that help advance the Kubeflow project. This [dashboard](https://kubeflow.devstats.cncf.io/d/9/developer-activity-counts-by-repository-group-table?orgId=1&var-period_name=Last%20year&var-metric=contributions&var-repogroup_name=All&var-country_name=All) shows only GitHub based contributions and does not capture all the contributions we value. We expect this metric not to capture everyone who should be eligible to vote. If a community member has had significant contributions over the past year but is not captured in the dashboard, they will be able to submit an exception form to the Elections Officer(s) who will then review and determine whether this member should be eligible to vote. All exceptions, and the reasons for them, will be recorded in a log that will be available to Steering and the TOC. @@ -88,44 +91,51 @@ We are committed to an inclusive process and will adapt future eligibility requi If you believe you are eligible to vote but are not listed as an elegible voter [you may file an exception using the exception form](https://forms.gle/epaMrirZCNBztoRz5). -## Candidate Eligibility ## -Community members must be eligible to vote in order to stand for election (this includes voters who qualify for an exception). Candidates may self-nominate or be nominated by another eligible member. There are no term limits for KSC members. Nothing prevents a qualified member from serving on the Kubeflow Steering Committee, Technical Oversight Committee and Conformance Committee simultaneously. +## Candidate Eligibility + +Community members must be eligible to vote in order to stand for election (this includes voters who qualify for an exception). Candidates may self-nominate or be nominated by another eligible member. There are no term limits for KSC members. Nothing prevents a qualified member from serving on the Kubeflow Steering Committee, Technical Oversight Committee and Conformance Committee simultaneously. -If you believe you are eligible to run in this election but are not listed as an eligible nominee candidate [you may file and exception using the exception form](https://forms.gle/epaMrirZCNBztoRz5). +If you believe you are eligible to run in this election but are not listed as an eligible nominee candidate [you may file and exception using the exception form](https://forms.gle/epaMrirZCNBztoRz5). -### Voting Procedure ### -Elections will be held using [Condorcet Internet Voting Service (CIVS)](https://civs1.civs.us/), an online voting tool that is used by many of the CNCF projects and other open-source communities. This tool has been running since 2003 and is what the [Elekto tool](https://elekto.dev/) is based on. +### Voting Procedure + +Elections will be held using [Condorcet Internet Voting Service (CIVS)](https://civs1.civs.us/), an online voting tool that is used by many of the CNCF projects and other open-source communities. This tool has been running since 2003 and is what the [Elekto tool](https://elekto.dev/) is based on. After this first election, the details for the KSC elections will be published in the elections folder. This folder will be set up after the conclusion of the first election. In the rare case that election ends in a tie, the election offices may ask the tied candidates to resolve the tie (e.g. one or more candidates could decide to withdraw). If the tie cannot be resolved among the tied candidates, a runoff election will be conducted. If the runoff election ends in a tie, candidate will be randomly selected to decided winners, with equal weights given to each runoff candidate. -### Limitations on Company Representation ### +### Limitations on Company Representation + No more than two seats may be held by employees of the same organization (or conglomerate, in the case of companies owning each other). If the results of an election result in greater than two employees of the same organization, the lowest vote getters in the current election from any particular employer will be removed until representation on the committee is down to two. -In the staggered election schedule, if a particular organization already has two seats among the rotation not affected by the election, no candidates from that organization will be selected by the election. If the organization wants to change its representation in KSC, one or more members from that organization needs to stand down from KSC, which will trigger a "resignation" event as explained below. There is no guarantee that vacancy created will be filled by the organization's candidate. +In the staggered election schedule, if a particular organization already has two seats among the rotation not affected by the election, no candidates from that organization will be selected by the election. If the organization wants to change its representation in KSC, one or more members from that organization needs to stand down from KSC, which will trigger a "resignation" event as explained below. There is no guarantee that vacancy created will be filled by the organization's candidate. If employers change because of job changes, acquisitions, or other events, in a way that would yield more than 2 seats being held by employees of the same organization, sufficient members of the committee must resign until only two employees of the same employer are left. If it is impossible to find sufficient members to resign, all employees of that organization will be removed and new special elections held. In the event of a question of company membership (for example evaluating independence of corporate subsidiaries) a majority of all non-involved Steering Committee members will decide. -#### Changes to take effect in 2025 election and beyond #### +#### Changes to take effect in 2025 election and beyond + No more than one seat may be held by employees of the same organization. Since KSC is a relatively small committee with 5 members, this rule was introduced to encourage diversity of representation in KSC. Exception: The 2024 election result may produce an outcome where the elected 2-member cohort comes from the same organization. In such scenario, the 2-member cohort may serve their full term of 2 years. -### Vacancies ### -In the event of a resignation or other loss of an elected committee member, the next most preferred candidate from the previous election will be offered the seat. +### Vacancies + +In the event of a resignation or other loss of an elected committee member, the next most preferred candidate from the previous election will be offered the seat. A maximum of one (1) committee member may be selected this way between elections. -In case this fails to fill the seat, a special election for that position will be held as soon as possible. +In case this fails to fill the seat, a special election for that position will be held as soon as possible. Eligible voters from the most recent election will vote in the special election i.e., eligibility will not be redetermined at the time of the special election. -A committee member elected in a special election will serve out the remainder of the term for the person they are replacing, regardless of the length of that remainder. +A committee member elected in a special election will serve out the remainder of the term for the person they are replacing, regardless of the length of that remainder. + +### Resignation -### Resignation ### If a committee member chooses not to continue in their role, for whatever self-elected reason, they must notify the committee in writing. -### Removal - No confidence ### +### Removal - No confidence + A Steering Committee member may be removed by an affirmative vote of four of five members. The call for a vote of no confidence will happen in a public Steering Committee meeting and must be documented as a GitHub issue in the committee's repository. @@ -135,14 +145,14 @@ The committee member who calls for the vote will prepare a statement which provi Once a vote of no confidence has been called, the committee will notify the community through the following channels: -- the community mailing list -- the community slack channel -- In the next Kubeflow Community Meeting +- the community mailing list +- the community slack channel +- In the next Kubeflow Community Meeting This notification will include: -- a link to the aforementioned GitHub issue -- the statement providing context on the reason for the vote +- a link to the aforementioned GitHub issue +- the statement providing context on the reason for the vote There will be a period of two weeks for members of the community to reach out to Steering Committee members to provide feedback. diff --git a/proposals/spark-operator-adoption.md b/proposals/648-spark-operator/README.md similarity index 97% rename from proposals/spark-operator-adoption.md rename to proposals/648-spark-operator/README.md index 1fd3e287d..2d922c14c 100644 --- a/proposals/spark-operator-adoption.md +++ b/proposals/648-spark-operator/README.md @@ -1,4 +1,4 @@ -# Adoption of Spark Kubernetes Operator in Kubeflow +# KEP-648: Adoption of Spark Kubernetes Operator in Kubeflow Original doc: https://docs.google.com/document/d/1rCPEBQZPKnk0m7kcA5aHPf0fISl0MTAzsa4Wg3dfs5M/edit @@ -36,7 +36,7 @@ Marcin from Google confirmed their willingness to donate the project to either o Spark already has a lot of AI/ML use-cases and the Kubeflow ecosystem can help to address those. According to the recent Kubeflow survey, Spark is one of the most popular frameworks for Kubeflow users: -![Kubeflow Survey 2022](diagrams/kubeflow-survey-2022.png) +![Kubeflow Survey 2022](kubeflow-survey-2022.png) ## Benefits for Kubeflow @@ -44,7 +44,7 @@ The following diagram shows the main components of Kubeflow. Notebooks for inter Training Operator for distributed ML Training, Katib for HyperParameter Tuning, KServe for Model Serving, Pipelines for ML pipelines, and Profiles to create Kubeflow user profiles. -![Kubeflow Overview](diagrams/kubeflow-overview.png) +![Kubeflow Overview](kubeflow-overview.png) Today, Kubeflow doesn’t have any component for Data Preparation which is an essential step for MLOps lifecycle. Spark is one of the best and most-used frameworks for Data Preparation, @@ -76,7 +76,7 @@ the Training Workers from Spark. We can leverage [Apache Arrow Flight](https://a to store data in-memory and use it in the job workers. We can propose API and controller changes in Spark Operator and Training Operator to support it. -![Spark to Training Operator](diagrams/spark-to-training.png) +![Spark to Training Operator](spark-to-training.png) ### Spark Operator & Kubeflow Distributions diff --git a/proposals/diagrams/kubeflow-overview.png b/proposals/648-spark-operator/kubeflow-overview.png similarity index 100% rename from proposals/diagrams/kubeflow-overview.png rename to proposals/648-spark-operator/kubeflow-overview.png diff --git a/proposals/diagrams/kubeflow-survey-2022.png b/proposals/648-spark-operator/kubeflow-survey-2022.png similarity index 100% rename from proposals/diagrams/kubeflow-survey-2022.png rename to proposals/648-spark-operator/kubeflow-survey-2022.png diff --git a/proposals/diagrams/spark-to-training.png b/proposals/648-spark-operator/spark-to-training.png similarity index 100% rename from proposals/diagrams/spark-to-training.png rename to proposals/648-spark-operator/spark-to-training.png diff --git a/proposals/649-kubeflow-helm-support/README.MD b/proposals/649-kubeflow-helm-support/README.MD new file mode 100644 index 000000000..03284329d --- /dev/null +++ b/proposals/649-kubeflow-helm-support/README.MD @@ -0,0 +1,454 @@ +# 649-Kubeflow-Helm-Support: Support Helm as an Alternative for Kustomize + + +The demand for a Helm chart for a basic Kubeflow installation has increased. Given the KSC's stance in issue 821 on neutral deployment language and user-defined production readiness, this is an opportune time to introduce a Helm chart. Supporting Helm will enhance ease of adoption and simplify deployments while maintaining the flexibility of community-maintained manifests. There have already been community efforts as well as the Kubeflow-Helm-Chart Slack Channel. + + +## Summary + + +Kubeflow manifests provide a fast way to deploy a minimal Kubeflow platform, with best-effort community support. For guaranteed assistance, users can opt for third-party distributions, consultants, or self-managed expertise. This approach extends to Helm chart support. Contributions and bug reports are encouraged, but no support will be guaranteed. The goal is to build a similar folder structure as Argo for Kubeflow Helm charts. + + +## Motivation + + +Currently, because Kubeflow/manifests are based on Kustomize, many potential users and companies that require Helm charts due to company processes/policies have to rely on third-party distributions. While these options are valuable, they require engagement with adjacent projects and communities. + + +As a project, we must ensure that our Helm chart provides a quick and accessible way for users to deploy a complete Kubeflow platform and individual components, enabling them to manage their environments or adopt a vendor solution. + + +Simplifying Kubeflow deployment lowers the barrier to entry, increases adoption, and encourages contributions. Just as Kubernetes enabled a new wave of cloud-native startups, a neutral, accessible deployment path can empower AI/ML startups to leverage tools like the Training Operator or Katib without reinventing common patterns. If support becomes burdensome, teams can hire expertise or use a distribution—both of which drive demand for Kubeflow skills. + + +By making deployment easy, we attract more end users and foster collaboration with broader communities like PyTorch, improving our implementations in service of their users. + +### About Helm + +- Helm is a graduated project in the CNCF and is maintained by the Helm community. +- Helm is supported by and built with a community of over 400 developers. +- Helm helps you manage Kubernetes applications, rollbacks, updates, dependencies, and releases, and the most complex Kubernetes applications. [More about Helm](https://helm.sh/) + +## Value to the Community + +- Streamline the deployment, upgrade, and rollback of the kubeflow installation process +- Provide another way to install Kubeflow to give the community more options, promoting flexibility and choice + +## Goals + + +✅ A fully functional Kubeflow Helm chart for the targeted release. This will install Kubeflow as a platform and as an individual component. +✅ Published Helm chart documentation with straightforward and uncomplicated configuration options. +✅ A step-by-step tutorial simplifying Kubeflow deployment for users. +✅ Contribution to the Kubeflow community effort for Helm-based installation as part of the official Kubeflow repository. +✅ Where possible, rely on upstream Helm charts (i.e., KServe/Istio). +✅ Consolidate community Helm efforts and prevent duplicate efforts. + + +## Non-Goals +* Deep integration with hyperscalers such as AWS managed databases. Nevertheless for example basic Dex/oauth2-proxy configuration for authentication integration with popular Kubernetes platforms such as EKS is a goal, because it is needed for M2M authentication within Kubeflow. +* Infrastructure provisioning. Users can opt into OpenTofu or Crossplane to template the Helm chart with infrastructure. Still, we are focused on the Helm chart where the values would be configured (if a component requires knowledge of external systems). +* Support separate abstracted operators for components. Helm will handle upgrades. +* Provide guaranteed community support. +* Define production for any particular set of users. + + +## Proposal +This proposal introduces official Helm chart support for deploying Kubeflow. The goal is to provide a modular, community-maintained method for installing and managing Kubeflow, making it more accessible for users who prefer Helm over Kustomize-based manifests. + + +## Desired Outcome + + +The Helm chart will allow users to: + + +✅ Deploy Kubeflow with a single Helm command, reducing installation complexity. +✅ Select specific components to install (e.g., Training Operator, Katib, Pipelines) without requiring the entire Kubeflow stack. +✅ Configure installations via Helm values, enabling customization for different environments (e.g., resource allocation, authentication settings, storage options). +✅ Upgrade and rollback Kubeflow deployments safely using Helm’s built-in version control. +✅ Integrate with GitOps workflows (e.g., ArgoCD, FluxCD) for automated deployments. +✅ Maintain a Helm chart structure similar to Argo’s, ensuring a familiar experience for Kubernetes users. + + +## Measuring Success + + +### Adoption Metrics + + +* Number of Helm chart downloads from the official Kubeflow repository. +* Community contributions to Helm chart improvements. +* Ease of Use and Community Engagement. +* Successful deployments reported by users via GitHub issues, Slack, and forums. +* Documentation feedback and tutorial completion rates. +* Metrics to compare downloads from kustomize vs Helm charts. + + +### Modularity and Customization + + +* Verified Helm installations of the platform and individual components. +* Flexibility demonstrated in community-reported use cases (e.g., deploying only the Training Operator). + + +### Stability and Maintainability + + +* Helm-based deployments function consistently across Kind, Minikube, AKS, EKS, GKE, Rancher and OpenShift. +* Contributions and maintenance of Helm charts remain sustainable within the Kubeflow community. + + +### User Stories (Optional) + + +#### Alex Conquers Kubeflow + + +**Background** + + +Alex is an ML engineer working at a mid-sized AI startup. The team wants to experiment with Kubeflow Pipelines and Katib for hyperparameter tuning but doesn’t need the full Kubeflow stack. Currently, deploying Kubeflow using Kustomize manifests feels cumbersome and requires significant manual effort and maintenance. + + +**Scenario:** + + +Alex needs a fast and repeatable way to deploy only the necessary Kubeflow components while keeping the installation manageable and configurable. + + +**Steps & Experience:** + + +**Discovering Helm Support for Kubeflow** +Alex reads the updated Kubeflow documentation and finds that Helm is now an official installation method. The documentation provides a simple command to install only the necessary components. + + +**Deploying Kubeflow with Helm** +Alex runs a command to install only Kubeflow Pipelines and Katib. The Helm chart automatically handles dependencies and namespace creation, reducing manual steps. Within minutes, the required services are running in the Kubernetes cluster. + + +**Customizing the Deployment** +Alex configures resource limits and storage settings by modifying the Helm values file. + + +**Scaling and Managing the Deployment** +Later, the team decides to add the Training Operator. Instead of redeploying everything, Alex simply enables it. Helm seamlessly applies the changes, avoiding disruption to the existing setup. + + +**Rolling Back** +A misconfiguration in values.yaml causes an issue. Instead of debugging manually, Alex rolls back to the previous working state. + + +**Outcome & Value:** + + +✅ Fast, modular deployment – No need to install unnecessary components. +✅ Easy configuration – Fine-tune installations using Helm values. +✅ Smooth upgrades and rollbacks – No more breaking changes due to manual YAML edits. +✅ Better DevOps integration – Fits naturally into the team’s GitOps workflow with tools like ArgoCD. + +**Managing updates** +Alex can easily install new updates to include new release versions and update dependencies. + +##### Alex's Outcomes: Easily Deploy Kubeflow Using Helm +Alex could deploy only the necessary Kubeflow components using Helm, avoiding the complexity of managing Kustomize-based manifests. Alex installed Kubeflow Pipelines and Katib by running a single command, making the deployment process fast, modular, and repeatable. + + +**Customize Deployments with Helm Values.** +Alex configured the Kubeflow deployment using Helm values, fine-tuning resource limits and storage settings without modifying raw YAML files. By adjusting values.yaml, Alex was able to: + + +* Enable Pipelines and Katib while keeping other components disabled. +* Set up a custom storage backend for Kubeflow Pipelines. +* Adjust CPU and memory limits for Katib experiments. + + +These changes were seamlessly applied with a Helm upgrade, making the system highly customizable and adaptable. + + +**Use Helm’s Standardized Package Management Features** +Alex leveraged Helm’s built-in lifecycle management to ensure a smooth deployment experience: + + +* When a misconfiguration caused an issue, Alex instantly rolled back to a stable deployment using Helm’s versioning feature. +* Helm automatically handled dependencies, ensuring Pipelines and Katib were installed correctly without manual intervention. +* As new versions of Kubeflow components were released, Alex could upgrade seamlessly without reinstalling everything. + + +**Deploy Individual Kubeflow Components** +Since Alex’s team only needed Kubeflow Pipelines and Katib, they didn’t have to deploy the entire Kubeflow stack. Instead, Helm allowed them to deploy only the necessary components, keeping the cluster lightweight and resource-efficient. + + +**Drive Adoption** +As someone new to Kubeflow, Alex benefited from clear documentation and a step-by-step guide for deploying components with Helm. Instead of spending hours understanding manifests and dependencies, Alex got Kubeflow running in minutes. The modular Helm-based approach made it easy for the team to evaluate Kubeflow without committing to a complex setup. + + +**Contribute to and Extend the Helm Chart.** +Alex’s organization saw value in Helm-based deployment and wanted to contribute improvements back to the community. Following a structured approach similar to Argo’s Helm charts, the team could extend the charts to support their infrastructure needs while sharing their updates with the wider Kubeflow community. + + +Thanks to Helm, Kubeflow deployment became effortless, modular, and scalable—allowing Alex’s team to focus on building ML workflows instead of dealing with infrastructure complexity. Alex will get feedback from his ML team using Kubeflow and motivate them to contribute improvements and feature requests to enhance the Kubeflow ecosystem. + + +### Notes/Constraints/Caveats +Alex may choose to use vanilla manifests or go with a vendor. The goal is not to be a distribution but still a part of Kubeflow/manifests. As the appetite for community support grows, the scope may expand, but for now, this is just a simple way to get Kubeflow running using a well-known deployment pattern and provide examples of how to use it. + + +## Risks and Mitigations + + +1. Fragmentation of Deployment Methods + + +**Risk:** Introducing Helm charts as an official deployment method alongside Kustomize may create fragmentation within the Kubeflow ecosystem, leading to confusion between Helm-based, Kustomize-based, and third-party deployment tools. + + +**Mitigation:** +* Clearly position Helm as an alternative to Kustomize, rather than a replacement. +* Maintain alignment with existing manifests, ensuring Helm charts remain consistent with official Kubeflow components. +* Provide comprehensive documentation comparing Helm, Kustomize, and third-party solutions. + + +2. Maintenance Burden and Long-Term Support + + +**Risk:** Maintaining a Helm chart requires ongoing updates as Kubeflow components evolve, which could become a burden if not adequately resourced. + + +**Mitigation:** +* Adopt a community-driven maintenance model, similar to Argo’s Helm charts. +* Establish clear ownership within the Kubeflow community and define a process for versioning and deprecating charts. +* Regularly sync Helm charts with upstream manifests to prevent drift. + + +3. Security Considerations + + +**Risk:** Misconfigured Helm deployments could introduce security vulnerabilities, such as exposed services, weak authentication, or misconfigured role-based access control (RBAC). + + +**Mitigation:** +* Follow Kubernetes security best practices, ensuring charts include secure default configurations. +* Conduct security reviews as part of Kubeflow’s release cycle. +* Provide Helm values presets for secure and production-ready configurations. + + +## Design Details + + +### Helm Chart Structure + + +The repository will contain a root Helm chart (kubeflow) that acts as an umbrella for subcharts: + + +``` +kubeflow/manifests/experimental/helm +│── charts/ +│ │── training-operator/ +│ │── katib/ +│ │── pipelines/ +│ │── istio/ +│ │── profiles/ +│ │── common/ +│ │── kserve/ +│── templates/ +│── values.yaml +│── Chart.yaml +│── README.md +``` + + +* The root kubeflow chart will manage dependencies and shared configurations. +* Subcharts for each component (training-operator, katib, pipelines, etc.) allow independent deployments. + + +### Example Helm Chart Configuration (values.yaml) + + +```yaml +# Global settings +global: + namespace: kubeflow + istio: + enabled: true + + +# Enable/Disable specific components +pipelines: + enabled: true + mysql: + persistence: + storageClass: gp2 + size: 10Gi + + +katib: + enabled: false + + +training-operator: + enabled: true + resources: + limits: + cpu: "2" + memory: "4Gi" +``` + + +### Installation + + +To install Kubeflow Pipelines and the Training Operator only: + + +``` +helm install kubeflow ./kubeflow-helm-chart --set pipelines.enabled=true --set training-operator.enabled=true +``` + + +To enable Katib after the initial installation: + + +``` +helm upgrade kubeflow ./kubeflow-helm-chart --set katib.enabled=true +``` + + +To roll back a deployment: + + +``` +helm rollback kubeflow 1 +``` + + +### Security and Default Configurations + + +To ensure secure and production-ready deployments, the Helm chart will include: + + +* Minimal privileges using Role-Based Access Control (RBAC) and Pod Security Standards restricted. +* Network policies to restrict component communication where necessary. +* Secure default values, with optional overrides for users needing customization. + + +Example RBAC template (templates/rbac.yaml): + + +```yaml +apiVersion: rbac.authorization.k8s.io/v1 +kind: ClusterRole +metadata: + name: kubeflow-pipelines-role +rules: +- apiGroups: ["kubeflow.org"] + resources: ["pipelines"] + verbs: ["get", "list", "watch"] +``` + + +### Implementation Plan + + +* Create a subdirectory (kubeflow/manifests/experimental/helm) to host the Helm charts. +* Define the Helm chart structure, ensuring compatibility, synchronizability, and a single source of truth with the Kustomize manifests. +* Develop subcharts for each major Kubeflow component (Pipelines, Platform, Notebooks, Dashboard, Katib, Training Operator, Istio, etc.) and make them deployable simultaneously to replicate the kustomize manifests. +* Write Helm values documentation, including examples for different environments (Dex vs oauth-proxy and authentication). +* Test the Helm charts as we test the Kustomize manifests. +* Engage the Kubeflow community for feedback and contributions. + + +### Test Plan +1. Unit Testing for Helm Templates +Each Helm template will be tested using Helm unit testing frameworks such as: +* helm-unittest – Validates Helm templates using YAML-based test cases. +* helm-template – Ensures templates render correctly without errors. +* This shall happen in the same GHA used for Kustomize since the output should stay the same. + + +2. Linting and Static Analysis +* Helm Linting (helm lint) – Ensures best practices in chart structure and values. +* YAML Schema Validation – Ensures manifests follow correct Kubernetes API specifications. + Kubeval & Kubeconform – Validates Kubernetes resources before applying them. +* We already have that repository-wide and can just add helm linting. + + +3. End-to-End Testing with CI/CD Pipelines +* Tests will validate component functionality post-deployment (e.g., Pipelines UI loads, Katib runs experiments). +* We shall reuse/extend the Kustomize tests, if the output is the same, we can test Helm and Kustomize simultaneously in the same GHA. + + +4. Community Testing and User Feedback +* Early adopters will be encouraged to test pre-release Helm charts and provide feedback via GitHub issues and the Kubeflow Slack #kubeflow-helm-chart channel. +* A beta phase will allow broader testing before an official Helm release. + + +[ X] I/we understand the components' owners may require updates to existing tests to make this code solid before committing the changes necessary to implement this enhancement. + + +#### Prerequisite Testing Updates + + +Since this is a replication of the Kustomize manifests, most of the testing infrastructure is already in place. + + +#### E2E Tests +The integration tests will be very similar to and based on the ones we have for the Kustomize manifests. + + +#### Integration Tests + + +The end-to-end tests will be very similar to and based on the ones we have for the Kustomize manifests. + + +### Graduation Criteria + + +Reach feature-parity with the Kustomze manifests + + +## Implementation History + + + + + + + +## Drawbacks +### Potential Drawbacks include: +* Users may expect Helm charts to be fully "production ready" and engage the community for out-of-scope support/contributions. +* Helm chart complexity may become burdensome to manage and strain community resources. +* Helm may have unforeseen limitations. + + +## Alternatives +### Glasskube +[Glasskube](https://github.com/glasskube/glasskube) was initially explored as a potential way to improve our deployment. That community [has made an effort](https://glasskube.dev/blog/kubeflow-setup-guide/), but we've yet to see more traction. Their implementation is not as widely adopted, and we may struggle finding contributors. Should the Glasskube community build a Kubeflow distribution/installation method, we'd gladly support them in this effort, but we have not seen a push for Glasskube like we've seen for Helm. + + +### KPT +The [GCP Distribution](https://googlecloudplatform.github.io/kubeflow-gke-docs/docs/) uses KPT, but KPT is not as easily integrated with upstream communities that've standardized on Helm. We'd need to consider upstream vanilla manifests and use KPT. + + +### Crossplane +[The Crossaplane project](https://www.crossplane.io/) could be used to template manifests with a higher-level manifest, but that project is more suited for templating and infrastructure management. We've yet to see any community traction for a Crossplane-powered Kubeflow distribution; therefore, resourcing may be difficult and could lead to a longer lead time versus using Helm. + + + + + diff --git a/proposals/NNNN-template/README.md b/proposals/NNNN-template/README.md new file mode 100644 index 000000000..a8481be20 --- /dev/null +++ b/proposals/NNNN-template/README.md @@ -0,0 +1,199 @@ +# KEP-NNNN: Your short, descriptive title + + + +## Summary + + + +## Motivation + + + +### Goals + + + +### Non-Goals + + + +## Proposal + + + +### User Stories (Optional) + + + +#### Story 1 + +#### Story 2 + +### Notes/Constraints/Caveats (Optional) + + + +### Risks and Mitigations + + + +## Design Details + + + +### Test Plan + + + +[ ] I/we understand the owners of the involved components may require updates to +existing tests to make this code solid enough prior to committing the changes necessary +to implement this enhancement. + +#### Prerequisite testing updates + + + +#### Unit Tests + + + + + +- ``: `` - `` + +#### E2E tests + + + +#### Integration tests + + + +### Graduation Criteria + + + +## Implementation History + + + +## Drawbacks + + + +## Alternatives + + diff --git a/proposals/README.md b/proposals/README.md new file mode 100644 index 000000000..e0ff02b35 --- /dev/null +++ b/proposals/README.md @@ -0,0 +1,122 @@ +# KEP Format to Propose and Document Enhancements to Kubeflow + +## Summary + +A standardized development process for Kubeflow is recommended, when the scope is sufficiently large, in order to: + +- provide a common structure for proposing changes to Kubeflow +- ensure that the motivation for a change is clear +- allow for the enumeration of stability milestones and stability graduation + criteria +- persist project information in a Version Control System (VCS) for future + contributors +- support the creation of _high-value, user-facing_ information such as: + - an overall project development roadmap + - motivation for impactful user-facing changes +- reserve GitHub issues for tracking work in flight, instead of creating "umbrella" + issues +- ensure community participants can successfully drive changes to + completion across one or more releases while stakeholders are adequately + represented throughout the process + +This process is supported by a unit of work of the Kubeflow Enhancement Proposal format, or KEP for short. +A KEP attempts to combine aspects of + +- a feature, and effort-tracking document +- a product requirements document +- a design document + +into one file. + +We are proposing usage of KEPs in Kubeflow to be better aligned with the wider Kubernetes community. + +This proposal takes heavy inspiration from the [Kubernetes KEP Proposal](https://github.com/kubernetes/enhancements/blob/master/keps/sig-architecture/0000-kep-process/README.md) + +## Motivation + +Documenting proposals and enhancements are critical in managing the complexity and longevity of software projects. + +Some examples include: + +1. Capturing the "why" behind critical architectural choices, not just the what. + 1. This helps current and future team members understand the reasoning behind decisions, particularly when the rationale is no longer obvious. +2. Improve Communication and Collaboration. +3. Serve as a single source of truth for architectural decisions. +4. By documenting options and their trade-offs, KEPs encourage structured decision-making and transparency. +5. Enable Traceability +6. Create a decision history that allows new (and old) contributors to trace architectural choices back to their original context, assumptions, and goals + +### Goals + +- Capture the Why Behind Decisions +- Foster Clear Communication +- Enable Decision Traceability +- Encourage Thoughtful, Deliberate Decisions +- Preserve Institutional Knowledge + +### Non-Goals + +- Not a substitute for technical or user documentation +- Not a substitute or replacement for meaningful commit messages + +## Proposal + +A KEP is broken into sections which can be merged into source control +incrementally in order to support an iterative development process. The KEP process +is intended to create high-quality, uniform design and implementation documents +for maintainers to deliberate. Contributors proposing new KEPs should create a copy +of the [template directory](./NNNN-template) and proceed as per the instructions in the template. + +## Reference-Level Explanation + +### What Type of Work Should Be Tracked by a KEP + +Roughly any user or operator facing enhancement should follow the KEP process. +If an enhancement would be described in either written or verbal communication +to anyone besides the KEP author or developer, then consider creating a KEP. + +Similarly, any technical effort (refactoring, major architectural change) that +will significantly impact a large section of the development community should also be +communicated widely. The KEP process is suited for this even if it will have +zero impact on the typical user or operator. + +### KEP Workflow + +A KEP has the following states: + +- `provisional`: The KEP has been proposed and is actively being defined. + This is the starting state while the KEP is being fleshed out and actively defined and discussed. + The owning SIG has accepted that this work must be done. +- `implementable`: The approvers have approved this KEP for implementation. +- `implemented`: The KEP has been implemented and is no longer actively changed. +- `deferred`: The KEP is proposed but not actively being worked on. +- `rejected`: The approvers and authors have decided that this KEP is not moving forward. + The KEP is kept around as a historical document. +- `withdrawn`: The authors have withdrawn the KEP. +- `replaced`: The KEP has been replaced by a new KEP. + The `superseded-by` metadata value should point to the new KEP. + +Some authors may prefer to prepare a Google Doc (or similar) before creating a KEP on GitHub. +While not a required part of the process, it may be useful to quickly gather initial community feedback. + +We strongly advise KEP authors to present their KEPs in relevant Kubeflow Working Group meetings or the +wider community meeting as relevant. This will help gather community feedback and bring visibility to +upcoming changes + +### Git and GitHub Implementation + +KEPs are checked into the component repo under the `proposals` directory. Note that there isn't yet a standard for where this directory is located. +Some components locate their KEPS at `./docs/proposals`, whereas other components have their KEPS located at `./proposals` + +KEPs affecting multiple Kubeflow projects that do not fit into existing cross-component projects such as `kubeflow/manifests` should be created under `kubeflow/community` + +- New KEPs can be checked in with a file name in the form of `XYZ-my-title.md`, where XYZ is a KEP number. +- As significant work is done on the KEP, the authors can assign a KEP number. + - If there isn't already a tracking issue for the KEP in the approriate repository, create it. This issue number should then be used as the KEP number +- No other changes should be put in that PR so that it can be approved quickly and minimize merge conflicts. +- The KEP number can also be done as part of the initial submission if the PR is likely to be uncontested and merged quickly. + +### Prior Art + +Our usage of the KEP process (and most of this file) is almost entirely based on the +[Kubernetes KEP Process](https://github.com/kubernetes/enhancements/blob/master/keps/sig-architecture/0000-kep-process/README.md) diff --git a/proposals/TEMPLATE.md b/proposals/TEMPLATE.md deleted file mode 100644 index d2da5416d..000000000 --- a/proposals/TEMPLATE.md +++ /dev/null @@ -1,19 +0,0 @@ -> Note to proposers: Please keep this document as brief as possible, preferably not more than two pages. - -## Motivation -A high level description of the problem or opportunity being addressed. - -## Goals -Specific new functionalities or other changes. - -## Non-Goals -Issues or changes not being addressed by this proposal. - -## UI or API -New interfaces or changes to existing interfaces. Backward compatibility must be considered. - -## Design -Description of new software design and any major changes to existing software. Should include figures or diagrams where appropriate. - -## Alternatives Considered -Description of possible alternative solutions and the reasons they were not chosen. diff --git a/proposals/issue_triage.md b/proposals/issue_triage.md deleted file mode 100644 index 2e1b584b7..000000000 --- a/proposals/issue_triage.md +++ /dev/null @@ -1,92 +0,0 @@ -# Kubeflow Issue Triage - -## TL;DR - -The purpose of this doc is to define a process for triaging Kubeflow issues. - -## Objectives - -* Establish well accepted criterion for determining whether issues have been triaged -* Establish a process for ensuring issues are triaged in a timely fashion -* Define metrics for measuring whether we are keeping up with issues - -## Triage Conditions - -The following are necessary and sufficient conditions for an issue to be considered triaged. - -* The issue must have a label indicating which one of the following kinds of issues it is - - * **bug** - * Something is not working as intended in general. - * **question** - * Clear question statement - * Something is not working as intended in author's specific use case and he/she doesn't know why. - * **feature** - * Everything is working as intended, but could be better (i.e more user friendly) - * **process** - * Typically used to leave a paper trail for updating Kubeflow infrastructure. It helps to track the changes to infrastructure for easy debugging in the future. - -* The issue must have at least one [area or platform label](https://github.com/kubeflow/community/blob/master/labels-owners.yaml) grouping related issues and relevant owners. - -* The issue must have a priority attached to it. Here is a guideline for priority - - * **P0** - Urgent - Work must begin immediately to fix with a patch release: - * Bugs that state that something is really broken and not working as intended. - * Features/improvements that are blocking the next release. - * **P1** - Rush - Work must be scheduled to assure issue will be fixed in the next release. - * **P2** - Low - Never blocks a release, assigned to a relevant project backlog if applicable. - * **P3** - Very Low - Non-critical or cosmetic issues that could and probably should eventually be fixed but have no specific schedule, assigned to a relavant project backlog if applicable. - -* **P0** & **P1** issues must be attached to a Kanban board corresponding to the release it is targeting - -## Process - -1. Global triagers are responsible for ensuring new issues have an area or platform label - - * A weekly rotation will be established to designate a primary person to apply initial triage - - * Once issues have an area/platform label they should be moved into the appropriate [column "Assigned to Area Owners"](https://github.com/orgs/kubeflow/projects/26#column-7382310) in the Needs Triage Kanban board - - * There is an open issue [kubeflow/code-intelligence#72](https://github.com/kubeflow/code-intelligence/issues/72) to do this automatically - -1. Area/Platform owners are responsible for ensuring issues in their area are triaged - - * The oncall will attempt to satisfy the above criterion or reassign to an appropriate WG if there is some question - -## Tooling - -* The [Needs Triage](https://github.com/orgs/kubeflow/projects/26) Kanban board will be used to track issues that need triage - - * Cards will be setup to monitor various issues; e.g. issues requiring discussion by various WG's - -* The [GitHub Issue Triage action](https://github.com/kubeflow/code-intelligence/tree/master/Issue_Triage/action) can be used to - automatically add/remove issues from the Kanban board depending on whether they need triage or not - - * Follow the [instructions](https://github.com/kubeflow/code-intelligence/tree/master/Issue_Triage/action#installing-the-action-on-a-repository) to install the GitHub action on a repository - -* The [triage notebook](https://github.com/kubeflow/code-intelligence/blob/master/py/code_intelligence/triage.ipynb) can be used to generate reports about number of untriaged issues as well as find issues needing triage - -## Become a contributor - -* Make sure that you have enough permissions to assign labels to an issue and add it to a project. -* In order to get permissions, open a PR to add yourself to [project-maintainers](https://github.com/kubeflow/internal-acls/blob/4e44f623ea4df32132b2e8a973ed0f0dce4f4139/github-orgs/kubeflow/org.yaml#L389) group. - -## Triage guideline - -* Take an issue from "Needs Triage" project and open it in a new tab. -* Carefully read the description. -* Carefully read all comments below. (Some issues might be already resolved). -* Make sure that issue is still relevant. (Some issues might be open for months and still be relevant to current Kubeflow release whereas some might be outdated and can be closed). -* Ping one of the issue repliers if he/she is not replying for a while. -* Make sure that all triage conditions are satisfied. - -## Metrics - -We would like to begin to collect and track the following metrics - -* Time to triage issues -* Issue volume - -## References - -* [kubeflow/community](https://github.com/kubeflow/community/issues/280) diff --git a/proposals/new-project-join-process.md b/proposals/new-project-join-process.md new file mode 100644 index 000000000..ad6f4f21d --- /dev/null +++ b/proposals/new-project-join-process.md @@ -0,0 +1,79 @@ +# KEP-748: Expanding the Kubeflow Ecosystem with a New OSS Project + +## Summary +This KEP outlines how OSS projects can join the Kubeflow Ecosystem. + + +Note: this process will follow the Kubeflow Steering Committee's [Normal decision process](../KUBEFLOW-STEERING-COMMITTEE.md#normal-decision-process) + +## Motivation +As Kubeflow has become a well established ecosystem and community, several +projects may want to join the Kubeflow ecosystem to explicitly be a part of our +community. + +Kubeflow's goal is to cover the entire AI/ML lifecycle and new projects can help +address missing stages in that lifecycle. + +### Goals +This goal of this process is to give clear guidelines and set expectations +for community members about how to be formally included into the Kubeflow Ecosystem and +the application process. + +The decision making process will be separate from the application process and is at the +discretion of the Kubeflow Steering Committee. The application and data provided are +critical for the KSC to make an informed decision that is best for the longetivity +of the project and community. + +### Non-Goals +- Give specific recommendations for evaluating any individual project. +- Supporting project add-ons. + +## Proposal +The process to join the Kubeflow Ecosystem is intended to be simple but thorough. + +Project owners or maintainers will apply to join by following a four +step process. + +The process is outlined in six steps: + +1. Create a GitHub Issue with a Google Document outlining your proposal (please allow for commentary), the document should have a rough outline with: + - Authors + - Motivation + - Benefits for Kubeflow + - Benefits for the Project's Community + - Community Metrics + - Contributor Metrics + - Maintainers + - Migration Plan + - Other Related Projects +2. Provide a demo during the Kubeflow Community Call +3. Submit a Pull Request with the [../proposals/new-project-join-process.md](application form). +4. Add your proposal to the Kubeflow Community call to introduce and collect feedback on +the application. +5. Work with the Kubeflow Outreach Committee to send an announcement email in `kubeflow-discuss` and publish messages on Slack, LinkedIn, X/Twitter, and other Kubeflow social resources +6. Schedule meeting with Kubeflow Steering Committee for initial vote and to collect feedback. +7. Identify the appropriate Kubeflow Working Group that should control the project. +8. Merge or close the Pull Request depending upon the outcome of the final vote. + +### Notes/Constraints/Caveats (Optional) + +Note that this application does not guarantee acceptance of the proposed project to Kubeflow. + +### Risks and Mitigations + +Two major risks for Kubeflow accepting projects are: +1. Accepting projects that do not have active contributors or a healthy user base +- This is why metrics are meant to capture this +2. Impacting the delivery speed of Kubeflow releases +- It will be expected that the maintainers invest in incorporating the project into the manifest or it will be removed +3. Additional infrastructure support +- It will be expected that the maintainers invest in providing this support + +## Drawbacks + +How could this new project harm the Kubeflow community? + +## Alternatives + +What other open source projects are there like the one proposed? +Why should Kubeflow accept the one proposed?