|
1 |
| -# WG Device Management |
| 1 | +# WG Device Management Charter |
2 | 2 |
|
3 |
| -In progress. |
| 3 | +This charter adheres to the conventions described in the [Kubernetes Charter |
| 4 | +README] and uses the Roles and Organization Management outlined in |
| 5 | +[wg-governance]. |
| 6 | + |
| 7 | +## Scope |
| 8 | + |
| 9 | +Enable simple and efficient configuration, sharing, and allocation of |
| 10 | +accelerators and other specialized devices. This working group focuses on the |
| 11 | +APIs, abstractions, and feature designs needed to configure, target, and share |
| 12 | +the necessary hardware for both batch and serving (inference) workloads. |
| 13 | + |
| 14 | +### In scope |
| 15 | + |
| 16 | +- Enable efficient utilization of specialized hardware devices. This includes |
| 17 | + sharing one or more resources effectively (many workloads sharing a pool of |
| 18 | + devices), as well as sharing individual devices effectively (several workloads |
| 19 | + dividing up a single device for sharing). |
| 20 | +- Enable workload authors to specify “just enough” details about their workload |
| 21 | + requirements to ensure it runs optimally, without having to understand exactly |
| 22 | + how the infrastructure team has provisioned the cluster. |
| 23 | +- Enable the scheduler to choose the correct place to run a workload the vast |
| 24 | + majority of the time (rejections should be extremely rare). |
| 25 | +- Enable cluster autoscalers and other node auto-provisioning components to |
| 26 | + predict whether creating additional resources will satisfy workload needs, |
| 27 | + before provisioning those resources. |
| 28 | +- Enable the shift from “pods run on nodes” to “workloads consume capacity”. |
| 29 | + This allows Kubernetes to provision sets of pods on top of sets of nodes and |
| 30 | + specialized hardware, while taking into account the relationships between |
| 31 | + those infrastructure components. |
| 32 | +- Enable in-node devices as well as network-accessible devices. |
| 33 | +- Minimize workload disruption due to hardware failures. |
| 34 | +- Address fragmentation of accelerator due to fractional use. |
| 35 | +- Additional problems that may be identified and deemed in scope as we gather |
| 36 | + use cases and requirements from WG Serving, WG Batch, and other stakeholders. |
| 37 | +- Address all of the above while with a simple API that is a natural extension |
| 38 | + of the existing Kubernetes APIs, and avoids or minimizes any transition |
| 39 | + effort. |
| 40 | + |
| 41 | +### Out of Scope |
| 42 | + |
| 43 | +- Higher-level workload controller APIs (for example, the equivalent of |
| 44 | + Deployment, StatefulSet, or DaemonSet) for specific types of workloads. |
| 45 | +- General resource management requirements not related to devices. |
| 46 | + |
| 47 | +## Deliverables |
| 48 | + |
| 49 | +The WG will coordinate the delivery of KEPs and their implementations by the |
| 50 | +participating SIGs. Interim artifacts will include documents capturing use |
| 51 | +cases, requirements, and designs; however, all of those will eventually result |
| 52 | +in KEPs and code owned by SIGs. |
| 53 | + |
| 54 | +Specifically, we expect to need: |
| 55 | + |
| 56 | +- APIs for publishing resource capacity of in-node and network-accessible |
| 57 | + devices, as well as sample code to ease creation of drivers to populate this |
| 58 | + information. |
| 59 | +- APIs for specifying workload resource requirements with respect to devices. |
| 60 | +- APIs, algorithms, and implementations for allocating access to and resources on devices, as well as |
| 61 | + persisting the results of those allocations. |
| 62 | +- APIs, algorithms, and implementations for allowing adminstrators to control |
| 63 | + and govern access to devices. |
| 64 | + |
| 65 | +## Stakeholders |
| 66 | + |
| 67 | +- SIG Architecture |
| 68 | +- SIG Autoscaling |
| 69 | +- SIG Network |
| 70 | +- SIG Node |
| 71 | +- SIG Scheduling |
| 72 | + |
| 73 | +Additionally a broad set of end users, device vendors, cloud providers, |
| 74 | +Kubernetes distribution providers, and ecosystem projects (particularly |
| 75 | +autoscaling-related projects) have expressed interest in this effort. There are |
| 76 | +five primary groups of stakeholders from each of which we expect multiple participants: |
| 77 | + |
| 78 | +- Device vendors that manufacture accelerators and other specialized hardware |
| 79 | + which they would like to make available to Kubernetes users. |
| 80 | +- Kubernetes distribution and managed offering providers that would like to make |
| 81 | + specialized hardware available to their users. |
| 82 | +- Kubernetes ecosystem projects that help manage workloads utilizing these |
| 83 | + accelerators (e.g., Karpenter, Kueue, Volcano) |
| 84 | +- End user workload authors that will create workloads that take advantage of |
| 85 | + the specialized hardware. |
| 86 | +- Cluster administrators that operate and govern clusters containing the |
| 87 | + specialized hardware. |
| 88 | + |
| 89 | +## Roles and Organization Management |
| 90 | + |
| 91 | +This sig follows adheres to the Roles and Organization Management outlined in [wg-governance] |
| 92 | +and opts-in to updates and modifications to [wg-governance]. |
| 93 | + |
| 94 | +## Exit Criteria |
| 95 | + |
| 96 | +The working group will disband when the KEPs resulting from these discussions |
| 97 | +have reached a terminal state. |
| 98 | + |
| 99 | +[wg-governance]: https://github.com/kubernetes/community/blob/master/committee-steering/governance/wg-governance.md |
| 100 | +[Kubernetes Charter README]: https://github.com/kubernetes/community/blob/master/committee-steering/governance/README.md |
0 commit comments