Skip to content

Commit d18506e

Browse files
committed
Update charter
Signed-off-by: John Belamaric <[email protected]>
1 parent 425840a commit d18506e

File tree

1 file changed

+99
-2
lines changed

1 file changed

+99
-2
lines changed

wg-device-management/charter.md

Lines changed: 99 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,100 @@
1-
# WG Device Management
1+
# WG Device Management Charter
22

3-
In progress.
3+
This charter adheres to the conventions described in the [Kubernetes Charter
4+
README] and uses the Roles and Organization Management outlined in
5+
[wg-governance].
6+
7+
## Scope
8+
9+
Enable simple and efficient configuration, sharing, and allocation of
10+
accelerators and other specialized devices. This working group focuses on the
11+
APIs, abstractions, and feature designs needed to configure, target, and share
12+
the necessary hardware for both batch and serving (inference) workloads.
13+
14+
### In scope
15+
16+
- Enable efficient utilization of specialized hardware devices. This includes
17+
sharing one or more resources effectively (many workloads sharing a pool of
18+
devices), as well as sharing individual devices effectively (several workloads
19+
dividing up a single device for sharing).
20+
- Enable workload authors to specify “just enough” details about their workload
21+
requirements to ensure it runs optimally, without having to understand exactly
22+
how the infrastructure team has provisioned the cluster.
23+
- Enable the scheduler to choose the correct place to run a workload the vast
24+
majority of the time (rejections should be extremely rare).
25+
- Enable cluster autoscalers and other node auto-provisioning components to
26+
predict whether creating additional resources will satisfy workload needs,
27+
before provisioning those resources.
28+
- Enable the shift from “pods run on nodes” to “workloads consume capacity”.
29+
This allows Kubernetes to provision sets of pods on top of sets of nodes and
30+
specialized hardware, while taking into account the relationships between
31+
those infrastructure components.
32+
- Enable in-node devices as well as network-accessible devices.
33+
- Minimize workload disruption due to hardware failures.
34+
- Address fragmentation of accelerator due to fractional use.
35+
- Additional problems that may be identified and deemed in scope as we gather
36+
use cases and requirements from WG Serving, WG Batch, and other stakeholders.
37+
- Address all of the above while with a simple API that is a natural extension
38+
of the existing Kubernetes APIs, and avoids or minimizes any transition
39+
effort.
40+
41+
### Out of Scope
42+
43+
- Higher-level workload controller APIs (for example, the equivalent of
44+
Deployment, StatefulSet, or DaemonSet) for specific types of workloads.
45+
- General resource management requirements not related to devices.
46+
47+
## Deliverables
48+
49+
The WG will coordinate the delivery of KEPs and their implementations by the
50+
participating SIGs. Interim artifacts will include documents capturing use
51+
cases, requirements, and designs; however, all of those will eventually result
52+
in KEPs and code owned by SIGs.
53+
54+
Specifically, we expect to need:
55+
56+
- APIs for publishing resource capacity of in-node and network-accessible
57+
devices, as well as sample code to ease creation of drivers to populate this
58+
information.
59+
- APIs for specifying workload resource requirements with respect to devices.
60+
- APIs, algorithms, and implementations for allocating access to and resources on devices, as well as
61+
persisting the results of those allocations.
62+
- APIs, algorithms, and implementations for allowing adminstrators to control
63+
and govern access to devices.
64+
65+
## Stakeholders
66+
67+
- SIG Architecture
68+
- SIG Autoscaling
69+
- SIG Network
70+
- SIG Node
71+
- SIG Scheduling
72+
73+
Additionally a broad set of end users, device vendors, cloud providers,
74+
Kubernetes distribution providers, and ecosystem projects (particularly
75+
autoscaling-related projects) have expressed interest in this effort. There are
76+
five primary groups of stakeholders from each of which we expect multiple participants:
77+
78+
- Device vendors that manufacture accelerators and other specialized hardware
79+
which they would like to make available to Kubernetes users.
80+
- Kubernetes distribution and managed offering providers that would like to make
81+
specialized hardware available to their users.
82+
- Kubernetes ecosystem projects that help manage workloads utilizing these
83+
accelerators (e.g., Karpenter, Kueue, Volcano)
84+
- End user workload authors that will create workloads that take advantage of
85+
the specialized hardware.
86+
- Cluster administrators that operate and govern clusters containing the
87+
specialized hardware.
88+
89+
## Roles and Organization Management
90+
91+
This sig follows adheres to the Roles and Organization Management outlined in [wg-governance]
92+
and opts-in to updates and modifications to [wg-governance].
93+
94+
## Exit Criteria
95+
96+
The working group will disband when the KEPs resulting from these discussions
97+
have reached a terminal state.
98+
99+
[wg-governance]: https://github.com/kubernetes/community/blob/master/committee-steering/governance/wg-governance.md
100+
[Kubernetes Charter README]: https://github.com/kubernetes/community/blob/master/committee-steering/governance/README.md

0 commit comments

Comments
 (0)