Skip to content

Commit 7008f54

Browse files
committed
blog: add dynamic resource allocation feature blog post
This feature got added as alpha in 1.26. kubernetes/enhancements#3063
1 parent 3801b57 commit 7008f54

File tree

1 file changed

+333
-0
lines changed
  • content/en/blog/_posts/2022-12-15-dynamic-resource-allocation-alpha

1 file changed

+333
-0
lines changed
Lines changed: 333 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,333 @@
1+
---
2+
layout: blog
3+
title: "Kubernetes 1.26: Alpha API For Dynamic Resource Allocation"
4+
date: 2022-12-15
5+
slug: dynamic-resource-allocation
6+
---
7+
8+
**Authors:** Patrick Ohly (Intel), Kevin Klues (NVIDIA)
9+
10+
Dynamic resource allocation is a new API for requesting resources. It is a
11+
generalization of the persistent volumes API for generic resources, making it possible to:
12+
13+
- access the same resource instance in different pods and containers,
14+
- attach arbitrary constraints to a resource request to get the exact resource
15+
you are looking for,
16+
- initialize a resource according to parameters provided by the user.
17+
18+
Third-party resource drivers are responsible for interpreting these parameters
19+
as well as tracking and allocating resources as requests come in.
20+
21+
Dynamic resource allocation is an *alpha feature* and only enabled when the
22+
`DynamicResourceAllocation` [feature
23+
gate](/docs/reference/command-line-tools-reference/feature-gates/) and the
24+
`resource.k8s.io/v1alpha1` {{< glossary_tooltip text="API group"
25+
term_id="api-group" >}} are enabled. For details, see the
26+
`--feature-gates` and `--runtime-config` [kube-apiserver
27+
parameters](/docs/reference/command-line-tools-reference/kube-apiserver/).
28+
The kube-scheduler, kube-controller-manager and kubelet components all need
29+
the feature gate enabled as well.
30+
31+
The default configuration of kube-scheduler enables the `DynamicResources`
32+
plugin if and only if the feature gate is enabled. Custom configurations may
33+
have to be modified to include it.
34+
35+
Once dynamic resource allocation is enabled, resource drivers can be installed
36+
to manage certain kinds of hardware. Kubernetes has a test driver that is used
37+
for end-to-end testing, but also can be run manually. See
38+
[below](#running-the-test-driver) for step-by-step instructions.
39+
40+
## API
41+
42+
The new `resource.k8s.io/v1alpha1` {{< glossary_tooltip text="API group"
43+
term_id="api-group" >}} provides four new types:
44+
45+
ResourceClass
46+
: Defines which resource driver handles a certain kind of
47+
resource and provides common parameters for it. ResourceClasses
48+
are created by a cluster administrator when installing a resource
49+
driver.
50+
51+
ResourceClaim
52+
: Defines a particular resource instances that is required by a
53+
workload. Created by a user (lifecycle managed manually, can be shared
54+
between different Pods) or for individual Pods by the control plane based on
55+
a ResourceClaimTemplate (automatic lifecycle, typically used by just one
56+
Pod).
57+
58+
ResourceClaimTemplate
59+
: Defines the spec and some meta data for creating
60+
ResourceClaims. Created by a user when deploying a workload.
61+
62+
PodScheduling
63+
: Used internally by the control plane and resource drivers
64+
to coordinate pod scheduling when ResourceClaims need to be allocated
65+
for a Pod.
66+
67+
Parameters for ResourceClass and ResourceClaim are stored in separate objects,
68+
typically using the type defined by a {{< glossary_tooltip
69+
term_id="CustomResourceDefinition" text="CRD" >}} that was created when
70+
installing a resource driver.
71+
72+
With this alpha feature enabled, the `spec` of Pod defines ResourceClaims that are needed for a Pod
73+
to run: this information goes into a new
74+
`resourceClaims` field. Entries in that list reference either a ResourceClaim
75+
or a ResourceClaimTemplate. When referencing a ResourceClaim, all Pods using
76+
this `.spec` (for example, inside a Deployment or StatefulSet) share the same
77+
ResourceClaim instance. When referencing a ResourceClaimTemplate, each Pod gets
78+
its own ResourceClaim instance.
79+
80+
For a container defined within a Pod, the `resources.claims` list
81+
defines whether that container gets
82+
access to these resource instances, which makes it possible to share resources
83+
between one or more containers inside the same Pod. For example, an init container could
84+
set up the resource before the application uses it.
85+
86+
Here is an example of a fictional resource driver. Two ResourceClaim objects
87+
will get created for this Pod and each container gets access to one of them.
88+
89+
Assuming a resource driver called `resource-driver.example.com` was installed
90+
together with the following resource class:
91+
92+
```
93+
apiVersion: resource.k8s.io/v1alpha1
94+
kind: ResourceClass
95+
name: resource.example.com
96+
driverName: resource-driver.example.com
97+
```
98+
99+
An end-user could then allocate two specific resources of type
100+
`resource.example.com` as follows:
101+
102+
```yaml
103+
---
104+
apiVersion: cats.resource.example.com/v1
105+
kind: ClaimParameters
106+
name: large-black-cats
107+
spec:
108+
color: black
109+
size: large
110+
---
111+
apiVersion: resource.k8s.io/v1alpha1
112+
kind: ResourceClaimTemplate
113+
metadata:
114+
name: large-black-cats
115+
spec:
116+
spec:
117+
resourceClassName: resource.example.com
118+
parametersRef:
119+
apiGroup: cats.resource.example.com
120+
kind: ClaimParameters
121+
name: large-black-cats
122+
–--
123+
apiVersion: v1
124+
kind: Pod
125+
metadata:
126+
name: pod-with-cats
127+
spec:
128+
containers: # two example containers; each container claims one cat resource
129+
- name: first-example
130+
image: ubuntu:22.04
131+
command: ["sleep", "9999"]
132+
resources:
133+
claims:
134+
- name: cat-0
135+
- name: second-example
136+
image: ubuntu:22.04
137+
command: ["sleep", "9999"]
138+
resources:
139+
claims:
140+
- name: cat-1
141+
resourceClaims:
142+
- name: cat-0
143+
source:
144+
resourceClaimTemplateName: large-black-cats
145+
- name: cat-1
146+
source:
147+
resourceClaimTemplateName: large-black-cats
148+
```
149+
150+
## Scheduling
151+
152+
In contrast to native resources (such as CPU or RAM) and
153+
[extended resources](/docs/concepts/configuration/manage-resources-containers/#extended-resources)
154+
(managed by a
155+
device plugin, advertised by kubelet), the scheduler has no knowledge of what
156+
dynamic resources are available in a cluster or how they could be split up to
157+
satisfy the requirements of a specific ResourceClaim. Resource drivers are
158+
responsible for that. Drivers mark ResourceClaims as _allocated_ once resources
159+
for it are reserved. This also then tells the scheduler where in the cluster a
160+
claimed resource is actually available.
161+
162+
ResourceClaims can get resources allocated as soon as the ResourceClaim
163+
is created (_immediate allocation_), without considering which Pods will use
164+
the resource. The default (_wait for first consumer_) is to delay allocation until
165+
a Pod that relies on the ResourceClaim becomes eligible for scheduling.
166+
This design with two allocation options is similar to how Kubernetes handles
167+
storage provisioning with PersistentVolumes and PersistentVolumeClaims.
168+
169+
In the wait for first consumer mode, the scheduler checks all ResourceClaims needed
170+
by a Pod. If the Pods has any ResourceClaims, the scheduler creates a PodScheduling
171+
(a special object that requests scheduling details on behalf of the Pod). The PodScheduling
172+
has the same name and namespace as the Pod and the Pod as its as owner.
173+
Using its PodScheduling, the scheduler informs the resource drivers
174+
responsible for those ResourceClaims about nodes that the scheduler considers
175+
suitable for the Pod. The resource drivers respond by excluding nodes that
176+
don't have enough of the driver's resources left.
177+
178+
Once the scheduler has that resource
179+
information, it selects one node and stores that choice in the PodScheduling
180+
object. The resource drivers then allocate resources based on the relevant
181+
ResourceClaims so that the resources will be available on that selected node.
182+
Once that resource allocation is complete, the scheduler attempts to schedule the Pod
183+
to a suitable node. Scheduling can still fail at this point; for example, a different Pod could
184+
be scheduled to the same node in the meantime. If this happens, already allocated
185+
ResourceClaims may get deallocated to enable scheduling onto a different node.
186+
187+
As part of this process, ResourceClaims also get reserved for the
188+
Pod. Currently ResourceClaims can either be used exclusively by a single Pod or
189+
an unlimited number of Pods.
190+
191+
One key feature is that Pods do not get scheduled to a node unless all of
192+
their resources are allocated and reserved. This avoids the scenario where
193+
a Pod gets scheduled onto one node and then cannot run there, which is bad
194+
because such a pending Pod also blocks all other resources like RAM or CPU that were
195+
set aside for it.
196+
197+
## Limitations
198+
199+
The scheduler plugin must be involved in scheduling Pods which use
200+
ResourceClaims. Bypassing the scheduler by setting the `nodeName` field leads
201+
to Pods that the kubelet refuses to start because the ResourceClaims are not
202+
reserved or not even allocated. It may be possible to remove this
203+
[limitation](https://github.com/kubernetes/kubernetes/issues/114005) in the
204+
future.
205+
206+
## Writing a resource driver
207+
208+
A dynamic resource allocation driver typically consists of two separate-but-coordinating
209+
components: a centralized controller, and a DaemonSet of node-local kubelet
210+
plugins. Most of the work required by the centralized controller to coordinate
211+
with the scheduler can be handled by boilerplate code. Only the business logic
212+
required to actually allocate ResourceClaims against the ResourceClasses owned
213+
by the plugin needs to be customized. As such, Kubernetes provides
214+
the following package, including APIs for invoking this boilerplate code as
215+
well as a `Driver` interface that you can implement to provide their custom
216+
business logic:
217+
218+
- [k8s.io/dynamic-resource-allocation/controller](https://github.com/kubernetes/dynamic-resource-allocation/tree/release-1.26/controller)
219+
220+
Likewise, boilerplate code can be used to register the node-local plugin with
221+
the kubelet, as well as start a gRPC server to implement the kubelet plugin
222+
API. For drivers written in Go, the following package is recommended:
223+
224+
- [k8s.io/dynamic-resource-allocation/kubeletplugin](https://github.com/kubernetes/dynamic-resource-allocation/tree/release-1.26/kubeletplugin)
225+
226+
It is up to the driver developer to decide how these two components
227+
communicate. The [KEP](https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/3063-dynamic-resource-allocation/README.md) outlines an [approach using
228+
CRDs](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/3063-dynamic-resource-allocation#implementing-a-plugin-for-node-resources).
229+
230+
Within SIG Node, we also plan to provide a complete [example
231+
driver](https://github.com/kubernetes-sigs/dra-example-driver) that can serve
232+
as a template for other drivers.
233+
234+
## Running the test driver
235+
236+
The following steps bring up a local, one-node cluster directly from the
237+
Kubernetes source code. As a prerequisite, your cluster must have nodes with a container
238+
runtime that supports the
239+
[Container Device Interface](https://github.com/container-orchestrated-devices/container-device-interface)
240+
(CDI). For example, you can run CRI-O [v1.23.2](https://github.com/cri-o/cri-o/releases/tag/v1.23.2) or later.
241+
Once containerd v1.7.0 is released, we expect that you can run that or any later version.
242+
In the example below, we use CRI-O.
243+
244+
First, clone the Kubernetes source code. Inside that directory, run:
245+
246+
```console
247+
$ hack/install-etcd.sh
248+
...
249+
250+
$ RUNTIME_CONFIG=resource.k8s.io/v1alpha1 \
251+
FEATURE_GATES=DynamicResourceAllocation=true \
252+
DNS_ADDON="coredns" \
253+
CGROUP_DRIVER=systemd \
254+
CONTAINER_RUNTIME_ENDPOINT=unix:///var/run/crio/crio.sock \
255+
LOG_LEVEL=6 \
256+
ENABLE_CSI_SNAPSHOTTER=false \
257+
API_SECURE_PORT=6444 \
258+
ALLOW_PRIVILEGED=1 \
259+
PATH=$(pwd)/third_party/etcd:$PATH \
260+
./hack/local-up-cluster.sh -O
261+
...
262+
To start using your cluster, you can open up another terminal/tab and run:
263+
264+
export KUBECONFIG=/var/run/kubernetes/admin.kubeconfig
265+
...
266+
```
267+
268+
Once the cluster is up, in another
269+
terminal run the test driver controller. `KUBECONFIG` must be set for all of
270+
the following commands.
271+
272+
```console
273+
$ go run ./test/e2e/dra/test-driver --feature-gates ContextualLogging=true -v=5 controller
274+
```
275+
276+
In another terminal, run the kubelet plugin:
277+
278+
```console
279+
$ sudo mkdir -p /var/run/cdi && \
280+
sudo chmod a+rwx /var/run/cdi /var/lib/kubelet/plugins_registry /var/lib/kubelet/plugins/
281+
$ go run ./test/e2e/dra/test-driver --feature-gates ContextualLogging=true -v=6 kubelet-plugin
282+
```
283+
284+
Changing the permissions of the directories makes it possible to run and (when
285+
using delve) debug the kubelet plugin as a normal user, which is convenient
286+
because it uses the already populated Go cache. Remember to restore permissions
287+
with `sudo chmod go-w` when done. Alternatively, you can also build the binary
288+
and run that as root.
289+
290+
Now the cluster is ready to create objects:
291+
292+
```console
293+
$ kubectl create -f test/e2e/dra/test-driver/deploy/example/resourceclass.yaml
294+
resourceclass.resource.k8s.io/example created
295+
296+
$ kubectl create -f test/e2e/dra/test-driver/deploy/example/pod-inline.yaml
297+
configmap/test-inline-claim-parameters created
298+
resourceclaimtemplate.resource.k8s.io/test-inline-claim-template created
299+
pod/test-inline-claim created
300+
301+
$ kubectl get resourceclaims
302+
NAME RESOURCECLASSNAME ALLOCATIONMODE STATE AGE
303+
test-inline-claim-resource example WaitForFirstConsumer allocated,reserved 8s
304+
305+
$ kubectl get pods
306+
NAME READY STATUS RESTARTS AGE
307+
test-inline-claim 0/2 Completed 0 21s
308+
```
309+
310+
The test driver doesn't do much, it only sets environment variables as defined
311+
in the ConfigMap. The test pod dumps the environment, so the log can be checked
312+
to verify that everything worked:
313+
314+
```console
315+
$ kubectl logs test-inline-claim with-resource | grep user_a
316+
user_a='b'
317+
```
318+
319+
## Next steps
320+
321+
- See the
322+
[Dynamic Resource Allocation](https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/3063-dynamic-resource-allocation/README.md)
323+
KEP for more information on the design.
324+
- Read [Dynamic Resource Allocation](/docs/concepts/scheduling-eviction/dynamic-resource-allocation/)
325+
in the official Kubernetes documentation.
326+
- You can participate in
327+
[SIG Node](https://github.com/kubernetes/community/blob/master/sig-node/README.md)
328+
and / or the [CNCF Container Orchestrated Device Working Group](https://github.com/cncf/tag-runtime/blob/master/wg/COD.md).
329+
- You can view or comment on the [project board](https://github.com/orgs/kubernetes/projects/95/views/1)
330+
for dynamic resource allocation.
331+
- In order to move this feature towards beta, we need feedback from hardware
332+
vendors, so here's a call to action: try out this feature, consider how it can help
333+
with problems that your users are having, and write resource drivers…

0 commit comments

Comments
 (0)