Skip to content

Commit 18cf0d0

Browse files
authored
Merge pull request #51971 from lauralorenz/51179-tutorial-dra-driver-install-and-observe-nits
DRA driver install tutorial nits follow-on
2 parents 2066d14 + 63ede0a commit 18cf0d0

File tree

2 files changed

+93
-104
lines changed

2 files changed

+93
-104
lines changed

content/en/docs/tutorials/_index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,7 @@ Before walking through each tutorial, you may want to bookmark the
5858
## Cluster Management
5959

6060
* [Running Kubelet in Standalone Mode](/docs/tutorials/cluster-management/kubelet-standalone/)
61+
* [Install Drivers and Allocate Devices with DRA](/docs/tutorials/cluster-management/install-use-dra/)
6162

6263
## {{% heading "whatsnext" %}}
6364

content/en/docs/tutorials/cluster-management/install-use-dra.md

Lines changed: 92 additions & 104 deletions
Original file line numberDiff line numberDiff line change
@@ -25,25 +25,24 @@ fun ✨ use cases.
2525

2626
<!-- overview -->
2727
This tutorial shows you how to install {{< glossary_tooltip term_id="dra"
28-
text="DRA" >}} drivers in your cluster and how to use them in conjunction with
29-
the DRA APIs to allocate {{< glossary_tooltip text="devices" term_id="device"
28+
text="Dynamic Resource Allocation (DRA)" >}} drivers in your cluster and how to
29+
use them in conjunction with the DRA APIs to allocate {{< glossary_tooltip
30+
text="devices" term_id="device"
3031
>}} to Pods. This page is intended for cluster administrators.
3132
3233
{{< glossary_tooltip text="Dynamic Resource Allocation (DRA)" term_id="dra" >}}
33-
is a Kubernetes feature that allows a cluster to manage availability and
34-
allocation of hardware resources to satisfy Pod-based claims for hardware
35-
requirements and preferences (see the [DRA Concept
36-
page](/docs/concepts/scheduling-eviction/dynamic-resource-allocation/) for more
37-
background). To support this, a mixture of Kubernetes built-in components (like
38-
the Kubernetes scheduler, kubelet, and kube-controller-manager) and third-party
39-
components (called DRA drivers) share the responsibility to advertise, allocate,
40-
prepare, mount, healthcheck, unprepare, and cleanup resources throughout the Pod
41-
lifecycle. These components share information via a series of DRA specific APIs
42-
in the `resource.k8s.io` API group, including {{< glossary_tooltip
43-
text="DeviceClasses" term_id="deviceclass" >}}, {{< glossary_tooltip
44-
text="ResourceSlices" term_id="resourceslice" >}}, {{< glossary_tooltip
45-
text="ResourceClaims" term_id="resourceclaim" >}}, as well as new fields in the
46-
Pod spec itself.
34+
lets a cluster manage availability and allocation of hardware resources to
35+
satisfy Pod-based claims for hardware requirements and preferences. To support
36+
this, a mixture of Kubernetes built-in components (like the Kubernetes
37+
scheduler, kubelet, and kube-controller-manager) and third-party drivers from
38+
device owners (called DRA drivers) share the responsibility to advertise,
39+
allocate, prepare, mount, healthcheck, unprepare, and cleanup resources
40+
throughout the Pod lifecycle. These components share information via a series of
41+
DRA specific APIs in the `resource.k8s.io` API group including {{<
42+
glossary_tooltip text="DeviceClasses" term_id="deviceclass" >}}, {{<
43+
glossary_tooltip text="ResourceSlices" term_id="resourceslice" >}}, {{<
44+
glossary_tooltip text="ResourceClaims" term_id="resourceclaim" >}}, as well as
45+
new fields in the Pod spec itself.
4746

4847
<!-- objectives -->
4948

@@ -83,20 +82,21 @@ To enable the DRA feature, you must enable the following feature gates and API g
8382
1. Enable the following
8483
{{< glossary_tooltip text="API groups" term_id="api-group" >}}:
8584

86-
* `resource.k8s.io/v1beta1`: required for DRA to function.
87-
* `resource.k8s.io/v1beta2`: optional, recommended improvements to the user
88-
experience.
85+
* `resource.k8s.io/v1beta1`
86+
* `resource.k8s.io/v1beta2`
8987

9088
For more information, see
9189
[Enabling or disabling API groups](/docs/reference/using-api/#enabling-or-disabling).
9290

9391

9492
<!-- lessoncontent -->
9593

96-
## Explore the DRA initial state
94+
## Explore the initial cluster state {#explore-initial-state}
9795

98-
With no driver installed or Pod claims yet to satisfy, you can observe the
99-
initial state of a cluster with DRA enabled.
96+
You can spend some time to observe the initial state of a cluster with DRA
97+
enabled, especially if you have not used these APIs extensively before. If you
98+
set up a new cluster for this tutorial, with no driver installed and no Pod
99+
claims yet to satisfy, the output of these commands won't show any resources.
100100

101101
1. Get a list of {{< glossary_tooltip text="DeviceClasses" term_id="deviceclass" >}}:
102102

@@ -108,10 +108,6 @@ initial state of a cluster with DRA enabled.
108108
No resources found
109109
```
110110

111-
If you set up a new blank cluster for this tutorial, it's normal to find that
112-
there are no DeviceClasses. [Learn more about DeviceClasses
113-
here.](/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#deviceclass)
114-
115111
1. Get a list of {{< glossary_tooltip text="ResourceSlices" term_id="resourceslice" >}}:
116112

117113
```shell
@@ -122,11 +118,7 @@ initial state of a cluster with DRA enabled.
122118
No resources found
123119
```
124120

125-
If you set up a new blank cluster for this tutorial, it's normal to find that
126-
there are no ResourceSlices advertised. [Learn more about ResourceSlices
127-
here.](/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#resourceslice)
128-
129-
1. View {{< glossary_tooltip text="ResourceClaims" term_id="resourceclaim" >}} and {{<
121+
1. Get a list of {{< glossary_tooltip text="ResourceClaims" term_id="resourceclaim" >}} and {{<
130122
glossary_tooltip text="ResourceClaimTemplates" term_id="resourceclaimtemplate"
131123
>}}
132124

@@ -140,12 +132,6 @@ glossary_tooltip text="ResourceClaimTemplates" term_id="resourceclaimtemplate"
140132
No resources found
141133
```
142134

143-
If you set up a new blank cluster for this tutorial, it's normal to find that
144-
there are no ResourceClaims or ResourceClaimTemplates as you, the user, have
145-
not created any. [Learn more about ResourceClaims and ResourceClaimTemplates
146-
here.](/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#resourceclaims-templates)
147-
148-
149135
At this point, you have confirmed that DRA is enabled and configured properly in
150136
the cluster, and that no DRA drivers have advertised any resources to the DRA
151137
APIs yet.
@@ -160,15 +146,22 @@ selection of the nodes (using {{< glossary_tooltip text="selectors"
160146
term_id="selector" >}} or similar mechanisms) in your cluster.
161147
162148
Check your driver's documentation for specific installation instructions, which
163-
may include a Helm chart, a set of manifests, or other deployment tooling.
149+
might include a Helm chart, a set of manifests, or other deployment tooling.
164150

165151
This tutorial uses an example driver which can be found in the
166152
[kubernetes-sigs/dra-example-driver](https://github.com/kubernetes-sigs/dra-example-driver)
167-
repository to demonstrate driver installation.
153+
repository to demonstrate driver installation. This example driver advertises
154+
simulated GPUs to Kubernetes for your Pods to interact with.
168155

169-
### Prepare your cluster for driver installation
156+
### Prepare your cluster for driver installation {#prepare-cluster-driver}
157+
158+
To simplify cleanup, create a namespace named dra-tutorial:
159+
160+
1. Create the namespace:
170161

171-
To make it easier to cleanup later, create a namespace called `dra-tutorial` in your cluster.
162+
```shell
163+
kubectl create namespace dra-tutorial
164+
```
172165

173166
In a production environment, you would likely be using a previously released or
174167
qualified image from the driver vendor or your own organization, and your nodes
@@ -177,12 +170,6 @@ hosted. In this tutorial, you will use a publicly released image of the
177170
dra-example-driver to simulate access to a DRA driver image.
178171

179172

180-
1. Create the namespace:
181-
182-
```shell
183-
kubectl create namespace dra-tutorial
184-
```
185-
186173
1. Confirm your nodes have access to the image by running the following
187174
from within one of your cluster's nodes:
188175
@@ -233,12 +220,10 @@ on this cluster:
233220
```
234221
235222
1. Create a {{< glossary_tooltip term_id="priority-class" >}} for the DRA
236-
driver. The DRA driver component is responsible for important lifecycle
237-
operations for Pods with claims, so you don't want it to be preempted. Learn
238-
more about [pod priority and preemption
239-
here](/docs/concepts/scheduling-eviction/pod-priority-preemption/). Learn
240-
more about [good practices when maintaining a DRA driver
241-
here](/docs/concepts/cluster-administration/dra/).
223+
driver. The PriorityClass prevents preemption of th DRA driver component,
224+
which is responsible for important lifecycle operations for Pods with
225+
claims. Learn more about [pod priority and preemption
226+
here](/docs/concepts/scheduling-eviction/pod-priority-preemption/).
242227
243228
{{% code_sample language="yaml" file="dra/driver-install/priorityclass.yaml" %}}
244229
@@ -247,21 +232,22 @@ on this cluster:
247232
```
248233
249234
1. Deploy the actual DRA driver as a DaemonSet configured to run the example
250-
driver binary with the permissions provisioned above.
235+
driver binary with the permissions provisioned above. The DaemonSet has the
236+
permissions that you granted to the ServiceAccount in the previous steps.
251237
252238
{{% code_sample language="yaml" file="dra/driver-install/daemonset.yaml" %}}
253239
254240
```shell
255241
kubectl apply --server-side -f http://k8s.io/examples/dra/driver-install/daemonset.yaml
256242
```
257-
It is configured with
243+
The DaemonSet is configured with
258244
the volume mounts necessary to interact with the underlying Container Device
259-
Interface (CDI) directory, and to expose its socket to kubelet via the
260-
kubelet plugins directory.
245+
Interface (CDI) directory, and to expose its socket to `kubelet` via the
246+
`kubelet/plugins` directory.
261247
262-
### Verify the DRA driver installation
248+
### Verify the DRA driver installation {#verify-driver-install}
263249
264-
1. Observe the Pods of the DRA driver DaemonSet across all worker nodes:
250+
1. Get a list of the Pods of the DRA driver DaemonSet across all worker nodes:
265251
266252
```shell
267253
kubectl get pod -l app.kubernetes.io/name=dra-example-driver -n dra-tutorial
@@ -295,7 +281,7 @@ At this point, you have successfully installed the example DRA driver, and
295281
confirmed its initial configuration. You're now ready to use DRA to schedule
296282
Pods.
297283
298-
## Claim resources and deploy a Pod
284+
## Claim resources and deploy a Pod {#claim-resources-pod}
299285
300286
To request resources using DRA, you create ResourceClaims or
301287
ResourceClaimTemplates that define the resources that your Pods need. In the
@@ -311,12 +297,11 @@ learn more about ResourceClaims.
311297
312298
### Create the ResourceClaim
313299
314-
The Pod manifest itself will include a reference to its relevant ResourceClaim
315-
object, which you will create now. Whatever the claim, the `deviceClassName` is
316-
a required field, narrowing down the scope of the request to a specific device
317-
class. The request itself can include a {{< glossary_tooltip term_id="cel" >}}
318-
expression that references attributes that may be advertised by the driver
319-
managing that device class.
300+
In this section, you create a ResourceClaim and reference it in a Pod. Whatever
301+
the claim, the `deviceClassName` is a required field, narrowing down the scope
302+
of the request to a specific device class. The request itself can include a {{<
303+
glossary_tooltip term_id="cel" >}} expression that references attributes that
304+
may be advertised by the driver managing that device class.
320305
321306
In this example, you will create a request for any GPU advertising over 10Gi
322307
memory capacity. The attribute exposing capacity from the example driver takes
@@ -343,20 +328,6 @@ underlying container.
343328
kubectl apply --server-side -f http://k8s.io/examples/dra/driver-install/example/pod.yaml
344329
```
345330

346-
### Explore the DRA state
347-
348-
The cluster now tries to schedule that Pod to a node where Kubernetes can
349-
satisfy the ResourceClaim. In our situation, the DRA driver is deployed on all
350-
nodes, and is advertising mock GPUs on all nodes, all of which have enough
351-
capacity advertised to satisfy the Pod's claim, so this Pod may be scheduled to
352-
any node and any of the mock GPUs on that node may be allocated.
353-
354-
The mock GPU driver injects environment variables in each container it is
355-
allocated to in order to indicate which GPUs _would_ have been injected into
356-
them by a real resource driver and how they would have been configured, so you
357-
can check those environment variables to see how the Pods have been handled by
358-
the system.
359-
360331
1. Confirm the pod has deployed:
361332

362333
```shell
@@ -369,7 +340,22 @@ the system.
369340
pod0 1/1 Running 0 9s
370341
```
371342

372-
1. Observe the pod logs which report the name of the mock GPU allocated:
343+
### Explore the DRA state
344+
345+
After you create the Pod, the cluster tries to schedule that Pod to a node where
346+
Kubernetes can satisfy the ResourceClaim. In this tutorial, the DRA driver is
347+
deployed on all nodes, and is advertising mock GPUs on all nodes, all of which
348+
have enough capacity advertised to satisfy the Pod's claim, so Kubernetes can
349+
schedule this Pod on any node and can allocate any of the mock GPUs on that
350+
node.
351+
352+
When Kubernetes allocates a mock GPU to a Pod, the example driver adds
353+
environment variables in each container it is allocated to in order to indicate
354+
which GPUs _would_ have been injected into them by a real resource driver and
355+
how they would have been configured, so you can check those environment
356+
variables to see how the Pods have been handled by the system.
357+
358+
1. Check the Pod logs, which report the name of the mock GPU that was allocated:
373359
374360
```shell
375361
kubectl logs pod0 -c ctr0 -n dra-tutorial | grep -E "GPU_DEVICE_[0-9]+=" | grep -v "RESOURCE_CLAIM"
@@ -380,10 +366,7 @@ the system.
380366
declare -x GPU_DEVICE_4="gpu-4"
381367
```
382368
383-
1. Observe the ResourceClaim object:
384-
385-
You can observe the ResourceClaim more closely, first only to see its state
386-
is allocated and reserved.
369+
1. Check the state of the ResourceClaim object:
387370
388371
```shell
389372
kubectl get resourceclaims -n dra-tutorial
@@ -396,8 +379,12 @@ the system.
396379
some-gpu allocated,reserved 34s
397380
```
398381
399-
Looking deeper at the `some-gpu` ResourceClaim, you can see that the status stanza includes information about the
400-
device that has been allocated and for what pod it has been reserved for:
382+
In this output, the `STATE` column shows that the ResourceClaim is allocated
383+
and reserved.
384+
385+
1. Check the details of the `some-gpu` ResourceClaim. The `status` stanza of
386+
the ResourceClaim has information about the allocated device and the Pod it
387+
has been reserved for:
401388
402389
```shell
403390
kubectl get resourceclaim some-gpu -n dra-tutorial -o yaml
@@ -454,8 +441,8 @@ the system.
454441
resourceVersion: ""
455442
{{< /highlight >}}
456443
457-
1. Observe the driver by checking the pod logs for pods backing the driver
458-
daemonset:
444+
1. To check how the driver handled device allocation, get the logs for the
445+
driver DaemonSet Pods:
459446
460447
```shell
461448
kubectl logs -l app.kubernetes.io/name=dra-example-driver -n dra-tutorial
@@ -467,19 +454,18 @@ the system.
467454
I0729 05:11:52.684450 1 driver.go:112] Returning newly prepared devices for claim '79e1e8d8-7e53-4362-aad1-eca97678339e': [&Device{RequestNames:[some-gpu],PoolName:kind-worker,DeviceName:gpu-4,CDIDeviceIDs:[k8s.gpu.example.com/gpu=common k8s.gpu.example.com/gpu=79e1e8d8-7e53-4362-aad1-eca97678339e-gpu-4],}]
468455
```
469456
470-
You have now successfully deployed a Pod with a DRA based claim, and seen it
471-
scheduled to an appropriate node and the associated DRA APIs updated to reflect
472-
its status.
457+
You have now successfully deployed a Pod that claims devices using DRA, verified
458+
that the Pod was scheduled to an appropriate node, and saw that the associated
459+
DRA APIs kinds were updated with the allocation status.
473460
474-
## Remove the Pod with a claim
461+
## Delete a Pod that has a claim {#delete-pod-claim}
475462
476463
When a Pod with a claim is deleted, the DRA driver deallocates the resource so
477-
it can be available for future scheduling. You can observe that by deleting our
478-
pod with a claim and seeing that the state of the ResourceClaim changes.
479-
480-
### Delete the pod using the resource claim
464+
it can be available for future scheduling. To validate this behavior, delete the
465+
Pod that you created in the previous steps and watch the corresponding changes
466+
to the ResourceClaim and driver.
481467
482-
1. Delete the pod directly:
468+
1. Delete the `pod0` Pod:
483469
484470
```shell
485471
kubectl delete pod pod0 -n dra-tutorial
@@ -493,10 +479,11 @@ pod with a claim and seeing that the state of the ResourceClaim changes.
493479
494480
### Observe the DRA state
495481
496-
The driver will deallocate the hardware and update the corresponding
497-
ResourceClaim resource that previously held the association.
482+
When the Pod is deleted, the driver deallocates the device from the
483+
ResourceClaim and updates the ResourceClaim resource in the Kubernetes API. The
484+
ResourceClaim has a `pending` state until it's referenced in a new Pod.
498485

499-
1. Check the ResourceClaim is now pending:
486+
1. Check the state of the `some-gpu` ResourceClaim:
500487

501488
```shell
502489
kubectl get resourceclaims -n dra-tutorial
@@ -508,8 +495,8 @@ ResourceClaim resource that previously held the association.
508495
some-gpu pending 76s
509496
```
510497

511-
1. Observe the driver logs and see that it processed unpreparing the device for
512-
this claim:
498+
1. Verify that the driver has processed unpreparing the device for this claim by
499+
checking the driver logs:
513500

514501
```shell
515502
kubectl logs -l app.kubernetes.io/name=dra-example-driver -n dra-tutorial
@@ -525,13 +512,14 @@ reflect that the resource is available again for future scheduling.
525512

526513
## {{% heading "cleanup" %}}
527514

528-
To cleanup the resources, delete the namespace for the tutorial which will clean up the ResourceClaims, driver components, and ServiceAccount. Then also delete the cluster level DeviceClass resource and cluster level RBAC resources.
515+
To clean up the resources that you created in this tutorial, follow these steps:
529516

530517
```shell
531518
kubectl delete namespace dra-tutorial
532519
kubectl delete deviceclass gpu.example.com
533520
kubectl delete clusterrole dra-example-driver-role
534521
kubectl delete clusterrolebinding dra-example-driver-role-binding
522+
kubectl delete priorityclass dra-driver-high-priority
535523
```
536524

537525
## {{% heading "whatsnext" %}}

0 commit comments

Comments
 (0)