@@ -91,10 +91,12 @@ To enable the DRA feature, you must enable the following feature gates and API g
91
91
92
92
<!-- lessoncontent -->
93
93
94
- ## Explore the DRA initial state
94
+ ## Explore the initial cluster state {#explore-initial-state}
95
95
96
- With no driver installed or Pod claims yet to satisfy, you can observe the
97
- initial state of a cluster with DRA enabled.
96
+ You can spend some time to observe the initial state of a cluster with DRA
97
+ enabled, especially if you have not used these APIs extensively before. If you
98
+ set up a new cluster for this tutorial, with no driver installed and no Pod
99
+ claims yet to satisfy, the output of these commands won't show any resources.
98
100
99
101
1 . Get a list of {{< glossary_tooltip text="DeviceClasses" term_id="deviceclass" >}}:
100
102
@@ -106,10 +108,6 @@ initial state of a cluster with DRA enabled.
106
108
No resources found
107
109
```
108
110
109
- If you set up a new blank cluster for this tutorial, it' s normal to find that
110
- there are no DeviceClasses. [Learn more about DeviceClasses
111
- here.](/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#deviceclass)
112
-
113
111
1. Get a list of {{< glossary_tooltip text=" ResourceSlices" term_id=" resourceslice" > }}:
114
112
115
113
` ` ` shell
@@ -120,11 +118,7 @@ initial state of a cluster with DRA enabled.
120
118
No resources found
121
119
```
122
120
123
- If you set up a new blank cluster for this tutorial, it' s normal to find that
124
- there are no ResourceSlices advertised. [Learn more about ResourceSlices
125
- here.](/docs/concepts/scheduling-eviction/dynamic-resource-allocation/# resourceslice)
126
-
127
- 1. View {{< glossary_tooltip text=" ResourceClaims" term_id=" resourceclaim" > }} and {{<
121
+ 1. Get a list of {{< glossary_tooltip text=" ResourceClaims" term_id=" resourceclaim" > }} and {{<
128
122
glossary_tooltip text=" ResourceClaimTemplates" term_id=" resourceclaimtemplate"
129
123
> }}
130
124
@@ -138,12 +132,6 @@ glossary_tooltip text="ResourceClaimTemplates" term_id="resourceclaimtemplate"
138
132
No resources found
139
133
```
140
134
141
- If you set up a new blank cluster for this tutorial, it' s normal to find that
142
- there are no ResourceClaims or ResourceClaimTemplates as you, the user, have
143
- not created any. [Learn more about ResourceClaims and ResourceClaimTemplates
144
- here.](/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#resourceclaims-templates)
145
-
146
-
147
135
At this point, you have confirmed that DRA is enabled and configured properly in
148
136
the cluster, and that no DRA drivers have advertised any resources to the DRA
149
137
APIs yet.
@@ -158,15 +146,22 @@ selection of the nodes (using {{< glossary_tooltip text="selectors"
158
146
term_id="selector" >}} or similar mechanisms) in your cluster.
159
147
160
148
Check your driver' s documentation for specific installation instructions, which
161
- may include a Helm chart, a set of manifests, or other deployment tooling.
149
+ might include a Helm chart, a set of manifests, or other deployment tooling.
162
150
163
151
This tutorial uses an example driver which can be found in the
164
152
[kubernetes-sigs/dra-example-driver](https://github.com/kubernetes-sigs/dra-example-driver)
165
- repository to demonstrate driver installation.
153
+ repository to demonstrate driver installation. This example driver advertises
154
+ simulated GPUs to Kubernetes for your Pods to interact with.
166
155
167
- ### Prepare your cluster for driver installation
156
+ # ## Prepare your cluster for driver installation {#prepare-cluster-driver}
157
+
158
+ To simplify cleanup, create a namespace named dra-tutorial:
159
+
160
+ 1. Create the namespace:
168
161
169
- To make it easier to cleanup later, create a namespace called `dra-tutorial` in your cluster.
162
+ ` ` ` shell
163
+ kubectl create namespace dra-tutorial
164
+ ` ` `
170
165
171
166
In a production environment, you would likely be using a previously released or
172
167
qualified image from the driver vendor or your own organization, and your nodes
@@ -175,12 +170,6 @@ hosted. In this tutorial, you will use a publicly released image of the
175
170
dra-example-driver to simulate access to a DRA driver image.
176
171
177
172
178
- 1. Create the namespace:
179
-
180
- ```shell
181
- kubectl create namespace dra-tutorial
182
- ```
183
-
184
173
1. Confirm your nodes have access to the image by running the following
185
174
from within one of your cluster' s nodes:
186
175
@@ -231,12 +220,10 @@ on this cluster:
231
220
```
232
221
233
222
1. Create a {{< glossary_tooltip term_id="priority-class" >}} for the DRA
234
- driver. The DRA driver component is responsible for important lifecycle
235
- operations for Pods with claims, so you don' t want it to be preempted. Learn
236
- more about [pod priority and preemption
237
- here](/docs/concepts/scheduling-eviction/pod-priority-preemption/). Learn
238
- more about [good practices when maintaining a DRA driver
239
- here](/docs/concepts/cluster-administration/dra/).
223
+ driver. The PriorityClass prevents preemption of th DRA driver component,
224
+ which is responsible for important lifecycle operations for Pods with
225
+ claims. Learn more about [pod priority and preemption
226
+ here](/docs/concepts/scheduling-eviction/pod-priority-preemption/).
240
227
241
228
{{% code_sample language="yaml" file="dra/driver-install/priorityclass.yaml" %}}
242
229
@@ -245,21 +232,22 @@ on this cluster:
245
232
```
246
233
247
234
1. Deploy the actual DRA driver as a DaemonSet configured to run the example
248
- driver binary with the permissions provisioned above.
235
+ driver binary with the permissions provisioned above. The DaemonSet has the
236
+ permissions that you granted to the ServiceAccount in the previous steps.
249
237
250
238
{{% code_sample language="yaml" file="dra/driver-install/daemonset.yaml" %}}
251
239
252
240
```shell
253
241
kubectl apply --server-side -f http://k8s.io/examples/dra/driver-install/daemonset.yaml
254
242
```
255
- It is configured with
243
+ The DaemonSet is configured with
256
244
the volume mounts necessary to interact with the underlying Container Device
257
- Interface (CDI) directory, and to expose its socket to kubelet via the
258
- kubelet plugins directory.
245
+ Interface (CDI) directory, and to expose its socket to ` kubelet` via the
246
+ ` kubelet/ plugins` directory.
259
247
260
- ### Verify the DRA driver installation
248
+ ### Verify the DRA driver installation {#verify-driver-install}
261
249
262
- 1. Observe the Pods of the DRA driver DaemonSet across all worker nodes:
250
+ 1. Get a list of the Pods of the DRA driver DaemonSet across all worker nodes:
263
251
264
252
```shell
265
253
kubectl get pod -l app.kubernetes.io/name=dra-example-driver -n dra-tutorial
@@ -293,7 +281,7 @@ At this point, you have successfully installed the example DRA driver, and
293
281
confirmed its initial configuration. You' re now ready to use DRA to schedule
294
282
Pods.
295
283
296
- ## Claim resources and deploy a Pod
284
+ ## Claim resources and deploy a Pod {#claim-resources-pod}
297
285
298
286
To request resources using DRA, you create ResourceClaims or
299
287
ResourceClaimTemplates that define the resources that your Pods need. In the
@@ -309,12 +297,11 @@ learn more about ResourceClaims.
309
297
310
298
### Create the ResourceClaim
311
299
312
- The Pod manifest itself will include a reference to its relevant ResourceClaim
313
- object, which you will create now. Whatever the claim, the `deviceClassName` is
314
- a required field, narrowing down the scope of the request to a specific device
315
- class. The request itself can include a {{< glossary_tooltip term_id="cel" >}}
316
- expression that references attributes that may be advertised by the driver
317
- managing that device class.
300
+ In this section, you create a ResourceClaim and reference it in a Pod. Whatever
301
+ the claim, the `deviceClassName` is a required field, narrowing down the scope
302
+ of the request to a specific device class. The request itself can include a {{<
303
+ glossary_tooltip term_id="cel" >}} expression that references attributes that
304
+ may be advertised by the driver managing that device class.
318
305
319
306
In this example, you will create a request for any GPU advertising over 10Gi
320
307
memory capacity. The attribute exposing capacity from the example driver takes
@@ -341,20 +328,6 @@ underlying container.
341
328
kubectl apply --server-side -f http://k8s.io/examples/dra/driver-install/example/pod.yaml
342
329
` ` `
343
330
344
- # ## Explore the DRA state
345
-
346
- The cluster now tries to schedule that Pod to a node where Kubernetes can
347
- satisfy the ResourceClaim. In our situation, the DRA driver is deployed on all
348
- nodes, and is advertising mock GPUs on all nodes, all of which have enough
349
- capacity advertised to satisfy the Pod' s claim, so this Pod may be scheduled to
350
- any node and any of the mock GPUs on that node may be allocated.
351
-
352
- The mock GPU driver injects environment variables in each container it is
353
- allocated to in order to indicate which GPUs _would_ have been injected into
354
- them by a real resource driver and how they would have been configured, so you
355
- can check those environment variables to see how the Pods have been handled by
356
- the system.
357
-
358
331
1. Confirm the pod has deployed:
359
332
360
333
` ` ` shell
@@ -367,7 +340,22 @@ the system.
367
340
pod0 1/1 Running 0 9s
368
341
```
369
342
370
- 1. Observe the pod logs which report the name of the mock GPU allocated:
343
+ # ## Explore the DRA state
344
+
345
+ After you create the Pod, the cluster tries to schedule that Pod to a node where
346
+ Kubernetes can satisfy the ResourceClaim. In this tutorial, the DRA driver is
347
+ deployed on all nodes, and is advertising mock GPUs on all nodes, all of which
348
+ have enough capacity advertised to satisfy the Pod' s claim, so Kubernetes can
349
+ schedule this Pod on any node and can allocate any of the mock GPUs on that
350
+ node.
351
+
352
+ When Kubernetes allocates a mock GPU to a Pod, the example driver adds
353
+ environment variables in each container it is allocated to in order to indicate
354
+ which GPUs _would_ have been injected into them by a real resource driver and
355
+ how they would have been configured, so you can check those environment
356
+ variables to see how the Pods have been handled by the system.
357
+
358
+ 1. Check the Pod logs, which report the name of the mock GPU that was allocated:
371
359
372
360
```shell
373
361
kubectl logs pod0 -c ctr0 -n dra-tutorial | grep -E "GPU_DEVICE_[0-9]+=" | grep -v "RESOURCE_CLAIM"
@@ -378,10 +366,7 @@ the system.
378
366
declare -x GPU_DEVICE_4="gpu-4"
379
367
```
380
368
381
- 1. Observe the ResourceClaim object:
382
-
383
- You can observe the ResourceClaim more closely, first only to see its state
384
- is allocated and reserved.
369
+ 1. Check the state of the ResourceClaim object:
385
370
386
371
```shell
387
372
kubectl get resourceclaims -n dra-tutorial
@@ -394,9 +379,12 @@ the system.
394
379
some-gpu allocated,reserved 34s
395
380
```
396
381
397
- Looking deeper at the `some-gpu` ResourceClaim, you can see that the status
398
- stanza includes information about the device that has been allocated and for
399
- what pod it has been reserved for:
382
+ In this output, the `STATE` column shows that the ResourceClaim is allocated
383
+ and reserved.
384
+
385
+ 1. Check the details of the `some-gpu` ResourceClaim. The `status` stanza of
386
+ the ResourceClaim has information about the allocated device and the Pod it
387
+ has been reserved for:
400
388
401
389
```shell
402
390
kubectl get resourceclaim some-gpu -n dra-tutorial -o yaml
@@ -453,8 +441,8 @@ the system.
453
441
resourceVersion: ""
454
442
{{< /highlight >}}
455
443
456
- 1. Observe the driver by checking the pod logs for pods backing the driver
457
- daemonset :
444
+ 1. To check how the driver handled device allocation, get the logs for the
445
+ driver DaemonSet Pods :
458
446
459
447
```shell
460
448
kubectl logs -l app.kubernetes.io/name=dra-example-driver -n dra-tutorial
@@ -466,17 +454,16 @@ the system.
466
454
I0729 05:11:52.684450 1 driver.go:112] Returning newly prepared devices for claim ' 79e1e8d8-7e53-4362-aad1-eca97678339e' : [&Device{RequestNames:[some-gpu],PoolName:kind-worker,DeviceName:gpu-4,CDIDeviceIDs:[k8s.gpu.example.com/gpu=common k8s.gpu.example.com/gpu=79e1e8d8-7e53-4362-aad1-eca97678339e-gpu-4],}]
467
455
```
468
456
469
- You have now successfully deployed a Pod with a DRA based claim, and seen it
470
- scheduled to an appropriate node and the associated DRA APIs updated to reflect
471
- its status.
457
+ You have now successfully deployed a Pod that claims devices using DRA, verified
458
+ that the Pod was scheduled to an appropriate node, and saw that the associated
459
+ DRA APIs kinds were updated with the allocation status.
472
460
473
- ## Remove the Pod with a claim
461
+ ## Delete a Pod that has a claim {#delete-pod-claim}
474
462
475
463
When a Pod with a claim is deleted, the DRA driver deallocates the resource so
476
- it can be available for future scheduling. You can observe that by deleting our
477
- pod with a claim and seeing that the state of the ResourceClaim changes.
478
-
479
- ### Delete the pod using the resource claim
464
+ it can be available for future scheduling. To validate this behavior, delete the
465
+ Pod that you created in the previous steps and watch the corresponding changes
466
+ to the ResourceClaim and driver.
480
467
481
468
1. Delete the `pod0` Pod:
482
469
0 commit comments