Merge pull request #51971 from lauralorenz/51179-tutorial-dra-driver-install-and-observe-nits

k8s-ci-robot · web-flow · commit 18cf0d0e00af · 2025-08-20T09:49:06.000-07:00
DRA driver install tutorial nits follow-on
diff --git a/content/en/docs/tutorials/_index.md b/content/en/docs/tutorials/_index.md
@@ -58,6 +58,7 @@ Before walking through each tutorial, you may want to bookmark the
 ## Cluster Management
 
 * [Running Kubelet in Standalone Mode](/docs/tutorials/cluster-management/kubelet-standalone/)
+* [Install Drivers and Allocate Devices with DRA](/docs/tutorials/cluster-management/install-use-dra/)
 
 ## {{% heading "whatsnext" %}}
 
diff --git a/content/en/docs/tutorials/cluster-management/install-use-dra.md b/content/en/docs/tutorials/cluster-management/install-use-dra.md
@@ -25,25 +25,24 @@ fun ✨ use cases.
 
 <!-- overview -->
 This tutorial shows you how to install {{< glossary_tooltip term_id="dra"
-text="DRA" >}} drivers in your cluster and how to use them in conjunction with
-the DRA APIs to allocate {{< glossary_tooltip text="devices" term_id="device"
+text="Dynamic Resource Allocation (DRA)" >}} drivers in your cluster and how to
+use them in conjunction with the DRA APIs to allocate {{< glossary_tooltip
+text="devices" term_id="device"
 >}} to Pods. This page is intended for cluster administrators.
 
 {{< glossary_tooltip text="Dynamic Resource Allocation (DRA)" term_id="dra" >}}
-is a Kubernetes feature that allows a cluster to manage availability and
-allocation of hardware resources to satisfy Pod-based claims for hardware
-requirements and preferences (see the [DRA Concept
-page](/docs/concepts/scheduling-eviction/dynamic-resource-allocation/) for more
-background). To support this, a mixture of Kubernetes built-in components (like
-the Kubernetes scheduler, kubelet, and kube-controller-manager) and third-party
-components (called DRA drivers) share the responsibility to advertise, allocate,
-prepare, mount, healthcheck, unprepare, and cleanup resources throughout the Pod
-lifecycle. These components share information via a series of DRA specific APIs
-in the `resource.k8s.io` API group, including {{< glossary_tooltip
-text="DeviceClasses" term_id="deviceclass" >}}, {{< glossary_tooltip
-text="ResourceSlices" term_id="resourceslice" >}}, {{< glossary_tooltip
-text="ResourceClaims" term_id="resourceclaim" >}}, as well as new fields in the
-Pod spec itself.
+lets a cluster manage availability and allocation of hardware resources to
+satisfy Pod-based claims for hardware requirements and preferences. To support
+this, a mixture of Kubernetes built-in components (like the Kubernetes
+scheduler, kubelet, and kube-controller-manager) and third-party drivers from
+device owners (called DRA drivers) share the responsibility to advertise,
+allocate, prepare, mount, healthcheck, unprepare, and cleanup resources
+throughout the Pod lifecycle. These components share information via a series of
+DRA specific APIs in the `resource.k8s.io` API group including {{<
+glossary_tooltip text="DeviceClasses" term_id="deviceclass" >}}, {{<
+glossary_tooltip text="ResourceSlices" term_id="resourceslice" >}}, {{<
+glossary_tooltip text="ResourceClaims" term_id="resourceclaim" >}}, as well as
+new fields in the Pod spec itself.
 
 <!-- objectives -->
 
@@ -83,20 +82,21 @@ To enable the DRA feature, you must enable the following feature gates and API g
 1.  Enable the following
     {{< glossary_tooltip text="API groups" term_id="api-group" >}}:
 
-    * `resource.k8s.io/v1beta1`: required for DRA to function.
-    * `resource.k8s.io/v1beta2`: optional, recommended improvements to the user
-      experience.
+    * `resource.k8s.io/v1beta1`
+    * `resource.k8s.io/v1beta2`
      
     For more information, see
     [Enabling or disabling API groups](/docs/reference/using-api/#enabling-or-disabling).
 
 
 <!-- lessoncontent -->
 
-## Explore the DRA initial state
+## Explore the initial cluster state {#explore-initial-state}
 
-With no driver installed or Pod claims yet to satisfy, you can observe the
-initial state of a cluster with DRA enabled.
+You can spend some time to observe the initial state of a cluster with DRA
+enabled, especially if you have not used these APIs extensively before. If you
+set up a new cluster for this tutorial, with no driver installed and no Pod
+claims yet to satisfy, the output of these commands won't show any resources.
 
 1.  Get a list of {{< glossary_tooltip text="DeviceClasses" term_id="deviceclass" >}}:
 
@@ -108,10 +108,6 @@ initial state of a cluster with DRA enabled.
     No resources found
     ```
 
-    If you set up a new blank cluster for this tutorial, it's normal to find that
-    there are no DeviceClasses. [Learn more about DeviceClasses
-    here.](/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#deviceclass)
-
 1.  Get a list of  {{< glossary_tooltip text="ResourceSlices" term_id="resourceslice" >}}:
 
     ```shell
@@ -122,11 +118,7 @@ initial state of a cluster with DRA enabled.
     No resources found
     ```
 
-    If you set up a new blank cluster for this tutorial, it's normal to find that
-    there are no ResourceSlices advertised. [Learn more about ResourceSlices
-    here.](/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#resourceslice)
-
-1.  View {{< glossary_tooltip text="ResourceClaims" term_id="resourceclaim" >}} and {{<
+1.  Get a list of {{< glossary_tooltip text="ResourceClaims" term_id="resourceclaim" >}} and {{<
 glossary_tooltip text="ResourceClaimTemplates" term_id="resourceclaimtemplate"
 >}}
 
@@ -140,12 +132,6 @@ glossary_tooltip text="ResourceClaimTemplates" term_id="resourceclaimtemplate"
     No resources found
     ```
 
-    If you set up a new blank cluster for this tutorial, it's normal to find that
-    there are no ResourceClaims or ResourceClaimTemplates as you, the user, have
-    not created any. [Learn more about ResourceClaims and ResourceClaimTemplates
-    here.](/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#resourceclaims-templates)
-
-
 At this point, you have confirmed that DRA is enabled and configured properly in
 the cluster, and that no DRA drivers have advertised any resources to the DRA
 APIs yet.
@@ -160,15 +146,22 @@ selection of the nodes (using {{< glossary_tooltip text="selectors"
 term_id="selector" >}} or similar mechanisms) in your cluster.
 
 Check your driver's documentation for specific installation instructions, which
-may include a Helm chart, a set of manifests, or other deployment tooling.
+might include a Helm chart, a set of manifests, or other deployment tooling.
 
 This tutorial uses an example driver which can be found in the
 [kubernetes-sigs/dra-example-driver](https://github.com/kubernetes-sigs/dra-example-driver)
-repository to demonstrate driver installation.
+repository to demonstrate driver installation. This example driver advertises
+simulated GPUs to Kubernetes for your Pods to interact with.
 
-### Prepare your cluster for driver installation
+### Prepare your cluster for driver installation {#prepare-cluster-driver}
+
+To simplify cleanup, create a namespace named dra-tutorial:
+
+1.  Create the namespace:
 
-To make it easier to cleanup later, create a namespace called `dra-tutorial` in your cluster.
+    ```shell
+    kubectl create namespace dra-tutorial 
+    ```
 
 In a production environment, you would likely be using a previously released or
 qualified image from the driver vendor or your own organization, and your nodes
@@ -177,12 +170,6 @@ hosted. In this tutorial, you will use a publicly released image of the
 dra-example-driver to simulate access to a DRA driver image.
 
 
-1.  Create the namespace:
-
-    ```shell
-    kubectl create namespace dra-tutorial 
-    ```
-
 1.  Confirm your nodes have access to the image by running the following
 from within one of your cluster's nodes:
 
@@ -233,12 +220,10 @@ on this cluster:
           ```
 
 1.  Create a {{< glossary_tooltip term_id="priority-class" >}} for the DRA
-    driver. The DRA driver component is responsible for important lifecycle
-    operations for Pods with claims, so you don't want it to be preempted. Learn
-    more about [pod priority and preemption
-    here](/docs/concepts/scheduling-eviction/pod-priority-preemption/). Learn
-    more about [good practices when maintaining a DRA driver
-    here](/docs/concepts/cluster-administration/dra/).
+    driver. The PriorityClass prevents preemption of th  DRA driver component,
+    which is responsible for important lifecycle operations for Pods with
+    claims. Learn more about [pod priority and preemption
+    here](/docs/concepts/scheduling-eviction/pod-priority-preemption/).
 
     {{% code_sample language="yaml" file="dra/driver-install/priorityclass.yaml" %}}
 
@@ -247,21 +232,22 @@ on this cluster:
     ```
 
 1.  Deploy the actual DRA driver as a DaemonSet configured to run the example
-   driver binary with the permissions provisioned above.
+   driver binary with the permissions provisioned above. The DaemonSet has the
+   permissions that you granted to the ServiceAccount in the previous steps.
 
     {{% code_sample language="yaml" file="dra/driver-install/daemonset.yaml" %}}
 
     ```shell
     kubectl apply --server-side -f http://k8s.io/examples/dra/driver-install/daemonset.yaml
     ```
-    It is configured with
+    The DaemonSet is configured with
       the volume mounts necessary to interact with the underlying Container Device
-      Interface (CDI) directory, and to expose its socket to kubelet via the
-      kubelet plugins directory.
+      Interface (CDI) directory, and to expose its socket to `kubelet` via the
+      `kubelet/plugins` directory.
 
-### Verify the DRA driver installation
+### Verify the DRA driver installation {#verify-driver-install}
 
-1.  Observe the Pods of the DRA driver DaemonSet across all worker nodes:
+1.  Get a list of the Pods of the DRA driver DaemonSet across all worker nodes:
 
     ```shell
     kubectl get pod -l app.kubernetes.io/name=dra-example-driver -n dra-tutorial
@@ -295,7 +281,7 @@ At this point, you have successfully installed the example DRA driver, and
 confirmed its initial configuration. You're now ready to use DRA to schedule
 Pods.
 
-## Claim resources and deploy a Pod  
+## Claim resources and deploy a Pod {#claim-resources-pod}
 
 To request resources using DRA, you create ResourceClaims or
 ResourceClaimTemplates that define the resources that your Pods need. In the
@@ -311,12 +297,11 @@ learn more about ResourceClaims.
 
 ### Create the ResourceClaim
 
-The Pod manifest itself will include a reference to its relevant ResourceClaim
-object, which you will create now. Whatever the claim, the `deviceClassName` is
-a required field, narrowing down the scope of the request to a specific device
-class. The request itself can include a {{< glossary_tooltip term_id="cel" >}}
-expression that references attributes that may be advertised by the driver
-managing that device class. 
+In this section, you create a ResourceClaim and reference it in a Pod. Whatever
+the claim, the `deviceClassName` is a required field, narrowing down the scope
+of the request to a specific device class. The request itself can include a {{<
+glossary_tooltip term_id="cel" >}} expression that references attributes that
+may be advertised by the driver managing that device class. 
 
 In this example, you will create a request for any GPU advertising over 10Gi
 memory capacity. The attribute exposing capacity from the example driver takes
@@ -343,20 +328,6 @@ underlying container.
 kubectl apply --server-side -f http://k8s.io/examples/dra/driver-install/example/pod.yaml
 ```
 
-### Explore the DRA state
-
-The cluster now tries to schedule that Pod to a node where Kubernetes can
-satisfy the ResourceClaim. In our situation, the DRA driver is deployed on all
-nodes, and is advertising mock GPUs on all nodes, all of which have enough
-capacity advertised to satisfy the Pod's claim, so this Pod may be scheduled to
-any node and any of the mock GPUs on that node may be allocated.
-
-The mock GPU driver injects environment variables in each container it is
-allocated to in order to indicate which GPUs _would_ have been injected into
-them by a real resource driver and how they would have been configured, so you
-can check those environment variables to see how the Pods have been handled by
-the system.
-
 1.  Confirm the pod has deployed:
 
     ```shell
@@ -369,7 +340,22 @@ the system.
     pod0   1/1     Running   0          9s
     ```
 
-1.  Observe the pod logs which report the name of the mock GPU allocated:
+### Explore the DRA state
+
+After you create the Pod, the cluster tries to schedule that Pod to a node where
+Kubernetes can satisfy the ResourceClaim. In this tutorial, the DRA driver is
+deployed on all nodes, and is advertising mock GPUs on all nodes, all of which
+have enough capacity advertised to satisfy the Pod's claim, so Kubernetes can
+schedule this Pod on any node and can allocate any of the mock GPUs on that
+node.
+
+When Kubernetes allocates a mock GPU to a Pod, the example driver adds
+environment variables in each container it is allocated to in order to indicate
+which GPUs _would_ have been injected into them by a real resource driver and
+how they would have been configured, so you can check those environment
+variables to see how the Pods have been handled by the system.
+
+1.  Check the Pod logs, which report the name of the mock GPU that was allocated:
 
     ```shell
     kubectl logs pod0 -c ctr0 -n dra-tutorial | grep -E "GPU_DEVICE_[0-9]+=" | grep -v "RESOURCE_CLAIM"
@@ -380,10 +366,7 @@ the system.
     declare -x GPU_DEVICE_4="gpu-4"
     ```
 
-1.  Observe the ResourceClaim object:
-
-    You can observe the ResourceClaim more closely, first only to see its state
-    is allocated and reserved.
+1.  Check the state of the ResourceClaim object:
 
     ```shell
     kubectl get resourceclaims -n dra-tutorial
@@ -396,8 +379,12 @@ the system.
     some-gpu   allocated,reserved   34s
     ```
 
-    Looking deeper at the `some-gpu` ResourceClaim, you can see that the status stanza includes information about the
-    device that has been allocated and for what pod it has been reserved for:
+    In this output, the `STATE` column shows that the ResourceClaim is allocated
+    and reserved.
+
+1.  Check the details of the `some-gpu` ResourceClaim. The `status` stanza of
+    the ResourceClaim has information about the allocated device and the Pod it
+    has been reserved for:
 
     ```shell
     kubectl get resourceclaim some-gpu -n dra-tutorial -o yaml
@@ -454,8 +441,8 @@ the system.
       resourceVersion: ""
     {{< /highlight >}}
 
-1.  Observe the driver by checking the pod logs for pods backing the driver
-   daemonset:
+1.  To check how the driver handled device allocation, get the logs for the
+    driver DaemonSet Pods:
 
     ```shell
     kubectl logs -l app.kubernetes.io/name=dra-example-driver -n dra-tutorial
@@ -467,19 +454,18 @@ the system.
     I0729 05:11:52.684450       1 driver.go:112] Returning newly prepared devices for claim '79e1e8d8-7e53-4362-aad1-eca97678339e': [&Device{RequestNames:[some-gpu],PoolName:kind-worker,DeviceName:gpu-4,CDIDeviceIDs:[k8s.gpu.example.com/gpu=common k8s.gpu.example.com/gpu=79e1e8d8-7e53-4362-aad1-eca97678339e-gpu-4],}]
     ```
 
-You have now successfully deployed a Pod with a DRA based claim, and seen it
-scheduled to an appropriate node and the associated DRA APIs updated to reflect
-its status.
+You have now successfully deployed a Pod that claims devices using DRA, verified
+that the Pod was scheduled to an appropriate node, and saw that the associated
+DRA APIs kinds were updated with the allocation status.
 
-## Remove the Pod with a claim
+## Delete a Pod that has a claim {#delete-pod-claim}
 
 When a Pod with a claim is deleted, the DRA driver deallocates the resource so
-it can be available for future scheduling. You can observe that by deleting our
-pod with a claim and seeing that the state of the ResourceClaim changes.
-
-###  Delete the pod using the resource claim
+it can be available for future scheduling. To validate this behavior, delete the
+Pod that you created in the previous steps and watch the corresponding changes
+to the ResourceClaim and driver.
 
-1.  Delete the pod directly:
+1.  Delete the `pod0` Pod:
 
     ```shell
     kubectl delete pod pod0 -n dra-tutorial
@@ -493,10 +479,11 @@ pod with a claim and seeing that the state of the ResourceClaim changes.
 
 ### Observe the DRA state
 
-The driver will deallocate the hardware and update the corresponding
-ResourceClaim resource that previously held the association.
+When the Pod is deleted, the driver deallocates the device from the
+ResourceClaim and updates the ResourceClaim resource in the Kubernetes API. The
+ResourceClaim has a `pending` state until it's referenced in a new Pod.
 
-1.  Check the ResourceClaim is now pending:
+1.  Check the state of the `some-gpu` ResourceClaim:
 
     ```shell
     kubectl get resourceclaims -n dra-tutorial
@@ -508,8 +495,8 @@ ResourceClaim resource that previously held the association.
     some-gpu   pending   76s
     ```
 
-1.  Observe the driver logs and see that it processed unpreparing the device for
-   this claim:
+1.  Verify that the driver has processed unpreparing the device for this claim by
+   checking the driver logs:
 
     ```shell
     kubectl logs -l app.kubernetes.io/name=dra-example-driver -n dra-tutorial
@@ -525,13 +512,14 @@ reflect that the resource is available again for future scheduling.
 
 ## {{% heading "cleanup" %}}
 
-To cleanup the resources, delete the namespace for the tutorial which will clean up the ResourceClaims, driver components, and ServiceAccount. Then also delete the cluster level DeviceClass resource and cluster level RBAC resources.
+To clean up the resources that you created in this tutorial, follow these steps:
 
 ```shell
 kubectl delete namespace dra-tutorial
 kubectl delete deviceclass gpu.example.com
 kubectl delete clusterrole dra-example-driver-role
 kubectl delete clusterrolebinding dra-example-driver-role-binding
+kubectl delete priorityclass dra-driver-high-priority
 ```
 
 ## {{% heading "whatsnext" %}}