|
| 1 | +--- |
| 2 | +layout: blog |
| 3 | +title: "Kubernetes 1.26: Alpha API For Dynamic Resource Allocation" |
| 4 | +date: 2022-12-15 |
| 5 | +slug: dynamic-resource-allocation |
| 6 | +--- |
| 7 | + |
| 8 | + **Authors:** Patrick Ohly (Intel), Kevin Klues (NVIDIA) |
| 9 | + |
| 10 | +Dynamic resource allocation is a new API for requesting resources. It is a |
| 11 | +generalization of the persistent volumes API for generic resources, making it possible to: |
| 12 | + |
| 13 | +- access the same resource instance in different pods and containers, |
| 14 | +- attach arbitrary constraints to a resource request to get the exact resource |
| 15 | + you are looking for, |
| 16 | +- initialize a resource according to parameters provided by the user. |
| 17 | + |
| 18 | +Third-party resource drivers are responsible for interpreting these parameters |
| 19 | +as well as tracking and allocating resources as requests come in. |
| 20 | + |
| 21 | +Dynamic resource allocation is an *alpha feature* and only enabled when the |
| 22 | +`DynamicResourceAllocation` [feature |
| 23 | +gate](/docs/reference/command-line-tools-reference/feature-gates/) and the |
| 24 | +`resource.k8s.io/v1alpha1` {{< glossary_tooltip text="API group" |
| 25 | +term_id="api-group" >}} are enabled. For details, see the |
| 26 | +`--feature-gates` and `--runtime-config` [kube-apiserver |
| 27 | +parameters](/docs/reference/command-line-tools-reference/kube-apiserver/). |
| 28 | +The kube-scheduler, kube-controller-manager and kubelet components all need |
| 29 | +the feature gate enabled as well. |
| 30 | + |
| 31 | +The default configuration of kube-scheduler enables the `DynamicResources` |
| 32 | +plugin if and only if the feature gate is enabled. Custom configurations may |
| 33 | +have to be modified to include it. |
| 34 | + |
| 35 | +Once dynamic resource allocation is enabled, resource drivers can be installed |
| 36 | +to manage certain kinds of hardware. Kubernetes has a test driver that is used |
| 37 | +for end-to-end testing, but also can be run manually. See |
| 38 | +[below](#running-the-test-driver) for step-by-step instructions. |
| 39 | + |
| 40 | +## API |
| 41 | + |
| 42 | +The new `resource.k8s.io/v1alpha1` {{< glossary_tooltip text="API group" |
| 43 | +term_id="api-group" >}} provides four new types: |
| 44 | + |
| 45 | +ResourceClass |
| 46 | +: Defines which resource driver handles a certain kind of |
| 47 | + resource and provides common parameters for it. ResourceClasses |
| 48 | + are created by a cluster administrator when installing a resource |
| 49 | + driver. |
| 50 | + |
| 51 | +ResourceClaim |
| 52 | +: Defines a particular resource instances that is required by a |
| 53 | + workload. Created by a user (lifecycle managed manually, can be shared |
| 54 | + between different Pods) or for individual Pods by the control plane based on |
| 55 | + a ResourceClaimTemplate (automatic lifecycle, typically used by just one |
| 56 | + Pod). |
| 57 | + |
| 58 | +ResourceClaimTemplate |
| 59 | +: Defines the spec and some meta data for creating |
| 60 | + ResourceClaims. Created by a user when deploying a workload. |
| 61 | + |
| 62 | +PodScheduling |
| 63 | +: Used internally by the control plane and resource drivers |
| 64 | + to coordinate pod scheduling when ResourceClaims need to be allocated |
| 65 | + for a Pod. |
| 66 | + |
| 67 | +Parameters for ResourceClass and ResourceClaim are stored in separate objects, |
| 68 | +typically using the type defined by a {{< glossary_tooltip |
| 69 | +term_id="CustomResourceDefinition" text="CRD" >}} that was created when |
| 70 | +installing a resource driver. |
| 71 | + |
| 72 | +With this alpha feature enabled, the `spec` of Pod defines ResourceClaims that are needed for a Pod |
| 73 | +to run: this information goes into a new |
| 74 | +`resourceClaims` field. Entries in that list reference either a ResourceClaim |
| 75 | +or a ResourceClaimTemplate. When referencing a ResourceClaim, all Pods using |
| 76 | +this `.spec` (for example, inside a Deployment or StatefulSet) share the same |
| 77 | +ResourceClaim instance. When referencing a ResourceClaimTemplate, each Pod gets |
| 78 | +its own ResourceClaim instance. |
| 79 | + |
| 80 | +For a container defined within a Pod, the `resources.claims` list |
| 81 | +defines whether that container gets |
| 82 | +access to these resource instances, which makes it possible to share resources |
| 83 | +between one or more containers inside the same Pod. For example, an init container could |
| 84 | +set up the resource before the application uses it. |
| 85 | + |
| 86 | +Here is an example of a fictional resource driver. Two ResourceClaim objects |
| 87 | +will get created for this Pod and each container gets access to one of them. |
| 88 | + |
| 89 | +Assuming a resource driver called `resource-driver.example.com` was installed |
| 90 | +together with the following resource class: |
| 91 | + |
| 92 | +``` |
| 93 | +apiVersion: resource.k8s.io/v1alpha1 |
| 94 | +kind: ResourceClass |
| 95 | +name: resource.example.com |
| 96 | +driverName: resource-driver.example.com |
| 97 | +``` |
| 98 | + |
| 99 | +An end-user could then allocate two specific resources of type |
| 100 | +`resource.example.com` as follows: |
| 101 | + |
| 102 | +```yaml |
| 103 | +--- |
| 104 | +apiVersion: cats.resource.example.com/v1 |
| 105 | +kind: ClaimParameters |
| 106 | +name: large-black-cats |
| 107 | +spec: |
| 108 | + color: black |
| 109 | + size: large |
| 110 | +--- |
| 111 | +apiVersion: resource.k8s.io/v1alpha1 |
| 112 | +kind: ResourceClaimTemplate |
| 113 | +metadata: |
| 114 | + name: large-black-cats |
| 115 | +spec: |
| 116 | + spec: |
| 117 | + resourceClassName: resource.example.com |
| 118 | + parametersRef: |
| 119 | + apiGroup: cats.resource.example.com |
| 120 | + kind: ClaimParameters |
| 121 | + name: large-black-cats |
| 122 | +–-- |
| 123 | +apiVersion: v1 |
| 124 | +kind: Pod |
| 125 | +metadata: |
| 126 | + name: pod-with-cats |
| 127 | +spec: |
| 128 | + containers: # two example containers; each container claims one cat resource |
| 129 | + - name: first-example |
| 130 | + image: ubuntu:22.04 |
| 131 | + command: ["sleep", "9999"] |
| 132 | + resources: |
| 133 | + claims: |
| 134 | + - name: cat-0 |
| 135 | + - name: second-example |
| 136 | + image: ubuntu:22.04 |
| 137 | + command: ["sleep", "9999"] |
| 138 | + resources: |
| 139 | + claims: |
| 140 | + - name: cat-1 |
| 141 | + resourceClaims: |
| 142 | + - name: cat-0 |
| 143 | + source: |
| 144 | + resourceClaimTemplateName: large-black-cats |
| 145 | + - name: cat-1 |
| 146 | + source: |
| 147 | + resourceClaimTemplateName: large-black-cats |
| 148 | +``` |
| 149 | +
|
| 150 | +## Scheduling |
| 151 | +
|
| 152 | +In contrast to native resources (such as CPU or RAM) and |
| 153 | +[extended resources](/docs/concepts/configuration/manage-resources-containers/#extended-resources) |
| 154 | +(managed by a |
| 155 | +device plugin, advertised by kubelet), the scheduler has no knowledge of what |
| 156 | +dynamic resources are available in a cluster or how they could be split up to |
| 157 | +satisfy the requirements of a specific ResourceClaim. Resource drivers are |
| 158 | +responsible for that. Drivers mark ResourceClaims as _allocated_ once resources |
| 159 | +for it are reserved. This also then tells the scheduler where in the cluster a |
| 160 | +claimed resource is actually available. |
| 161 | +
|
| 162 | +ResourceClaims can get resources allocated as soon as the ResourceClaim |
| 163 | +is created (_immediate allocation_), without considering which Pods will use |
| 164 | +the resource. The default (_wait for first consumer_) is to delay allocation until |
| 165 | +a Pod that relies on the ResourceClaim becomes eligible for scheduling. |
| 166 | +This design with two allocation options is similar to how Kubernetes handles |
| 167 | +storage provisioning with PersistentVolumes and PersistentVolumeClaims. |
| 168 | +
|
| 169 | +In the wait for first consumer mode, the scheduler checks all ResourceClaims needed |
| 170 | +by a Pod. If the Pods has any ResourceClaims, the scheduler creates a PodScheduling |
| 171 | +(a special object that requests scheduling details on behalf of the Pod). The PodScheduling |
| 172 | +has the same name and namespace as the Pod and the Pod as its as owner. |
| 173 | +Using its PodScheduling, the scheduler informs the resource drivers |
| 174 | +responsible for those ResourceClaims about nodes that the scheduler considers |
| 175 | +suitable for the Pod. The resource drivers respond by excluding nodes that |
| 176 | +don't have enough of the driver's resources left. |
| 177 | +
|
| 178 | +Once the scheduler has that resource |
| 179 | +information, it selects one node and stores that choice in the PodScheduling |
| 180 | +object. The resource drivers then allocate resources based on the relevant |
| 181 | +ResourceClaims so that the resources will be available on that selected node. |
| 182 | +Once that resource allocation is complete, the scheduler attempts to schedule the Pod |
| 183 | +to a suitable node. Scheduling can still fail at this point; for example, a different Pod could |
| 184 | +be scheduled to the same node in the meantime. If this happens, already allocated |
| 185 | +ResourceClaims may get deallocated to enable scheduling onto a different node. |
| 186 | +
|
| 187 | +As part of this process, ResourceClaims also get reserved for the |
| 188 | +Pod. Currently ResourceClaims can either be used exclusively by a single Pod or |
| 189 | +an unlimited number of Pods. |
| 190 | +
|
| 191 | +One key feature is that Pods do not get scheduled to a node unless all of |
| 192 | +their resources are allocated and reserved. This avoids the scenario where |
| 193 | +a Pod gets scheduled onto one node and then cannot run there, which is bad |
| 194 | +because such a pending Pod also blocks all other resources like RAM or CPU that were |
| 195 | +set aside for it. |
| 196 | +
|
| 197 | +## Limitations |
| 198 | +
|
| 199 | +The scheduler plugin must be involved in scheduling Pods which use |
| 200 | +ResourceClaims. Bypassing the scheduler by setting the `nodeName` field leads |
| 201 | +to Pods that the kubelet refuses to start because the ResourceClaims are not |
| 202 | +reserved or not even allocated. It may be possible to remove this |
| 203 | +[limitation](https://github.com/kubernetes/kubernetes/issues/114005) in the |
| 204 | +future. |
| 205 | + |
| 206 | +## Writing a resource driver |
| 207 | + |
| 208 | +A dynamic resource allocation driver typically consists of two separate-but-coordinating |
| 209 | +components: a centralized controller, and a DaemonSet of node-local kubelet |
| 210 | +plugins. Most of the work required by the centralized controller to coordinate |
| 211 | +with the scheduler can be handled by boilerplate code. Only the business logic |
| 212 | +required to actually allocate ResourceClaims against the ResourceClasses owned |
| 213 | +by the plugin needs to be customized. As such, Kubernetes provides |
| 214 | +the following package, including APIs for invoking this boilerplate code as |
| 215 | +well as a `Driver` interface that you can implement to provide their custom |
| 216 | +business logic: |
| 217 | + |
| 218 | +- [k8s.io/dynamic-resource-allocation/controller](https://github.com/kubernetes/dynamic-resource-allocation/tree/release-1.26/controller) |
| 219 | + |
| 220 | +Likewise, boilerplate code can be used to register the node-local plugin with |
| 221 | +the kubelet, as well as start a gRPC server to implement the kubelet plugin |
| 222 | +API. For drivers written in Go, the following package is recommended: |
| 223 | + |
| 224 | +- [k8s.io/dynamic-resource-allocation/kubeletplugin](https://github.com/kubernetes/dynamic-resource-allocation/tree/release-1.26/kubeletplugin) |
| 225 | + |
| 226 | +It is up to the driver developer to decide how these two components |
| 227 | +communicate. The [KEP](https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/3063-dynamic-resource-allocation/README.md) outlines an [approach using |
| 228 | +CRDs](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/3063-dynamic-resource-allocation#implementing-a-plugin-for-node-resources). |
| 229 | + |
| 230 | +Within SIG Node, we also plan to provide a complete [example |
| 231 | +driver](https://github.com/kubernetes-sigs/dra-example-driver) that can serve |
| 232 | +as a template for other drivers. |
| 233 | + |
| 234 | +## Running the test driver |
| 235 | + |
| 236 | +The following steps bring up a local, one-node cluster directly from the |
| 237 | +Kubernetes source code. As a prerequisite, your cluster must have nodes with a container |
| 238 | +runtime that supports the |
| 239 | +[Container Device Interface](https://github.com/container-orchestrated-devices/container-device-interface) |
| 240 | +(CDI). For example, you can run CRI-O [v1.23.2](https://github.com/cri-o/cri-o/releases/tag/v1.23.2) or later. |
| 241 | +Once containerd v1.7.0 is released, we expect that you can run that or any later version. |
| 242 | +In the example below, we use CRI-O. |
| 243 | + |
| 244 | +First, clone the Kubernetes source code. Inside that directory, run: |
| 245 | + |
| 246 | +```console |
| 247 | +$ hack/install-etcd.sh |
| 248 | +... |
| 249 | +
|
| 250 | +$ RUNTIME_CONFIG=resource.k8s.io/v1alpha1 \ |
| 251 | + FEATURE_GATES=DynamicResourceAllocation=true \ |
| 252 | + DNS_ADDON="coredns" \ |
| 253 | + CGROUP_DRIVER=systemd \ |
| 254 | + CONTAINER_RUNTIME_ENDPOINT=unix:///var/run/crio/crio.sock \ |
| 255 | + LOG_LEVEL=6 \ |
| 256 | + ENABLE_CSI_SNAPSHOTTER=false \ |
| 257 | + API_SECURE_PORT=6444 \ |
| 258 | + ALLOW_PRIVILEGED=1 \ |
| 259 | + PATH=$(pwd)/third_party/etcd:$PATH \ |
| 260 | + ./hack/local-up-cluster.sh -O |
| 261 | +... |
| 262 | +To start using your cluster, you can open up another terminal/tab and run: |
| 263 | +
|
| 264 | + export KUBECONFIG=/var/run/kubernetes/admin.kubeconfig |
| 265 | +... |
| 266 | +``` |
| 267 | + |
| 268 | +Once the cluster is up, in another |
| 269 | +terminal run the test driver controller. `KUBECONFIG` must be set for all of |
| 270 | +the following commands. |
| 271 | + |
| 272 | +```console |
| 273 | +$ go run ./test/e2e/dra/test-driver --feature-gates ContextualLogging=true -v=5 controller |
| 274 | +``` |
| 275 | + |
| 276 | +In another terminal, run the kubelet plugin: |
| 277 | + |
| 278 | +```console |
| 279 | +$ sudo mkdir -p /var/run/cdi && \ |
| 280 | + sudo chmod a+rwx /var/run/cdi /var/lib/kubelet/plugins_registry /var/lib/kubelet/plugins/ |
| 281 | +$ go run ./test/e2e/dra/test-driver --feature-gates ContextualLogging=true -v=6 kubelet-plugin |
| 282 | +``` |
| 283 | + |
| 284 | +Changing the permissions of the directories makes it possible to run and (when |
| 285 | +using delve) debug the kubelet plugin as a normal user, which is convenient |
| 286 | +because it uses the already populated Go cache. Remember to restore permissions |
| 287 | +with `sudo chmod go-w` when done. Alternatively, you can also build the binary |
| 288 | +and run that as root. |
| 289 | + |
| 290 | +Now the cluster is ready to create objects: |
| 291 | + |
| 292 | +```console |
| 293 | +$ kubectl create -f test/e2e/dra/test-driver/deploy/example/resourceclass.yaml |
| 294 | +resourceclass.resource.k8s.io/example created |
| 295 | +
|
| 296 | +$ kubectl create -f test/e2e/dra/test-driver/deploy/example/pod-inline.yaml |
| 297 | +configmap/test-inline-claim-parameters created |
| 298 | +resourceclaimtemplate.resource.k8s.io/test-inline-claim-template created |
| 299 | +pod/test-inline-claim created |
| 300 | +
|
| 301 | +$ kubectl get resourceclaims |
| 302 | +NAME RESOURCECLASSNAME ALLOCATIONMODE STATE AGE |
| 303 | +test-inline-claim-resource example WaitForFirstConsumer allocated,reserved 8s |
| 304 | +
|
| 305 | +$ kubectl get pods |
| 306 | +NAME READY STATUS RESTARTS AGE |
| 307 | +test-inline-claim 0/2 Completed 0 21s |
| 308 | +``` |
| 309 | + |
| 310 | +The test driver doesn't do much, it only sets environment variables as defined |
| 311 | +in the ConfigMap. The test pod dumps the environment, so the log can be checked |
| 312 | +to verify that everything worked: |
| 313 | + |
| 314 | +```console |
| 315 | +$ kubectl logs test-inline-claim with-resource | grep user_a |
| 316 | +user_a='b' |
| 317 | +``` |
| 318 | + |
| 319 | +## Next steps |
| 320 | + |
| 321 | +- See the |
| 322 | +[Dynamic Resource Allocation](https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/3063-dynamic-resource-allocation/README.md) |
| 323 | + KEP for more information on the design. |
| 324 | +- Read [Dynamic Resource Allocation](/docs/concepts/scheduling-eviction/dynamic-resource-allocation/) |
| 325 | + in the official Kubernetes documentation. |
| 326 | +- You can participate in |
| 327 | + [SIG Node](https://github.com/kubernetes/community/blob/master/sig-node/README.md) |
| 328 | + and / or the [CNCF Container Orchestrated Device Working Group](https://github.com/cncf/tag-runtime/blob/master/wg/COD.md). |
| 329 | +- You can view or comment on the [project board](https://github.com/orgs/kubernetes/projects/95/views/1) |
| 330 | + for dynamic resource allocation. |
| 331 | + - In order to move this feature towards beta, we need feedback from hardware |
| 332 | + vendors, so here's a call to action: try out this feature, consider how it can help |
| 333 | + with problems that your users are having, and write resource drivers… |
0 commit comments