Skip to content

Commit c1a89ef

Browse files
klueskapohly
authored andcommitted
DRA: add text for supporting partitionable devices
Signed-off-by: Kevin Klues <[email protected]>
1 parent 84d2e99 commit c1a89ef

File tree

1 file changed

+201
-8
lines changed
  • keps/sig-node/4381-dra-structured-parameters

1 file changed

+201
-8
lines changed

keps/sig-node/4381-dra-structured-parameters/README.md

Lines changed: 201 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -500,7 +500,7 @@ as described in the next paragraph. However, the source of this data may vary; f
500500
example, a cloud provider controller could populate this based upon information
501501
from the cloud provider API.
502502

503-
In the kubelet case, each driver running on a node publishes a set of
503+
In the node-local case, each driver running on a node publishes a set of
504504
`ResourceSlice` objects to the API server for its own resources, using its
505505
connection to the apiserver. The collection of these objects form a pool from
506506
which resources can be allocated. Some additional fields (defined in the API
@@ -514,7 +514,7 @@ and `driverName` fields in each `ResourceSlice` object are used to determine whi
514514
managed by which driver instance. The owner reference ensures that objects
515515
beloging to a node get cleaned up when the node gets removed.
516516

517-
In addition, whenever kubelet starts, it first deletes all `ResourceSlices`
517+
In addition, whenever the kubelet starts, it first deletes all `ResourceSlices`
518518
belonging to the node with a `DeleteCollection` call that uses the node name in
519519
a field filter. This ensures that no pods depending in DRA get scheduled to the
520520
node until the required DRA drivers have started up again (node reboot) and
@@ -530,7 +530,9 @@ reconstructed by the driver. This has no effect on already allocated claims
530530
because the allocation result is tracked in those claims, not the
531531
`ResourceSlice` objects (see [below](#state-and-communication)).
532532

533-
Embedded inside each `ResourceSlice` is the representation of one or more devices.
533+
#### Devices as a named list of attributes
534+
535+
Embedded inside each `ResourceSlice` is a list of one or more devices, represented as a named list of attributes.
534536

535537
```yaml
536538
kind: ResourceSlice
@@ -584,6 +586,146 @@ would allow us to delete the pod and trying again with a new one, but is not don
584586
at the moment because admission checks cannot be retried if a check finds
585587
a transient problem.
586588

589+
#### Partitionable devices
590+
591+
In addition to devices, a `ResourceSlice` can also embed a list of
592+
`SharedCapacity` objects. Each `SharedCapacity` object represents some amount
593+
of shared "capacity" that can be consumed by one or more devices listed in the
594+
slice. When listing such devices, the sum of the capacity consumed across all
595+
devices *may* exceed the total amount available in the `SharedCapacity` object.
596+
This allows one, for example, to provide a way of logically partition a device
597+
into a set of overlapping sub-devices, each of which "could" be allocated by a
598+
scheduler (just not at the same time).
599+
600+
As such, scheduler support will need to be added to track the set of
601+
`SharedCapacity` objects provided by each `ResourceSlice` as well as perform
602+
the following steps to decide if a given device is a candidate for allocation
603+
or not:
604+
605+
1. Look at the set of `SharedCapacityConsumed` objects referenced by the device
606+
1. For each `SharedCapacityConsumed` object, look and see how much capacity is still available in the `SharedCapacity` object it is tracking with the same name
607+
1. If enough capacity is still available in the `SharedCapacity` object to satisfy the device's consumption, continue to consider it for allocation
608+
1. If not enough capacity is available, move on to the next device
609+
1. Upon deciding to allocate a device, subtract all of the capacity in its `SharedCapacityConsumed` objects from the corresponding `SharedCapacity` objects being tracked
610+
1. Upon freeing a device, add all of the capacity in its `SharedCapacityConsumed` objects back into the corresponding `SharedCapacity` objects being tracked
611+
612+
**Note:** Only devices embedded in the _same_ `ResourceSlice` where a given
613+
`SharedCapacity` is declared have access to that `SharedCapacity`. This
614+
restriction simplifies the logic required to track these capacities in the
615+
scheduler, and shouldn't be too limiting in practice. Also, note that while it
616+
is described as "subtracting" and "adding" capacity in the `SharedCapacity`
617+
object, these activities are not written back to the `ResourceSlice` itself.
618+
Just like all summaries of current allocations, they are maintained in-memory
619+
by the scheduler based on the totality of device allocations recorded in all
620+
claim statuses.
621+
622+
As an example, consider the following YAML which declares shared capacity for a
623+
set of "memory" blocks within a GPU. Multiple devices (including the full GPU)
624+
are defined which consume some number of these memory blocks at varying
625+
physical locations within GPU memory. With this in place, the scheduler is free
626+
to choose any device that matches a provided device selector, so long as no
627+
other devices have already been allocated that consume overlapping memory
628+
blocks.
629+
630+
```yaml
631+
kind: ResourceSlice
632+
apiVersion: resource.k8s.io/v1alpha3
633+
...
634+
spec:
635+
# The node name indicates the node.
636+
# Each driver on a node provides pools of devices for allocation,
637+
# with unique device names inside each pool.
638+
# Usually, but not necessarily, that pool name is the same as the
639+
# node name.
640+
nodeName: worker-1
641+
poolName: worker-1
642+
driverName: gpu.dra.example.com
643+
sharedCapacity:
644+
- name: gpu-0-memory-block-0
645+
capacity: 1
646+
- name: gpu-0-memory-block-1
647+
capacity: 1
648+
- name: gpu-0-memory-block-2
649+
capacity: 1
650+
- name: gpu-0-memory-block-3
651+
capacity: 1
652+
devices:
653+
- name: gpu-0
654+
attributes:
655+
- name: memory
656+
quantity: 40Gi
657+
sharedCapacityConsumed:
658+
- name: gpu-0-memory-block-0
659+
capacity: 1
660+
- name: gpu-0-memory-block-1
661+
capacity: 1
662+
- name: gpu-0-memory-block-2
663+
capacity: 1
664+
- name: gpu-0-memory-block-3
665+
capacity: 1
666+
- name: gpu-0-first-half
667+
attributes:
668+
- name: memory
669+
quantity: 20Gi
670+
sharedCapacityConsumed:
671+
- name: gpu-0-memory-block-0
672+
capacity: 1
673+
- name: gpu-0-memory-block-1
674+
capacity: 1
675+
- name: gpu-0-middle-half
676+
attributes:
677+
- name: memory
678+
quantity: 20Gi
679+
sharedCapacityConsumed:
680+
- name: gpu-0-memory-block-1
681+
capacity: 1
682+
- name: gpu-0-memory-block-2
683+
capacity: 1
684+
- name: gpu-0-second-half
685+
attributes:
686+
- name: memory
687+
quantity: 20Gi
688+
sharedCapacityConsumed:
689+
- name: gpu-0-memory-block-2
690+
capacity: 1
691+
- name: gpu-0-memory-block-3
692+
capacity: 1
693+
- name: gpu-0-first-quarter
694+
attributes:
695+
- name: memory
696+
quantity: 10Gi
697+
sharedCapacityConsumed:
698+
- name: gpu-0-memory-block-0
699+
capacity: 1
700+
- name: gpu-0-second-quarter
701+
attributes:
702+
- name: memory
703+
quantity: 10Gi
704+
sharedCapacityConsumed:
705+
- name: gpu-0-memory-block-1
706+
capacity: 1
707+
- name: gpu-0-third-quarter
708+
attributes:
709+
- name: memory
710+
quantity: 10Gi
711+
sharedCapacityConsumed:
712+
- name: gpu-0-memory-block-2
713+
capacity: 1
714+
- name: gpu-0-fourth-quarter
715+
attributes:
716+
- name: memory
717+
quantity: 10Gi
718+
sharedCapacityConsumed:
719+
- name: gpu-0-memory-block-3
720+
capacity: 1
721+
```
722+
723+
In this example, `gpu-0-first-half` and `gpu-0-second-half` could be allocated
724+
simultaneouly (because the set of `gpu-0-memory-block`s they consume are
725+
mutually exclusive). However, `gpu-0-first-half` and `gpu-0-first-quarter`
726+
could not (because `gpu-0-memory-block-0` is consumed completely by both of
727+
them).
728+
587729
### Using structured parameters
588730

589731
A ResourceClaim is a request to allocate one or more devices. Each request in a
@@ -1038,6 +1180,15 @@ type ResourceSliceSpec struct {
10381180
// seen all slices.
10391181
PoolDeviceCount int64
10401182

1183+
// SharedCapacity defines the set of shared capacity consumable by
1184+
// devices in this ResourceSlice.
1185+
//
1186+
// Must not have more than 128 entries.
1187+
//
1188+
// +listType=atomic
1189+
// +optional
1190+
SharedCapacity []SharedCapacity
1191+
10411192
// Devices lists all available devices in this pool.
10421193
//
10431194
// Must not have more than 128 entries.
@@ -1048,6 +1199,7 @@ type ResourceSliceSpec struct {
10481199
// them) empty pool.
10491200
}
10501201

1202+
const ResourceSliceMaxSharedCapacity = 128
10511203
const ResourceSliceMaxDevices = 128
10521204
const PoolNameMaxLength = validation.DNS1123SubdomainMaxLength // Same as for a single node name.
10531205
```
@@ -1086,14 +1238,23 @@ type Device struct {
10861238
// +optional
10871239
Attributes []DeviceAttribute
10881240

1089-
// TODO for 1.31: define how to support partitionable devices
1241+
// SharedCapacityConsumed defines the set of shared capacity consumed by
1242+
// this device.
1243+
//
1244+
// Must not have more than 32 entries.
1245+
//
1246+
// +listType=atomic
1247+
// +optional
1248+
SharedCapacityConsumed []SharedCapacity
10901249
}
10911250

10921251
const ResourceSliceMaxAttributesPerDevice = 32
1252+
const ResourceSliceMaxSharedCapacityConsumedPerDevice = 32
10931253

1094-
// ResourceSliceMaxDevices and ResourceSliceMaxAttributesPerDevice where chosen
1095-
// so that with the maximum attribute length of 96 characters the total size of
1096-
// the ResourceSlice object is around 420KB.
1254+
// ResourceSliceMaxDevices and ResourceSliceMaxAttributesPerDevice were chosen
1255+
// so that with a maximum `DeviceAttribute` length of 96 characters and a
1256+
// maximum `SharedCapacity` length of ~40 characters (ignoring overhead), the
1257+
// total size of the ResourceSlice object is around 590KB.
10971258

10981259
// DeviceAttribute is a combination of an attribute name and its value.
10991260
// Exactly one value must be set.
@@ -1135,11 +1296,43 @@ type DeviceAttribute struct {
11351296
VersionValue *string
11361297
}
11371298

1299+
type SharedCapacity struct {
1300+
// Name is a unique identifier among all shared capacities managed by the
1301+
// driver in the pool.
1302+
//
1303+
// It is referenced both when defining the total amount of shared capacity
1304+
// that is available, as well as by individual devices when declaring
1305+
// how much of this shared capacity they consume.
1306+
//
1307+
// SharedCapacity names must be a C-style identifier (e.g. "the_name") with
1308+
// a maximum length of 32.
1309+
//
1310+
// By limiting these names to a C-style identifier, the same validation can
1311+
// be used for both these names and the identifier portion of a
1312+
// DeviceAttribute name.
1313+
//
1314+
// +required
1315+
Name string `json:"name"`
1316+
1317+
// Capacity is the total capacity of the named resource.
1318+
// This can either represent the total *available* capacity, or the total
1319+
// capacity *consumed*, depending on the context where it is referenced.
1320+
//
1321+
// +required
1322+
Capacity resource.Quantity `json:"capacity"`
1323+
}
1324+
1325+
// CStyleIdentifierMaxLength is the maximum length of a c-style identifier used for naming.
1326+
const CStyleIdentifierMaxLength = 32
1327+
11381328
// DeviceAttributeMaxIDLength is the maximum length of the identifier in a device attribute name (`<domain>/<ID>`).
1139-
const DeviceAttributeMaxIDLength = 32
1329+
const DeviceAttributeMaxIDLength = CStyleIdentifierMaxLength
11401330

11411331
// DeviceAttributeMaxValueLength is the maximum length of a string or version attribute value.
11421332
const DeviceAttributeMaxValueLength = 64
1333+
1334+
// SharedCapacityMaxNameLength is the maximum length of a shared capacity name.
1335+
const SharedCapacityMaxNameLength = CStyleIdentifierMaxLength
11431336
```
11441337

11451338
###### ResourceClaim

0 commit comments

Comments
 (0)