You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: keps/sig-node/4381-dra-structured-parameters/README.md
+201-8Lines changed: 201 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -500,7 +500,7 @@ as described in the next paragraph. However, the source of this data may vary; f
500
500
example, a cloud provider controller could populate this based upon information
501
501
from the cloud provider API.
502
502
503
-
In the kubelet case, each driver running on a node publishes a set of
503
+
In the node-local case, each driver running on a node publishes a set of
504
504
`ResourceSlice`objects to the API server for its own resources, using its
505
505
connection to the apiserver. The collection of these objects form a pool from
506
506
which resources can be allocated. Some additional fields (defined in the API
@@ -514,7 +514,7 @@ and `driverName` fields in each `ResourceSlice` object are used to determine whi
514
514
managed by which driver instance. The owner reference ensures that objects
515
515
beloging to a node get cleaned up when the node gets removed.
516
516
517
-
In addition, whenever kubelet starts, it first deletes all `ResourceSlices`
517
+
In addition, whenever the kubelet starts, it first deletes all `ResourceSlices`
518
518
belonging to the node with a `DeleteCollection` call that uses the node name in
519
519
a field filter. This ensures that no pods depending in DRA get scheduled to the
520
520
node until the required DRA drivers have started up again (node reboot) and
@@ -530,7 +530,9 @@ reconstructed by the driver. This has no effect on already allocated claims
530
530
because the allocation result is tracked in those claims, not the
531
531
`ResourceSlice`objects (see [below](#state-and-communication)).
532
532
533
-
Embedded inside each `ResourceSlice` is the representation of one or more devices.
533
+
#### Devices as a named list of attributes
534
+
535
+
Embedded inside each `ResourceSlice` is a list of one or more devices, represented as a named list of attributes.
534
536
535
537
```yaml
536
538
kind: ResourceSlice
@@ -584,6 +586,146 @@ would allow us to delete the pod and trying again with a new one, but is not don
584
586
at the moment because admission checks cannot be retried if a check finds
585
587
a transient problem.
586
588
589
+
#### Partitionable devices
590
+
591
+
In addition to devices, a `ResourceSlice` can also embed a list of
592
+
`SharedCapacity`objects. Each `SharedCapacity` object represents some amount
593
+
of shared "capacity" that can be consumed by one or more devices listed in the
594
+
slice. When listing such devices, the sum of the capacity consumed across all
595
+
devices *may* exceed the total amount available in the `SharedCapacity` object.
596
+
This allows one, for example, to provide a way of logically partition a device
597
+
into a set of overlapping sub-devices, each of which "could" be allocated by a
598
+
scheduler (just not at the same time).
599
+
600
+
As such, scheduler support will need to be added to track the set of
601
+
`SharedCapacity`objects provided by each `ResourceSlice` as well as perform
602
+
the following steps to decide if a given device is a candidate for allocation
603
+
or not:
604
+
605
+
1. Look at the set of `SharedCapacityConsumed` objects referenced by the device
606
+
1. For each `SharedCapacityConsumed` object, look and see how much capacity is still available in the `SharedCapacity` object it is tracking with the same name
607
+
1. If enough capacity is still available in the `SharedCapacity` object to satisfy the device's consumption, continue to consider it for allocation
608
+
1. If not enough capacity is available, move on to the next device
609
+
1. Upon deciding to allocate a device, subtract all of the capacity in its `SharedCapacityConsumed` objects from the corresponding `SharedCapacity` objects being tracked
610
+
1. Upon freeing a device, add all of the capacity in its `SharedCapacityConsumed` objects back into the corresponding `SharedCapacity` objects being tracked
611
+
612
+
**Note:** Only devices embedded in the _same_ `ResourceSlice` where a given
613
+
`SharedCapacity`is declared have access to that `SharedCapacity`. This
614
+
restriction simplifies the logic required to track these capacities in the
615
+
scheduler, and shouldn't be too limiting in practice. Also, note that while it
616
+
is described as "subtracting" and "adding" capacity in the `SharedCapacity`
617
+
object, these activities are not written back to the `ResourceSlice` itself.
618
+
Just like all summaries of current allocations, they are maintained in-memory
619
+
by the scheduler based on the totality of device allocations recorded in all
620
+
claim statuses.
621
+
622
+
As an example, consider the following YAML which declares shared capacity for a
623
+
set of "memory" blocks within a GPU. Multiple devices (including the full GPU)
624
+
are defined which consume some number of these memory blocks at varying
625
+
physical locations within GPU memory. With this in place, the scheduler is free
626
+
to choose any device that matches a provided device selector, so long as no
627
+
other devices have already been allocated that consume overlapping memory
628
+
blocks.
629
+
630
+
```yaml
631
+
kind: ResourceSlice
632
+
apiVersion: resource.k8s.io/v1alpha3
633
+
...
634
+
spec:
635
+
# The node name indicates the node.
636
+
# Each driver on a node provides pools of devices for allocation,
637
+
# with unique device names inside each pool.
638
+
# Usually, but not necessarily, that pool name is the same as the
639
+
# node name.
640
+
nodeName: worker-1
641
+
poolName: worker-1
642
+
driverName: gpu.dra.example.com
643
+
sharedCapacity:
644
+
- name: gpu-0-memory-block-0
645
+
capacity: 1
646
+
- name: gpu-0-memory-block-1
647
+
capacity: 1
648
+
- name: gpu-0-memory-block-2
649
+
capacity: 1
650
+
- name: gpu-0-memory-block-3
651
+
capacity: 1
652
+
devices:
653
+
- name: gpu-0
654
+
attributes:
655
+
- name: memory
656
+
quantity: 40Gi
657
+
sharedCapacityConsumed:
658
+
- name: gpu-0-memory-block-0
659
+
capacity: 1
660
+
- name: gpu-0-memory-block-1
661
+
capacity: 1
662
+
- name: gpu-0-memory-block-2
663
+
capacity: 1
664
+
- name: gpu-0-memory-block-3
665
+
capacity: 1
666
+
- name: gpu-0-first-half
667
+
attributes:
668
+
- name: memory
669
+
quantity: 20Gi
670
+
sharedCapacityConsumed:
671
+
- name: gpu-0-memory-block-0
672
+
capacity: 1
673
+
- name: gpu-0-memory-block-1
674
+
capacity: 1
675
+
- name: gpu-0-middle-half
676
+
attributes:
677
+
- name: memory
678
+
quantity: 20Gi
679
+
sharedCapacityConsumed:
680
+
- name: gpu-0-memory-block-1
681
+
capacity: 1
682
+
- name: gpu-0-memory-block-2
683
+
capacity: 1
684
+
- name: gpu-0-second-half
685
+
attributes:
686
+
- name: memory
687
+
quantity: 20Gi
688
+
sharedCapacityConsumed:
689
+
- name: gpu-0-memory-block-2
690
+
capacity: 1
691
+
- name: gpu-0-memory-block-3
692
+
capacity: 1
693
+
- name: gpu-0-first-quarter
694
+
attributes:
695
+
- name: memory
696
+
quantity: 10Gi
697
+
sharedCapacityConsumed:
698
+
- name: gpu-0-memory-block-0
699
+
capacity: 1
700
+
- name: gpu-0-second-quarter
701
+
attributes:
702
+
- name: memory
703
+
quantity: 10Gi
704
+
sharedCapacityConsumed:
705
+
- name: gpu-0-memory-block-1
706
+
capacity: 1
707
+
- name: gpu-0-third-quarter
708
+
attributes:
709
+
- name: memory
710
+
quantity: 10Gi
711
+
sharedCapacityConsumed:
712
+
- name: gpu-0-memory-block-2
713
+
capacity: 1
714
+
- name: gpu-0-fourth-quarter
715
+
attributes:
716
+
- name: memory
717
+
quantity: 10Gi
718
+
sharedCapacityConsumed:
719
+
- name: gpu-0-memory-block-3
720
+
capacity: 1
721
+
```
722
+
723
+
In this example, `gpu-0-first-half` and `gpu-0-second-half` could be allocated
724
+
simultaneouly (because the set of `gpu-0-memory-block`s they consume are
725
+
mutually exclusive). However, `gpu-0-first-half` and `gpu-0-first-quarter`
726
+
could not (because `gpu-0-memory-block-0` is consumed completely by both of
727
+
them).
728
+
587
729
### Using structured parameters
588
730
589
731
A ResourceClaim is a request to allocate one or more devices. Each request in a
@@ -1038,6 +1180,15 @@ type ResourceSliceSpec struct {
1038
1180
// seen all slices.
1039
1181
PoolDeviceCount int64
1040
1182
1183
+
// SharedCapacity defines the set of shared capacity consumable by
1184
+
// devices in this ResourceSlice.
1185
+
//
1186
+
// Must not have more than 128 entries.
1187
+
//
1188
+
// +listType=atomic
1189
+
// +optional
1190
+
SharedCapacity []SharedCapacity
1191
+
1041
1192
// Devices lists all available devices in this pool.
1042
1193
//
1043
1194
// Must not have more than 128 entries.
@@ -1048,6 +1199,7 @@ type ResourceSliceSpec struct {
1048
1199
// them) empty pool.
1049
1200
}
1050
1201
1202
+
const ResourceSliceMaxSharedCapacity = 128
1051
1203
const ResourceSliceMaxDevices = 128
1052
1204
const PoolNameMaxLength = validation.DNS1123SubdomainMaxLength // Same as for a single node name.
1053
1205
```
@@ -1086,14 +1238,23 @@ type Device struct {
1086
1238
// +optional
1087
1239
Attributes []DeviceAttribute
1088
1240
1089
-
// TODO for 1.31: define how to support partitionable devices
1241
+
// SharedCapacityConsumed defines the set of shared capacity consumed by
0 commit comments