Skip to content

Commit c222629

Browse files
committed
dynamic resource allocation: remove ResourceClaimStatus.Scheduling
Having the scheduler and drivers exchange availability information through the PodScheduling object has several advantages: - users don't need to see the information - a selected node and potential nodes automatically apply to all pending claims - drivers can make holistic decisions about resource availability, for example when a pod requests two distinct GPUs but only some nodes have more than one or when there are interdependencies with other drivers Deallocate gets renamed to DeallocationRequested to make it describe the state of the claim, not an imperative. The reason why it needs to remain in ResourceClaimStatus is explained better. Because the scheduler extender API has no support for Reserve and Unreserve, the previous proposal for replacing usage of PodScheduling with webhook calls is no longer applicable and would have to be extended. This may be feasible, but is more complicated and is left out for now.
1 parent 29e9e83 commit c222629

File tree

1 file changed

+119
-119
lines changed
  • keps/sig-node/3063-dynamic-resource-allocation

1 file changed

+119
-119
lines changed

keps/sig-node/3063-dynamic-resource-allocation/README.md

Lines changed: 119 additions & 119 deletions
Original file line numberDiff line numberDiff line change
@@ -702,8 +702,7 @@ For a resource driver the following components are needed:
702702
- *Resource driver controller*: a central component which handles resource allocation
703703
by watching ResourceClaims and updating their status once it is done with
704704
allocation. It may run inside the cluster or outside of it. The only
705-
hard requirement is that it can connect to the API server. Optionally,
706-
it may also be configured as a [special scheduler extender](#kube-scheduler).
705+
hard requirement is that it can connect to the API server.
707706
- *Resource kubelet plugin*: a component which cooperates with kubelet to prepare
708707
the usage of the resource on a node.
709708

@@ -803,6 +802,22 @@ Some of the race conditions that need to be handled are:
803802
put the pod back into the queue, waiting for the ResourceClaim to become
804803
usable again.
805804

805+
- Two pods get created which both reference the same unallocated claim with
806+
delayed allocation. A single scheduler could detect this special situation
807+
and then trigger allocation only for one of the two pods. But it is simpler
808+
to proceed with pod scheduling for both of them independently, which implies
809+
trying to select a node and allocate for it in parallel. Depending on timing,
810+
the resource driver will see one of the requests for allocation first and
811+
execute it. The other pod then either can share the same resource (if
812+
supported) or must wait until the first one is done with it and reallocate
813+
it.
814+
815+
- Scheduling a pod and allocating resources for it has been attempted, but one
816+
claim needs to be reallocated to fit the overall resource requirements. A second
817+
pod gets created which references the same claim that is in the process of
818+
being deallocated. Because that is visible in the claim status, scheduling
819+
of the second pod cannot proceed.
820+
806821
### Custom parameters
807822

808823
To support arbitrarily complex parameters, both ResourceClass and ResourceClaim
@@ -937,9 +952,6 @@ The entire scheduling section is tentative. Key opens:
937952
- Support arbitrary combinations of user- vs. Kubernetes-managed ResourceClaims
938953
and immediate vs. late allocation?
939954
https://github.com/kubernetes/enhancements/pull/3064#discussion_r901948474
940-
- Can and should `SelectedNode`, `SelectedUser`, `Deallocate` be moved to
941-
`PodScheduling` or be handled differently?
942-
https://github.com/pohly/enhancements/pull/13/files
943955
<<[/UNRESOLVED]>>
944956

945957

@@ -968,7 +980,7 @@ while <pod needs to be scheduled> {
968980
uses delayed allocation, and
969981
was not available on a node> {
970982
<randomly pick one of those resources and
971-
tell resource driver to deallocate it by setting `Deallocate` and
983+
tell resource driver to deallocate it by setting `claim.status.deallocationRequested` and
972984
removing the pod from `claim.status.reservedFor` (if present there)>
973985
}
974986
} else if <all resources allocated> {
@@ -985,7 +997,8 @@ a certain resource class, a node selector can be specified in that class. That
985997
selector is static and typically will use labels that determine which nodes may
986998
have resources available.
987999

988-
To gather information about the current state of resource availability, the
1000+
To gather information about the current state of resource availability and to
1001+
trigger allocation of a claim, the
9891002
scheduler creates a PodScheduling object. That object is owned by the pod and
9901003
will either get deleted by the scheduler when it is done with pod scheduling or
9911004
through the garbage collector. In the PodScheduling object, the scheduler posts
@@ -997,9 +1010,7 @@ likely to pick a node for which allocation succeeds.
9971010

9981011
This scheduling information is optional and does not have to be in sync with
9991012
the current ResourceClaim state, therefore it is okay to store it
1000-
separately. Alternatively, the scheduler may be configured to do node filtering
1001-
through scheduler extenders, in which case the PodScheduling object will not be
1002-
needed.
1013+
separately.
10031014

10041015
Allowing the scheduler to trigger allocation in parallel to asking for more
10051016
information was chosen because for pods with a single resource claim, the cost
@@ -1030,19 +1041,17 @@ else changes in the system, like for example deleting objects.
10301041
* **scheduler** filters nodes
10311042
* if *delayed allocation and resource not allocated yet*:
10321043
* if *at least one node fits pod*:
1033-
* **scheduler** creates or updates a `PodScheduling` object with `potentialNodes=<all nodes that fit the pod>`
1034-
* **scheduler** picks one node, sets `claim.status.scheduling.selectedNode=<the chosen node>` and `claim.status.scheduling.selectedUser=<the pod being scheduled>`
1035-
* if *resource is available for `claim.status.scheduling.selectedNode`*:s
1036-
* **resource driver** adds finalizer to claim to prevent deletion -> allocation in progress
1037-
* **resource driver** finishes allocation, sets `claim.status.allocation` and the
1038-
intended user in `claim.status.reservedFor`, clears `claim.status.selectedNode` and `claim.status.selectedUser` -> claim ready for use and reserved
1039-
for the pod
1040-
* else *scheduler needs to know that it must avoid this and possibly other nodes*:
1041-
* **resource driver** retrieves `PodScheduling` object from informer cache or API server
1042-
* **resource driver** sets `podScheduling.claims[name=name of claim in pod].unsuitableNodes`
1043-
* **resource driver** clears `claim.status.selectedNode` -> next attempt by scheduler has more information and is more likely to succeed
1044+
* **scheduler** creates or updates a `PodScheduling` object with `podScheduling.spec.potentialNodes=<nodes that fit the pod>`
1045+
* if *exactly one claim is pending* or *all drivers have provided information*:
1046+
* **scheduler** picks one node, sets `podScheduling.spec.selectedNode=<the chosen node>`
1047+
* if *resource is available for this selected node*:
1048+
* **resource driver** adds finalizer to claim to prevent deletion -> allocation in progress
1049+
* **resource driver** finishes allocation, sets `claim.status.allocation` and the
1050+
pod in `claim.status.reservedFor` -> claim ready for use and reserved for the pod
1051+
* else *scheduler needs to know that it must avoid this and possibly other nodes*:
1052+
* **resource driver** sets `podScheduling.status.claims[name=name of claim in pod].unsuitableNodes`
10441053
* else *pod cannot be scheduled*:
1045-
* **scheduler** may trigger deallocation of some claim with delayed allocation by setting `claim.status.deallocate` to true
1054+
* **scheduler** may trigger deallocation of some claim with delayed allocation by setting `claim.status.deallocationRequested` to true
10461055
(see [pseudo-code above](#coordinating-resource-allocation-through-the-scheduler)) or wait
10471056
* if *pod not listed in `claim.status.reservedFor` yet* (can occur for immediate allocation):
10481057
* **scheduler** adds it to `claim.status.reservedFor`
@@ -1216,26 +1225,22 @@ type ResourceClaimStatus struct {
12161225
// marked for deletion.
12171226
DriverName string
12181227
1219-
// Scheduling contains information that is only relevant while the
1220-
// scheduler and the resource driver are in the process of selecting a
1221-
// node for a Pod and the allocation mode is AllocationModeWaitForFirstConsumer. The
1222-
// resource driver should unset this when it has successfully allocated
1223-
// the resource.
1224-
Scheduling SchedulingStatus
1225-
12261228
// Allocation is set by the resource driver once a resource has been
12271229
// allocated successfully. Nil indicates that the resource is not
12281230
// allocated.
12291231
Allocation *AllocationResult
12301232
1231-
// Deallocate may be set to true to request deallocation of a resource as soon
1232-
// as it is unused. The scheduler uses this when it finds that deallocating
1233-
// the resource and reallocating it elsewhere might unblock a pod.
1233+
// DeallocationRequested gets set by the scheduler when it detects
1234+
// the situation where pod scheduling cannot proceed because some
1235+
// claim was allocated for a node that cannot provide some other
1236+
// required resource.
1237+
//
1238+
// The driver then needs to deallocate this claim and the scheduler
1239+
// will try again.
12341240
//
1235-
// The resource driver checks this fields and resets it to false
1236-
// together with clearing the Allocation field. It also sets it
1237-
// to false when the resource is not allocated.
1238-
Deallocate bool
1241+
// While DeallocationRequested is set, no new users may be added
1242+
// to ReservedFor.
1243+
DeallocationRequested bool
12391244
12401245
// ReservedFor indicates which entities are currently allowed to use
12411246
// the resource. Usually those are Pods, but other objects are
@@ -1245,7 +1250,8 @@ type ResourceClaimStatus struct {
12451250
// A scheduler must add a Pod that it is scheduling. This must be done
12461251
// in an atomic ResourceClaim update because there might be multiple
12471252
// schedulers working on different Pods that compete for access to the
1248-
// same ResourceClaim.
1253+
// same ResourceClaim, the ResourceClaim might have been marked
1254+
// for deletion, or even been deallocated already.
12491255
//
12501256
// kubelet will check this before allowing a Pod to run because a
12511257
// a user might have selected a node manually without reserving
@@ -1269,41 +1275,6 @@ type ResourceClaimStatus struct {
12691275
<<[/UNRESOLVED]>>
12701276
}
12711277
1272-
// SchedulingStatus contains information that is relevant while
1273-
// a Pod with delayed allocation is being scheduled.
1274-
type SchedulingStatus struct {
1275-
// When allocation is delayed, the scheduler must set
1276-
// the node for which it wants the resource to be allocated
1277-
// before the driver proceeds with allocation.
1278-
//
1279-
// This field may only be set by a scheduler, but not get
1280-
// overwritten. It will get reset by the driver when allocation
1281-
// succeeds or fails. This ensures that different schedulers
1282-
// that handle different Pods do not accidentally trigger
1283-
// allocation for different nodes.
1284-
//
1285-
// For immediate allocation, the scheduler will not set
1286-
// this field. The resource driver controller may
1287-
// set it to trigger allocation on a specific node if the
1288-
// resources are local to nodes.
1289-
//
1290-
// List/watch requests for ResourceClaims can filter on this field
1291-
// using a "status.scheduling.scheduler.selectedNode=NAME"
1292-
// fieldSelector.
1293-
SelectedNode string
1294-
1295-
// SelectedUser may be set by the scheduler together with SelectedNode
1296-
// to the Pod that it is scheduling. The resource driver then may set
1297-
// both the Allocation and the ReservedFor field when it is done with
1298-
// a successful allocation.
1299-
//
1300-
// This is an optional optimization that saves one API server call
1301-
// and one pod scheduling attempt in the scheduler because the resource
1302-
// will be ready for use by the Pod the next time that the scheduler
1303-
// tries to schedule it.
1304-
SelectedUser *metav1.OwnerReference
1305-
}
1306-
13071278
// AllocationResult contains attributed of an allocated resource.
13081279
type AllocationResult struct {
13091280
// ResourceHandle contains arbitrary data returned by the driver after a
@@ -1336,13 +1307,9 @@ type AllocationResult struct {
13361307
SharedResource bool
13371308
}
13381309
1339-
// PodScheduling objects get created by a scheduler when it needs
1340-
// information from resource driver(s) while scheduling a Pod that uses
1341-
// one or more unallocated ResourceClaims with delayed allocation.
1342-
//
1343-
// Alternatively, a scheduler extender might be configured for the
1344-
// resource driver(s). If all drivers have one, this object is not
1345-
// needed.
1310+
// PodScheduling objects get created by a scheduler when it handles
1311+
// a pod which uses one or more unallocated ResourceClaims with delayed
1312+
// allocation.
13461313
type PodScheduling struct {
13471314
metav1.TypeMeta
13481315
@@ -1351,41 +1318,100 @@ type PodScheduling struct {
13511318
// to ensure that the PodScheduling object gets deleted
13521319
// when no longer needed. Normally the scheduler will delete it.
13531320
//
1321+
// Drivers must ignore PodScheduling objects where the owning
1322+
// pod already got deleted because such objects are orphaned
1323+
// and will be removed soon.
1324+
//
13541325
// More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
13551326
metav1.ObjectMeta
13561327
1328+
// Spec is set and updated by the scheduler.
1329+
Spec PodSchedulingSpec
1330+
1331+
// Status is updated by resource drivers.
1332+
Status PodSchedulingStatus
1333+
}
1334+
1335+
// PodSchedulingSpec contains the request for information about
1336+
// resources required by a pod and eventually communicates
1337+
// the decision of the scheduler to move ahead with pod scheduling
1338+
// for a specific node.
1339+
type PodSchedulingSpec {
1340+
// When allocation is delayed, the scheduler must set
1341+
// the node for which it wants the resource(s) to be allocated
1342+
// before the driver(s) start with allocation.
1343+
//
1344+
// The driver must ensure that the allocated resource
1345+
// is available on this node or update ResourceSchedulingStatus.UnsuitableNodes
1346+
// to indicate where allocation might succeed.
1347+
//
1348+
// When allocation succeeds, drivers should immediately add
1349+
// the pod to the ResourceClaimStatus.ReservedFor field
1350+
// together with setting ResourceClaimStatus.Allocated. This
1351+
// optimization may save scheduling attempts and roundtrips
1352+
// through the API server because the scheduler does not
1353+
// need to reserve the claim for the pod itself.
1354+
//
1355+
// The selected node may change over time, for example
1356+
// when the initial choice turns out to be unsuitable
1357+
// after all. Drivers must not reallocate for a different
1358+
// node when they see such a change because it would
1359+
// lead to race conditions. Instead, the scheduler
1360+
// will trigger deallocation of specific claims as
1361+
// needed through the ResourceClaimStatus.DeallocationRequested
1362+
// field.
1363+
SelectedNode string
1364+
13571365
// When allocation is delayed, and the scheduler needs to
13581366
// decide on which node a Pod should run, it will
13591367
// ask the driver(s) on which nodes the resource might be
13601368
// made available. To trigger that check, the scheduler
13611369
// provides the names of nodes which might be suitable
13621370
// for the Pod. Will be updated periodically until
1363-
// the claim is allocated.
1371+
// all resources are allocated.
13641372
//
13651373
// The ResourceClass.SuiteableNodes node selector can be
13661374
// used to filter out nodes based on labels. This prevents
13671375
// adding nodes here that the driver then would need to
13681376
// reject through UnsuitableNodes.
1377+
//
1378+
// The size of this field is limited to 256. This is large
1379+
// enough for many clusters. Larger clusters may need more
1380+
// attempts to find a node that suits all pending resources.
13691381
PotentialNodes []string
1382+
}
13701383
1384+
// PodSchedulingStatus is where resource drivers provide
1385+
// information about where the could allocate a resource
1386+
// and whether allocation failed.
1387+
type PodSchedulingStatus struct {
13711388
// Each resource driver is responsible for providing information about
1372-
// those claims in the Pod that the driver manages. It can skip
1373-
// adding that information when it already allocated the claim.
1389+
// those resources in the Pod that the driver manages. It can skip
1390+
// adding that information when it already allocated the resource.
1391+
//
1392+
// A driver must add entries here for all its pending claims, even if
1393+
// the ResourceSchedulingStatus.UnsuitabeNodes field is empty,
1394+
// because the scheduler may decide to wait with selecting
1395+
// a node until it has information from all drivers.
13741396
//
13751397
// +listType=map
13761398
// +listMapKey=podResourceClaimName
13771399
// +optional
1378-
Claims []ResourceClaimScheduling
1400+
Claims []ResourceClaimSchedulingStatus
1401+
1402+
// If there ever is a need to support other kinds of resources
1403+
// than ResourceClaim, then new fields could get added here
1404+
// for those other resources.
13791405
}
13801406
1381-
// ResourceClaimScheduling contains information about one
1382-
// particular claim in a pod while scheduling that pod.
1383-
type ResourceClaimScheduling struct {
1407+
// ResourceClaimSchedulingStatus contains information about one
1408+
// particular claim while scheduling a pod.
1409+
type ResourceClaimSchedulingStatus struct {
13841410
// PodResourceClaimName matches the PodResourceClaim.Name field.
13851411
PodResourceClaimName string
13861412
1387-
// A change of the PotentialNodes field in the PodScheduling object
1388-
// triggers a check in the driver
1413+
// A change of the PodSchedulingSpec.PotentialNodes field and/or a failed
1414+
// allocation attempt trigger a check in the driver
13891415
// on which of those nodes the resource might be made available. It
13901416
// then excludes nodes by listing those where that is not the case in
13911417
// UnsuitableNodes.
@@ -1566,28 +1592,6 @@ notices this, the current scheduling attempt for the pod must stop and the pod
15661592
needs to be put back into the work queue. It then gets retried whenever a
15671593
ResourceClaim gets added or modified.
15681594

1569-
In addition, kube-scheduler can be configured to contact a resource driver
1570-
directly as a scheduler extender. This can avoid the need to communicate the
1571-
list of potential and unsuitable nodes through the apiserver:
1572-
1573-
```
1574-
type Extender struct {
1575-
...
1576-
// ManagedResourceDrivers is a list of resource driver names that are managed
1577-
// by this extender. A pod will be sent to the extender on the Filter, Prioritize
1578-
// and Bind (if the extender is the binder) phases iff the pod requests at least
1579-
// one ResourceClaim for which the resource driver name in the corresponding
1580-
// ResourceClass is listed here. In addition, the builtin dynamic resources
1581-
// plugin will skip creation and updating of the PodScheduling object
1582-
// if all claims in the pod have an extender with the FilterVerb.
1583-
ManagedResourceDrivers []string
1584-
```
1585-
1586-
The existing extender plugin must check this field to decide when to contact
1587-
the extender. It will get added to the most recent scheduler configuration API
1588-
version as a feature-gated field. Not adding it to older versions is meant to
1589-
encourage using the current API version.
1590-
15911595
The following extension points are implemented in the new plugin:
15921596

15931597
#### Pre-filter
@@ -1635,7 +1639,7 @@ conditions apply:
16351639
Filter
16361640

16371641
One of the ResourceClaims satisfying these criteria is picked randomly and deallocation
1638-
is requested by setting the Deallocate field. The scheduler then needs to wait
1642+
is requested by setting the ResourceClaimStatus.DeallocationRequested field. The scheduler then needs to wait
16391643
for the resource driver to react to that change and deallocate the resource.
16401644

16411645
This may make it possible to run the Pod
@@ -1656,30 +1660,26 @@ gets updated now if the field doesn't
16561660
match the current list already. If no PodScheduling object exists yet,
16571661
it gets created.
16581662

1659-
When there is a scheduler extender configured for the claim, creating and
1660-
updating PodScheduling objects gets skipped because the scheduler
1661-
extender handles filtering.
1662-
16631663
#### Reserve
16641664

16651665
A node has been chosen for the Pod.
16661666

16671667
If using delayed allocation and the resource has not been allocated yet,
1668-
the SelectedNode field of the ResourceClaim
1668+
the PodSchedulingSpec.SelectedNode field
16691669
gets set here and the scheduling attempt gets stopped for now. It will be
1670-
retried when the ResourceClaim status changes.
1670+
retried when the ResourceClaim or PodScheduling statuses change.
16711671

16721672
If all resources have been allocated already,
16731673
the scheduler adds the Pod to the `claim.status.reservedFor` field of its ResourceClaims to ensure that
16741674
no-one else gets to use those.
16751675

16761676
If some resources are not allocated yet or reserving an allocated resource
16771677
fails, the scheduling attempt needs to be aborted and retried at a later time
1678-
or when the ResourceClaims change.
1678+
or when the statuses change (same as above).
16791679

16801680
#### Unreserve
16811681

1682-
The scheduler removes the Pod from the ReservedFor field because it cannot be scheduled after
1682+
The scheduler removes the Pod from the ResourceClaimStatus.ReservedFor field because it cannot be scheduled after
16831683
all.
16841684

16851685
### Cluster Autoscaler

0 commit comments

Comments
 (0)