You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Having the scheduler and drivers exchange availability information through the
PodScheduling object has several advantages:
- users don't need to see the information
- a selected node and potential nodes automatically apply to
all pending claims
- drivers can make holistic decisions about
resource availability, for example when a pod
requests two distinct GPUs but only some nodes have more
than one or when there are interdependencies with
other drivers
Deallocate gets renamed to DeallocationRequested to make it describe the state
of the claim, not an imperative. The reason why it needs to remain in
ResourceClaimStatus is explained better.
Because the scheduler extender API has no support for Reserve and Unreserve,
the previous proposal for replacing usage of PodScheduling with webhook calls
is no longer applicable and would have to be extended. This may be feasible,
but is more complicated and is left out for now.
@@ -968,7 +980,7 @@ while <pod needs to be scheduled> {
968
980
uses delayed allocation, and
969
981
was not available on a node> {
970
982
<randomly pick one of those resources and
971
-
tell resource driver to deallocate it by setting `Deallocate` and
983
+
tell resource driver to deallocate it by setting `claim.status.deallocationRequested` and
972
984
removing the pod from `claim.status.reservedFor` (if present there)>
973
985
}
974
986
} else if <all resources allocated> {
@@ -985,7 +997,8 @@ a certain resource class, a node selector can be specified in that class. That
985
997
selector is static and typically will use labels that determine which nodes may
986
998
have resources available.
987
999
988
-
To gather information about the current state of resource availability, the
1000
+
To gather information about the current state of resource availability and to
1001
+
trigger allocation of a claim, the
989
1002
scheduler creates a PodScheduling object. That object is owned by the pod and
990
1003
will either get deleted by the scheduler when it is done with pod scheduling or
991
1004
through the garbage collector. In the PodScheduling object, the scheduler posts
@@ -997,9 +1010,7 @@ likely to pick a node for which allocation succeeds.
997
1010
998
1011
This scheduling information is optional and does not have to be in sync with
999
1012
the current ResourceClaim state, therefore it is okay to store it
1000
-
separately. Alternatively, the scheduler may be configured to do node filtering
1001
-
through scheduler extenders, in which case the PodScheduling object will not be
1002
-
needed.
1013
+
separately.
1003
1014
1004
1015
Allowing the scheduler to trigger allocation in parallel to asking for more
1005
1016
information was chosen because for pods with a single resource claim, the cost
@@ -1030,19 +1041,17 @@ else changes in the system, like for example deleting objects.
1030
1041
***scheduler** filters nodes
1031
1042
* if *delayed allocation and resource not allocated yet*:
1032
1043
* if *at least one node fits pod*:
1033
-
***scheduler** creates or updates a `PodScheduling` object with `potentialNodes=<all nodes that fit the pod>`
1034
-
***scheduler** picks one node, sets `claim.status.scheduling.selectedNode=<the chosen node>` and `claim.status.scheduling.selectedUser=<the pod being scheduled>`
1035
-
* if *resource is available for `claim.status.scheduling.selectedNode`*:s
1036
-
***resource driver** adds finalizer to claim to prevent deletion -> allocation in progress
1037
-
***resource driver** finishes allocation, sets `claim.status.allocation` and the
1038
-
intended user in `claim.status.reservedFor`, clears `claim.status.selectedNode` and `claim.status.selectedUser` -> claim ready for use and reserved
1039
-
for the pod
1040
-
* else *scheduler needs to know that it must avoid this and possibly other nodes*:
1041
-
***resource driver** retrieves `PodScheduling` object from informer cache or API server
1042
-
***resource driver** sets `podScheduling.claims[name=name of claim in pod].unsuitableNodes`
1043
-
***resource driver** clears `claim.status.selectedNode` -> next attempt by scheduler has more information and is more likely to succeed
1044
+
***scheduler** creates or updates a `PodScheduling` object with `podScheduling.spec.potentialNodes=<nodes that fit the pod>`
1045
+
* if *exactly one claim is pending* or *all drivers have provided information*:
1046
+
***scheduler** picks one node, sets `podScheduling.spec.selectedNode=<the chosen node>`
1047
+
* if *resource is available for this selected node*:
1048
+
***resource driver** adds finalizer to claim to prevent deletion -> allocation in progress
1049
+
***resource driver** finishes allocation, sets `claim.status.allocation` and the
1050
+
pod in `claim.status.reservedFor` -> claim ready for use and reserved for the pod
1051
+
* else *scheduler needs to know that it must avoid this and possibly other nodes*:
1052
+
***resource driver** sets `podScheduling.status.claims[name=name of claim in pod].unsuitableNodes`
1044
1053
* else *pod cannot be scheduled*:
1045
-
***scheduler** may trigger deallocation of some claim with delayed allocation by setting `claim.status.deallocate` to true
1054
+
***scheduler** may trigger deallocation of some claim with delayed allocation by setting `claim.status.deallocationRequested` to true
1046
1055
(see [pseudo-code above](#coordinating-resource-allocation-through-the-scheduler)) or wait
1047
1056
* if *pod not listed in `claim.status.reservedFor` yet* (can occur for immediate allocation):
1048
1057
***scheduler** adds it to `claim.status.reservedFor`
@@ -1216,26 +1225,22 @@ type ResourceClaimStatus struct {
1216
1225
// marked for deletion.
1217
1226
DriverName string
1218
1227
1219
-
// Scheduling contains information that is only relevant while the
1220
-
// scheduler and the resource driver are in the process of selecting a
1221
-
// node for a Pod and the allocation mode is AllocationModeWaitForFirstConsumer. The
1222
-
// resource driver should unset this when it has successfully allocated
1223
-
// the resource.
1224
-
Scheduling SchedulingStatus
1225
-
1226
1228
// Allocation is set by the resource driver once a resource has been
1227
1229
// allocated successfully. Nil indicates that the resource is not
1228
1230
// allocated.
1229
1231
Allocation *AllocationResult
1230
1232
1231
-
// Deallocate may be set to true to request deallocation of a resource as soon
1232
-
// as it is unused. The scheduler uses this when it finds that deallocating
1233
-
// the resource and reallocating it elsewhere might unblock a pod.
1233
+
// DeallocationRequested gets set by the scheduler when it detects
1234
+
// the situation where pod scheduling cannot proceed because some
1235
+
// claim was allocated for a node that cannot provide some other
1236
+
// required resource.
1237
+
//
1238
+
// The driver then needs to deallocate this claim and the scheduler
1239
+
// will try again.
1234
1240
//
1235
-
// The resource driver checks this fields and resets it to false
1236
-
// together with clearing the Allocation field. It also sets it
1237
-
// to false when the resource is not allocated.
1238
-
Deallocate bool
1241
+
// While DeallocationRequested is set, no new users may be added
1242
+
// to ReservedFor.
1243
+
DeallocationRequested bool
1239
1244
1240
1245
// ReservedFor indicates which entities are currently allowed to use
1241
1246
// the resource. Usually those are Pods, but other objects are
@@ -1245,7 +1250,8 @@ type ResourceClaimStatus struct {
1245
1250
// A scheduler must add a Pod that it is scheduling. This must be done
1246
1251
// in an atomic ResourceClaim update because there might be multiple
1247
1252
// schedulers working on different Pods that compete for access to the
1248
-
// same ResourceClaim.
1253
+
// same ResourceClaim, the ResourceClaim might have been marked
1254
+
// for deletion, or even been deallocated already.
1249
1255
//
1250
1256
// kubelet will check this before allowing a Pod to run because a
1251
1257
// a user might have selected a node manually without reserving
@@ -1269,41 +1275,6 @@ type ResourceClaimStatus struct {
1269
1275
<<[/UNRESOLVED]>>
1270
1276
}
1271
1277
1272
-
// SchedulingStatus contains information that is relevant while
1273
-
// a Pod with delayed allocation is being scheduled.
1274
-
type SchedulingStatus struct {
1275
-
// When allocation is delayed, the scheduler must set
1276
-
// the node for which it wants the resource to be allocated
1277
-
// before the driver proceeds with allocation.
1278
-
//
1279
-
// This field may only be set by a scheduler, but not get
1280
-
// overwritten. It will get reset by the driver when allocation
1281
-
// succeeds or fails. This ensures that different schedulers
1282
-
// that handle different Pods do not accidentally trigger
1283
-
// allocation for different nodes.
1284
-
//
1285
-
// For immediate allocation, the scheduler will not set
1286
-
// this field. The resource driver controller may
1287
-
// set it to trigger allocation on a specific node if the
1288
-
// resources are local to nodes.
1289
-
//
1290
-
// List/watch requests for ResourceClaims can filter on this field
1291
-
// using a "status.scheduling.scheduler.selectedNode=NAME"
1292
-
// fieldSelector.
1293
-
SelectedNode string
1294
-
1295
-
// SelectedUser may be set by the scheduler together with SelectedNode
1296
-
// to the Pod that it is scheduling. The resource driver then may set
1297
-
// both the Allocation and the ReservedFor field when it is done with
1298
-
// a successful allocation.
1299
-
//
1300
-
// This is an optional optimization that saves one API server call
1301
-
// and one pod scheduling attempt in the scheduler because the resource
1302
-
// will be ready for use by the Pod the next time that the scheduler
1303
-
// tries to schedule it.
1304
-
SelectedUser *metav1.OwnerReference
1305
-
}
1306
-
1307
1278
// AllocationResult contains attributed of an allocated resource.
1308
1279
type AllocationResult struct {
1309
1280
// ResourceHandle contains arbitrary data returned by the driver after a
@@ -1336,13 +1307,9 @@ type AllocationResult struct {
1336
1307
SharedResource bool
1337
1308
}
1338
1309
1339
-
// PodScheduling objects get created by a scheduler when it needs
1340
-
// information from resource driver(s) while scheduling a Pod that uses
1341
-
// one or more unallocated ResourceClaims with delayed allocation.
1342
-
//
1343
-
// Alternatively, a scheduler extender might be configured for the
1344
-
// resource driver(s). If all drivers have one, this object is not
1345
-
// needed.
1310
+
// PodScheduling objects get created by a scheduler when it handles
1311
+
// a pod which uses one or more unallocated ResourceClaims with delayed
1312
+
// allocation.
1346
1313
type PodScheduling struct {
1347
1314
metav1.TypeMeta
1348
1315
@@ -1351,41 +1318,100 @@ type PodScheduling struct {
1351
1318
// to ensure that the PodScheduling object gets deleted
1352
1319
// when no longer needed. Normally the scheduler will delete it.
1353
1320
//
1321
+
// Drivers must ignore PodScheduling objects where the owning
1322
+
// pod already got deleted because such objects are orphaned
1323
+
// and will be removed soon.
1324
+
//
1354
1325
// More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
1355
1326
metav1.ObjectMeta
1356
1327
1328
+
// Spec is set and updated by the scheduler.
1329
+
Spec PodSchedulingSpec
1330
+
1331
+
// Status is updated by resource drivers.
1332
+
Status PodSchedulingStatus
1333
+
}
1334
+
1335
+
// PodSchedulingSpec contains the request for information about
1336
+
// resources required by a pod and eventually communicates
1337
+
// the decision of the scheduler to move ahead with pod scheduling
1338
+
// for a specific node.
1339
+
type PodSchedulingSpec {
1340
+
// When allocation is delayed, the scheduler must set
1341
+
// the node for which it wants the resource(s) to be allocated
1342
+
// before the driver(s) start with allocation.
1343
+
//
1344
+
// The driver must ensure that the allocated resource
1345
+
// is available on this node or update ResourceSchedulingStatus.UnsuitableNodes
1346
+
// to indicate where allocation might succeed.
1347
+
//
1348
+
// When allocation succeeds, drivers should immediately add
1349
+
// the pod to the ResourceClaimStatus.ReservedFor field
1350
+
// together with setting ResourceClaimStatus.Allocated. This
1351
+
// optimization may save scheduling attempts and roundtrips
1352
+
// through the API server because the scheduler does not
1353
+
// need to reserve the claim for the pod itself.
1354
+
//
1355
+
// The selected node may change over time, for example
1356
+
// when the initial choice turns out to be unsuitable
1357
+
// after all. Drivers must not reallocate for a different
1358
+
// node when they see such a change because it would
1359
+
// lead to race conditions. Instead, the scheduler
1360
+
// will trigger deallocation of specific claims as
1361
+
// needed through the ResourceClaimStatus.DeallocationRequested
1362
+
// field.
1363
+
SelectedNode string
1364
+
1357
1365
// When allocation is delayed, and the scheduler needs to
1358
1366
// decide on which node a Pod should run, it will
1359
1367
// ask the driver(s) on which nodes the resource might be
1360
1368
// made available. To trigger that check, the scheduler
1361
1369
// provides the names of nodes which might be suitable
1362
1370
// for the Pod. Will be updated periodically until
1363
-
// the claim is allocated.
1371
+
// all resources are allocated.
1364
1372
//
1365
1373
// The ResourceClass.SuiteableNodes node selector can be
1366
1374
// used to filter out nodes based on labels. This prevents
1367
1375
// adding nodes here that the driver then would need to
1368
1376
// reject through UnsuitableNodes.
1377
+
//
1378
+
// The size of this field is limited to 256. This is large
1379
+
// enough for many clusters. Larger clusters may need more
1380
+
// attempts to find a node that suits all pending resources.
1369
1381
PotentialNodes []string
1382
+
}
1370
1383
1384
+
// PodSchedulingStatus is where resource drivers provide
1385
+
// information about where the could allocate a resource
1386
+
// and whether allocation failed.
1387
+
type PodSchedulingStatus struct {
1371
1388
// Each resource driver is responsible for providing information about
1372
-
// those claims in the Pod that the driver manages. It can skip
1373
-
// adding that information when it already allocated the claim.
1389
+
// those resources in the Pod that the driver manages. It can skip
1390
+
// adding that information when it already allocated the resource.
1391
+
//
1392
+
// A driver must add entries here for all its pending claims, even if
1393
+
// the ResourceSchedulingStatus.UnsuitabeNodes field is empty,
1394
+
// because the scheduler may decide to wait with selecting
1395
+
// a node until it has information from all drivers.
1374
1396
//
1375
1397
// +listType=map
1376
1398
// +listMapKey=podResourceClaimName
1377
1399
// +optional
1378
-
Claims []ResourceClaimScheduling
1400
+
Claims []ResourceClaimSchedulingStatus
1401
+
1402
+
// If there ever is a need to support other kinds of resources
1403
+
// than ResourceClaim, then new fields could get added here
1404
+
// for those other resources.
1379
1405
}
1380
1406
1381
-
// ResourceClaimScheduling contains information about one
1382
-
// particular claim in a pod while scheduling that pod.
1383
-
type ResourceClaimScheduling struct {
1407
+
// ResourceClaimSchedulingStatus contains information about one
1408
+
// particular claim while scheduling a pod.
1409
+
type ResourceClaimSchedulingStatus struct {
1384
1410
// PodResourceClaimName matches the PodResourceClaim.Name field.
1385
1411
PodResourceClaimName string
1386
1412
1387
-
// A change of the PotentialNodes field in the PodScheduling object
1388
-
// triggers a check in the driver
1413
+
// A change of the PodSchedulingSpec.PotentialNodes field and/or a failed
1414
+
// allocation attempt trigger a check in the driver
1389
1415
// on which of those nodes the resource might be made available. It
1390
1416
// then excludes nodes by listing those where that is not the case in
1391
1417
// UnsuitableNodes.
@@ -1566,28 +1592,6 @@ notices this, the current scheduling attempt for the pod must stop and the pod
1566
1592
needs to be put back into the work queue. It then gets retried whenever a
1567
1593
ResourceClaim gets added or modified.
1568
1594
1569
-
In addition, kube-scheduler can be configured to contact a resource driver
1570
-
directly as a scheduler extender. This can avoid the need to communicate the
1571
-
list of potential and unsuitable nodes through the apiserver:
1572
-
1573
-
```
1574
-
type Extender struct {
1575
-
...
1576
-
// ManagedResourceDrivers is a list of resource driver names that are managed
1577
-
// by this extender. A pod will be sent to the extender on the Filter, Prioritize
1578
-
// and Bind (if the extender is the binder) phases iff the pod requests at least
1579
-
// one ResourceClaim for which the resource driver name in the corresponding
1580
-
// ResourceClass is listed here. In addition, the builtin dynamic resources
1581
-
// plugin will skip creation and updating of the PodScheduling object
1582
-
// if all claims in the pod have an extender with the FilterVerb.
1583
-
ManagedResourceDrivers []string
1584
-
```
1585
-
1586
-
The existing extender plugin must check this field to decide when to contact
1587
-
the extender. It will get added to the most recent scheduler configuration API
1588
-
version as a feature-gated field. Not adding it to older versions is meant to
1589
-
encourage using the current API version.
1590
-
1591
1595
The following extension points are implemented in the new plugin:
1592
1596
1593
1597
#### Pre-filter
@@ -1635,7 +1639,7 @@ conditions apply:
1635
1639
Filter
1636
1640
1637
1641
One of the ResourceClaims satisfying these criteria is picked randomly and deallocation
1638
-
is requested by setting the Deallocate field. The scheduler then needs to wait
1642
+
is requested by setting the ResourceClaimStatus.DeallocationRequested field. The scheduler then needs to wait
1639
1643
for the resource driver to react to that change and deallocate the resource.
1640
1644
1641
1645
This may make it possible to run the Pod
@@ -1656,30 +1660,26 @@ gets updated now if the field doesn't
1656
1660
match the current list already. If no PodScheduling object exists yet,
1657
1661
it gets created.
1658
1662
1659
-
When there is a scheduler extender configured for the claim, creating and
1660
-
updating PodScheduling objects gets skipped because the scheduler
1661
-
extender handles filtering.
1662
-
1663
1663
#### Reserve
1664
1664
1665
1665
A node has been chosen for the Pod.
1666
1666
1667
1667
If using delayed allocation and the resource has not been allocated yet,
1668
-
the SelectedNode field of the ResourceClaim
1668
+
the PodSchedulingSpec.SelectedNode field
1669
1669
gets set here and the scheduling attempt gets stopped for now. It will be
1670
-
retried when the ResourceClaim status changes.
1670
+
retried when the ResourceClaim or PodScheduling statuses change.
1671
1671
1672
1672
If all resources have been allocated already,
1673
1673
the scheduler adds the Pod to the `claim.status.reservedFor` field of its ResourceClaims to ensure that
1674
1674
no-one else gets to use those.
1675
1675
1676
1676
If some resources are not allocated yet or reserving an allocated resource
1677
1677
fails, the scheduling attempt needs to be aborted and retried at a later time
1678
-
or when the ResourceClaims change.
1678
+
or when the statuses change (same as above).
1679
1679
1680
1680
#### Unreserve
1681
1681
1682
-
The scheduler removes the Pod from the ReservedFor field because it cannot be scheduled after
1682
+
The scheduler removes the Pod from the ResourceClaimStatus.ReservedFor field because it cannot be scheduled after
0 commit comments