@@ -998,8 +998,8 @@ selector is static and typically will use labels that determine which nodes may
998
998
have resources available.
999
999
1000
1000
To gather information about the current state of resource availability and to
1001
- trigger allocation of a claim, the
1002
- scheduler creates a PodScheduling object . That object is owned by the pod and
1001
+ trigger allocation of a claim, the scheduler creates one PodScheduling object
1002
+ for each pod that uses claims . That object is owned by the pod and
1003
1003
will either get deleted by the scheduler when it is done with pod scheduling or
1004
1004
through the garbage collector. In the PodScheduling object, the scheduler posts
1005
1005
the list of all potential nodes that it was left with after considering all
@@ -1042,7 +1042,7 @@ else changes in the system, like for example deleting objects.
1042
1042
* if * delayed allocation and resource not allocated yet* :
1043
1043
* if * at least one node fits pod* :
1044
1044
* ** scheduler** creates or updates a ` PodScheduling ` object with ` podScheduling.spec.potentialNodes=<nodes that fit the pod> `
1045
- * if * exactly one claim is pending* or * all drivers have provided information* :
1045
+ * if * exactly one claim is pending (see below) * or * all drivers have provided information* :
1046
1046
* ** scheduler** picks one node, sets ` podScheduling.spec.selectedNode=<the chosen node> `
1047
1047
* if * resource is available for this selected node* :
1048
1048
* ** resource driver** adds finalizer to claim to prevent deletion -> allocation in progress
@@ -1075,6 +1075,14 @@ else changes in the system, like for example deleting objects.
1075
1075
* ** resource driver** clears finalizer and ` claim.status.allocation `
1076
1076
* ** API server** removes ResourceClaim
1077
1077
1078
+ When exactly one claim is pending, it is safe to trigger the allocation: if the
1079
+ node is suitable, the allocation will succeed and the pod can get scheduled
1080
+ without further delays. If the node is not suitable, allocation fails and the
1081
+ next attempt can do better because it has more information. The same should not
1082
+ be done when there are multiple claims because allocation might succeed for
1083
+ some, but not all of them, which would force the scheduler to recover by asking
1084
+ for deallocation. It's better to wait for information in this case.
1085
+
1078
1086
The flow is similar for a ResourceClaim that gets created as a stand-alone
1079
1087
object by the user. In that case, the Pod reference that ResourceClaim by
1080
1088
name. The ResourceClaim does not get deleted at the end and can be reused by
@@ -1373,9 +1381,10 @@ type PodSchedulingSpec {
1373
1381
// adding nodes here that the driver then would need to
1374
1382
// reject through UnsuitableNodes.
1375
1383
//
1376
- // The size of this field is limited to 256. This is large
1377
- // enough for many clusters. Larger clusters may need more
1378
- // attempts to find a node that suits all pending resources.
1384
+ // The size of this field is limited to 256 (=
1385
+ // [PodSchedulingNodeListMaxSize]). This is large enough for many
1386
+ // clusters. Larger clusters may need more attempts to find a node that
1387
+ // suits all pending resources.
1379
1388
PotentialNodes []string
1380
1389
}
1381
1390
@@ -1408,23 +1417,36 @@ type ResourceClaimSchedulingStatus struct {
1408
1417
// PodResourceClaimName matches the PodResourceClaim.Name field.
1409
1418
PodResourceClaimName string
1410
1419
1411
- // A change of the PodSchedulingSpec.PotentialNodes field and/or a failed
1412
- // allocation attempt trigger a check in the driver
1413
- // on which of those nodes the resource might be made available. It
1414
- // then excludes nodes by listing those where that is not the case in
1415
- // UnsuitableNodes.
1416
- //
1420
+ // UnsuitableNodes lists nodes that the claim cannot be allocated for.
1417
1421
// Nodes listed here will be ignored by the scheduler when selecting a
1418
1422
// node for a Pod. All other nodes are potential candidates, either
1419
1423
// because no information is available yet or because allocation might
1420
1424
// succeed.
1421
1425
//
1422
- // This can change, so the driver must refresh this information
1426
+ // A change of the PodSchedulingSpec.PotentialNodes field and/or a failed
1427
+ // allocation attempt trigger an update of this field: the driver
1428
+ // then checks all nodes listed in PotentialNodes and UnsuitableNodes
1429
+ // and updates UnsuitableNodes.
1430
+ //
1431
+ // It must include the prior UnsuitableNodes in this check because the
1432
+ // scheduler will not list those again in PotentialNodes but they might
1433
+ // still be unsuitable.
1434
+ //
1435
+ // This can change, so the driver also must refresh this information
1423
1436
// periodically and/or after changing resource allocation for some
1424
1437
// other ResourceClaim until a node gets selected by the scheduler.
1438
+ //
1439
+ // The size of this field is limited to 256 (=
1440
+ // [PodSchedulingNodeListMaxSize]), the same as for
1441
+ // PodSchedulingSpec.PotentialNodes.
1425
1442
UnsuitableNodes []string
1426
1443
}
1427
1444
1445
+ // PodSchedulingNodeListMaxSize defines the maximum number of entries in the
1446
+ // node lists that are stored in PodScheduling objects. This limit is part
1447
+ // of the API.
1448
+ const PodSchedulingNodeListMaxSize = 256
1449
+
1428
1450
type PodSpec {
1429
1451
...
1430
1452
// ResourceClaims defines which ResourceClaims must be allocated
@@ -1657,32 +1679,54 @@ might attempt to improve this.
1657
1679
#### Pre-score
1658
1680
1659
1681
This is passed a list of nodes that have passed filtering by the resource
1660
- plugin and the other plugins. The PodScheduling.PotentialNodes field
1661
- gets updated now if the field doesn't
1662
- match the current list already. If no PodScheduling object exists yet,
1663
- it gets created.
1682
+ plugin and the other plugins. That list is stored by the plugin and will
1683
+ be copied to PodSchedulingSpec.PotentialNodes when the plugin creates or updates
1684
+ the object in Reserve.
1685
+
1686
+ Pre-score is not called when there is only a single potential node. In that
1687
+ case Reserve will store the selected node in PodSchedulingSpec.PotentialNodes.
1664
1688
1665
1689
#### Reserve
1666
1690
1667
1691
A node has been chosen for the Pod.
1668
1692
1669
- If using delayed allocation and the resource has not been allocated yet,
1670
- the PodSchedulingSpec.SelectedNode field
1671
- gets set here and the scheduling attempt gets stopped for now. It will be
1672
- retried when the ResourceClaim or PodScheduling statuses change.
1693
+ If using delayed allocation and one or more claims have not been allocated yet,
1694
+ the plugin now needs to decide whether it wants to trigger allocation by
1695
+ setting the PodSchedulingSpec.SelectedNode field. For a single unallocated
1696
+ claim that is safe even if no information about unsuitable nodes is available
1697
+ because the allocation will either succeed or fail. For multiple such claims
1698
+ allocation only gets triggered when that information is available, to minimize
1699
+ the risk of getting only some but not all claims allocated. In both cases the
1700
+ PodScheduling object gets created or updated as needed. This is also where the
1701
+ PodSchedulingSpec.PotentialNodes field gets set.
1673
1702
1674
1703
If all resources have been allocated already,
1675
- the scheduler adds the Pod to the ` claim.status.reservedFor ` field of its ResourceClaims to ensure that
1676
- no-one else gets to use those.
1704
+ the scheduler ensures that the Pod is listed in the ` claim.status.reservedFor ` field
1705
+ of its ResourceClaims. The driver can and should already have added
1706
+ the Pod when specifically allocating the claim for it, so it may
1707
+ be possible to skip this update.
1677
1708
1678
1709
If some resources are not allocated yet or reserving an allocated resource
1679
1710
fails, the scheduling attempt needs to be aborted and retried at a later time
1680
- or when the statuses change (same as above) .
1711
+ or when the statuses change.
1681
1712
1682
1713
#### Unreserve
1683
1714
1684
- The scheduler removes the Pod from the ResourceClaimStatus.ReservedFor field because it cannot be scheduled after
1685
- all.
1715
+ The scheduler removes the Pod from the ResourceClaimStatus.ReservedFor field
1716
+ because it cannot be scheduled after all.
1717
+
1718
+ This is necessary to prevent a deadlock: suppose there are two stand-alone
1719
+ claims that only can be used by one pod at a time and two pods which both
1720
+ reference them. Both pods will get scheduled independently, perhaps even by
1721
+ different schedulers. When each pod manages to allocate and reserve one claim,
1722
+ then neither of them can get scheduled because they cannot reserve the other
1723
+ claim.
1724
+
1725
+ Giving up the reservations in Unreserve means that the next pod scheduling
1726
+ attempts have a chance to succeed. It's non-deterministic which pod will win,
1727
+ but eventually one of them will. Not giving up the reservations would lead to a
1728
+ permanent deadlock that somehow would have to be detected and resolved to make
1729
+ progress.
1686
1730
1687
1731
### Cluster Autoscaler
1688
1732
0 commit comments