Skip to content

Commit a705dc7

Browse files
committed
dynamic resource allocation: API constant and scheduler update
This is in response to kubernetes#3502 (review).
1 parent 586e8dc commit a705dc7

File tree

1 file changed

+70
-26
lines changed
  • keps/sig-node/3063-dynamic-resource-allocation

1 file changed

+70
-26
lines changed

keps/sig-node/3063-dynamic-resource-allocation/README.md

Lines changed: 70 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -998,8 +998,8 @@ selector is static and typically will use labels that determine which nodes may
998998
have resources available.
999999

10001000
To gather information about the current state of resource availability and to
1001-
trigger allocation of a claim, the
1002-
scheduler creates a PodScheduling object. That object is owned by the pod and
1001+
trigger allocation of a claim, the scheduler creates one PodScheduling object
1002+
for each pod that uses claims. That object is owned by the pod and
10031003
will either get deleted by the scheduler when it is done with pod scheduling or
10041004
through the garbage collector. In the PodScheduling object, the scheduler posts
10051005
the list of all potential nodes that it was left with after considering all
@@ -1042,7 +1042,7 @@ else changes in the system, like for example deleting objects.
10421042
* if *delayed allocation and resource not allocated yet*:
10431043
* if *at least one node fits pod*:
10441044
* **scheduler** creates or updates a `PodScheduling` object with `podScheduling.spec.potentialNodes=<nodes that fit the pod>`
1045-
* if *exactly one claim is pending* or *all drivers have provided information*:
1045+
* if *exactly one claim is pending (see below)* or *all drivers have provided information*:
10461046
* **scheduler** picks one node, sets `podScheduling.spec.selectedNode=<the chosen node>`
10471047
* if *resource is available for this selected node*:
10481048
* **resource driver** adds finalizer to claim to prevent deletion -> allocation in progress
@@ -1075,6 +1075,14 @@ else changes in the system, like for example deleting objects.
10751075
* **resource driver** clears finalizer and `claim.status.allocation`
10761076
* **API server** removes ResourceClaim
10771077

1078+
When exactly one claim is pending, it is safe to trigger the allocation: if the
1079+
node is suitable, the allocation will succeed and the pod can get scheduled
1080+
without further delays. If the node is not suitable, allocation fails and the
1081+
next attempt can do better because it has more information. The same should not
1082+
be done when there are multiple claims because allocation might succeed for
1083+
some, but not all of them, which would force the scheduler to recover by asking
1084+
for deallocation. It's better to wait for information in this case.
1085+
10781086
The flow is similar for a ResourceClaim that gets created as a stand-alone
10791087
object by the user. In that case, the Pod reference that ResourceClaim by
10801088
name. The ResourceClaim does not get deleted at the end and can be reused by
@@ -1373,9 +1381,10 @@ type PodSchedulingSpec {
13731381
// adding nodes here that the driver then would need to
13741382
// reject through UnsuitableNodes.
13751383
//
1376-
// The size of this field is limited to 256. This is large
1377-
// enough for many clusters. Larger clusters may need more
1378-
// attempts to find a node that suits all pending resources.
1384+
// The size of this field is limited to 256 (=
1385+
// [PodSchedulingNodeListMaxSize]). This is large enough for many
1386+
// clusters. Larger clusters may need more attempts to find a node that
1387+
// suits all pending resources.
13791388
PotentialNodes []string
13801389
}
13811390
@@ -1408,23 +1417,36 @@ type ResourceClaimSchedulingStatus struct {
14081417
// PodResourceClaimName matches the PodResourceClaim.Name field.
14091418
PodResourceClaimName string
14101419
1411-
// A change of the PodSchedulingSpec.PotentialNodes field and/or a failed
1412-
// allocation attempt trigger a check in the driver
1413-
// on which of those nodes the resource might be made available. It
1414-
// then excludes nodes by listing those where that is not the case in
1415-
// UnsuitableNodes.
1416-
//
1420+
// UnsuitableNodes lists nodes that the claim cannot be allocated for.
14171421
// Nodes listed here will be ignored by the scheduler when selecting a
14181422
// node for a Pod. All other nodes are potential candidates, either
14191423
// because no information is available yet or because allocation might
14201424
// succeed.
14211425
//
1422-
// This can change, so the driver must refresh this information
1426+
// A change of the PodSchedulingSpec.PotentialNodes field and/or a failed
1427+
// allocation attempt trigger an update of this field: the driver
1428+
// then checks all nodes listed in PotentialNodes and UnsuitableNodes
1429+
// and updates UnsuitableNodes.
1430+
//
1431+
// It must include the prior UnsuitableNodes in this check because the
1432+
// scheduler will not list those again in PotentialNodes but they might
1433+
// still be unsuitable.
1434+
//
1435+
// This can change, so the driver also must refresh this information
14231436
// periodically and/or after changing resource allocation for some
14241437
// other ResourceClaim until a node gets selected by the scheduler.
1438+
//
1439+
// The size of this field is limited to 256 (=
1440+
// [PodSchedulingNodeListMaxSize]), the same as for
1441+
// PodSchedulingSpec.PotentialNodes.
14251442
UnsuitableNodes []string
14261443
}
14271444
1445+
// PodSchedulingNodeListMaxSize defines the maximum number of entries in the
1446+
// node lists that are stored in PodScheduling objects. This limit is part
1447+
// of the API.
1448+
const PodSchedulingNodeListMaxSize = 256
1449+
14281450
type PodSpec {
14291451
...
14301452
// ResourceClaims defines which ResourceClaims must be allocated
@@ -1657,32 +1679,54 @@ might attempt to improve this.
16571679
#### Pre-score
16581680

16591681
This is passed a list of nodes that have passed filtering by the resource
1660-
plugin and the other plugins. The PodScheduling.PotentialNodes field
1661-
gets updated now if the field doesn't
1662-
match the current list already. If no PodScheduling object exists yet,
1663-
it gets created.
1682+
plugin and the other plugins. That list is stored by the plugin and will
1683+
be copied to PodSchedulingSpec.PotentialNodes when the plugin creates or updates
1684+
the object in Reserve.
1685+
1686+
Pre-score is not called when there is only a single potential node. In that
1687+
case Reserve will store the selected node in PodSchedulingSpec.PotentialNodes.
16641688

16651689
#### Reserve
16661690

16671691
A node has been chosen for the Pod.
16681692

1669-
If using delayed allocation and the resource has not been allocated yet,
1670-
the PodSchedulingSpec.SelectedNode field
1671-
gets set here and the scheduling attempt gets stopped for now. It will be
1672-
retried when the ResourceClaim or PodScheduling statuses change.
1693+
If using delayed allocation and one or more claims have not been allocated yet,
1694+
the plugin now needs to decide whether it wants to trigger allocation by
1695+
setting the PodSchedulingSpec.SelectedNode field. For a single unallocated
1696+
claim that is safe even if no information about unsuitable nodes is available
1697+
because the allocation will either succeed or fail. For multiple such claims
1698+
allocation only gets triggered when that information is available, to minimize
1699+
the risk of getting only some but not all claims allocated. In both cases the
1700+
PodScheduling object gets created or updated as needed. This is also where the
1701+
PodSchedulingSpec.PotentialNodes field gets set.
16731702

16741703
If all resources have been allocated already,
1675-
the scheduler adds the Pod to the `claim.status.reservedFor` field of its ResourceClaims to ensure that
1676-
no-one else gets to use those.
1704+
the scheduler ensures that the Pod is listed in the `claim.status.reservedFor` field
1705+
of its ResourceClaims. The driver can and should already have added
1706+
the Pod when specifically allocating the claim for it, so it may
1707+
be possible to skip this update.
16771708

16781709
If some resources are not allocated yet or reserving an allocated resource
16791710
fails, the scheduling attempt needs to be aborted and retried at a later time
1680-
or when the statuses change (same as above).
1711+
or when the statuses change.
16811712

16821713
#### Unreserve
16831714

1684-
The scheduler removes the Pod from the ResourceClaimStatus.ReservedFor field because it cannot be scheduled after
1685-
all.
1715+
The scheduler removes the Pod from the ResourceClaimStatus.ReservedFor field
1716+
because it cannot be scheduled after all.
1717+
1718+
This is necessary to prevent a deadlock: suppose there are two stand-alone
1719+
claims that only can be used by one pod at a time and two pods which both
1720+
reference them. Both pods will get scheduled independently, perhaps even by
1721+
different schedulers. When each pod manages to allocate and reserve one claim,
1722+
then neither of them can get scheduled because they cannot reserve the other
1723+
claim.
1724+
1725+
Giving up the reservations in Unreserve means that the next pod scheduling
1726+
attempts have a chance to succeed. It's non-deterministic which pod will win,
1727+
but eventually one of them will. Not giving up the reservations would lead to a
1728+
permanent deadlock that somehow would have to be detected and resolved to make
1729+
progress.
16861730

16871731
### Cluster Autoscaler
16881732

0 commit comments

Comments
 (0)