Skip to content

Commit fa3c5ce

Browse files
committed
DRA: document PreBind
kubernetes/kubernetes#121876 changed where the cluster gets updated with blocking API calls.
1 parent 2db47ba commit fa3c5ce

File tree

1 file changed

+35
-9
lines changed
  • keps/sig-node/3063-dynamic-resource-allocation

1 file changed

+35
-9
lines changed

keps/sig-node/3063-dynamic-resource-allocation/README.md

Lines changed: 35 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -111,6 +111,7 @@ SIG Architecture for cross-cutting KEPs).
111111
- [Post-filter](#post-filter)
112112
- [Pre-score](#pre-score)
113113
- [Reserve](#reserve)
114+
- [PreBind](#prebind)
114115
- [Unreserve](#unreserve)
115116
- [Cluster Autoscaler](#cluster-autoscaler)
116117
- [kubelet](#kubelet)
@@ -150,7 +151,7 @@ SIG Architecture for cross-cutting KEPs).
150151
- [Improving scheduling performance](#improving-scheduling-performance)
151152
- [Optimize for network-attached resources](#optimize-for-network-attached-resources)
152153
- [Moving blocking API calls into goroutines](#moving-blocking-api-calls-into-goroutines)
153-
- [RPC calls instead of <code>PodSchedulingContext</code>](#rpc-calls-instead-of-)
154+
- [RPC calls instead of <code>PodSchedulingContext</code>](#rpc-calls-instead-of-podschedulingcontext)
154155
- [Infrastructure Needed](#infrastructure-needed)
155156
<!-- /toc -->
156157

@@ -1818,13 +1819,11 @@ notices this, the current scheduling attempt for the pod must stop and the pod
18181819
needs to be put back into the work queue. It then gets retried whenever a
18191820
ResourceClaim gets added or modified.
18201821

1821-
The following extension points are implemented in the new claim plugin. Some of
1822-
them invoke API calls to create or update objects. This is done to simplify
1823-
error handling: a failure during such a call puts the pod into the backoff
1824-
queue where it will be retried after a timeout. The downside is that the
1825-
latency caused by those blocking calls not only affects pods using claims, but
1826-
also all other pending pods because the scheduler only schedules one pod at a
1827-
time.
1822+
The following extension points are implemented in the new claim plugin. Except
1823+
for some unlikely edge cases (see below) there are no API calls during the main
1824+
scheduling cycle. Instead, the plugin collects information and updates the
1825+
cluster in the separate goroutine which invokes PreBind.
1826+
18281827

18291828
#### EventsToRegister
18301829

@@ -1906,6 +1905,12 @@ At the moment, the claim plugin has no information that might enable it to
19061905
prioritize which resource to deallocate first. Future extensions of this KEP
19071906
might attempt to improve this.
19081907

1908+
This is currently using blocking API calls. They are unlikely because this
1909+
situation can only arise when there are multiple claims per pod and allocation
1910+
for one of them fails despite all drivers agreeing that a node should be
1911+
suitable, or when reusing a claim for multiple pods (not a common use case) and
1912+
the original node became unusable for the next pod.
1913+
19091914
#### Pre-score
19101915

19111916
This is passed a list of nodes that have passed filtering by the claim
@@ -1936,9 +1941,21 @@ of its ResourceClaims. The driver can and should already have added
19361941
the Pod when specifically allocating the claim for it, so it may
19371942
be possible to skip this update.
19381943

1944+
All the PodSchedulingContext and ResourceClaim updates are recorded in the
1945+
plugin state. They will be written to the cluster during PreBind.
1946+
19391947
If some resources are not allocated yet or reserving an allocated resource
19401948
fails, the scheduling attempt needs to be aborted and retried at a later time
1941-
or when the statuses change.
1949+
or when the statuses change. The Reserve call itself never fails. If resources
1950+
are not currently available, that information is recorded in the plugin state
1951+
and will cause the PreBind call to fail instead.
1952+
1953+
#### PreBind
1954+
1955+
This is called in a separate goroutine. The plugin now checks all the
1956+
information gathered earlier and updates the cluster accordingly. If some
1957+
claims are not allocated or not reserved, PreBind fails and the pod must be
1958+
retried.
19421959

19431960
#### Unreserve
19441961

@@ -1958,6 +1975,13 @@ but eventually one of them will. Not giving up the reservations would lead to a
19581975
permanent deadlock that somehow would have to be detected and resolved to make
19591976
progress.
19601977

1978+
Unreserve is called in two scenarios:
1979+
- In the main goroutine when scheduling a pod has failed: in that case the plugin's
1980+
Reserve call hasn't actually changed the claim status yet, so there is nothing
1981+
that needs to be rolled back.
1982+
- After binding has failed: this runs in a goroutine, so reverting the
1983+
`claim.status.reservedFor` with a blocking call is acceptable.
1984+
19611985
### Cluster Autoscaler
19621986

19631987
When [Cluster
@@ -2439,6 +2463,8 @@ For beta:
24392463

24402464
#### Alpha -> Beta Graduation
24412465

2466+
- In normal scenarios, scheduling pods with claims must not block scheduling of
2467+
other pods by doing blocking API calls
24422468
- Implement integration with Cluster Autoscaler through numeric parameters
24432469
- Gather feedback from developers and surveys
24442470
- Positive acknowledgment from 3 would-be implementors of a resource driver,

0 commit comments

Comments
 (0)