@@ -111,6 +111,7 @@ SIG Architecture for cross-cutting KEPs).
111
111
- [ Post-filter] ( #post-filter )
112
112
- [ Pre-score] ( #pre-score )
113
113
- [ Reserve] ( #reserve )
114
+ - [ PreBind] ( #prebind )
114
115
- [ Unreserve] ( #unreserve )
115
116
- [ Cluster Autoscaler] ( #cluster-autoscaler )
116
117
- [ kubelet] ( #kubelet )
@@ -150,7 +151,7 @@ SIG Architecture for cross-cutting KEPs).
150
151
- [ Improving scheduling performance] ( #improving-scheduling-performance )
151
152
- [ Optimize for network-attached resources] ( #optimize-for-network-attached-resources )
152
153
- [ Moving blocking API calls into goroutines] ( #moving-blocking-api-calls-into-goroutines )
153
- - [ RPC calls instead of <code >PodSchedulingContext</code >] ( #rpc-calls-instead-of- )
154
+ - [ RPC calls instead of <code >PodSchedulingContext</code >] ( #rpc-calls-instead-of-podschedulingcontext )
154
155
- [ Infrastructure Needed] ( #infrastructure-needed )
155
156
<!-- /toc -->
156
157
@@ -1818,13 +1819,11 @@ notices this, the current scheduling attempt for the pod must stop and the pod
1818
1819
needs to be put back into the work queue. It then gets retried whenever a
1819
1820
ResourceClaim gets added or modified.
1820
1821
1821
- The following extension points are implemented in the new claim plugin. Some of
1822
- them invoke API calls to create or update objects. This is done to simplify
1823
- error handling: a failure during such a call puts the pod into the backoff
1824
- queue where it will be retried after a timeout. The downside is that the
1825
- latency caused by those blocking calls not only affects pods using claims, but
1826
- also all other pending pods because the scheduler only schedules one pod at a
1827
- time.
1822
+ The following extension points are implemented in the new claim plugin. Except
1823
+ for some unlikely edge cases (see below) there are no API calls during the main
1824
+ scheduling cycle. Instead, the plugin collects information and updates the
1825
+ cluster in the separate goroutine which invokes PreBind.
1826
+
1828
1827
1829
1828
#### EventsToRegister
1830
1829
@@ -1906,6 +1905,12 @@ At the moment, the claim plugin has no information that might enable it to
1906
1905
prioritize which resource to deallocate first. Future extensions of this KEP
1907
1906
might attempt to improve this.
1908
1907
1908
+ This is currently using blocking API calls. They are unlikely because this
1909
+ situation can only arise when there are multiple claims per pod and allocation
1910
+ for one of them fails despite all drivers agreeing that a node should be
1911
+ suitable, or when reusing a claim for multiple pods (not a common use case) and
1912
+ the original node became unusable for the next pod.
1913
+
1909
1914
#### Pre-score
1910
1915
1911
1916
This is passed a list of nodes that have passed filtering by the claim
@@ -1936,9 +1941,21 @@ of its ResourceClaims. The driver can and should already have added
1936
1941
the Pod when specifically allocating the claim for it, so it may
1937
1942
be possible to skip this update.
1938
1943
1944
+ All the PodSchedulingContext and ResourceClaim updates are recorded in the
1945
+ plugin state. They will be written to the cluster during PreBind.
1946
+
1939
1947
If some resources are not allocated yet or reserving an allocated resource
1940
1948
fails, the scheduling attempt needs to be aborted and retried at a later time
1941
- or when the statuses change.
1949
+ or when the statuses change. The Reserve call itself never fails. If resources
1950
+ are not currently available, that information is recorded in the plugin state
1951
+ and will cause the PreBind call to fail instead.
1952
+
1953
+ #### PreBind
1954
+
1955
+ This is called in a separate goroutine. The plugin now checks all the
1956
+ information gathered earlier and updates the cluster accordingly. If some
1957
+ claims are not allocated or not reserved, PreBind fails and the pod must be
1958
+ retried.
1942
1959
1943
1960
#### Unreserve
1944
1961
@@ -1958,6 +1975,13 @@ but eventually one of them will. Not giving up the reservations would lead to a
1958
1975
permanent deadlock that somehow would have to be detected and resolved to make
1959
1976
progress.
1960
1977
1978
+ Unreserve is called in two scenarios:
1979
+ - In the main goroutine when scheduling a pod has failed: in that case the plugin's
1980
+ Reserve call hasn't actually changed the claim status yet, so there is nothing
1981
+ that needs to be rolled back.
1982
+ - After binding has failed: this runs in a goroutine, so reverting the
1983
+ ` claim.status.reservedFor ` with a blocking call is acceptable.
1984
+
1961
1985
### Cluster Autoscaler
1962
1986
1963
1987
When [ Cluster
@@ -2439,6 +2463,8 @@ For beta:
2439
2463
2440
2464
#### Alpha -> Beta Graduation
2441
2465
2466
+ - In normal scenarios, scheduling pods with claims must not block scheduling of
2467
+ other pods by doing blocking API calls
2442
2468
- Implement integration with Cluster Autoscaler through numeric parameters
2443
2469
- Gather feedback from developers and surveys
2444
2470
- Positive acknowledgment from 3 would-be implementors of a resource driver,
0 commit comments