@@ -77,7 +77,6 @@ SIG Architecture for cross-cutting KEPs).
77
77
- [ User Stories] ( #user-stories )
78
78
- [ Cluster add-on development] ( #cluster-add-on-development )
79
79
- [ Cluster configuration] ( #cluster-configuration )
80
- - [ Integration with cluster autoscaling] ( #integration-with-cluster-autoscaling )
81
80
- [ Partial GPU allocation] ( #partial-gpu-allocation )
82
81
- [ Network-attached accelerator] ( #network-attached-accelerator )
83
82
- [ Combined setup of different hardware functions] ( #combined-setup-of-different-hardware-functions )
@@ -114,10 +113,6 @@ SIG Architecture for cross-cutting KEPs).
114
113
- [ Reserve] ( #reserve )
115
114
- [ Unreserve] ( #unreserve )
116
115
- [ Cluster Autoscaler] ( #cluster-autoscaler )
117
- - [ Generic plugin enhancements] ( #generic-plugin-enhancements )
118
- - [ DRA scheduler plugin extension mechanism] ( #dra-scheduler-plugin-extension-mechanism )
119
- - [ Handling claims without vendor code] ( #handling-claims-without-vendor-code )
120
- - [ Building a custom Cluster Autoscaler binary] ( #building-a-custom-cluster-autoscaler-binary )
121
116
- [ kubelet] ( #kubelet )
122
117
- [ Managing resources] ( #managing-resources )
123
118
- [ Communication between kubelet and resource kubelet plugin] ( #communication-between-kubelet-and-resource-kubelet-plugin )
@@ -441,15 +436,6 @@ parametersRef:
441
436
name: acme-gpu-init
442
437
```
443
438
444
- #### Integration with cluster autoscaling
445
-
446
- As a cloud provider, I want to support GPUs as part of a hosted Kubernetes
447
- environment, including cluster autoscaling. I ensure that the kernel is
448
- configured as required by the hardware and that the container runtime supports
449
- CDI. I review the Go code provided by the vendor for simulating cluster scaling
450
- and build it into a customized cluster autoscaler binary that supports my cloud
451
- infrastructure.
452
-
453
439
#### Partial GPU allocation
454
440
455
441
As a user, I want to use a GPU as accelerator, but don't need exclusive access
@@ -676,8 +662,8 @@ allocation also may turn out to be insufficient. Some risks are:
676
662
- Network-attached resources may have additional constraints that are not
677
663
captured yet (like limited number of nodes that they can be attached to).
678
664
679
- - Cluster autoscaling will not work as expected unless the autoscaler and
680
- resource drivers get extended to support it .
665
+ - Cluster autoscaling will not work as expected unless the DRA driver
666
+ uses [ numeric parameters ] ( https://github.com/kubernetes/enhancements/issues/4381 ) .
681
667
682
668
All of these risks will have to be evaluated by gathering feedback from users
683
669
and resource driver developers.
@@ -1974,252 +1960,21 @@ progress.
1974
1960
1975
1961
### Cluster Autoscaler
1976
1962
1977
- -<<[ UNRESOLVED pohly] >>
1978
- The entire autoscaler section is tentative. Key opens:
1979
- - Are DRA driver authors able and willing to provide implementations of
1980
- the simulation interface if needed for their driver?
1981
- - Is the simulation interface generic enough to work across a variety
1982
- of autoscaler forks and/or implementations? What about Karpenter?
1983
- - Is the suggested deployment approach (rebuild binary) workable?
1984
- - Can we really not do something else, ideally RPC-based?
1985
- -<<[ /UNRESOLVED] >>
1986
-
1987
1963
When [ Cluster
1988
1964
Autoscaler] ( https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler#cluster-autoscaler )
1989
- encounters a pod that uses a resource claim for node-local resources, the autoscaler needs assistance by
1990
- the resource driver for that claim to make the right decisions. Without that
1991
- assistance, the autoscaler might scale up the wrong node group (resource is
1992
- provided by nodes in another group) or not scale up (pod is pending because of
1993
- a claim that cannot be allocated, but looks like it should be scheduleable
1994
- to the autoscaler).
1995
-
1996
- With the following changes, vendors can provide Go code in a package that can
1997
- be built into a custom autoscaler binary to support correct scale up
1998
- simulations for clusters that use their hardware. Extensions for invoking such
1999
- vendor code through some RPC mechanism, as WASM plugin, or some generic
2000
- code which just needs to be parameterized for specific hardware could be added
2001
- later in separate KEPs.
2002
-
2003
- Such vendor code is * not* needed for network-attached resources. Adding or
2004
- removing nodes does not change availability of such resources.
2005
-
2006
- The in-tree DRA scheduler plugin is still active. It handles the generic checks
2007
- like "can this allocated claim be reserved for this pod" and only calls out to
2008
- vendor code when it comes to decisions that only the vendor can handle, like
2009
- "can this claim be allocated" and "what effect does allocating this claim have
2010
- for the cluster".
2011
-
2012
- The underlying assumption is that vendors can determine the capabilities of
2013
- nodes based on labels. Those labels get set by the autoscaler for simulated
2014
- nodes either by cloning some real node or through configuration during scale up
2015
- from zero. Then when some vendor code encounters a node which doesn't exit
2016
- in the real cluster, it can determine what resource the vendor driver would
2017
- be able to make available if it was created for real.
2018
-
2019
- #### Generic plugin enhancements
2020
-
2021
- The changes in this section are independent of DRA. They could also be used to
2022
- simulate volume provisioning better.
2023
-
2024
- At the start of a scale up or scale down cycle, autoscaler takes a snapshot of
2025
- the current cluster state. Then autoscaler determines whether a real or
2026
- fictional node fits a pod by calling the pre-filter and filter extension points
2027
- of scheduler plugins. If a pod fits a node, the snapshot is updated by calling
2028
- [ NodeInfo.AddPod] ( https://github.com/kubernetes/kubernetes/blob/7e3c98fd303359cb9f79dfc691da733a6ca2a6e3/pkg/scheduler/framework/types.go#L620-L623 ) . This
2029
- influences further checks for other pending pods. During scale down, eviction
2030
- is simulated by
2031
- [ SimulateNodeRemoval] ( https://github.com/kubernetes/autoscaler/blob/2f7c61e13bd1cbfc0ba4085fb84bd692a1e9ac6e/cluster-autoscaler/simulator/cluster.go#L149 )
2032
- which [ pretends that pods running on a node that is to be removed are not
2033
- running] ( https://github.com/kubernetes/autoscaler/blob/2f7c61e13bd1cbfc0ba4085fb84bd692a1e9ac6e/cluster-autoscaler/simulator/cluster.go#L231-L237 ) .
2034
-
2035
- The DRA scheduler plugin gets integrated into this snapshotting and simulated
2036
- pod scheduling through a new scheduler framework interface:
2037
-
2038
- ```
2039
- // ClusterAutoScalerPlugin is an interface that is used only by the cluster autoscaler.
2040
- // It enables plugins to store state across different scheduling cycles.
2041
- //
2042
- // The usual call sequence of a plugin when used in the scheduler is:
2043
- // - at program startup:
2044
- // - instantiate plugin
2045
- // - EventsToRegister
2046
- // - for each new pod:
2047
- // - PreEnqueue
2048
- // - for each pod that is ready to be scheduled, one pod at a time:
2049
- // - PreFilter, Filter, etc.
2050
- //
2051
- // Cluster autoscaler works a bit differently. It identifies all pending pods,
2052
- // takes a snapshot of the current cluster state, and then simulates the effect
2053
- // of scheduling those pods with additional nodes added to the cluster. To
2054
- // determine whether a pod fits into one of these simulated nodes, it
2055
- // uses the same PreFilter and Filter plugins as the scheduler. Other extension
2056
- // points (Reserve, Bind) are not used. Plugins which modify the cluster state
2057
- // therefore need a different way of recording the result of scheduling
2058
- // a pod onto a node. This is done through ClusterAutoScalerPlugin.
2059
- //
2060
- // Cluster autoscaler will:
2061
- // - at program startup:
2062
- // - instantiate plugin, with real informer factory and no Kubernetes client
2063
- // - start informers
2064
- // - at the start of a simulation:
2065
- // - call StartSimulation with a clean cycle state
2066
- // - for each pending pod:
2067
- // - call PreFilter and Filter with the same cycle state that
2068
- // was passed to StartSimulation
2069
- // - call SimulateBindPod with the same cycle state that
2070
- // was passed to StartSimulation (i.e. *not* the one which was modified
2071
- // by PreFilter or Filter) to indicate that a pod is being scheduled onto a node
2072
- // as part of the simulation
2073
- //
2074
- // A plugin may:
2075
- // - Take a snapshot of all relevant cluster state as part of StartSimulation
2076
- // and store it in the cycle state. This signals to the other extension
2077
- // points that the plugin is being used as part of the cluster autoscaler.
2078
- // . In PreFilter and Filter use the cluster snapshot to make decisions
2079
- // instead of the normal "live" cluster state.
2080
- // - In SimulateBindPod update the snapshot in the cycle state.
2081
- type ClusterAutoScalerPlugin interface {
2082
- Plugin
2083
- // StartSimulation is called when the cluster autoscaler begins
2084
- // a simulation.
2085
- StartSimulation(ctx context.Context, state *CycleState) *Status
2086
- // SimulateBindPod is called when the cluster autoscaler decided to schedule
2087
- // a pod onto a certain node.
2088
- SimulateBindPod(ctx context.Context, state *CycleState, pod *v1.Pod, nodeInfo *NodeInfo) *Status
2089
- // SimulateEvictPod is called when the cluster autoscaler simulates removal
2090
- // of a node. All claims used only by this pod should be considered deallocated,
2091
- // to enable starting the same pod elsewhere.
2092
- SimulateEvictPod(ctx context.Context, state *CycleState, pod *v1.Pod, nodeName string) *Status
2093
- // NodeIsReady checks whether some real node has been initialized completely.
2094
- // Even if it is "ready" as far Kubernetes is concerned, some DaemonSet pod
2095
- // might still be missing or not done with its startup yet.
2096
- NodeIsReady(ctx context.Context, node *v1.Node) (bool, error)
2097
- }
2098
- ```
2099
-
2100
- ` NodeIsReady ` is needed to solve one particular problem: when a new node first
2101
- starts up, it may be ready to run pods, but the pod from a resource driver's
2102
- DaemonSet may still be starting up. If the resource driver controller needs
2103
- information from such a pod, then it will not be able to filter
2104
- correctly. Similar to how extended resources are handled, the autoscaler then
2105
- first needs to wait until the plugin also considers the node to be ready.
2106
-
2107
- #### DRA scheduler plugin extension mechanism
2108
-
2109
- The in-tree scheduler plugin gets extended by vendors through the following API
2110
- in ` k8s.io/dynamic-resource-allocation/simulation ` . Vendor code depends
2111
- neither on the k/k/pkg/scheduler package nor on autoscaler packages.
2112
-
2113
- ```
2114
- // Registry stores all known plugins which can simulate claim allocation.
2115
- // It is thread-safe.
2116
- var Registry registry
2117
-
2118
- // PluginName is a special type that is used to look up plugins for a claim.
2119
- // For now it must be the same as the driver name in the resource class of a
2120
- // claim.
2121
- type PluginName string
2122
-
2123
- // Add adds or overwrites the plugin for a certain name.
2124
- func (r *registry) Add(name PluginName, plugin Plugin) { ... }
2125
-
2126
- ...
2127
-
2128
- // Plugin is used to register a plugin.
2129
- type Plugin interface {
2130
- // Activate will get called to prepare the plugin for usage.
2131
- Activate(ctx context.Context, client kubernetes.Interface, informerFactory informers.SharedInformerFactory) (ActivePlugin, error)
2132
- }
2133
-
2134
- // ActivePlugin is a plugin which is ready to start a simulation.
2135
- type ActivePlugin interface {
2136
- // Start will get called at the start of a simulation. The plugin must
2137
- // capture the current cluster state.
2138
- Start(ctx context.Context) (StartedPlugin, error)
2139
-
2140
- // NodeIsReady checks whether some real node has been initialized completely.
2141
- NodeIsReady(ctx context.Context, node *v1.Node) (bool, error)
2142
- }
2143
-
2144
- // StartedPlugin is a plugin which encapsulates a certain cluster state and
2145
- // can make changes to it.
2146
- type StartedPlugin interface {
2147
- // Clone must create a new, independent copy of the current state.
2148
- // This must be fast and cannot fail. If it has to do some long-running
2149
- // operation, then it must do that in a new goroutine and check the
2150
- // result when some method is called in the returned instance.
2151
- Clone() StartedPlugin
2152
-
2153
- // NodeIsSuitable checks whether a claim could be allocated for
2154
- // a pod such that it will be available on the node.
2155
- NodeIsSuitable(ctx context.Context, pod *v1.Pod, claim *resourcev1alpha2.ResourceClaim, node *v1.Node) (bool, error)
2156
-
2157
- // Allocate must adapt the cluster state as if the claim
2158
- // had been allocated for use on the selected node and return
2159
- // the result for the claim. It must not modify the claim,
2160
- // that will be done by the caller.
2161
- Allocate(ctx context.Context, claim *resourcev1alpha2.ResourceClaim, node *v1.Node) (*resourcev1alpha2.AllocationResult, error)
2162
-
2163
- // Deallocate must adapt the cluster state as if the claim
2164
- // had been deallocated. It must not modify the claim,
2165
- // that will be done by the caller.
2166
- Deallocate(ctx context.Context, claim *resourcev1alpha2.ResourceClaim) error
2167
- }
2168
- ```
2169
-
2170
- When the DRA scheduler plugin gets initialized, it activates all registered
2171
- vendor plugins. When ` StartSimulation ` is called, all vendor plugins are
2172
- started. When the scheduler plugin's state data is cloned, the plugins also
2173
- get cloned. In addition, ` StartSimulation ` captures the state of all claims.
2174
-
2175
- ` NodeIsSuitable ` is called during the ` Filter ` check to determine whether a
2176
- pending claim could be allocated for a node. ` Allocate ` is called as part of
2177
- the ` SimulateBindPod ` implementation. The simulated allocation result is stored
2178
- in the claim snapshot and then the claim is reserved for the pod. If the claim
2179
- cannot be shared between pods, that will prevent other pods from using the
2180
- claim while the autoscaler goes through it's binpacking simulation.
2181
-
2182
- Finally, ` NodeIsReady ` of each vendor plugin is called to implement the
2183
- scheduler plugin's own ` NodeIsReady ` .
2184
-
2185
- #### Handling claims without vendor code
1965
+ encounters a pod that uses a resource claim for node-local resources, it needs
1966
+ to understand the parameters for the claim and available capacity in order
1967
+ to simulate the effect of allocating claims as part of scheduling and of
1968
+ creating or removing nodes.
2186
1969
2187
- When the DRA scheduler plugin does not have specific vendor code for a certain
2188
- resource class, it falls back to the assumption that resources are unlimited,
2189
- i.e. allocation will always work. This is how volume provisioning is currently
2190
- handled during cluster autoscaling.
2191
-
2192
- If a pod is not getting scheduled because a resource claim cannot be allocated
2193
- by the real DRA driver, to the autoscaler it will look like the pod should be
2194
- schedulable and therefore it will not spin up new nodes for it, which is the
2195
- right decision.
2196
-
2197
- If a pod is not getting scheduled because some other resource requirement is
2198
- not satisfied, the autoscaler will simulate scale up and can pick some
2199
- arbitrary node pool because the DRA scheduler plugin will accept all of those
2200
- nodes.
2201
-
2202
- During scale down, moving a running pod to a different node is assumed to work,
2203
- so that scenario also works.
2204
-
2205
- #### Building a custom Cluster Autoscaler binary
2206
-
2207
- Vendors are encouraged to include an "init" package in their driver
2208
- simulation implementation. That "init" package registers their plugin. Then to
2209
- build a custom autoscaler binary, one additional file alongside ` main.go ` is
2210
- sufficient:
2211
-
2212
- ```
2213
- package main
2214
-
2215
- import (
2216
- _ "acme.example.com/dra-resource-driver/simulation-plugin/init"
2217
- )
2218
- ```
1970
+ This is not possible with opaque parameters as described in this KEP. If a DRA
1971
+ driver developer wants to support Cluster Autoscaler, they have to use numeric
1972
+ parameters. Numeric parameters are an extension of this KEP that is defined in
1973
+ [ KEP #4381 ] ( https://github.com/kubernetes/enhancements/issues/4381 ) .
2219
1974
2220
- This init package may also register additional command line flags. Care must be
2221
- taken to not cause conflicts between different plugins, so all vendor flags
2222
- should start with a unique prefix .
1975
+ Numeric parameters are not necessary for network-attached resources because
1976
+ adding or removing nodes doesn't change their availability and thus Cluster
1977
+ Autoscaler does not need to understand their parameters .
2223
1978
2224
1979
### kubelet
2225
1980
@@ -2684,7 +2439,7 @@ For beta:
2684
2439
2685
2440
#### Alpha -> Beta Graduation
2686
2441
2687
- - Implement integration with Cluster Autoscaler
2442
+ - Implement integration with Cluster Autoscaler through numeric parameters
2688
2443
- Gather feedback from developers and surveys
2689
2444
- Positive acknowledgment from 3 would-be implementors of a resource driver,
2690
2445
from a diversity of companies or projects
0 commit comments