You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The RPC mechanism is likely to have performance challenges. It is better to
focus on an extension mechanism for custom autoscaler binaries first. In
practice, this is likely to be what cloud providers are running anyway.
encounters a pod that uses a resource claim, the autoscaler needs assistance by
1931
1934
the resource driver for that claim to make the right decisions. Without that
1932
1935
assistance, the autoscaler might scale up the wrong node group (resource is
1933
-
provided by nodes in another group) or scale up unnecessarily (resource is
1934
-
network-attached and adding nodes won't help).
1936
+
provided by nodes in another group) or not scale up (pod is pending because of
1937
+
a claim that cannot be allocated, but looks like it should be scheduleable
1938
+
to the autoscaler).
1939
+
1940
+
With the following changes, vendors can provide Go code in a package that can
1941
+
be built into a custom autoscaler binary to support correct scale up
1942
+
simulations for clusters that use their hardware. Extensions for invoking such
1943
+
vendor code through some RPC mechanism, as WASM plugin, or some generic
1944
+
code which just needs to be parameterized for specific hardware could be added
1945
+
later in separate KEPs.
1946
+
1947
+
The in-tree DRA scheduler plugin is still active. It handles the generic checks
1948
+
like "can this allocated claim be reserved for this pod" and only calls out to
1949
+
vendor code when it comes to decisions that only the vendor can handle, like
1950
+
"can this claim be allocated" and "what effect does allocating this claim have
1951
+
for the cluster".
1952
+
1953
+
The underlying assumption is that vendors can determine the capabilities of
1954
+
nodes based on labels. Those labels get set by the autoscaler for simulated
1955
+
nodes either by cloning some real node or through configuration during scale up
1956
+
from zero. Then when some vendor code encounters a node which doesn't exit
1957
+
in the real cluster, it can determine what resource the vendor driver would
1958
+
be able to make available if it was created for real.
1959
+
1960
+
#### Generic plugin enhancements
1961
+
1962
+
The changes in this section are independent of DRA. They could also be used to
1963
+
simulate volume provisioning better.
1935
1964
1936
1965
At the start of a scale up or scale down cycle, autoscaler takes a snapshot of
1937
1966
the current cluster state. Then autoscaler determines whether a real or
@@ -1940,69 +1969,165 @@ of scheduler plugins. If a pod fits a node, the snapshot is updated by calling
1940
1969
[NodeInfo.AddPod](https://github.com/kubernetes/kubernetes/blob/7e3c98fd303359cb9f79dfc691da733a6ca2a6e3/pkg/scheduler/framework/types.go#L620-L623). This
1941
1970
influences further checks for other pending pods.
1942
1971
1943
-
To support the custom allocation logic that a vendor uses for its resources,
1944
-
the autoscaler needs an extension mechanism similar to the [scheduler
1945
-
extender](https://github.com/kubernetes/kubernetes/blob/7e3c98fd303359cb9f79dfc691da733a6ca2a6e3/pkg/scheduler/framework/extender.go#L24-L72). The
1946
-
existing scheduler extender API has to be extended to include methods that
1947
-
would only get called by the autoscaler, like starting a cycle. Instead of
1948
-
adding these methods to the scheduler framework, autoscaler can define its own
1949
-
interface that inherits from the framework:
1972
+
The DRA scheduler plugin gets integrated into this snapshotting and simulated
1973
+
pod scheduling through a new scheduler framework interface:
1950
1974
1951
1975
```
1952
-
import "k8s.io/pkg/scheduler/framework"
1953
-
1954
-
type Extender interface {
1955
-
framework.Extender
1956
-
1957
-
// NodeSelected gets called when the autoscaler determined that
1958
-
// a pod should run on a node.
1959
-
NodeSelected(pod *v1.Pod, node *v1.Node) error
1960
-
1961
-
// NodeReady gets called by the autoscaler to check whether
1962
-
// a new node is fully initialized.
1963
-
NodeReady(nodeName string) (bool, error)
1976
+
// ClusterAutoScalerPlugin is an interface that is used only by the cluster autoscaler.
1977
+
// It enables plugins to store state across different scheduling cycles.
1978
+
//
1979
+
// The usual call sequence of a plugin when used in the scheduler is:
1980
+
// - at program startup:
1981
+
// - instantiate plugin
1982
+
// - EventsToRegister
1983
+
// - for each new pod:
1984
+
// - PreEnqueue
1985
+
// - for each pod that is ready to be scheduled, one pod at a time:
1986
+
// - PreFilter, Filter, etc.
1987
+
//
1988
+
// Cluster autoscaler works a bit differently. It identifies all pending pods,
1989
+
// takes a snapshot of the current cluster state, and then simulates the effect
1990
+
// of scheduling those pods with additional nodes added to the cluster. To
1991
+
// determine whether a pod fits into one of these simulated nodes, it
1992
+
// uses the same PreFilter and Filter plugins as the scheduler. Other extension
1993
+
// points (Reserve, Bind) are not used. Plugins which modify the cluster state
1994
+
// therefore need a different way of recording the result of scheduling
1995
+
// a pod onto a node. This is done through ClusterAutoScalerPlugin.
1996
+
//
1997
+
// Cluster autoscaler will:
1998
+
// - at program startup:
1999
+
// - instantiate plugin, with real informer factory and no Kubernetes client
2000
+
// - start informers
2001
+
// - at the start of a simulation:
2002
+
// - call StartSimulation with a clean cycle state
2003
+
// - for each pending pod:
2004
+
// - call PreFilter and Filter with the same cycle state that
2005
+
// was passed to StartSimulation
2006
+
// - call SimulateBindPod with the same cycle state that
2007
+
// was passed to StartSimulation (i.e. *not* the one which was modified
2008
+
// by PreFilter or Filter) to indicate that a pod is being scheduled onto a node
2009
+
// as part of the simulation
2010
+
//
2011
+
// A plugin may:
2012
+
// - Take a snapshot of all relevant cluster state as part of StartSimulation
2013
+
// and store it in the cycle state. This signals to the other extension
2014
+
// points that the plugin is being used as part of the cluster autoscaler.
2015
+
// . In PreFilter and Filter use the cluster snapshot to make decisions
2016
+
// instead of the normal "live" cluster state.
2017
+
// - In SimulateBindPod update the snapshot in the cycle state.
2018
+
type ClusterAutoScalerPlugin interface {
2019
+
Plugin
2020
+
// StartSimulation is called when the cluster autoscaler begins
2021
+
// a simulation.
2022
+
StartSimulation(ctx context.Context, state *CycleState) *Status
2023
+
// SimulateBindPod is called when the cluster autoscaler decided to schedule
2024
+
// a pod onto a certain node.
2025
+
SimulateBindPod(ctx context.Context, state *CycleState, pod *v1.Pod, nodeInfo *NodeInfo) *Status
2026
+
// NodeIsReady checks whether some real node has been initialized completely.
2027
+
// Even if it is "ready" as far Kubernetes is concerned, some DaemonSet pod
2028
+
// might still be missing or not done with its startup yet.
0 commit comments