Skip to content

Commit 7e17126

Browse files
committed
dra: update Cluster Autoscaler support section
The RPC mechanism is likely to have performance challenges. It is better to focus on an extension mechanism for custom autoscaler binaries first. In practice, this is likely to be what cloud providers are running anyway.
1 parent a1382ec commit 7e17126

File tree

1 file changed

+180
-55
lines changed
  • keps/sig-node/3063-dynamic-resource-allocation

1 file changed

+180
-55
lines changed

keps/sig-node/3063-dynamic-resource-allocation/README.md

Lines changed: 180 additions & 55 deletions
Original file line numberDiff line numberDiff line change
@@ -111,6 +111,9 @@ SIG Architecture for cross-cutting KEPs).
111111
- [Reserve](#reserve)
112112
- [Unreserve](#unreserve)
113113
- [Cluster Autoscaler](#cluster-autoscaler)
114+
- [Generic plugin enhancements](#generic-plugin-enhancements)
115+
- [DRA scheduler plugin extension mechanism](#dra-scheduler-plugin-extension-mechanism)
116+
- [Building a custom Cluster Autoscaler binary](#building-a-custom-cluster-autoscaler-binary)
114117
- [kubelet](#kubelet)
115118
- [Managing resources](#managing-resources)
116119
- [Communication between kubelet and resource kubelet plugin](#communication-between-kubelet-and-resource-kubelet-plugin)
@@ -1930,8 +1933,34 @@ Autoscaler](https://github.com/kubernetes/autoscaler/tree/master/cluster-autosca
19301933
encounters a pod that uses a resource claim, the autoscaler needs assistance by
19311934
the resource driver for that claim to make the right decisions. Without that
19321935
assistance, the autoscaler might scale up the wrong node group (resource is
1933-
provided by nodes in another group) or scale up unnecessarily (resource is
1934-
network-attached and adding nodes won't help).
1936+
provided by nodes in another group) or not scale up (pod is pending because of
1937+
a claim that cannot be allocated, but looks like it should be scheduleable
1938+
to the autoscaler).
1939+
1940+
With the following changes, vendors can provide Go code in a package that can
1941+
be built into a custom autoscaler binary to support correct scale up
1942+
simulations for clusters that use their hardware. Extensions for invoking such
1943+
vendor code through some RPC mechanism, as WASM plugin, or some generic
1944+
code which just needs to be parameterized for specific hardware could be added
1945+
later in separate KEPs.
1946+
1947+
The in-tree DRA scheduler plugin is still active. It handles the generic checks
1948+
like "can this allocated claim be reserved for this pod" and only calls out to
1949+
vendor code when it comes to decisions that only the vendor can handle, like
1950+
"can this claim be allocated" and "what effect does allocating this claim have
1951+
for the cluster".
1952+
1953+
The underlying assumption is that vendors can determine the capabilities of
1954+
nodes based on labels. Those labels get set by the autoscaler for simulated
1955+
nodes either by cloning some real node or through configuration during scale up
1956+
from zero. Then when some vendor code encounters a node which doesn't exit
1957+
in the real cluster, it can determine what resource the vendor driver would
1958+
be able to make available if it was created for real.
1959+
1960+
#### Generic plugin enhancements
1961+
1962+
The changes in this section are independent of DRA. They could also be used to
1963+
simulate volume provisioning better.
19351964

19361965
At the start of a scale up or scale down cycle, autoscaler takes a snapshot of
19371966
the current cluster state. Then autoscaler determines whether a real or
@@ -1940,69 +1969,165 @@ of scheduler plugins. If a pod fits a node, the snapshot is updated by calling
19401969
[NodeInfo.AddPod](https://github.com/kubernetes/kubernetes/blob/7e3c98fd303359cb9f79dfc691da733a6ca2a6e3/pkg/scheduler/framework/types.go#L620-L623). This
19411970
influences further checks for other pending pods.
19421971

1943-
To support the custom allocation logic that a vendor uses for its resources,
1944-
the autoscaler needs an extension mechanism similar to the [scheduler
1945-
extender](https://github.com/kubernetes/kubernetes/blob/7e3c98fd303359cb9f79dfc691da733a6ca2a6e3/pkg/scheduler/framework/extender.go#L24-L72). The
1946-
existing scheduler extender API has to be extended to include methods that
1947-
would only get called by the autoscaler, like starting a cycle. Instead of
1948-
adding these methods to the scheduler framework, autoscaler can define its own
1949-
interface that inherits from the framework:
1972+
The DRA scheduler plugin gets integrated into this snapshotting and simulated
1973+
pod scheduling through a new scheduler framework interface:
19501974

19511975
```
1952-
import "k8s.io/pkg/scheduler/framework"
1953-
1954-
type Extender interface {
1955-
framework.Extender
1956-
1957-
// NodeSelected gets called when the autoscaler determined that
1958-
// a pod should run on a node.
1959-
NodeSelected(pod *v1.Pod, node *v1.Node) error
1960-
1961-
// NodeReady gets called by the autoscaler to check whether
1962-
// a new node is fully initialized.
1963-
NodeReady(nodeName string) (bool, error)
1976+
// ClusterAutoScalerPlugin is an interface that is used only by the cluster autoscaler.
1977+
// It enables plugins to store state across different scheduling cycles.
1978+
//
1979+
// The usual call sequence of a plugin when used in the scheduler is:
1980+
// - at program startup:
1981+
// - instantiate plugin
1982+
// - EventsToRegister
1983+
// - for each new pod:
1984+
// - PreEnqueue
1985+
// - for each pod that is ready to be scheduled, one pod at a time:
1986+
// - PreFilter, Filter, etc.
1987+
//
1988+
// Cluster autoscaler works a bit differently. It identifies all pending pods,
1989+
// takes a snapshot of the current cluster state, and then simulates the effect
1990+
// of scheduling those pods with additional nodes added to the cluster. To
1991+
// determine whether a pod fits into one of these simulated nodes, it
1992+
// uses the same PreFilter and Filter plugins as the scheduler. Other extension
1993+
// points (Reserve, Bind) are not used. Plugins which modify the cluster state
1994+
// therefore need a different way of recording the result of scheduling
1995+
// a pod onto a node. This is done through ClusterAutoScalerPlugin.
1996+
//
1997+
// Cluster autoscaler will:
1998+
// - at program startup:
1999+
// - instantiate plugin, with real informer factory and no Kubernetes client
2000+
// - start informers
2001+
// - at the start of a simulation:
2002+
// - call StartSimulation with a clean cycle state
2003+
// - for each pending pod:
2004+
// - call PreFilter and Filter with the same cycle state that
2005+
// was passed to StartSimulation
2006+
// - call SimulateBindPod with the same cycle state that
2007+
// was passed to StartSimulation (i.e. *not* the one which was modified
2008+
// by PreFilter or Filter) to indicate that a pod is being scheduled onto a node
2009+
// as part of the simulation
2010+
//
2011+
// A plugin may:
2012+
// - Take a snapshot of all relevant cluster state as part of StartSimulation
2013+
// and store it in the cycle state. This signals to the other extension
2014+
// points that the plugin is being used as part of the cluster autoscaler.
2015+
// . In PreFilter and Filter use the cluster snapshot to make decisions
2016+
// instead of the normal "live" cluster state.
2017+
// - In SimulateBindPod update the snapshot in the cycle state.
2018+
type ClusterAutoScalerPlugin interface {
2019+
Plugin
2020+
// StartSimulation is called when the cluster autoscaler begins
2021+
// a simulation.
2022+
StartSimulation(ctx context.Context, state *CycleState) *Status
2023+
// SimulateBindPod is called when the cluster autoscaler decided to schedule
2024+
// a pod onto a certain node.
2025+
SimulateBindPod(ctx context.Context, state *CycleState, pod *v1.Pod, nodeInfo *NodeInfo) *Status
2026+
// NodeIsReady checks whether some real node has been initialized completely.
2027+
// Even if it is "ready" as far Kubernetes is concerned, some DaemonSet pod
2028+
// might still be missing or not done with its startup yet.
2029+
NodeIsReady(ctx context.Context, node *v1.Node) (bool, error)
19642030
}
19652031
```
19662032

1967-
The underlying implementation can either be compiled into a custom autoscaler
1968-
binary by cloud provider who controls the entire cluster or use HTTP similar to
1969-
the [HTTP
1970-
extender](https://github.com/kubernetes/kubernetes/blob/7e3c98fd303359cb9f79dfc691da733a6ca2a6e3/pkg/scheduler/extender.go#L41-L53).
1971-
As an initial step, configuring such HTTP webhooks for different resource
1972-
drivers can be added to the configuration file defined by the `--cloud-config`
1973-
configuration file with a common field that gets added in all cloud provider
1974-
configs or a new `--config` parameter can be added. Later, dynamically
1975-
discovering deployed webhooks can be added through an autoscaler CRD.
1976-
1977-
In contrast to the in-tree HTTP extender implementation, the one for autoscaler
1978-
must be session oriented: when creating the extender for a cycle, a new "start"
1979-
verb needs to be invoked. When this is called in a resource driver controller
1980-
webhook, it needs to take a snapshot of the relevant state and return a session
1981-
ID. This session ID must be included in all following HTTP invocations as a
1982-
"session" header. Ensuring that a "stop" verb gets called reliably would
1983-
complicate the autoscaler. Instead, the webhook should support a small number
1984-
of recent session and garbage-collect older ones.
1985-
1986-
The existing `extenderv1.ExtenderArgs` and `extenderv1.ExtenderFilterResult`
1987-
API can be used for the "filter" operation. The extender can be added to the
1988-
list of active scheduler plugins because it implements the plugin interface.
1989-
Because filter operations may involve fictional nodes, the full `Node` objects
1990-
instead of just the node names must be passed. For fictional nodes, the
1991-
resource driver must determine based on labels which resources it can provide
1992-
on such a node. New APIs are needed for `NodeSelected` and `NodeReady`.
1993-
1994-
`NodeReady` is needed to solve one particular problem: when a new node first
2033+
`NodeIsReady` is needed to solve one particular problem: when a new node first
19952034
starts up, it may be ready to run pods, but the pod from a resource driver's
19962035
DaemonSet may still be starting up. If the resource driver controller needs
19972036
information from such a pod, then it will not be able to filter
19982037
correctly. Similar to how extended resources are handled, the autoscaler then
1999-
first needs to wait until the extender also considers the node to be ready.
2038+
first needs to wait until the plugin also considers the node to be ready.
20002039

2001-
Such extenders completely replace the generic scheduler resource plugin. The
2002-
generic plugin would be able to filter out nodes based on already allocated
2003-
resources. But because it is stateless, it would not handle the use count
2004-
restrictions correctly when multiple pods are pending and reference the same
2005-
resource.
2040+
### DRA scheduler plugin extension mechanism
2041+
2042+
The in-tree scheduler plugin gets extended by vendors through the following API
2043+
in `k8s.io/dynamic-resource-allocation/simulation`. Vendor code does not depend
2044+
on the k/k/pkg/scheduler package nor on autoscaler packages.
2045+
2046+
```
2047+
// Registry stores all known plugins which can simulate claim allocation.
2048+
// It is thread-safe.
2049+
var Registry registry
2050+
2051+
// PluginName is a special type that is used to look up plugins for a claim.
2052+
// For now it must be the same as the driver name in the resource class of a
2053+
// claim.
2054+
type PluginName string
2055+
2056+
// Add adds or overwrites the plugin for a certain name.
2057+
func (r *registry) Add(name PluginName, plugin Plugin) { ... }
2058+
2059+
...
2060+
2061+
// Plugin is used to register a plugin.
2062+
type Plugin interface {
2063+
// Activate will get called to prepare the plugin for usage.
2064+
Activate(ctx context.Context, client kubernetes.Interface, informerFactory informers.SharedInformerFactory) (ActivePlugin, error)
2065+
}
2066+
2067+
// ActivePlugin is a plugin which is ready to start a simulation.
2068+
type ActivePlugin interface {
2069+
// Start will get called at the start of a simulation. The plugin must
2070+
// capture the current cluster state.
2071+
Start(ctx context.Context) (StartedPlugin, error)
2072+
2073+
// NodeIsReady checks whether some real node has been initialized completely.
2074+
NodeIsReady(ctx context.Context, node *v1.Node) (bool, error)
2075+
}
2076+
2077+
// StartedPlugin is a plugin which encapsulates a certain cluster state and
2078+
// can make changes to it.
2079+
type StartedPlugin interface {
2080+
// Clone must create a new, independent copy of the current state.
2081+
// This must be fast and cannot fail. If it has to do some long-running
2082+
// operation, then it must do that in a new goroutine and check the
2083+
// result when some method is called in the returned instance.
2084+
Clone() StartedPlugin
2085+
2086+
// NodeIsSuitable checks whether a claim could be allocated for
2087+
// a pod such that it will be available on the node.
2088+
NodeIsSuitable(ctx context.Context, pod *v1.Pod, claim *resourcev1alpha2.ResourceClaim, node *v1.Node) (bool, error)
2089+
2090+
// Allocate must adapt the cluster state as if the claim
2091+
// had been allocated for use on the selected node and return
2092+
// the result for the claim. It must not modify the claim,
2093+
// that will be done by the caller.
2094+
Allocate(ctx context.Context, claim *resourcev1alpha2.ResourceClaim, node *v1.Node) (*resourcev1alpha2.AllocationResult, error)
2095+
}
2096+
```
2097+
2098+
When the DRA scheduler plugin gets initialized, it activates all registered
2099+
vendor plugins. When `StartSimulation` is called, all vendor plugins are
2100+
started. When the scheduler plugin's state data is cloned, the plugin's also
2101+
get cloned. In addition, `StartSimulation` captures the state of all claims.
2102+
2103+
`NodeIsSuitable` is called during the `Filter` check to determine whether a
2104+
pending claim could be allocated for a node. `Allocate` is called as part of
2105+
the `SimulateBindPod` implementation. The simulated allocation result is stored
2106+
in the claim snapshot and then the claim is reserved for the pod. If the claim
2107+
cannot be shared between pods, that will prevent other pods from using the
2108+
claim while the autoscaler goes through it's binpacking simulation.
2109+
2110+
Finally, `NodeIsReady` of each vendor plugin is called to implement the
2111+
scheduler plugin's own `NodeIsReady`.
2112+
2113+
#### Building a custom Cluster Autoscaler binary
2114+
2115+
Vendors are encouraged to include an "init" package together with their driver
2116+
simulation implementation. That "init" package registers their plugin. Then to
2117+
build a custom autoscaler binary, one additional file alongside `main.go` is
2118+
sufficient:
2119+
2120+
```
2121+
package main
2122+
2123+
import (
2124+
_ "acme.example.com/dra-resource-driver/simulation-plugin/init"
2125+
)
2126+
```
2127+
2128+
This init package may also register additional command line flags. Care must be
2129+
taken to not cause conflicts between different plugins, so all vendor flags
2130+
should start with a unique prefix.
20062131

20072132
### kubelet
20082133

0 commit comments

Comments
 (0)