Skip to content

Commit 2bcdb5c

Browse files
committed
DRA: defer to numeric parameters as solution for cluster autoscaling
Technically the support for cluster autoscaling can be defined and implemented as an extension of the core DRA, without changing the core feature. By separating out the specification of "numeric parameters" into a separate KEP it might be easier to make progress on the different aspects because they are better separated.
1 parent 3a55078 commit 2bcdb5c

File tree

1 file changed

+14
-259
lines changed
  • keps/sig-node/3063-dynamic-resource-allocation

1 file changed

+14
-259
lines changed

keps/sig-node/3063-dynamic-resource-allocation/README.md

Lines changed: 14 additions & 259 deletions
Original file line numberDiff line numberDiff line change
@@ -77,7 +77,6 @@ SIG Architecture for cross-cutting KEPs).
7777
- [User Stories](#user-stories)
7878
- [Cluster add-on development](#cluster-add-on-development)
7979
- [Cluster configuration](#cluster-configuration)
80-
- [Integration with cluster autoscaling](#integration-with-cluster-autoscaling)
8180
- [Partial GPU allocation](#partial-gpu-allocation)
8281
- [Network-attached accelerator](#network-attached-accelerator)
8382
- [Combined setup of different hardware functions](#combined-setup-of-different-hardware-functions)
@@ -114,10 +113,6 @@ SIG Architecture for cross-cutting KEPs).
114113
- [Reserve](#reserve)
115114
- [Unreserve](#unreserve)
116115
- [Cluster Autoscaler](#cluster-autoscaler)
117-
- [Generic plugin enhancements](#generic-plugin-enhancements)
118-
- [DRA scheduler plugin extension mechanism](#dra-scheduler-plugin-extension-mechanism)
119-
- [Handling claims without vendor code](#handling-claims-without-vendor-code)
120-
- [Building a custom Cluster Autoscaler binary](#building-a-custom-cluster-autoscaler-binary)
121116
- [kubelet](#kubelet)
122117
- [Managing resources](#managing-resources)
123118
- [Communication between kubelet and resource kubelet plugin](#communication-between-kubelet-and-resource-kubelet-plugin)
@@ -441,15 +436,6 @@ parametersRef:
441436
name: acme-gpu-init
442437
```
443438

444-
#### Integration with cluster autoscaling
445-
446-
As a cloud provider, I want to support GPUs as part of a hosted Kubernetes
447-
environment, including cluster autoscaling. I ensure that the kernel is
448-
configured as required by the hardware and that the container runtime supports
449-
CDI. I review the Go code provided by the vendor for simulating cluster scaling
450-
and build it into a customized cluster autoscaler binary that supports my cloud
451-
infrastructure.
452-
453439
#### Partial GPU allocation
454440

455441
As a user, I want to use a GPU as accelerator, but don't need exclusive access
@@ -676,8 +662,8 @@ allocation also may turn out to be insufficient. Some risks are:
676662
- Network-attached resources may have additional constraints that are not
677663
captured yet (like limited number of nodes that they can be attached to).
678664

679-
- Cluster autoscaling will not work as expected unless the autoscaler and
680-
resource drivers get extended to support it.
665+
- Cluster autoscaling will not work as expected unless the DRA driver
666+
uses [numeric parameters](https://github.com/kubernetes/enhancements/issues/4381).
681667

682668
All of these risks will have to be evaluated by gathering feedback from users
683669
and resource driver developers.
@@ -1974,252 +1960,21 @@ progress.
19741960

19751961
### Cluster Autoscaler
19761962

1977-
-<<[UNRESOLVED pohly]>>
1978-
The entire autoscaler section is tentative. Key opens:
1979-
- Are DRA driver authors able and willing to provide implementations of
1980-
the simulation interface if needed for their driver?
1981-
- Is the simulation interface generic enough to work across a variety
1982-
of autoscaler forks and/or implementations? What about Karpenter?
1983-
- Is the suggested deployment approach (rebuild binary) workable?
1984-
- Can we really not do something else, ideally RPC-based?
1985-
-<<[/UNRESOLVED]>>
1986-
19871963
When [Cluster
19881964
Autoscaler](https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler#cluster-autoscaler)
1989-
encounters a pod that uses a resource claim for node-local resources, the autoscaler needs assistance by
1990-
the resource driver for that claim to make the right decisions. Without that
1991-
assistance, the autoscaler might scale up the wrong node group (resource is
1992-
provided by nodes in another group) or not scale up (pod is pending because of
1993-
a claim that cannot be allocated, but looks like it should be scheduleable
1994-
to the autoscaler).
1995-
1996-
With the following changes, vendors can provide Go code in a package that can
1997-
be built into a custom autoscaler binary to support correct scale up
1998-
simulations for clusters that use their hardware. Extensions for invoking such
1999-
vendor code through some RPC mechanism, as WASM plugin, or some generic
2000-
code which just needs to be parameterized for specific hardware could be added
2001-
later in separate KEPs.
2002-
2003-
Such vendor code is *not* needed for network-attached resources. Adding or
2004-
removing nodes does not change availability of such resources.
2005-
2006-
The in-tree DRA scheduler plugin is still active. It handles the generic checks
2007-
like "can this allocated claim be reserved for this pod" and only calls out to
2008-
vendor code when it comes to decisions that only the vendor can handle, like
2009-
"can this claim be allocated" and "what effect does allocating this claim have
2010-
for the cluster".
2011-
2012-
The underlying assumption is that vendors can determine the capabilities of
2013-
nodes based on labels. Those labels get set by the autoscaler for simulated
2014-
nodes either by cloning some real node or through configuration during scale up
2015-
from zero. Then when some vendor code encounters a node which doesn't exit
2016-
in the real cluster, it can determine what resource the vendor driver would
2017-
be able to make available if it was created for real.
2018-
2019-
#### Generic plugin enhancements
2020-
2021-
The changes in this section are independent of DRA. They could also be used to
2022-
simulate volume provisioning better.
2023-
2024-
At the start of a scale up or scale down cycle, autoscaler takes a snapshot of
2025-
the current cluster state. Then autoscaler determines whether a real or
2026-
fictional node fits a pod by calling the pre-filter and filter extension points
2027-
of scheduler plugins. If a pod fits a node, the snapshot is updated by calling
2028-
[NodeInfo.AddPod](https://github.com/kubernetes/kubernetes/blob/7e3c98fd303359cb9f79dfc691da733a6ca2a6e3/pkg/scheduler/framework/types.go#L620-L623). This
2029-
influences further checks for other pending pods. During scale down, eviction
2030-
is simulated by
2031-
[SimulateNodeRemoval](https://github.com/kubernetes/autoscaler/blob/2f7c61e13bd1cbfc0ba4085fb84bd692a1e9ac6e/cluster-autoscaler/simulator/cluster.go#L149)
2032-
which [pretends that pods running on a node that is to be removed are not
2033-
running](https://github.com/kubernetes/autoscaler/blob/2f7c61e13bd1cbfc0ba4085fb84bd692a1e9ac6e/cluster-autoscaler/simulator/cluster.go#L231-L237).
2034-
2035-
The DRA scheduler plugin gets integrated into this snapshotting and simulated
2036-
pod scheduling through a new scheduler framework interface:
2037-
2038-
```
2039-
// ClusterAutoScalerPlugin is an interface that is used only by the cluster autoscaler.
2040-
// It enables plugins to store state across different scheduling cycles.
2041-
//
2042-
// The usual call sequence of a plugin when used in the scheduler is:
2043-
// - at program startup:
2044-
// - instantiate plugin
2045-
// - EventsToRegister
2046-
// - for each new pod:
2047-
// - PreEnqueue
2048-
// - for each pod that is ready to be scheduled, one pod at a time:
2049-
// - PreFilter, Filter, etc.
2050-
//
2051-
// Cluster autoscaler works a bit differently. It identifies all pending pods,
2052-
// takes a snapshot of the current cluster state, and then simulates the effect
2053-
// of scheduling those pods with additional nodes added to the cluster. To
2054-
// determine whether a pod fits into one of these simulated nodes, it
2055-
// uses the same PreFilter and Filter plugins as the scheduler. Other extension
2056-
// points (Reserve, Bind) are not used. Plugins which modify the cluster state
2057-
// therefore need a different way of recording the result of scheduling
2058-
// a pod onto a node. This is done through ClusterAutoScalerPlugin.
2059-
//
2060-
// Cluster autoscaler will:
2061-
// - at program startup:
2062-
// - instantiate plugin, with real informer factory and no Kubernetes client
2063-
// - start informers
2064-
// - at the start of a simulation:
2065-
// - call StartSimulation with a clean cycle state
2066-
// - for each pending pod:
2067-
// - call PreFilter and Filter with the same cycle state that
2068-
// was passed to StartSimulation
2069-
// - call SimulateBindPod with the same cycle state that
2070-
// was passed to StartSimulation (i.e. *not* the one which was modified
2071-
// by PreFilter or Filter) to indicate that a pod is being scheduled onto a node
2072-
// as part of the simulation
2073-
//
2074-
// A plugin may:
2075-
// - Take a snapshot of all relevant cluster state as part of StartSimulation
2076-
// and store it in the cycle state. This signals to the other extension
2077-
// points that the plugin is being used as part of the cluster autoscaler.
2078-
// . In PreFilter and Filter use the cluster snapshot to make decisions
2079-
// instead of the normal "live" cluster state.
2080-
// - In SimulateBindPod update the snapshot in the cycle state.
2081-
type ClusterAutoScalerPlugin interface {
2082-
Plugin
2083-
// StartSimulation is called when the cluster autoscaler begins
2084-
// a simulation.
2085-
StartSimulation(ctx context.Context, state *CycleState) *Status
2086-
// SimulateBindPod is called when the cluster autoscaler decided to schedule
2087-
// a pod onto a certain node.
2088-
SimulateBindPod(ctx context.Context, state *CycleState, pod *v1.Pod, nodeInfo *NodeInfo) *Status
2089-
// SimulateEvictPod is called when the cluster autoscaler simulates removal
2090-
// of a node. All claims used only by this pod should be considered deallocated,
2091-
// to enable starting the same pod elsewhere.
2092-
SimulateEvictPod(ctx context.Context, state *CycleState, pod *v1.Pod, nodeName string) *Status
2093-
// NodeIsReady checks whether some real node has been initialized completely.
2094-
// Even if it is "ready" as far Kubernetes is concerned, some DaemonSet pod
2095-
// might still be missing or not done with its startup yet.
2096-
NodeIsReady(ctx context.Context, node *v1.Node) (bool, error)
2097-
}
2098-
```
2099-
2100-
`NodeIsReady` is needed to solve one particular problem: when a new node first
2101-
starts up, it may be ready to run pods, but the pod from a resource driver's
2102-
DaemonSet may still be starting up. If the resource driver controller needs
2103-
information from such a pod, then it will not be able to filter
2104-
correctly. Similar to how extended resources are handled, the autoscaler then
2105-
first needs to wait until the plugin also considers the node to be ready.
2106-
2107-
#### DRA scheduler plugin extension mechanism
2108-
2109-
The in-tree scheduler plugin gets extended by vendors through the following API
2110-
in `k8s.io/dynamic-resource-allocation/simulation`. Vendor code depends
2111-
neither on the k/k/pkg/scheduler package nor on autoscaler packages.
2112-
2113-
```
2114-
// Registry stores all known plugins which can simulate claim allocation.
2115-
// It is thread-safe.
2116-
var Registry registry
2117-
2118-
// PluginName is a special type that is used to look up plugins for a claim.
2119-
// For now it must be the same as the driver name in the resource class of a
2120-
// claim.
2121-
type PluginName string
2122-
2123-
// Add adds or overwrites the plugin for a certain name.
2124-
func (r *registry) Add(name PluginName, plugin Plugin) { ... }
2125-
2126-
...
2127-
2128-
// Plugin is used to register a plugin.
2129-
type Plugin interface {
2130-
// Activate will get called to prepare the plugin for usage.
2131-
Activate(ctx context.Context, client kubernetes.Interface, informerFactory informers.SharedInformerFactory) (ActivePlugin, error)
2132-
}
2133-
2134-
// ActivePlugin is a plugin which is ready to start a simulation.
2135-
type ActivePlugin interface {
2136-
// Start will get called at the start of a simulation. The plugin must
2137-
// capture the current cluster state.
2138-
Start(ctx context.Context) (StartedPlugin, error)
2139-
2140-
// NodeIsReady checks whether some real node has been initialized completely.
2141-
NodeIsReady(ctx context.Context, node *v1.Node) (bool, error)
2142-
}
2143-
2144-
// StartedPlugin is a plugin which encapsulates a certain cluster state and
2145-
// can make changes to it.
2146-
type StartedPlugin interface {
2147-
// Clone must create a new, independent copy of the current state.
2148-
// This must be fast and cannot fail. If it has to do some long-running
2149-
// operation, then it must do that in a new goroutine and check the
2150-
// result when some method is called in the returned instance.
2151-
Clone() StartedPlugin
2152-
2153-
// NodeIsSuitable checks whether a claim could be allocated for
2154-
// a pod such that it will be available on the node.
2155-
NodeIsSuitable(ctx context.Context, pod *v1.Pod, claim *resourcev1alpha2.ResourceClaim, node *v1.Node) (bool, error)
2156-
2157-
// Allocate must adapt the cluster state as if the claim
2158-
// had been allocated for use on the selected node and return
2159-
// the result for the claim. It must not modify the claim,
2160-
// that will be done by the caller.
2161-
Allocate(ctx context.Context, claim *resourcev1alpha2.ResourceClaim, node *v1.Node) (*resourcev1alpha2.AllocationResult, error)
2162-
2163-
// Deallocate must adapt the cluster state as if the claim
2164-
// had been deallocated. It must not modify the claim,
2165-
// that will be done by the caller.
2166-
Deallocate(ctx context.Context, claim *resourcev1alpha2.ResourceClaim) error
2167-
}
2168-
```
2169-
2170-
When the DRA scheduler plugin gets initialized, it activates all registered
2171-
vendor plugins. When `StartSimulation` is called, all vendor plugins are
2172-
started. When the scheduler plugin's state data is cloned, the plugins also
2173-
get cloned. In addition, `StartSimulation` captures the state of all claims.
2174-
2175-
`NodeIsSuitable` is called during the `Filter` check to determine whether a
2176-
pending claim could be allocated for a node. `Allocate` is called as part of
2177-
the `SimulateBindPod` implementation. The simulated allocation result is stored
2178-
in the claim snapshot and then the claim is reserved for the pod. If the claim
2179-
cannot be shared between pods, that will prevent other pods from using the
2180-
claim while the autoscaler goes through it's binpacking simulation.
2181-
2182-
Finally, `NodeIsReady` of each vendor plugin is called to implement the
2183-
scheduler plugin's own `NodeIsReady`.
2184-
2185-
#### Handling claims without vendor code
1965+
encounters a pod that uses a resource claim for node-local resources, it needs
1966+
to understand the parameters for the claim and available capacity in order
1967+
to simulate the effect of allocating claims as part of scheduling and of
1968+
creating or removing nodes.
21861969

2187-
When the DRA scheduler plugin does not have specific vendor code for a certain
2188-
resource class, it falls back to the assumption that resources are unlimited,
2189-
i.e. allocation will always work. This is how volume provisioning is currently
2190-
handled during cluster autoscaling.
2191-
2192-
If a pod is not getting scheduled because a resource claim cannot be allocated
2193-
by the real DRA driver, to the autoscaler it will look like the pod should be
2194-
schedulable and therefore it will not spin up new nodes for it, which is the
2195-
right decision.
2196-
2197-
If a pod is not getting scheduled because some other resource requirement is
2198-
not satisfied, the autoscaler will simulate scale up and can pick some
2199-
arbitrary node pool because the DRA scheduler plugin will accept all of those
2200-
nodes.
2201-
2202-
During scale down, moving a running pod to a different node is assumed to work,
2203-
so that scenario also works.
2204-
2205-
#### Building a custom Cluster Autoscaler binary
2206-
2207-
Vendors are encouraged to include an "init" package in their driver
2208-
simulation implementation. That "init" package registers their plugin. Then to
2209-
build a custom autoscaler binary, one additional file alongside `main.go` is
2210-
sufficient:
2211-
2212-
```
2213-
package main
2214-
2215-
import (
2216-
_ "acme.example.com/dra-resource-driver/simulation-plugin/init"
2217-
)
2218-
```
1970+
This is not possible with opaque parameters as described in this KEP. If a DRA
1971+
driver developer wants to support Cluster Autoscaler, they have to use numeric
1972+
parameters. Numeric parameters are an extension of this KEP that is defined in
1973+
[KEP #4381](https://github.com/kubernetes/enhancements/issues/4381).
22191974

2220-
This init package may also register additional command line flags. Care must be
2221-
taken to not cause conflicts between different plugins, so all vendor flags
2222-
should start with a unique prefix.
1975+
Numeric parameters are not necessary for network-attached resources because
1976+
adding or removing nodes doesn't change their availability and thus Cluster
1977+
Autoscaler does not need to understand their parameters.
22231978

22241979
### kubelet
22251980

@@ -2684,7 +2439,7 @@ For beta:
26842439

26852440
#### Alpha -> Beta Graduation
26862441

2687-
- Implement integration with Cluster Autoscaler
2442+
- Implement integration with Cluster Autoscaler through numeric parameters
26882443
- Gather feedback from developers and surveys
26892444
- Positive acknowledgment from 3 would-be implementors of a resource driver,
26902445
from a diversity of companies or projects

0 commit comments

Comments
 (0)