Skip to content

Commit d4cb60c

Browse files
committed
DRA: include SimulateEvictPod for scale down, language tweaks
As discussed on Slack, scale down must determine whether some currently running pods could get moved. This simulation depends on simulating deallocation, otherwise the allocated claim prevents moving pods.
1 parent 7e17126 commit d4cb60c

File tree

1 file changed

+20
-7
lines changed
  • keps/sig-node/3063-dynamic-resource-allocation

1 file changed

+20
-7
lines changed

keps/sig-node/3063-dynamic-resource-allocation/README.md

Lines changed: 20 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -112,7 +112,7 @@ SIG Architecture for cross-cutting KEPs).
112112
- [Unreserve](#unreserve)
113113
- [Cluster Autoscaler](#cluster-autoscaler)
114114
- [Generic plugin enhancements](#generic-plugin-enhancements)
115-
- [DRA scheduler plugin extension mechanism](#dra-scheduler-plugin-extension-mechanism)
115+
- [DRA scheduler plugin extension mechanism](#dra-scheduler-plugin-extension-mechanism)
116116
- [Building a custom Cluster Autoscaler binary](#building-a-custom-cluster-autoscaler-binary)
117117
- [kubelet](#kubelet)
118118
- [Managing resources](#managing-resources)
@@ -1967,7 +1967,11 @@ the current cluster state. Then autoscaler determines whether a real or
19671967
fictional node fits a pod by calling the pre-filter and filter extension points
19681968
of scheduler plugins. If a pod fits a node, the snapshot is updated by calling
19691969
[NodeInfo.AddPod](https://github.com/kubernetes/kubernetes/blob/7e3c98fd303359cb9f79dfc691da733a6ca2a6e3/pkg/scheduler/framework/types.go#L620-L623). This
1970-
influences further checks for other pending pods.
1970+
influences further checks for other pending pods. During scale down, eviction
1971+
is simulated by
1972+
[SimulateNodeRemoval](https://github.com/kubernetes/autoscaler/blob/2f7c61e13bd1cbfc0ba4085fb84bd692a1e9ac6e/cluster-autoscaler/simulator/cluster.go#L149)
1973+
which [pretends that pods running on a node that is to be removed are not
1974+
running](https://github.com/kubernetes/autoscaler/blob/2f7c61e13bd1cbfc0ba4085fb84bd692a1e9ac6e/cluster-autoscaler/simulator/cluster.go#L231-L237).
19711975

19721976
The DRA scheduler plugin gets integrated into this snapshotting and simulated
19731977
pod scheduling through a new scheduler framework interface:
@@ -2023,6 +2027,10 @@ type ClusterAutoScalerPlugin interface {
20232027
// SimulateBindPod is called when the cluster autoscaler decided to schedule
20242028
// a pod onto a certain node.
20252029
SimulateBindPod(ctx context.Context, state *CycleState, pod *v1.Pod, nodeInfo *NodeInfo) *Status
2030+
// SimulateEvictPod is called when the cluster autoscaler simulates removal
2031+
// of a node. All claims used only by this pod should be considered deallocated,
2032+
// to enable starting the same pod elsewhere.
2033+
SimulateEvictPod(ctx context.Context, state *CycleState, pod *v1.Pod, nodeName string) *Status
20262034
// NodeIsReady checks whether some real node has been initialized completely.
20272035
// Even if it is "ready" as far Kubernetes is concerned, some DaemonSet pod
20282036
// might still be missing or not done with its startup yet.
@@ -2037,11 +2045,11 @@ information from such a pod, then it will not be able to filter
20372045
correctly. Similar to how extended resources are handled, the autoscaler then
20382046
first needs to wait until the plugin also considers the node to be ready.
20392047

2040-
### DRA scheduler plugin extension mechanism
2048+
#### DRA scheduler plugin extension mechanism
20412049

20422050
The in-tree scheduler plugin gets extended by vendors through the following API
2043-
in `k8s.io/dynamic-resource-allocation/simulation`. Vendor code does not depend
2044-
on the k/k/pkg/scheduler package nor on autoscaler packages.
2051+
in `k8s.io/dynamic-resource-allocation/simulation`. Vendor code depends
2052+
neither on the k/k/pkg/scheduler package nor on autoscaler packages.
20452053

20462054
```
20472055
// Registry stores all known plugins which can simulate claim allocation.
@@ -2092,12 +2100,17 @@ type StartedPlugin interface {
20922100
// the result for the claim. It must not modify the claim,
20932101
// that will be done by the caller.
20942102
Allocate(ctx context.Context, claim *resourcev1alpha2.ResourceClaim, node *v1.Node) (*resourcev1alpha2.AllocationResult, error)
2103+
2104+
// Deallocate must adapt the cluster state as if the claim
2105+
// had been deallocated. It must not modify the claim,
2106+
// that will be done by the caller.
2107+
Deallocate(ctx context.Context, claim *resourcev1alpha2.ResourceClaim) error
20952108
}
20962109
```
20972110

20982111
When the DRA scheduler plugin gets initialized, it activates all registered
20992112
vendor plugins. When `StartSimulation` is called, all vendor plugins are
2100-
started. When the scheduler plugin's state data is cloned, the plugin's also
2113+
started. When the scheduler plugin's state data is cloned, the plugins also
21012114
get cloned. In addition, `StartSimulation` captures the state of all claims.
21022115

21032116
`NodeIsSuitable` is called during the `Filter` check to determine whether a
@@ -2112,7 +2125,7 @@ scheduler plugin's own `NodeIsReady`.
21122125

21132126
#### Building a custom Cluster Autoscaler binary
21142127

2115-
Vendors are encouraged to include an "init" package together with their driver
2128+
Vendors are encouraged to include an "init" package in their driver
21162129
simulation implementation. That "init" package registers their plugin. Then to
21172130
build a custom autoscaler binary, one additional file alongside `main.go` is
21182131
sufficient:

0 commit comments

Comments
 (0)