generated from kubernetes/kubernetes-template-project
-
Notifications
You must be signed in to change notification settings - Fork 569
[KEP] Add resource policy plugin #594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
KunWuLuan
wants to merge
5
commits into
kubernetes-sigs:master
Choose a base branch
from
KunWuLuan:kep/resourcepolicy
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,205 @@ | ||
# Resource Policy | ||
|
||
## Table of Contents | ||
|
||
<!-- toc --> | ||
- [Summary](#summary) | ||
- [Motivation](#motivation) | ||
- [Use Cases](#use-cases) | ||
- [Goals](#goals) | ||
- [Non-Goals](#non-goals) | ||
- [Proposal](#proposal) | ||
- [CRD API](#crd-api) | ||
- [Implementation Details](#implementation-details) | ||
- [Scheduler Plugins](#scheduler-plugins) | ||
- [PreFilter](#prefilter) | ||
- [Filter](#filter) | ||
- [Score](#score) | ||
- [Resource Policy Controller](#resource-policy-controller) | ||
- [Known limitations](#known-limitations) | ||
- [Test plans](#test-plans) | ||
- [Graduation criteria](#graduation-criteria) | ||
- [Feature enablement and rollback](#feature-enablement-and-rollback) | ||
<!-- /toc --> | ||
|
||
## Summary | ||
This proposal introduces a plugin that enables users to set priorities for various resources and define maximum resource consumption limits for workloads across different resources. | ||
|
||
## Motivation | ||
A Kubernetes cluster typically consists of heterogeneous machines, with varying SKUs on CPU, memory, GPU, and pricing. To | ||
efficiently utilize the different resources available in the cluster, users can set priorities for machines of different | ||
types and configure resource allocations for different workloads. Additionally, they may choose to delete pods running | ||
on low priority nodes instead of high priority ones. | ||
|
||
### Use Cases | ||
|
||
1. As a administrator of kubernetes cluster, there are some static but expensive VM instances and some dynamic but cheaper Spot | ||
instances in my cluster. I hope to restrict the resource consumption on each kind of resource for different workloads to limit the cost. | ||
I hope that important workloads in my cluster can be deployed first on static VM instances so that they will not worry about been preempted. And during business peak periods, the Pods that are scaled up are deployed on cheap, spot instances. At the end of the business peak, the Pods on Spot | ||
instances are prioritized to be scaled down. | ||
|
||
### Goals | ||
|
||
1. Develop a filter plugin to restrict the resource consumption on each kind of resource for different workloads. | ||
2. Develop a score plugin to favor nodes matched by a high priority kind of resource. | ||
3. Automatically setting deletion costs on Pods to control the scaling in sequence of workloads through a controller. | ||
|
||
### Non-Goals | ||
|
||
1. Scheduler will not delete the pods. | ||
|
||
## Proposal | ||
|
||
### API | ||
```yaml | ||
apiVersion: scheduling.sigs.x-k8s.io/v1alpha1 | ||
kind: ResourcePolicy | ||
metadata: | ||
name: xxx | ||
namespace: xxx | ||
spec: | ||
matchLabelKeys: | ||
- pod-template-hash | ||
matchPolicy: | ||
ignoreTerminatingPod: true | ||
podSelector: | ||
matchExpressions: | ||
- key: key1 | ||
operator: In | ||
values: | ||
- value1 | ||
matchLabels: | ||
key1: value1 | ||
strategy: prefer | ||
units: | ||
- name: unit1 | ||
priority: 5 | ||
maxCount: 10 | ||
nodeSelector: | ||
matchExpressions: | ||
- key: key1 | ||
operator: In | ||
values: | ||
- value1 | ||
- name: unit2 | ||
priority: 5 | ||
maxCount: 10 | ||
nodeSelector: | ||
matchExpressions: | ||
- key: key1 | ||
operator: In | ||
values: | ||
- value2 | ||
- name: unit3 | ||
priority: 4 | ||
maxCount: 20 | ||
nodeSelector: | ||
matchLabels: | ||
key1: value3 | ||
``` | ||
|
||
```go | ||
type ResourcePolicy struct { | ||
ObjectMeta | ||
TypeMeta | ||
|
||
Spec ResourcePolicySpec | ||
} | ||
type ResourcePolicySpec struct { | ||
MatchLabelKeys []string | ||
MatchPolicy MatchPolicy | ||
Strategy string | ||
PodSelector metav1.LabelSelector | ||
Units []Unit | ||
} | ||
type MatchPolicy struct { | ||
IgnoreTerminatingPod bool | ||
} | ||
type Unit struct { | ||
Priority *int32 | ||
MaxCount *int32 | ||
NodeSelector metav1.LabelSelector | ||
} | ||
``` | ||
|
||
Pods will be matched by the ResourcePolicy in same namespace when the `.spec.podSelector`. And if `.spec.matchPolicy.ignoreTerminatingPod` is `true`, pods with Non-Zero `.spec.deletionTimestamp` will be ignored. | ||
ResourcePolicies will never match pods in different namesapces. One pod can not be matched by more than one Resource Policies. | ||
|
||
Pods can only be scheduled on units defined in `.spec.units` and this behavior can be changed by `.spec.strategy`. Each item in `.spec.units` contains a set of nodes that match the `NodeSelector` which describes a kind of resource in the cluster. | ||
|
||
`.spec.units[].priority` define the priority of each unit. Units with higher priority will get higher score in the score plugin. | ||
If all units have the same priority, resourcepolicy will only limit the max pod on these units. | ||
If the `.spec.units[].priority` is not set, the default value is 0. | ||
`.spec.units[].maxCount` define the maximum number of pods that can be scheduled on each unit. If `.spec.units[].maxCount` is not set, pods can always be scheduled on the units except there is no enough resource. | ||
|
||
`.spec.strategy` indicate how we treat the nodes doesn't match any unit. | ||
If strategy is `required`, the pod can only be scheduled on nodes that match the units in resource policy. | ||
If strategy is `prefer`, the pod can be scheduled on all nodes, these nodes not match the units will be | ||
considered after all nodes match the units. So if the strategy is `required`, we will return `unschedulable` | ||
for those nodes not match the units. | ||
|
||
`.spec.matchLabelKeys` indicate how we group the pods matched by `podSelector` and `matchPolicy`, its behavior is like | ||
`.spec.matchLabelKeys` in `PodTopologySpread`. | ||
|
||
### Implementation Details | ||
|
||
#### PreFilter | ||
PreFilter check if the current pods match only one resource policy. If not, PreFilter will reject the pod. | ||
If yes, PreFilter will get the number of pods on each unit to determine which units are available for the pod | ||
and write this information into cycleState. | ||
|
||
#### Filter | ||
Filter check if the node belongs to an available unit. If the node doesn't belong to any unit, we will return | ||
success if the `.spec.strategy` is `prefer`, otherwise we will return unschedulable. | ||
|
||
Besides, filter will check if the pods that was scheduled on the unit has already violated the quantity constraint. | ||
If the number of pods has reach the `.spec.unit[].maxCount`, all the nodes in unit will be marked unschedulable. | ||
|
||
#### Score | ||
If `.spec.unit[].priority` is set in resource policy, we will schedule pod based on `.spec.unit[].priority`. Default priority is 0, and minimum | ||
priority is 0. | ||
|
||
Score calculation details: | ||
|
||
1. calculate priority score, `scorePriority = (priority) * 20`, to make sure we give nodes without priority a minimum score. | ||
2. normalize score | ||
|
||
#### Resource Policy Controller | ||
Resource policy controller set deletion cost on pods when the related resource policies were updated or added. | ||
|
||
## Known limitations | ||
|
||
- Currently deletion costs only take effect on deployment workload. | ||
|
||
## Test plans | ||
|
||
1. Add detailed unit and integration tests for the plugin and controller. | ||
2. Add basic e2e tests, to ensure all components are working together. | ||
|
||
## Graduation criteria | ||
|
||
This plugin will not be enabled only when users enable it in scheduler framework and create a resourcepolicy for pods. | ||
So it is safe to be beta. | ||
|
||
* Beta | ||
- [ ] Add node E2E tests. | ||
- [ ] Provide beta-level documentation. | ||
|
||
## Feature enablement and rollback | ||
|
||
Enable resourcepolicy in MultiPointPlugin to enable this plugin, like this: | ||
|
||
```yaml | ||
piVersion: kubescheduler.config.k8s.io/v1 | ||
kind: KubeSchedulerConfiguration | ||
leaderElection: | ||
leaderElect: false | ||
profiles: | ||
- schedulerName: default-scheduler | ||
plugins: | ||
multiPoint: | ||
enabled: | ||
- name: resourcepolicy | ||
``` | ||
|
||
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
title: Resourcepolicy | ||
kep-number: 594 | ||
authors: | ||
- "@KunWuLuan" | ||
- "@fjding" |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a multi-tenant cluster, given ResourcePolicy is a namespace-scoped CR, but nodes/node pools might be shared across tenants. Does it mean that if a tenant set a high-priority policy with strategy
required
, it could make other tenants who don't set or set low-priority policy get less chances to get their workloads scheduled?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the ResourcePolicy should only be set by cluster administrator. This should be a method to help reduce the cost without change the YAML of workloads. When used in multi-tenant, this can be used to limit the resource consumption of each tenant on different resources, and this should also not be set by tenant itself.