Skip to content

Commit 2dd992c

Browse files
committed
Promote pod deletion cost KEP to implementable
1 parent e900e43 commit 2dd992c

File tree

3 files changed

+110
-76
lines changed

3 files changed

+110
-76
lines changed
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
kep-number: 2255
2+
alpha:
3+
approver: "@wojtek-t"

keps/sig-apps/2255-pod-cost/README.md

Lines changed: 97 additions & 67 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# KEP-2255: Add pod-cost annotation for ReplicaSet
1+
# KEP-2255: ReplicaSet Pod Deletion Cost
22

33

44
<!-- toc -->
@@ -27,7 +27,6 @@
2727
- [Scalability](#scalability)
2828
- [Troubleshooting](#troubleshooting)
2929
- [Implementation History](#implementation-history)
30-
- [Drawbacks](#drawbacks)
3130
- [Alternatives](#alternatives)
3231
<!-- /toc -->
3332

@@ -55,122 +54,164 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
5554

5655
## Summary
5756

58-
This feature allows making a suggestion to the ReplicaSet controller, which pod of a Deployment should be deleted first when a scale-down event happens. This can prevent session disruption in stateful applications in a trivial manner.
57+
This feature allows applications to give a hint to the ReplicaSet controller
58+
as to which pods should be deleted first on scale down.
5959

6060
## Motivation
6161

62-
For some applications, it is necessary that the application can tell Kubernetes which pod can be deleted and which replica has to be protected. The reason for this is that some applications do have stateful sessions and it is not possible to put such an application into Kubernetes because of session termination resulting from "random" down-scale. If the application is able to tell Kubernetes which of the replicas contains no/few/less important active sessions, this would solve many problems. This feature is non-disruptive to the default behaviour. Only if the annotation is existing, it will make a difference in deletion order.
62+
Currently ReplicaSets are scaled down based on a criteria that on the
63+
limit prioritizes deleting pods with a more recent creation/readiness
64+
timestamp. This is not ideal for some applications where the cost of
65+
deleting pods is not related to how recent they were created.
6366

6467
### Goals
6568

66-
To recommend which pod gets deleted next of a ReplicaSet. This should help to avoid major reworks in existing applications architecture:
67-
* [45509](https://github.com/kubernetes/kubernetes/issues/45509) - Scale down a deployment by removing specific pods
69+
- An API that allows applications to influence the order of deleting
70+
pods when scaling down a ReplicaSet
6871

6972

7073
### Non-Goals
7174

72-
Guaranteed (in contrast to the recommendation stated in Goals) deletion of a selected replica.
75+
- Guarantees on pod deletion order
76+
- A controller that sets the cost of deleting the pods.
7377

7478
## Proposal
7579

76-
The application can set the `controller.kubernetes.io/pod-cost` annotation to a pod through the Kubernetes API. When a downscale event happens, the pod with the lower priority value of the previously set annotation will be deleted first. If one pod of the Deployment has no priority annotation set, it will be treated as the lowest priority.
77-
78-
If all pods have the same priority, there is no difference in the normal pod delete decision behaviour. The same applies if the pod-cost annotation is not used at all.
79-
80-
The pod-cost annotation can be changed during operation, for example, if workload changes or a new master gets elected.
80+
Define a known annotation, namely `controller.kubernetes.io/pod-deletion-cost` that
81+
applications can set to offer a hint on the cost of deleting a pod compared
82+
to other pods belonging to the same ReplicaSet.
8183

8284
### User Stories (optional)
8385

84-
8586
#### Story 1
8687

87-
In an application environment with stateful worker (user-)sessions, it is essential to keep the user sessions alive as good as possible. In case of a scale-down event, the application has to tell the scheduler, which delete decision would have the lowest impact on existing sessions.
88+
The different pods of an application could have different utilization levels.
89+
On scale down, the application may prefer to remove the pods with lower utilization.
90+
To avoid frequently updating the pods, the application should update pod-deletion-cost
91+
once before issuing a scale down. This works if the application itself controls the down
92+
scaling (e.g., the driver pod of a Spark deployment).
8893

8994
#### Story 2
9095

91-
An application consists of identical server processes, but one of the replicas will be the master, which should be kept as long as possible. All other replicas can be treated as cattle workload. Then the master can set the priority annotation with a high priority value as soon as it has finished its startup process. The other replicas can remain either without any priority set, or e.g. with all the same, lower priority. This ensures, that the master replica of this deployment will be protected in a downscale situation.
96+
On scale down, the application may want to remove pods running on the most expensive
97+
nodes first. For example, remove pods from nodes running on standard VMs first
98+
then from ones running on preemptible/spot VMs (which can be 80% cheaper than standard VMs).
9299

93100

94101
### Risks and Mitigations
95102

96-
On previous Kubernetes ReplicaSet controller versions that don't implement the pod-cost annotation feature, the same application might make false assumptions about the protection of a master instance or workers with open (user-)sessions on it. As the pod-cost annotation would be only a suggestion to the ReplicaSet controller, the application developer should, however, handle the case of a failed master instance or broken user sessions. The feature is just an improvement, not a guarantee, as there might happen timing issues between setting the annotation and the next controller scale-down event.
103+
- Users perceive the feature as a guarantee to delete order. Documentation
104+
should stress the fact that this is best effort.
105+
106+
- Users deploy controllers that update the annotation frequently causing a
107+
significant load on the api server. Documentation should include best
108+
practices as to how this feature should be used (e.g., update the
109+
pod-deletion-cost only before scale down). Moreover, [API priority and fairness](https://kubernetes.io/docs/concepts/cluster-administration/flow-control/)
110+
gives operators a new server-side knob that allows them to limit update
111+
qps issued by such controllers.
112+
97113

98114
## Design Details
99115

116+
The pod-deletion-cost range will be from [-MaxInt, MaxInt]. The default value is 0.
117+
Invalid values (like setting the annotation to string) will be rejected by the api-server
118+
with a BadRequest status code.
119+
120+
Having the default value in the middle of the range allows controllers to cutomize
121+
the semantics of the cost of deleting pods that don't have the annotation set:
122+
controllers can use positive pod-deletion-cost values if they always want uninitialized
123+
pods to be deleted first, or use negative pod-deletion-cost values if they want
124+
uninitialized pods to always be deleted last.
125+
126+
When scaling down a ReplicaSet, controller-manager will prioritize deleting
127+
pods with lower pod-deletion-cost. Specifically, the pod-deletion-cost will be evaluated after
128+
step 3 and before step 4 as they are currently defined in
129+
[ActivePodsWithRanks](https://github.com/kubernetes/kubernetes/blob/cac933934b1301665e6e51a81c66c483f4e16c49/pkg/controller/controller_utils.go#L784-L809),
130+
which means the followig criteria is applied when comparing two pods regardless of their pod-deletion-cost:
131+
- if one is assigned a node and the other is not, then the unassigned pod is deleted first.
132+
- if the two pods are in different phases, then the pod in pending/unknown status is deleted first.
133+
- if the two pods have different readiness status, then the not ready pod is deleted first
134+
135+
136+
If none of the pods set the pod-deletion-cost annotation or all of them have the same value, then the
137+
scale down behavior is not changed compared to now.
100138

101139
### Test Plan
102140

103-
* Units test in kube-controller-manager package to test a variety of scenarios.
104-
* New E2E Tests to validate that replicas get deleted as expected e.g:
105-
* Replicas with lower pod-cost before replicas with higher pod-cost
106-
* Replicas with no pod-cost annotation set before replicas with low priority
141+
- Units test in kube-controller-manager package to test a variety of scenarios.
142+
- Integration tests to validate that:
143+
- Replicas with lower pod-deletion-cost are deleted before replicas with higher pod-deletion-cost
144+
- No behavior change when pod-deletion-cost is not set or all pods have the same pod-deletion-cost
107145

108146
### Graduation Criteria
109147

110148
#### Alpha -> Beta Graduation
111149
* Implemented feedback from alpha testers
112-
* Thorough E2E and unit testing in place
113150

114151
#### Beta -> GA Graduation
115-
* Significant number of end-users are using the feature
116152
* We're confident that no further API changes will be needed to achieve the goals of the KEP
117153
* All known functional bugs have been fixed
118154

119155
### Upgrade / Downgrade Strategy
120156

121-
When upgrading no changes are needed to maintain existing behaviour as all of this behaviour is fully optional and disabled by default. To activate this feature either a user has to make an annotation to a pod in a Deployment by hand or the application annotates a pod in a Deployment through the API.
122-
123-
When downgrading, there is no need to changing anything, as this is just a pod annotation, which is uncritical.
157+
There is no strategy per se. On upgrade, controller-manager will start taking into account
158+
pod-deletion-cost annotation for new and existing ReplicaSets that set the annotation. On
159+
downgrade, controller-manager will stop taking into account pod-deletion-cost, and so
160+
reverting to old behavior.
124161

125162
### Version Skew Strategy
126163

127-
As this feature is based on pod annotations, there is no issue with different Kubernetes versions. The lack of this feature in older versions may change the efficiency and reliability of the applications.
164+
N/A
128165

129166
## Production Readiness Review Questionnaire
130167

131168
### Feature enablement and rollback
132169

133170
* **How can this feature be enabled / disabled in a live cluster?**
134-
- [x] Other
135-
- Make special pod annotations within a live Deployment
171+
- [x] Feature gate (also fill in values in `kep.yaml`)
172+
- Feature gate name: ReplicaSetPodDeletionCost
173+
- Components depending on the feature gate: kube-controller-manager
174+
- [ ] Other
175+
- Describe the mechanism:
176+
- Will enabling / disabling the feature require downtime of the control
177+
plane?
178+
- Will enabling / disabling the feature require downtime or reprovisioning
179+
of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled).
136180

137181

138182
* **Does enabling the feature change any default behavior?**
139-
- No
183+
No.
140184

141185

142186
* **Can the feature be disabled once it has been enabled (i.e. can we rollback
143187
the enablement)?**
144-
- One can either remove the annotations or downgrade to an older Kubernetes release
188+
Yes.
145189

146190

147191
* **What happens if we reenable the feature if it was previously rolled back?**
148-
- Then the feature will be reenabled. Nothing special to consider here.
192+
It should continue to work as expected.
149193

150194

151195
* **Are there any tests for feature enablement/disablement?**
152-
196+
We will add unit tests.
153197

154198
### Rollout, Upgrade and Rollback Planning
155199

156200
_This section must be completed when targeting beta graduation to a release._
157201

158202
* **How can a rollout fail? Can it impact already running workloads?**
159-
- As the feature is a simple annoation, the worst what could happen is that either the annotation is lost or ignored. In the worst case, a pod with a higher priority gets deleted before a pod with a lower priority.
160-
203+
It shouldn't impact already running workloads. This is an opt-in feature
204+
since users need to explicitly set the annotation.
161205

162206
* **What specific metrics should inform a rollback?**
163-
- None
164-
207+
None.
165208

166209
* **Were upgrade and rollback tested? Was upgrade->downgrade->upgrade path tested?**
167-
- Was tested. Behaviour change in both directions, as expected.
168-
210+
We will do manual testing.
169211

170212
* **Is the rollout accompanied by any deprecations and/or removals of features,
171213
APIs, fields of API types, flags, etc.?**
172-
- No. However, the exact same pod annotation string cannot be used for any other purposes.
173-
214+
No.
174215

175216
### Monitoring requirements
176217

@@ -179,39 +220,30 @@ _This section must be completed when targeting beta graduation to a release._
179220
* **How can an operator determine if the feature is in use by workloads?**
180221
- Search for pod annotations with the exact same pod-cost annotation string.
181222

182-
183223
* **What are the SLIs (Service Level Indicators) an operator can use to
184224
determine the health of the service?**
185-
- A pod with a lower pod-cost annotation in a Deployment gets deleted first on a scale-down event.
186-
225+
N/A
187226

188227
* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
189-
- All pods with a lower pod-cost annotation in a Deployment are deleted first on a scale-down event.
228+
N/A
190229

191230
* **Are there any missing metrics that would be useful to have to improve
192231
observability if this feature?**
193-
- N/A
232+
No.
194233

195234
### Dependencies
196235

197236
_This section must be completed when targeting beta graduation to a release._
198237

199238
* **Does this feature depend on any specific services running in the cluster?**
200-
- The feature requires the existing of the kube-controller-manager and the ability and permissions to set pod annotations.
201-
239+
No.
202240

203241
### Scalability
204242

205-
_For alpha, this section is encouraged: reviewers should consider these questions
206-
and attempt to answer them._
207-
208-
_For beta, this section is required: reviewers must answer these questions._
209-
210-
_For GA, this section is required: approvers should be able to confirms the
211-
previous answers based on experience in the field._
212-
213243
* **Will enabling / using this feature result in any new API calls?**
214-
- Whenever the application decides, that a change in pod-cost is needed for a replica, it will send out an API request and set the appropriate pod annotation(s).
244+
- No, not the feature itself. However, users will want to deploy an external controller
245+
that updates the pod-deletion-cost, documentation should stress that update frequency
246+
to be coarse grained.
215247

216248

217249
* **Will enabling / using this feature result in introducing new API types?**
@@ -225,21 +257,17 @@ previous answers based on experience in the field._
225257

226258
* **Will enabling / using this feature result in increasing size or count
227259
of the existing API objects?**
228-
Describe them providing:
229-
- API type(s): Pod annotation
230-
- Estimated increase in size: Size of a new annotation
231-
- Estimated amount of new objects: new annotation for potentially every existing Pod
232-
260+
- No.
233261

234262
* **Will enabling / using this feature result in increasing time taken by any
235263
operations covered by [existing SLIs/SLOs][]?**
236-
- The time it takes to set/delete/change a pod annotation
264+
- There are no SLOs covering scale down, but this feature should have negligible
265+
impact on scale-down latency since we are adding an additional sorting key.
237266

238267

239268
* **Will enabling / using this feature result in non-negligible increase of
240269
resource usage (CPU, RAM, disk, IO, ...) in any components?**
241-
- The resources it takes to set/delete/change a pod annotation
242-
270+
- No.
243271

244272
### Troubleshooting
245273

@@ -257,11 +285,13 @@ _This section must be completed when targeting beta graduation to a release._
257285
[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
258286

259287
## Implementation History
288+
- 2021-01-13: Initial KEP submitted as provisional
289+
- 2021-01-15: KEP promoted to implementable
260290

261291

262-
## Drawbacks
292+
## Alternatives
263293

294+
One alternative to using an annotation is adding an explicit API field. If the feature gets
295+
enough traction, we may consider promoting the annotation to a Status field.
264296

265-
## Alternatives
266297

267-
Similar behaviour can be achieved through the Operator Framework which however will take a lot more configuration and setup work and is not a built-in Kubernetes feature.

keps/sig-apps/2255-pod-cost/kep.yaml

Lines changed: 10 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,23 +1,25 @@
1-
title: Add pod-cost annotation for ReplicaSet
1+
title: ReplicaSet Pod Deletion Cost
22
kep-number: 2255
33
authors:
44
- "@drbugfinder-work"
55
- "@ahg-g"
66
- "@alculquicondor"
77
owning-sig: sig-apps
88
participating-sigs:
9-
status: provisional
9+
status: implementable
1010
creation-date: 2021-01-12
1111
reviewers:
1212
- "@ahg-g"
1313
- "@janetkuo"
1414
- "@alculquicondor"
1515
approvers:
1616
- "@janetkuo"
17+
prr-approvers:
18+
- "@wojtek-t"
1719
see-also:
1820
- https://github.com/kubernetes/kubernetes/issues/45509
19-
- https://github.com/kubernetes/enhancements/issues/2255
20-
replaces:
21+
- https://github.com/kubernetes/kubernetes/issues/4301
22+
2123

2224
# The target maturity stage in the current dev cycle for this KEP.
2325
stage: alpha
@@ -35,11 +37,10 @@ milestone:
3537

3638
# The following PRR answers are required at alpha release
3739
# List the feature gate name and the components for which it must be enabled
38-
#feature-gates:
39-
# - name: MyFeature
40-
# components:
41-
# - kube-apiserver
42-
# - kube-controller-manager
40+
feature-gates:
41+
- name: ReplicaSetPodDeletionCost
42+
components:
43+
- kube-controller-manager
4344
disable-supported: true
4445

4546
# The following PRR answers are required at beta release

0 commit comments

Comments
 (0)