You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -55,122 +54,164 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
55
54
56
55
## Summary
57
56
58
-
This feature allows making a suggestion to the ReplicaSet controller, which pod of a Deployment should be deleted first when a scale-down event happens. This can prevent session disruption in stateful applications in a trivial manner.
57
+
This feature allows applications to give a hint to the ReplicaSet controller
58
+
as to which pods should be deleted first on scale down.
59
59
60
60
## Motivation
61
61
62
-
For some applications, it is necessary that the application can tell Kubernetes which pod can be deleted and which replica has to be protected. The reason for this is that some applications do have stateful sessions and it is not possible to put such an application into Kubernetes because of session termination resulting from "random" down-scale. If the application is able to tell Kubernetes which of the replicas contains no/few/less important active sessions, this would solve many problems. This feature is non-disruptive to the default behaviour. Only if the annotation is existing, it will make a difference in deletion order.
62
+
Currently ReplicaSets are scaled down based on a criteria that on the
63
+
limit prioritizes deleting pods with a more recent creation/readiness
64
+
timestamp. This is not ideal for some applications where the cost of
65
+
deleting pods is not related to how recent they were created.
63
66
64
67
### Goals
65
68
66
-
To recommend which pod gets deleted next of a ReplicaSet. This should help to avoid major reworks in existing applications architecture:
67
-
*[45509](https://github.com/kubernetes/kubernetes/issues/45509) - Scale down a deployment by removing specific pods
69
+
- An API that allows applications to influence the order of deleting
70
+
pods when scaling down a ReplicaSet
68
71
69
72
70
73
### Non-Goals
71
74
72
-
Guaranteed (in contrast to the recommendation stated in Goals) deletion of a selected replica.
75
+
- Guarantees on pod deletion order
76
+
- A controller that sets the cost of deleting the pods.
73
77
74
78
## Proposal
75
79
76
-
The application can set the `controller.kubernetes.io/pod-cost` annotation to a pod through the Kubernetes API. When a downscale event happens, the pod with the lower priority value of the previously set annotation will be deleted first. If one pod of the Deployment has no priority annotation set, it will be treated as the lowest priority.
77
-
78
-
If all pods have the same priority, there is no difference in the normal pod delete decision behaviour. The same applies if the pod-cost annotation is not used at all.
79
-
80
-
The pod-cost annotation can be changed during operation, for example, if workload changes or a new master gets elected.
80
+
Define a known annotation, namely `controller.kubernetes.io/pod-deletion-cost` that
81
+
applications can set to offer a hint on the cost of deleting a pod compared
82
+
to other pods belonging to the same ReplicaSet.
81
83
82
84
### User Stories (optional)
83
85
84
-
85
86
#### Story 1
86
87
87
-
In an application environment with stateful worker (user-)sessions, it is essential to keep the user sessions alive as good as possible. In case of a scale-down event, the application has to tell the scheduler, which delete decision would have the lowest impact on existing sessions.
88
+
The different pods of an application could have different utilization levels.
89
+
On scale down, the application may prefer to remove the pods with lower utilization.
90
+
To avoid frequently updating the pods, the application should update pod-deletion-cost
91
+
once before issuing a scale down. This works if the application itself controls the down
92
+
scaling (e.g., the driver pod of a Spark deployment).
88
93
89
94
#### Story 2
90
95
91
-
An application consists of identical server processes, but one of the replicas will be the master, which should be kept as long as possible. All other replicas can be treated as cattle workload. Then the master can set the priority annotation with a high priority value as soon as it has finished its startup process. The other replicas can remain either without any priority set, or e.g. with all the same, lower priority. This ensures, that the master replica of this deployment will be protected in a downscale situation.
96
+
On scale down, the application may want to remove pods running on the most expensive
97
+
nodes first. For example, remove pods from nodes running on standard VMs first
98
+
then from ones running on preemptible/spot VMs (which can be 80% cheaper than standard VMs).
92
99
93
100
94
101
### Risks and Mitigations
95
102
96
-
On previous Kubernetes ReplicaSet controller versions that don't implement the pod-cost annotation feature, the same application might make false assumptions about the protection of a master instance or workers with open (user-)sessions on it. As the pod-cost annotation would be only a suggestion to the ReplicaSet controller, the application developer should, however, handle the case of a failed master instance or broken user sessions. The feature is just an improvement, not a guarantee, as there might happen timing issues between setting the annotation and the next controller scale-down event.
103
+
- Users perceive the feature as a guarantee to delete order. Documentation
104
+
should stress the fact that this is best effort.
105
+
106
+
- Users deploy controllers that update the annotation frequently causing a
107
+
significant load on the api server. Documentation should include best
108
+
practices as to how this feature should be used (e.g., update the
109
+
pod-deletion-cost only before scale down). Moreover, [API priority and fairness](https://kubernetes.io/docs/concepts/cluster-administration/flow-control/)
110
+
gives operators a new server-side knob that allows them to limit update
111
+
qps issued by such controllers.
112
+
97
113
98
114
## Design Details
99
115
116
+
The pod-deletion-cost range will be from [-MaxInt, MaxInt]. The default value is 0.
117
+
Invalid values (like setting the annotation to string) will be rejected by the api-server
118
+
with a BadRequest status code.
119
+
120
+
Having the default value in the middle of the range allows controllers to cutomize
121
+
the semantics of the cost of deleting pods that don't have the annotation set:
122
+
controllers can use positive pod-deletion-cost values if they always want uninitialized
123
+
pods to be deleted first, or use negative pod-deletion-cost values if they want
124
+
uninitialized pods to always be deleted last.
125
+
126
+
When scaling down a ReplicaSet, controller-manager will prioritize deleting
127
+
pods with lower pod-deletion-cost. Specifically, the pod-deletion-cost will be evaluated after
128
+
step 3 and before step 4 as they are currently defined in
which means the followig criteria is applied when comparing two pods regardless of their pod-deletion-cost:
131
+
- if one is assigned a node and the other is not, then the unassigned pod is deleted first.
132
+
- if the two pods are in different phases, then the pod in pending/unknown status is deleted first.
133
+
- if the two pods have different readiness status, then the not ready pod is deleted first
134
+
135
+
136
+
If none of the pods set the pod-deletion-cost annotation or all of them have the same value, then the
137
+
scale down behavior is not changed compared to now.
100
138
101
139
### Test Plan
102
140
103
-
* Units test in kube-controller-manager package to test a variety of scenarios.
104
-
* New E2E Tests to validate that replicas get deleted as expected e.g:
105
-
*Replicas with lower pod-cost before replicas with higher pod-cost
106
-
* Replicas with no pod-cost annotation set before replicas with low priority
141
+
- Units test in kube-controller-manager package to test a variety of scenarios.
142
+
- Integration tests to validate that:
143
+
-Replicas with lower pod-deletion-cost are deleted before replicas with higher pod-deletion-cost
144
+
- No behavior change when pod-deletion-cost is not set or all pods have the same pod-deletion-cost
107
145
108
146
### Graduation Criteria
109
147
110
148
#### Alpha -> Beta Graduation
111
149
* Implemented feedback from alpha testers
112
-
* Thorough E2E and unit testing in place
113
150
114
151
#### Beta -> GA Graduation
115
-
* Significant number of end-users are using the feature
116
152
* We're confident that no further API changes will be needed to achieve the goals of the KEP
117
153
* All known functional bugs have been fixed
118
154
119
155
### Upgrade / Downgrade Strategy
120
156
121
-
When upgrading no changes are needed to maintain existing behaviour as all of this behaviour is fully optional and disabled by default. To activate this feature either a user has to make an annotation to a pod in a Deployment by hand or the application annotates a pod in a Deployment through the API.
122
-
123
-
When downgrading, there is no need to changing anything, as this is just a pod annotation, which is uncritical.
157
+
There is no strategy per se. On upgrade, controller-manager will start taking into account
158
+
pod-deletion-cost annotation for new and existing ReplicaSets that set the annotation. On
159
+
downgrade, controller-manager will stop taking into account pod-deletion-cost, and so
160
+
reverting to old behavior.
124
161
125
162
### Version Skew Strategy
126
163
127
-
As this feature is based on pod annotations, there is no issue with different Kubernetes versions. The lack of this feature in older versions may change the efficiency and reliability of the applications.
164
+
N/A
128
165
129
166
## Production Readiness Review Questionnaire
130
167
131
168
### Feature enablement and rollback
132
169
133
170
***How can this feature be enabled / disabled in a live cluster?**
134
-
-[x] Other
135
-
- Make special pod annotations within a live Deployment
171
+
-[x] Feature gate (also fill in values in `kep.yaml`)
172
+
- Feature gate name: ReplicaSetPodDeletionCost
173
+
- Components depending on the feature gate: kube-controller-manager
174
+
-[ ] Other
175
+
- Describe the mechanism:
176
+
- Will enabling / disabling the feature require downtime of the control
177
+
plane?
178
+
- Will enabling / disabling the feature require downtime or reprovisioning
179
+
of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled).
136
180
137
181
138
182
***Does enabling the feature change any default behavior?**
139
-
- No
183
+
No.
140
184
141
185
142
186
***Can the feature be disabled once it has been enabled (i.e. can we rollback
143
187
the enablement)?**
144
-
- One can either remove the annotations or downgrade to an older Kubernetes release
188
+
Yes.
145
189
146
190
147
191
***What happens if we reenable the feature if it was previously rolled back?**
148
-
- Then the feature will be reenabled. Nothing special to consider here.
192
+
It should continue to work as expected.
149
193
150
194
151
195
***Are there any tests for feature enablement/disablement?**
152
-
196
+
We will add unit tests.
153
197
154
198
### Rollout, Upgrade and Rollback Planning
155
199
156
200
_This section must be completed when targeting beta graduation to a release._
157
201
158
202
***How can a rollout fail? Can it impact already running workloads?**
159
-
- As the feature is a simple annoation, the worst what could happen is that either the annotation is lost or ignored. In the worst case, a pod with a higher priority gets deleted before a pod with a lower priority.
160
-
203
+
It shouldn't impact already running workloads. This is an opt-in feature
204
+
since users need to explicitly set the annotation.
161
205
162
206
***What specific metrics should inform a rollback?**
163
-
- None
164
-
207
+
None.
165
208
166
209
***Were upgrade and rollback tested? Was upgrade->downgrade->upgrade path tested?**
167
-
- Was tested. Behaviour change in both directions, as expected.
168
-
210
+
We will do manual testing.
169
211
170
212
***Is the rollout accompanied by any deprecations and/or removals of features,
171
213
APIs, fields of API types, flags, etc.?**
172
-
- No. However, the exact same pod annotation string cannot be used for any other purposes.
173
-
214
+
No.
174
215
175
216
### Monitoring requirements
176
217
@@ -179,39 +220,30 @@ _This section must be completed when targeting beta graduation to a release._
179
220
***How can an operator determine if the feature is in use by workloads?**
180
221
- Search for pod annotations with the exact same pod-cost annotation string.
181
222
182
-
183
223
***What are the SLIs (Service Level Indicators) an operator can use to
184
224
determine the health of the service?**
185
-
- A pod with a lower pod-cost annotation in a Deployment gets deleted first on a scale-down event.
186
-
225
+
N/A
187
226
188
227
***What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
189
-
- All pods with a lower pod-cost annotation in a Deployment are deleted first on a scale-down event.
228
+
N/A
190
229
191
230
***Are there any missing metrics that would be useful to have to improve
192
231
observability if this feature?**
193
-
- N/A
232
+
No.
194
233
195
234
### Dependencies
196
235
197
236
_This section must be completed when targeting beta graduation to a release._
198
237
199
238
***Does this feature depend on any specific services running in the cluster?**
200
-
- The feature requires the existing of the kube-controller-manager and the ability and permissions to set pod annotations.
201
-
239
+
No.
202
240
203
241
### Scalability
204
242
205
-
_For alpha, this section is encouraged: reviewers should consider these questions
206
-
and attempt to answer them._
207
-
208
-
_For beta, this section is required: reviewers must answer these questions._
209
-
210
-
_For GA, this section is required: approvers should be able to confirms the
211
-
previous answers based on experience in the field._
212
-
213
243
***Will enabling / using this feature result in any new API calls?**
214
-
- Whenever the application decides, that a change in pod-cost is needed for a replica, it will send out an API request and set the appropriate pod annotation(s).
244
+
- No, not the feature itself. However, users will want to deploy an external controller
245
+
that updates the pod-deletion-cost, documentation should stress that update frequency
246
+
to be coarse grained.
215
247
216
248
217
249
***Will enabling / using this feature result in introducing new API types?**
@@ -225,21 +257,17 @@ previous answers based on experience in the field._
225
257
226
258
***Will enabling / using this feature result in increasing size or count
227
259
of the existing API objects?**
228
-
Describe them providing:
229
-
- API type(s): Pod annotation
230
-
- Estimated increase in size: Size of a new annotation
231
-
- Estimated amount of new objects: new annotation for potentially every existing Pod
232
-
260
+
- No.
233
261
234
262
***Will enabling / using this feature result in increasing time taken by any
235
263
operations covered by [existing SLIs/SLOs][]?**
236
-
- The time it takes to set/delete/change a pod annotation
264
+
- There are no SLOs covering scale down, but this feature should have negligible
265
+
impact on scale-down latency since we are adding an additional sorting key.
237
266
238
267
239
268
***Will enabling / using this feature result in non-negligible increase of
240
269
resource usage (CPU, RAM, disk, IO, ...) in any components?**
241
-
- The resources it takes to set/delete/change a pod annotation
242
-
270
+
- No.
243
271
244
272
### Troubleshooting
245
273
@@ -257,11 +285,13 @@ _This section must be completed when targeting beta graduation to a release._
- 2021-01-13: Initial KEP submitted as provisional
289
+
- 2021-01-15: KEP promoted to implementable
260
290
261
291
262
-
## Drawbacks
292
+
## Alternatives
263
293
294
+
One alternative to using an annotation is adding an explicit API field. If the feature gets
295
+
enough traction, we may consider promoting the annotation to a Status field.
264
296
265
-
## Alternatives
266
297
267
-
Similar behaviour can be achieved through the Operator Framework which however will take a lot more configuration and setup work and is not a built-in Kubernetes feature.
0 commit comments