|
| 1 | +# KEP-2255: Add pod-cost annotation for ReplicaSet |
| 2 | + |
| 3 | + |
| 4 | +<!-- toc --> |
| 5 | +- [Release Signoff Checklist](#release-signoff-checklist) |
| 6 | +- [Summary](#summary) |
| 7 | +- [Motivation](#motivation) |
| 8 | + - [Goals](#goals) |
| 9 | + - [Non-Goals](#non-goals) |
| 10 | +- [Proposal](#proposal) |
| 11 | + - [User Stories (optional)](#user-stories-optional) |
| 12 | + - [Story 1](#story-1) |
| 13 | + - [Story 2](#story-2) |
| 14 | + - [Risks and Mitigations](#risks-and-mitigations) |
| 15 | +- [Design Details](#design-details) |
| 16 | + - [Test Plan](#test-plan) |
| 17 | + - [Graduation Criteria](#graduation-criteria) |
| 18 | + - [Alpha -> Beta Graduation](#alpha---beta-graduation) |
| 19 | + - [Beta -> GA Graduation](#beta---ga-graduation) |
| 20 | + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) |
| 21 | + - [Version Skew Strategy](#version-skew-strategy) |
| 22 | +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) |
| 23 | + - [Feature enablement and rollback](#feature-enablement-and-rollback) |
| 24 | + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) |
| 25 | + - [Monitoring requirements](#monitoring-requirements) |
| 26 | + - [Dependencies](#dependencies) |
| 27 | + - [Scalability](#scalability) |
| 28 | + - [Troubleshooting](#troubleshooting) |
| 29 | +- [Implementation History](#implementation-history) |
| 30 | +- [Drawbacks](#drawbacks) |
| 31 | +- [Alternatives](#alternatives) |
| 32 | +<!-- /toc --> |
| 33 | + |
| 34 | +## Release Signoff Checklist |
| 35 | + |
| 36 | + |
| 37 | +Items marked with (R) are required *prior to targeting to a milestone / release*. |
| 38 | + |
| 39 | +- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) |
| 40 | +- [ ] (R) KEP approvers have approved the KEP status as `implementable` |
| 41 | +- [ ] (R) Design details are appropriately documented |
| 42 | +- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input |
| 43 | +- [ ] (R) Graduation criteria is in place |
| 44 | +- [ ] (R) Production readiness review completed |
| 45 | +- [ ] Production readiness review approved |
| 46 | +- [ ] "Implementation History" section is up-to-date for milestone |
| 47 | +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] |
| 48 | +- [ ] Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes |
| 49 | + |
| 50 | + |
| 51 | +[kubernetes.io]: https://kubernetes.io/ |
| 52 | +[kubernetes/enhancements]: https://git.k8s.io/enhancements |
| 53 | +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes |
| 54 | +[kubernetes/website]: https://git.k8s.io/website |
| 55 | + |
| 56 | +## Summary |
| 57 | + |
| 58 | +This feature allows making a suggestion to the ReplicaSet controller, which pod of a Deployment should be deleted first when a scale-down event happens. This can prevent session disruption in stateful applications in a trivial manner. |
| 59 | + |
| 60 | +## Motivation |
| 61 | + |
| 62 | +For some applications, it is necessary that the application can tell Kubernetes which pod can be deleted and which replica has to be protected. The reason for this is that some applications do have stateful sessions and it is not possible to put such an application into Kubernetes because of session termination resulting from "random" down-scale. If the application is able to tell Kubernetes which of the replicas contains no/few/less important active sessions, this would solve many problems. This feature is non-disruptive to the default behaviour. Only if the annotation is existing, it will make a difference in deletion order. |
| 63 | + |
| 64 | +### Goals |
| 65 | + |
| 66 | +To recommend which pod gets deleted next of a ReplicaSet. This should help to avoid major reworks in existing applications architecture: |
| 67 | +* [45509](https://github.com/kubernetes/kubernetes/issues/45509) - Scale down a deployment by removing specific pods |
| 68 | + |
| 69 | + |
| 70 | +### Non-Goals |
| 71 | + |
| 72 | +Guaranteed (in contrast to the recommendation stated in Goals) deletion of a selected replica. |
| 73 | + |
| 74 | +## Proposal |
| 75 | + |
| 76 | +The application can set the `controller.kubernetes.io/pod-cost` annotation to a pod through the Kubernetes API. When a downscale event happens, the pod with the lower priority value of the previously set annotation will be deleted first. If one pod of the Deployment has no priority annotation set, it will be treated as the lowest priority. |
| 77 | + |
| 78 | +If all pods have the same priority, there is no difference in the normal pod delete decision behaviour. The same applies if the pod-cost annotation is not used at all. |
| 79 | + |
| 80 | +The pod-cost annotation can be changed during operation, for example, if workload changes or a new master gets elected. |
| 81 | + |
| 82 | +### User Stories (optional) |
| 83 | + |
| 84 | + |
| 85 | +#### Story 1 |
| 86 | + |
| 87 | +In an application environment with stateful worker (user-)sessions, it is essential to keep the user sessions alive as good as possible. In case of a scale-down event, the application has to tell the scheduler, which delete decision would have the lowest impact on existing sessions. |
| 88 | + |
| 89 | +#### Story 2 |
| 90 | + |
| 91 | +An application consists of identical server processes, but one of the replicas will be the master, which should be kept as long as possible. All other replicas can be treated as cattle workload. Then the master can set the priority annotation with a high priority value as soon as it has finished its startup process. The other replicas can remain either without any priority set, or e.g. with all the same, lower priority. This ensures, that the master replica of this deployment will be protected in a downscale situation. |
| 92 | + |
| 93 | + |
| 94 | +### Risks and Mitigations |
| 95 | + |
| 96 | +On previous Kubernetes ReplicaSet controller versions that don't implement the pod-cost annotation feature, the same application might make false assumptions about the protection of a master instance or workers with open (user-)sessions on it. As the pod-cost annotation would be only a suggestion to the ReplicaSet controller, the application developer should, however, handle the case of a failed master instance or broken user sessions. The feature is just an improvement, not a guarantee, as there might happen timing issues between setting the annotation and the next controller scale-down event. |
| 97 | + |
| 98 | +## Design Details |
| 99 | + |
| 100 | + |
| 101 | +### Test Plan |
| 102 | + |
| 103 | +* Units test in kube-controller-manager package to test a variety of scenarios. |
| 104 | +* New E2E Tests to validate that replicas get deleted as expected e.g: |
| 105 | + * Replicas with lower pod-cost before replicas with higher pod-cost |
| 106 | + * Replicas with no pod-cost annotation set before replicas with low priority |
| 107 | + |
| 108 | +### Graduation Criteria |
| 109 | + |
| 110 | +#### Alpha -> Beta Graduation |
| 111 | +* Implemented feedback from alpha testers |
| 112 | +* Thorough E2E and unit testing in place |
| 113 | + |
| 114 | +#### Beta -> GA Graduation |
| 115 | +* Significant number of end-users are using the feature |
| 116 | +* We're confident that no further API changes will be needed to achieve the goals of the KEP |
| 117 | +* All known functional bugs have been fixed |
| 118 | + |
| 119 | +### Upgrade / Downgrade Strategy |
| 120 | + |
| 121 | +When upgrading no changes are needed to maintain existing behaviour as all of this behaviour is fully optional and disabled by default. To activate this feature either a user has to make an annotation to a pod in a Deployment by hand or the application annotates a pod in a Deployment through the API. |
| 122 | + |
| 123 | +When downgrading, there is no need to changing anything, as this is just a pod annotation, which is uncritical. |
| 124 | + |
| 125 | +### Version Skew Strategy |
| 126 | + |
| 127 | +As this feature is based on pod annotations, there is no issue with different Kubernetes versions. The lack of this feature in older versions may change the efficiency and reliability of the applications. |
| 128 | + |
| 129 | +## Production Readiness Review Questionnaire |
| 130 | + |
| 131 | +### Feature enablement and rollback |
| 132 | + |
| 133 | +* **How can this feature be enabled / disabled in a live cluster?** |
| 134 | + - [x] Other |
| 135 | + - Make special pod annotations within a live Deployment |
| 136 | + |
| 137 | + |
| 138 | +* **Does enabling the feature change any default behavior?** |
| 139 | + - No |
| 140 | + |
| 141 | + |
| 142 | +* **Can the feature be disabled once it has been enabled (i.e. can we rollback |
| 143 | + the enablement)?** |
| 144 | + - One can either remove the annotations or downgrade to an older Kubernetes release |
| 145 | + |
| 146 | + |
| 147 | +* **What happens if we reenable the feature if it was previously rolled back?** |
| 148 | + - Then the feature will be reenabled. Nothing special to consider here. |
| 149 | + |
| 150 | + |
| 151 | +* **Are there any tests for feature enablement/disablement?** |
| 152 | + |
| 153 | + |
| 154 | +### Rollout, Upgrade and Rollback Planning |
| 155 | + |
| 156 | +_This section must be completed when targeting beta graduation to a release._ |
| 157 | + |
| 158 | +* **How can a rollout fail? Can it impact already running workloads?** |
| 159 | + - As the feature is a simple annoation, the worst what could happen is that either the annotation is lost or ignored. In the worst case, a pod with a higher priority gets deleted before a pod with a lower priority. |
| 160 | + |
| 161 | + |
| 162 | +* **What specific metrics should inform a rollback?** |
| 163 | + - None |
| 164 | + |
| 165 | + |
| 166 | +* **Were upgrade and rollback tested? Was upgrade->downgrade->upgrade path tested?** |
| 167 | + - Was tested. Behaviour change in both directions, as expected. |
| 168 | + |
| 169 | + |
| 170 | +* **Is the rollout accompanied by any deprecations and/or removals of features, |
| 171 | + APIs, fields of API types, flags, etc.?** |
| 172 | + - No. However, the exact same pod annotation string cannot be used for any other purposes. |
| 173 | + |
| 174 | + |
| 175 | +### Monitoring requirements |
| 176 | + |
| 177 | +_This section must be completed when targeting beta graduation to a release._ |
| 178 | + |
| 179 | +* **How can an operator determine if the feature is in use by workloads?** |
| 180 | + - Search for pod annotations with the exact same pod-cost annotation string. |
| 181 | + |
| 182 | + |
| 183 | +* **What are the SLIs (Service Level Indicators) an operator can use to |
| 184 | + determine the health of the service?** |
| 185 | + - A pod with a lower pod-cost annotation in a Deployment gets deleted first on a scale-down event. |
| 186 | + |
| 187 | + |
| 188 | +* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?** |
| 189 | + - All pods with a lower pod-cost annotation in a Deployment are deleted first on a scale-down event. |
| 190 | + |
| 191 | +* **Are there any missing metrics that would be useful to have to improve |
| 192 | + observability if this feature?** |
| 193 | + - N/A |
| 194 | + |
| 195 | +### Dependencies |
| 196 | + |
| 197 | +_This section must be completed when targeting beta graduation to a release._ |
| 198 | + |
| 199 | +* **Does this feature depend on any specific services running in the cluster?** |
| 200 | + - The feature requires the existing of the kube-controller-manager and the ability and permissions to set pod annotations. |
| 201 | + |
| 202 | + |
| 203 | +### Scalability |
| 204 | + |
| 205 | +_For alpha, this section is encouraged: reviewers should consider these questions |
| 206 | +and attempt to answer them._ |
| 207 | + |
| 208 | +_For beta, this section is required: reviewers must answer these questions._ |
| 209 | + |
| 210 | +_For GA, this section is required: approvers should be able to confirms the |
| 211 | +previous answers based on experience in the field._ |
| 212 | + |
| 213 | +* **Will enabling / using this feature result in any new API calls?** |
| 214 | + - Whenever the application decides, that a change in pod-cost is needed for a replica, it will send out an API request and set the appropriate pod annotation(s). |
| 215 | + |
| 216 | + |
| 217 | +* **Will enabling / using this feature result in introducing new API types?** |
| 218 | + - No. |
| 219 | + |
| 220 | + |
| 221 | +* **Will enabling / using this feature result in any new calls to cloud |
| 222 | + provider?** |
| 223 | + - No. |
| 224 | + |
| 225 | + |
| 226 | +* **Will enabling / using this feature result in increasing size or count |
| 227 | + of the existing API objects?** |
| 228 | + Describe them providing: |
| 229 | + - API type(s): Pod annotation |
| 230 | + - Estimated increase in size: Size of a new annotation |
| 231 | + - Estimated amount of new objects: new annotation for potentially every existing Pod |
| 232 | + |
| 233 | + |
| 234 | +* **Will enabling / using this feature result in increasing time taken by any |
| 235 | + operations covered by [existing SLIs/SLOs][]?** |
| 236 | + - The time it takes to set/delete/change a pod annotation |
| 237 | + |
| 238 | + |
| 239 | +* **Will enabling / using this feature result in non-negligible increase of |
| 240 | + resource usage (CPU, RAM, disk, IO, ...) in any components?** |
| 241 | + - The resources it takes to set/delete/change a pod annotation |
| 242 | + |
| 243 | + |
| 244 | +### Troubleshooting |
| 245 | + |
| 246 | +_This section must be completed when targeting beta graduation to a release._ |
| 247 | + |
| 248 | +* **How does this feature react if the API server and/or etcd is unavailable?** |
| 249 | + - The pod annotation can't be set. The normal pod deletion behavior will be used for non-annotated pods in a Deployment. |
| 250 | +* **What are other known failure modes?** |
| 251 | + - None. |
| 252 | + |
| 253 | +* **What steps should be taken if SLOs are not being met to determine the problem?** |
| 254 | + - N/A |
| 255 | + |
| 256 | +[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md |
| 257 | +[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos |
| 258 | + |
| 259 | +## Implementation History |
| 260 | + |
| 261 | + |
| 262 | +## Drawbacks |
| 263 | + |
| 264 | + |
| 265 | +## Alternatives |
| 266 | + |
| 267 | +Similar behaviour can be achieved through the Operator Framework which however will take a lot more configuration and setup work and is not a built-in Kubernetes feature. |
0 commit comments