Merge pull request kubernetes#2272 from ahg-g/ahg-cost

k8s-ci-robot · web-flow · commit 84f601332e01 · 2021-01-22T06:29:27.000-08:00
Promote pod deletion cost KEP to implementable
diff --git a/keps/prod-readiness/sig-apps/2255.yaml b/keps/prod-readiness/sig-apps/2255.yaml
@@ -0,0 +1,3 @@
+kep-number: 2255
+alpha:
+  approver: "@wojtek-t"
diff --git a/keps/sig-apps/2255-pod-cost/README.md b/keps/sig-apps/2255-pod-cost/README.md
@@ -1,4 +1,4 @@
-# KEP-2255: Add pod-cost annotation for ReplicaSet
+# KEP-2255: ReplicaSet Pod Deletion Cost
 
 
 <!-- toc -->
@@ -27,7 +27,6 @@
   - [Scalability](#scalability)
   - [Troubleshooting](#troubleshooting)
 - [Implementation History](#implementation-history)
-- [Drawbacks](#drawbacks)
 - [Alternatives](#alternatives)
 <!-- /toc -->
 
@@ -55,122 +54,164 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
 
 ## Summary
 
-This feature allows making a suggestion to the ReplicaSet controller, which pod of a Deployment should be deleted first when a scale-down event happens. This can prevent session disruption in stateful applications in a trivial manner.
+This feature allows applications to give a hint to the ReplicaSet controller 
+as to which pods should be deleted first on scale down.
 
 ## Motivation
 
-For some applications, it is necessary that the application can tell Kubernetes which pod can be deleted and which replica has to be protected. The reason for this is that some applications do have stateful sessions and it is not possible to put such an application into Kubernetes because of session termination resulting from "random" down-scale. If the application is able to tell Kubernetes which of the replicas contains no/few/less important active sessions, this would solve many problems. This feature is non-disruptive to the default behaviour. Only if the annotation is existing, it will make a difference in deletion order.
+Currently ReplicaSets are scaled down based on a criteria that on the
+limit prioritizes deleting pods with a more recent creation/readiness 
+timestamp. This is not ideal for some applications where the cost of 
+deleting pods is not related to how recent they were created. 
 
 ### Goals
 
-To recommend which pod gets deleted next of a ReplicaSet. This should help to avoid major reworks in existing applications architecture:
-* [45509](https://github.com/kubernetes/kubernetes/issues/45509) - Scale down a deployment by removing specific pods
+- An API that allows applications to influence the order of deleting 
+  pods when scaling down a ReplicaSet
 
 
 ### Non-Goals
 
-Guaranteed (in contrast to the recommendation stated in Goals) deletion of a selected replica.
+- Guarantees on pod deletion order
+- A controller that sets the cost of deleting the pods.
 
 ## Proposal
 
-The application can set the `controller.kubernetes.io/pod-cost` annotation to a pod through the Kubernetes API. When a downscale event happens, the pod with the lower priority value of the previously set annotation will be deleted first. If one pod of the Deployment has no priority annotation set, it will be treated as the lowest priority.
-
-If all pods have the same priority, there is no difference in the normal pod delete decision behaviour. The same applies if the pod-cost annotation is not used at all.
-
-The pod-cost annotation can be changed during operation, for example, if workload changes or a new master gets elected.
+Define a known annotation, namely `controller.kubernetes.io/pod-deletion-cost` that 
+applications can set to offer a hint on the cost of deleting a pod compared
+to other pods belonging to the same ReplicaSet. 
 
 ### User Stories (optional)
 
-
 #### Story 1
 
-In an application environment with stateful worker (user-)sessions, it is essential to keep the user sessions alive as good as possible. In case of a scale-down event, the application has to tell the scheduler, which delete decision would have the lowest impact on existing sessions.
+The different pods of an application could have different utilization levels. 
+On scale down, the application may prefer to remove the pods with lower utilization.
+To avoid frequently updating the pods, the application should update pod-deletion-cost
+once before issuing a scale down. This works if the application itself controls the down 
+scaling (e.g., the driver pod of a Spark deployment).
 
 #### Story 2
 
-An application consists of identical server processes, but one of the replicas will be the master, which should be kept as long as possible. All other replicas can be treated as cattle workload. Then the master can set the priority annotation with a high priority value as soon as it has finished its startup process. The other replicas can remain either without any priority set, or e.g. with all the same, lower priority. This ensures, that the master replica of this deployment will be protected in a downscale situation.
+On scale down, the application may want to remove pods running on the most expensive 
+nodes first. For example, remove pods from nodes running on standard VMs first
+then from ones running on preemptible/spot VMs (which can be 80% cheaper than standard VMs).
 
 
 ### Risks and Mitigations
 
-On previous Kubernetes ReplicaSet controller versions that don't implement the pod-cost annotation feature, the same application might make false assumptions about the protection of a master instance or workers with open (user-)sessions on it. As the pod-cost annotation would be only a suggestion to the ReplicaSet controller, the application developer should, however, handle the case of a failed master instance or broken user sessions. The feature is just an improvement, not a guarantee, as there might happen timing issues between setting the annotation and the next controller scale-down event.
+- Users perceive the feature as a guarantee to delete order. Documentation
+should stress the fact that this is best effort.
+
+- Users deploy controllers that update the annotation frequently causing a
+  significant load on the api server. Documentation should include best 
+  practices as to how this feature should be used (e.g., update the
+  pod-deletion-cost only before scale down). Moreover, [API priority and fairness](https://kubernetes.io/docs/concepts/cluster-administration/flow-control/) 
+  gives operators a new server-side knob that allows them to limit update
+  qps issued by such controllers.
+  
 
 ## Design Details
 
+The pod-deletion-cost range will be from [-MaxInt, MaxInt]. The default value is 0.
+Invalid values (like setting the annotation to string) will be rejected by the api-server
+with a BadRequest status code.
+
+Having the default value in the middle of the range allows controllers to cutomize
+the semantics of the cost of deleting pods that don't have the annotation set: 
+controllers can use positive pod-deletion-cost values if they always want uninitialized
+pods to be deleted first, or use negative pod-deletion-cost values if they want
+uninitialized pods to always be deleted last.
+
+When scaling down a ReplicaSet, controller-manager will prioritize deleting
+pods with lower pod-deletion-cost. Specifically, the pod-deletion-cost will be evaluated after
+step 3 and before step 4 as they are currently defined in
+[ActivePodsWithRanks](https://github.com/kubernetes/kubernetes/blob/cac933934b1301665e6e51a81c66c483f4e16c49/pkg/controller/controller_utils.go#L784-L809),
+which means the followig criteria is applied when comparing two pods regardless of their pod-deletion-cost:
+- if one is assigned a node and the other is not, then the unassigned pod is deleted first.
+- if the two pods are in different phases, then the pod in pending/unknown status is deleted first.
+- if the two pods have different readiness status, then the not ready pod is deleted first
+
+
+If none of the pods set the pod-deletion-cost annotation or all of them have the same value, then the 
+scale down behavior is not changed compared to now.
 
 ### Test Plan
 
-* Units test in kube-controller-manager package to test a variety of scenarios.
-* New E2E Tests to validate that replicas get deleted as expected e.g:
- * Replicas with lower pod-cost before replicas with higher pod-cost
- * Replicas with no pod-cost annotation set before replicas with low priority
+- Units test in kube-controller-manager package to test a variety of scenarios.
+- Integration tests to validate that:
+  - Replicas with lower pod-deletion-cost are deleted before replicas with higher pod-deletion-cost
+  - No behavior change when pod-deletion-cost is not set or all pods have the same pod-deletion-cost
 
 ### Graduation Criteria
 
 #### Alpha -> Beta Graduation
 * Implemented feedback from alpha testers
-* Thorough E2E and unit testing in place
 
 #### Beta -> GA Graduation
-* Significant number of end-users are using the feature
 * We're confident that no further API changes will be needed to achieve the goals of the KEP
 * All known functional bugs have been fixed
 
 ### Upgrade / Downgrade Strategy
 
-When upgrading no changes are needed to maintain existing behaviour as all of this behaviour is fully optional and disabled by default. To activate this feature either a user has to make an annotation to a pod in a Deployment by hand or the application annotates a pod in a Deployment through the API.
-
-When downgrading, there is no need to changing anything, as this is just a pod annotation, which is uncritical.
+There is no strategy per se. On upgrade, controller-manager will start taking into account
+pod-deletion-cost annotation for new and existing ReplicaSets that set the annotation. On 
+downgrade, controller-manager will stop taking into account pod-deletion-cost, and so
+reverting to old behavior.
 
 ### Version Skew Strategy
 
-As this feature is based on pod annotations, there is no issue with different Kubernetes versions. The lack of this feature in older versions may change the efficiency and reliability of the applications.
+N/A
 
 ## Production Readiness Review Questionnaire
 
 ### Feature enablement and rollback
 
 * **How can this feature be enabled / disabled in a live cluster?**
-  - [x] Other
-    - Make special pod annotations within a live Deployment
+  - [x] Feature gate (also fill in values in `kep.yaml`)
+    - Feature gate name: ReplicaSetPodDeletionCost
+    - Components depending on the feature gate: kube-controller-manager
+  - [ ] Other
+    - Describe the mechanism:
+    - Will enabling / disabling the feature require downtime of the control
+      plane?
+    - Will enabling / disabling the feature require downtime or reprovisioning
+      of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled).
 
 
 * **Does enabling the feature change any default behavior?**
-  - No
+ No.
 
 
 * **Can the feature be disabled once it has been enabled (i.e. can we rollback
   the enablement)?**
-  - One can either remove the annotations or downgrade to an older Kubernetes release
+Yes.
 
 
 * **What happens if we reenable the feature if it was previously rolled back?**
-  - Then the feature will be reenabled. Nothing special to consider here.
+It should continue to work as expected.
 
 
 * **Are there any tests for feature enablement/disablement?**
-
+We will add unit tests. 
 
 ### Rollout, Upgrade and Rollback Planning
 
 _This section must be completed when targeting beta graduation to a release._
 
 * **How can a rollout fail? Can it impact already running workloads?**
-  - As the feature is a simple annoation, the worst what could happen is that either the annotation is lost or ignored. In the worst case, a pod with a higher priority gets deleted before a pod with a lower priority.
-
+It shouldn't impact already running workloads. This is an opt-in feature
+since users need to explicitly set the annotation.
 
 * **What specific metrics should inform a rollback?**
-  - None
-
+None.
 
 * **Were upgrade and rollback tested? Was upgrade->downgrade->upgrade path tested?**
-  - Was tested. Behaviour change in both directions, as expected.
-
+We will do manual testing.
 
 * **Is the rollout accompanied by any deprecations and/or removals of features,
   APIs, fields of API types, flags, etc.?**
-  - No. However, the exact same pod annotation string cannot be used for any other purposes.
-
+No.
 
 ### Monitoring requirements
 
@@ -179,39 +220,30 @@ _This section must be completed when targeting beta graduation to a release._
 * **How can an operator determine if the feature is in use by workloads?**
   - Search for pod annotations with the exact same pod-cost annotation string.
 
-
 * **What are the SLIs (Service Level Indicators) an operator can use to
   determine the health of the service?**
-  - A pod with a lower pod-cost annotation in a Deployment gets deleted first on a scale-down event.
-
+N/A
 
 * **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
-  - All pods with a lower pod-cost annotation in a Deployment are deleted first on a scale-down event.
+N/A
 
 * **Are there any missing metrics that would be useful to have to improve
   observability if this feature?**
-  - N/A
+No.
 
 ### Dependencies
 
 _This section must be completed when targeting beta graduation to a release._
 
 * **Does this feature depend on any specific services running in the cluster?**
-  - The feature requires the existing of the kube-controller-manager and the ability and permissions to set pod annotations.
-
+No.
 
 ### Scalability
 
-_For alpha, this section is encouraged: reviewers should consider these questions
-and attempt to answer them._
-
-_For beta, this section is required: reviewers must answer these questions._
-
-_For GA, this section is required: approvers should be able to confirms the
-previous answers based on experience in the field._
-
 * **Will enabling / using this feature result in any new API calls?**
-  - Whenever the application decides, that a change in pod-cost is needed for a replica, it will send out an API request and set the appropriate pod annotation(s).
+  - No, not the feature itself. However, users will want to deploy an external controller 
+    that updates the pod-deletion-cost, documentation should stress that update frequency
+    to be coarse grained.
 
 
 * **Will enabling / using this feature result in introducing new API types?**
@@ -225,21 +257,17 @@ previous answers based on experience in the field._
 
 * **Will enabling / using this feature result in increasing size or count
   of the existing API objects?**
-  Describe them providing:
-  - API type(s): Pod annotation
-  - Estimated increase in size: Size of a new annotation
-  - Estimated amount of new objects: new annotation for potentially every existing Pod
-
+  - No.
 
 * **Will enabling / using this feature result in increasing time taken by any
   operations covered by [existing SLIs/SLOs][]?**
-  - The time it takes to set/delete/change a pod annotation
+  - There are no SLOs covering scale down, but this feature should have negligible 
+    impact on scale-down latency since we are adding an additional sorting key.
 
 
 * **Will enabling / using this feature result in non-negligible increase of
   resource usage (CPU, RAM, disk, IO, ...) in any components?**
-  - The resources it takes to set/delete/change a pod annotation
-
+  - No.
 
 ### Troubleshooting
 
@@ -257,11 +285,13 @@ _This section must be completed when targeting beta graduation to a release._
 [existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
 
 ## Implementation History
+- 2021-01-13: Initial KEP submitted as provisional
+- 2021-01-15: KEP promoted to implementable
 
 
-## Drawbacks
+## Alternatives
 
+One alternative to using an annotation is adding an explicit API field. If the feature gets
+enough traction, we may consider promoting the annotation to a Status field.
 
-## Alternatives
 
-Similar behaviour can be achieved through the Operator Framework which however will take a lot more configuration and setup work and is not a built-in Kubernetes feature.
diff --git a/keps/sig-apps/2255-pod-cost/kep.yaml b/keps/sig-apps/2255-pod-cost/kep.yaml
@@ -1,23 +1,25 @@
-title: Add pod-cost annotation for ReplicaSet
+title: ReplicaSet Pod Deletion Cost
 kep-number: 2255
 authors:
   - "@drbugfinder-work"
   - "@ahg-g"
   - "@alculquicondor"
 owning-sig: sig-apps
 participating-sigs:
-status: provisional
+status: implementable
 creation-date: 2021-01-12
 reviewers:
   - "@ahg-g"
   - "@janetkuo"
   - "@alculquicondor"
 approvers:
   - "@janetkuo"
+prr-approvers:
+  - "@wojtek-t"
 see-also:
   - https://github.com/kubernetes/kubernetes/issues/45509
-  - https://github.com/kubernetes/enhancements/issues/2255
-replaces:
+  - https://github.com/kubernetes/kubernetes/issues/4301
+
 
 # The target maturity stage in the current dev cycle for this KEP.
 stage: alpha
@@ -35,11 +37,10 @@ milestone:
 
 # The following PRR answers are required at alpha release
 # List the feature gate name and the components for which it must be enabled
-#feature-gates:
-#  - name: MyFeature
-#    components:
-#      - kube-apiserver
-#      - kube-controller-manager
+feature-gates:
+ - name: ReplicaSetPodDeletionCost
+   components:
+     - kube-controller-manager
 disable-supported: true
 
 # The following PRR answers are required at beta release

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+kep-number: 2255`
	`2`	`+alpha:`
	`3`	`+ approver: "@wojtek-t"`