Fix some headers, add Drawbacks section

lauralorenz · lauralorenz · commit 0deaea22b96a · 2024-06-12T18:15:53.000Z
Signed-off-by: lauralorenz &lt;lauralorenz@google.com&gt;
diff --git a/keps/sig-node/4603-tune-crashloopbackoff/README.md b/keps/sig-node/4603-tune-crashloopbackoff/README.md
@@ -83,8 +83,8 @@ tags, and then generate with `hack/update-toc.sh`.
   - [Goals](#goals)
   - [Non-Goals](#non-goals)
 - [Proposal](#proposal)
-    - [Existing backoff curve change: front loaded decay](#existing-backoff-curve-change-front-loaded-decay)
-    - [API opt in for max cap decay curve (<code>restartPolicy: Rapid</code>)](#api-opt-in-for-max-cap-decay-curve-restartpolicy-rapid)
+  - [Existing backoff curve change: front loaded decay](#existing-backoff-curve-change-front-loaded-decay)
+  - [API opt in for max cap decay curve (<code>restartPolicy: Rapid</code>)](#api-opt-in-for-max-cap-decay-curve-restartpolicy-rapid)
   - [User Stories](#user-stories)
     - [Task isolation](#task-isolation)
     - [Fast restart on failure](#fast-restart-on-failure)
@@ -351,7 +351,7 @@ Note that proposal will NOT change:
   [Alternatives](#more-complex-heuristics)
 
 
-#### Existing backoff curve change: front loaded decay
+### Existing backoff curve change: front loaded decay
 
 This KEP proposes changing the existing backoff curve to load more restarts
 earlier by changing the initial value of the exponential backoff. A number of
@@ -362,7 +362,7 @@ analyze its impact on infrastructure during alpha.
 ![](todayvs1sbackoff.png)
 
 
-#### API opt in for max cap decay curve (`restartPolicy: Rapid`)
+### API opt in for max cap decay curve (`restartPolicy: Rapid`)
 
 Pods and restartable init (aka sidecar) containers will be able to set a new
 OneOf value, `restartPolicy: Rapid`, to opt in to an exponential backoff decay
@@ -589,7 +589,7 @@ Among these modeled initial values, we would get between 3-7 excess restarts per
 backoff lifetime, mostly within the first three time windows matching today's
 restart behavior.
 
-#### Rapid curve methodology
+### Rapid curve methodology
 
 For some users in
 [Kubernetes#57291](https://github.com/kubernetes/kubernetes/issues/57291), any
@@ -667,7 +667,7 @@ heterogenity between "Succeeded" terminating pods, crashing pods whose
 `restartPolicy: Always`, and crashing pods whose `restartPolicy: Rapid`, 
  * what is the load and rate of Pod restart related API requests to the API
    server?
- * what are the performance (memory, CPU, and latency) effects on the kubelet
+ * what are the performance (memory, CPU, and pod start latency) effects on the kubelet
    component?
 
 In order to answer these questions, metrics tying together the number of
@@ -1429,6 +1429,35 @@ Major milestones might include:
 Why should this KEP _not_ be implemented?
 -->
 
+CrashLoopBackoff behavior has been stable and untouched for most of the
+Kubernetes lifetime. It could be argued that it "isn't broken", that most people
+are ok with it or have sufficient and architecturally well placed workarounds
+using third party reaper processes or application code based solutions, and
+changing it just invites high risk to the platform as a whole instead of
+individual end user deployments. However, per the [Motivation](#motivation)
+section, there are emerging workload use cases and a long history of a vocal
+minority in favor of changes to this behavior, so trying to change it now is
+timely. Obviously we could still decide not to graduate the change out of alpha
+if the risks are determined to be too high or the feedback is not positive.
+
+Though the issue is highly upvoted, on an analysis of the comments presented in
+the canonical tracking issue
+[Kubernetes#57291](https://github.com/kubernetes/kubernetes/issues/57291), 22
+unique commenters were requesting a constant or instant backoff for `Succeeded`
+Pods, 19 for earlier recovery tries, and 6 for better late recovery behavior;
+the latter is arguably even more highly requested when also considering related
+issue [Kubernetes#50375](https://github.com/kubernetes/kubernetes/issues/50375).
+Though an early version of this KEP also addressed the `Success` case, in its
+current version this KEP really only addresses the early recovery case, which by
+our quantitative data is actually the least requested option. That being said,
+other use cases described in [User Stories](#user-stories) that don't have
+quantitative counts are also driving forces on why we should address the early
+recovery cases now. On top of that, compared to the late recovery cases, early
+recovery is more approachable and easily modelable and improving benchmarking
+and insight can help us improve late recovery later on (see also the related
+discussion in Alternatives [here](#more-complex-heuristics) and
+[here](#late-recovery)).
+
 ## Alternatives
 
 <!--