Skip to content

Commit 0deaea2

Browse files
committed
Fix some headers, add Drawbacks section
Signed-off-by: lauralorenz <[email protected]>
1 parent 64a1207 commit 0deaea2

File tree

1 file changed

+35
-6
lines changed
  • keps/sig-node/4603-tune-crashloopbackoff

1 file changed

+35
-6
lines changed

keps/sig-node/4603-tune-crashloopbackoff/README.md

Lines changed: 35 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -83,8 +83,8 @@ tags, and then generate with `hack/update-toc.sh`.
8383
- [Goals](#goals)
8484
- [Non-Goals](#non-goals)
8585
- [Proposal](#proposal)
86-
- [Existing backoff curve change: front loaded decay](#existing-backoff-curve-change-front-loaded-decay)
87-
- [API opt in for max cap decay curve (<code>restartPolicy: Rapid</code>)](#api-opt-in-for-max-cap-decay-curve-restartpolicy-rapid)
86+
- [Existing backoff curve change: front loaded decay](#existing-backoff-curve-change-front-loaded-decay)
87+
- [API opt in for max cap decay curve (<code>restartPolicy: Rapid</code>)](#api-opt-in-for-max-cap-decay-curve-restartpolicy-rapid)
8888
- [User Stories](#user-stories)
8989
- [Task isolation](#task-isolation)
9090
- [Fast restart on failure](#fast-restart-on-failure)
@@ -351,7 +351,7 @@ Note that proposal will NOT change:
351351
[Alternatives](#more-complex-heuristics)
352352

353353

354-
#### Existing backoff curve change: front loaded decay
354+
### Existing backoff curve change: front loaded decay
355355

356356
This KEP proposes changing the existing backoff curve to load more restarts
357357
earlier by changing the initial value of the exponential backoff. A number of
@@ -362,7 +362,7 @@ analyze its impact on infrastructure during alpha.
362362
![](todayvs1sbackoff.png)
363363

364364

365-
#### API opt in for max cap decay curve (`restartPolicy: Rapid`)
365+
### API opt in for max cap decay curve (`restartPolicy: Rapid`)
366366

367367
Pods and restartable init (aka sidecar) containers will be able to set a new
368368
OneOf value, `restartPolicy: Rapid`, to opt in to an exponential backoff decay
@@ -589,7 +589,7 @@ Among these modeled initial values, we would get between 3-7 excess restarts per
589589
backoff lifetime, mostly within the first three time windows matching today's
590590
restart behavior.
591591

592-
#### Rapid curve methodology
592+
### Rapid curve methodology
593593

594594
For some users in
595595
[Kubernetes#57291](https://github.com/kubernetes/kubernetes/issues/57291), any
@@ -667,7 +667,7 @@ heterogenity between "Succeeded" terminating pods, crashing pods whose
667667
`restartPolicy: Always`, and crashing pods whose `restartPolicy: Rapid`,
668668
* what is the load and rate of Pod restart related API requests to the API
669669
server?
670-
* what are the performance (memory, CPU, and latency) effects on the kubelet
670+
* what are the performance (memory, CPU, and pod start latency) effects on the kubelet
671671
component?
672672

673673
In order to answer these questions, metrics tying together the number of
@@ -1429,6 +1429,35 @@ Major milestones might include:
14291429
Why should this KEP _not_ be implemented?
14301430
-->
14311431

1432+
CrashLoopBackoff behavior has been stable and untouched for most of the
1433+
Kubernetes lifetime. It could be argued that it "isn't broken", that most people
1434+
are ok with it or have sufficient and architecturally well placed workarounds
1435+
using third party reaper processes or application code based solutions, and
1436+
changing it just invites high risk to the platform as a whole instead of
1437+
individual end user deployments. However, per the [Motivation](#motivation)
1438+
section, there are emerging workload use cases and a long history of a vocal
1439+
minority in favor of changes to this behavior, so trying to change it now is
1440+
timely. Obviously we could still decide not to graduate the change out of alpha
1441+
if the risks are determined to be too high or the feedback is not positive.
1442+
1443+
Though the issue is highly upvoted, on an analysis of the comments presented in
1444+
the canonical tracking issue
1445+
[Kubernetes#57291](https://github.com/kubernetes/kubernetes/issues/57291), 22
1446+
unique commenters were requesting a constant or instant backoff for `Succeeded`
1447+
Pods, 19 for earlier recovery tries, and 6 for better late recovery behavior;
1448+
the latter is arguably even more highly requested when also considering related
1449+
issue [Kubernetes#50375](https://github.com/kubernetes/kubernetes/issues/50375).
1450+
Though an early version of this KEP also addressed the `Success` case, in its
1451+
current version this KEP really only addresses the early recovery case, which by
1452+
our quantitative data is actually the least requested option. That being said,
1453+
other use cases described in [User Stories](#user-stories) that don't have
1454+
quantitative counts are also driving forces on why we should address the early
1455+
recovery cases now. On top of that, compared to the late recovery cases, early
1456+
recovery is more approachable and easily modelable and improving benchmarking
1457+
and insight can help us improve late recovery later on (see also the related
1458+
discussion in Alternatives [here](#more-complex-heuristics) and
1459+
[here](#late-recovery)).
1460+
14321461
## Alternatives
14331462

14341463
<!--

0 commit comments

Comments
 (0)