@@ -83,8 +83,8 @@ tags, and then generate with `hack/update-toc.sh`.
83
83
- [ Goals] ( #goals )
84
84
- [ Non-Goals] ( #non-goals )
85
85
- [ Proposal] ( #proposal )
86
- - [ Existing backoff curve change: front loaded decay] ( #existing-backoff-curve-change-front-loaded-decay )
87
- - [ API opt in for max cap decay curve (<code >restartPolicy: Rapid</code >)] ( #api-opt-in-for-max-cap-decay-curve-restartpolicy-rapid )
86
+ - [ Existing backoff curve change: front loaded decay] ( #existing-backoff-curve-change-front-loaded-decay )
87
+ - [ API opt in for max cap decay curve (<code >restartPolicy: Rapid</code >)] ( #api-opt-in-for-max-cap-decay-curve-restartpolicy-rapid )
88
88
- [ User Stories] ( #user-stories )
89
89
- [ Task isolation] ( #task-isolation )
90
90
- [ Fast restart on failure] ( #fast-restart-on-failure )
@@ -351,7 +351,7 @@ Note that proposal will NOT change:
351
351
[ Alternatives] ( #more-complex-heuristics )
352
352
353
353
354
- #### Existing backoff curve change: front loaded decay
354
+ ### Existing backoff curve change: front loaded decay
355
355
356
356
This KEP proposes changing the existing backoff curve to load more restarts
357
357
earlier by changing the initial value of the exponential backoff. A number of
@@ -362,7 +362,7 @@ analyze its impact on infrastructure during alpha.
362
362
![ ] ( todayvs1sbackoff.png )
363
363
364
364
365
- #### API opt in for max cap decay curve (` restartPolicy: Rapid ` )
365
+ ### API opt in for max cap decay curve (` restartPolicy: Rapid ` )
366
366
367
367
Pods and restartable init (aka sidecar) containers will be able to set a new
368
368
OneOf value, ` restartPolicy: Rapid ` , to opt in to an exponential backoff decay
@@ -589,7 +589,7 @@ Among these modeled initial values, we would get between 3-7 excess restarts per
589
589
backoff lifetime, mostly within the first three time windows matching today's
590
590
restart behavior.
591
591
592
- #### Rapid curve methodology
592
+ ### Rapid curve methodology
593
593
594
594
For some users in
595
595
[ Kubernetes #57291 ] ( https://github.com/kubernetes/kubernetes/issues/57291 ) , any
@@ -667,7 +667,7 @@ heterogenity between "Succeeded" terminating pods, crashing pods whose
667
667
` restartPolicy: Always ` , and crashing pods whose ` restartPolicy: Rapid ` ,
668
668
* what is the load and rate of Pod restart related API requests to the API
669
669
server?
670
- * what are the performance (memory, CPU, and latency) effects on the kubelet
670
+ * what are the performance (memory, CPU, and pod start latency) effects on the kubelet
671
671
component?
672
672
673
673
In order to answer these questions, metrics tying together the number of
@@ -1429,6 +1429,35 @@ Major milestones might include:
1429
1429
Why should this KEP _not_ be implemented?
1430
1430
-->
1431
1431
1432
+ CrashLoopBackoff behavior has been stable and untouched for most of the
1433
+ Kubernetes lifetime. It could be argued that it "isn't broken", that most people
1434
+ are ok with it or have sufficient and architecturally well placed workarounds
1435
+ using third party reaper processes or application code based solutions, and
1436
+ changing it just invites high risk to the platform as a whole instead of
1437
+ individual end user deployments. However, per the [Motivation](#motivation)
1438
+ section, there are emerging workload use cases and a long history of a vocal
1439
+ minority in favor of changes to this behavior, so trying to change it now is
1440
+ timely. Obviously we could still decide not to graduate the change out of alpha
1441
+ if the risks are determined to be too high or the feedback is not positive.
1442
+
1443
+ Though the issue is highly upvoted, on an analysis of the comments presented in
1444
+ the canonical tracking issue
1445
+ [Kubernetes#57291](https://github.com/kubernetes/kubernetes/issues/57291), 22
1446
+ unique commenters were requesting a constant or instant backoff for `Succeeded`
1447
+ Pods, 19 for earlier recovery tries, and 6 for better late recovery behavior;
1448
+ the latter is arguably even more highly requested when also considering related
1449
+ issue [Kubernetes#50375](https://github.com/kubernetes/kubernetes/issues/50375).
1450
+ Though an early version of this KEP also addressed the `Success` case, in its
1451
+ current version this KEP really only addresses the early recovery case, which by
1452
+ our quantitative data is actually the least requested option. That being said,
1453
+ other use cases described in [User Stories](#user-stories) that don't have
1454
+ quantitative counts are also driving forces on why we should address the early
1455
+ recovery cases now. On top of that, compared to the late recovery cases, early
1456
+ recovery is more approachable and easily modelable and improving benchmarking
1457
+ and insight can help us improve late recovery later on (see also the related
1458
+ discussion in Alternatives [here](#more-complex-heuristics) and
1459
+ [here](#late-recovery)).
1460
+
1432
1461
# # Alternatives
1433
1462
1434
1463
<!--
0 commit comments