@@ -219,7 +219,7 @@ are considered too conservative, especially in cases where the exit code was 0
219
219
(Success) and the pod is transitioned into a "Completed" state or the expected
220
220
length of the pod run is less than 10 minutes.
221
221
222
- This KEP proposes the following changes :
222
+ This KEP proposes the following change :
223
223
* Provide an alpha-gated change to get feedback and periodic scalability tests
224
224
on changes to the global initial backoff to 1s and maximum backoff to 1 minute
225
225
@@ -228,6 +228,10 @@ CrashLoopBackOffBehavior of today, with the proposed new default, and with the
228
228
proposed minimum per node configuration] (./restarts-vs-elapsed-all.png "KEP-4603
229
229
CrashLoopBackoff proposal comparison")
230
230
231
+ Originally, this KEP included a proposal to lower the maximum CrashLoopBackOff
232
+ duration. This has been split into [ KEP-5593: Configure the max CrashLoopBackOff
233
+ delay] ( https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/5593-configure-the-max-crashloopbackoff-delay ) .
234
+
231
235
## Motivation
232
236
233
237
<!--
@@ -488,24 +492,15 @@ additional load, the 2x increase in apiserver cpu usage is probably not a
488
492
particularly useful metric. Might be worth mentioning the raw numbers here
489
493
instead.>> <<[ /UNRESOLVED] >>
490
494
491
- For both of these changes, by passing these changes through the existing
492
- SIG-scalability tests, while pursuing manual and more detailed periodic
493
- benchmarking during the alpha period, we can increase the confidence in the
494
- changes and explore the possibility of reducing the values further in the
495
- future.
495
+ By passing the proposed changes through the existing SIG-scalability tests,
496
+ while pursuing manual and more detailed periodic benchmarking during the alpha
497
+ period, we can increase the confidence in the changes and explore the
498
+ possibility of reducing the values further in the future.
496
499
497
500
In the meantime, during alpha, naturally the first line of defense is that the
498
- enhancements, even the reduced "default" baseline curve for CrashLoopBackoff,
499
- are not usable by default and must be opted into. In this specific case they are
500
- opted into separately with different alpha feature gates, so clusters will only
501
- be affected by each risk if the cluster operator enables the new features during
502
- the alpha period.
503
-
504
- Beyond this, there are two main mitigations during alpha: conservativism in
505
- changes to the default behavior based on prior stress testing, and limiting any
506
- further overrides to be opt-in per Node, and only by users with the permissions
507
- to modify the kubelet configuration -- in other words, a cluster operator
508
- persona.
501
+ enhancement is not usable by default and must be opted into. Further mitigation
502
+ is conservativism in changes to the default behavior based on prior stress
503
+ testing.
509
504
510
505
The alpha changes to the _ default_ backoff curve were chosen because they meet
511
506
emerging use cases and user sentiment from the canonical feature request issue
@@ -1549,8 +1544,11 @@ Think about adding additional work or introducing new steps in between
1549
1544
Maybe! As containers will be restarting more, this may affect "Startup latency
1550
1545
of schedulable stateless pods", "Startup latency of schedule stateful pods".
1551
1546
This is directly the type of SLI impact that a) the split between the default
1552
- behavior change and the per node opt in is trying to mitigate, and b) one of the
1553
- targets of the benchmarking period during alpha.
1547
+ behavior change and the per node max CrashLoopBackOff delay configuration
1548
+ proposed in [KEP-5593 - Configure the max CrashLoopBackOff
1549
+ delay](https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/5593-configure-the-max-crashloopbackoff-delay/README.md)
1550
+ is trying to mitigate, and b) one of the targets of the benchmarking period
1551
+ during alpha.
1554
1552
1555
1553
# ##### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
1556
1554
@@ -1569,7 +1567,8 @@ initial manual benchmarking tests, CPU usage of kubelet increased 2x on nodes
1569
1567
saturated with 110 instantly crashing single-container pods. During the alpha
1570
1568
benchmarking period, we will be quantifying that amount in fully and partially
1571
1569
saturated nodes with both the new default backoff curve and the minimum per node
1572
- backoff curve.
1570
+ backoff curve proposed in [KEP-5593 - Configure the max CrashLoopBackOff
1571
+ delay](https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/5593-configure-the-max-crashloopbackoff-delay/README.md).
1573
1572
1574
1573
# ##### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
1575
1574
0 commit comments