Skip to content

Commit 1d0be4d

Browse files
committed
Add new KEP for configuring max CrashLoopBackOff delay
This KEP is mostly a copy of keps/sig-node/4603-tune-crashloopbackoff with all the tuning bits removed (and grammar adjusted to make sense). The desire is to advance this KEP to beta sooner than we'd be able to advance the other one.
1 parent dc6c057 commit 1d0be4d

File tree

10 files changed

+2026
-20
lines changed

10 files changed

+2026
-20
lines changed
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
kep-number: 5593
2+
alpha:
3+
approver: "@soltysh"

keps/sig-node/4603-tune-crashloopbackoff/README.md

Lines changed: 19 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -219,7 +219,7 @@ are considered too conservative, especially in cases where the exit code was 0
219219
(Success) and the pod is transitioned into a "Completed" state or the expected
220220
length of the pod run is less than 10 minutes.
221221

222-
This KEP proposes the following changes:
222+
This KEP proposes the following change:
223223
* Provide an alpha-gated change to get feedback and periodic scalability tests
224224
on changes to the global initial backoff to 1s and maximum backoff to 1 minute
225225

@@ -228,6 +228,10 @@ CrashLoopBackOffBehavior of today, with the proposed new default, and with the
228228
proposed minimum per node configuration](./restarts-vs-elapsed-all.png "KEP-4603
229229
CrashLoopBackoff proposal comparison")
230230

231+
Originally, this KEP included a proposal to lower the maximum CrashLoopBackOff
232+
duration. This has been split into [KEP-5593: Configure the max CrashLoopBackOff
233+
delay](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/5593-configure-the-max-crashloopbackoff-delay).
234+
231235
## Motivation
232236

233237
<!--
@@ -488,24 +492,15 @@ additional load, the 2x increase in apiserver cpu usage is probably not a
488492
particularly useful metric. Might be worth mentioning the raw numbers here
489493
instead.>> <<[/UNRESOLVED]>>
490494

491-
For both of these changes, by passing these changes through the existing
492-
SIG-scalability tests, while pursuing manual and more detailed periodic
493-
benchmarking during the alpha period, we can increase the confidence in the
494-
changes and explore the possibility of reducing the values further in the
495-
future.
495+
By passing the proposed changes through the existing SIG-scalability tests,
496+
while pursuing manual and more detailed periodic benchmarking during the alpha
497+
period, we can increase the confidence in the changes and explore the
498+
possibility of reducing the values further in the future.
496499

497500
In the meantime, during alpha, naturally the first line of defense is that the
498-
enhancements, even the reduced "default" baseline curve for CrashLoopBackoff,
499-
are not usable by default and must be opted into. In this specific case they are
500-
opted into separately with different alpha feature gates, so clusters will only
501-
be affected by each risk if the cluster operator enables the new features during
502-
the alpha period.
503-
504-
Beyond this, there are two main mitigations during alpha: conservativism in
505-
changes to the default behavior based on prior stress testing, and limiting any
506-
further overrides to be opt-in per Node, and only by users with the permissions
507-
to modify the kubelet configuration -- in other words, a cluster operator
508-
persona.
501+
enhancement is not usable by default and must be opted into. Further mitigation
502+
is conservativism in changes to the default behavior based on prior stress
503+
testing.
509504

510505
The alpha changes to the _default_ backoff curve were chosen because they meet
511506
emerging use cases and user sentiment from the canonical feature request issue
@@ -1549,8 +1544,11 @@ Think about adding additional work or introducing new steps in between
15491544
Maybe! As containers will be restarting more, this may affect "Startup latency
15501545
of schedulable stateless pods", "Startup latency of schedule stateful pods".
15511546
This is directly the type of SLI impact that a) the split between the default
1552-
behavior change and the per node opt in is trying to mitigate, and b) one of the
1553-
targets of the benchmarking period during alpha.
1547+
behavior change and the per node max CrashLoopBackOff delay configuration
1548+
proposed in [KEP-5593 - Configure the max CrashLoopBackOff
1549+
delay](https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/5593-configure-the-max-crashloopbackoff-delay/README.md)
1550+
is trying to mitigate, and b) one of the targets of the benchmarking period
1551+
during alpha.
15541552

15551553
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
15561554

@@ -1569,7 +1567,8 @@ initial manual benchmarking tests, CPU usage of kubelet increased 2x on nodes
15691567
saturated with 110 instantly crashing single-container pods. During the alpha
15701568
benchmarking period, we will be quantifying that amount in fully and partially
15711569
saturated nodes with both the new default backoff curve and the minimum per node
1572-
backoff curve.
1570+
backoff curve proposed in [KEP-5593 - Configure the max CrashLoopBackOff
1571+
delay](https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/5593-configure-the-max-crashloopbackoff-delay/README.md).
15731572

15741573
###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
15751574

0 commit comments

Comments
 (0)