Skip to content

Commit e513863

Browse files
committed
Add rationale and clarity on late recovery and what this KEP doesn't touch
Signed-off-by: Laura Lorenz <[email protected]>
1 parent 3001a42 commit e513863

File tree

1 file changed

+36
-1
lines changed
  • keps/sig-node/4603-tune-crashloopbackoff

1 file changed

+36
-1
lines changed

keps/sig-node/4603-tune-crashloopbackoff/README.md

Lines changed: 36 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -215,7 +215,7 @@ length of the pod run is less than 10 minutes. This KEP proposes a two-pronged
215215
approach to revisiting the CrashLoopBackoff behaviors for common use cases:
216216
1. modifying the standard backoff delay to start faster but decay to the same 5m
217217
threshold
218-
2. allowing Pods to opt-in to an even faster backoff curve
218+
2. allowing Pods to opt-in to an even faster backoff curve with a lower max cap
219219

220220
For each of these changes, the exact values are subject to modification in the
221221
alpha period in order to empirically derive defaults intended to
@@ -345,6 +345,17 @@ metrics will also supply cluster operators the data necessary to better analyze
345345
and anticipate the change in load and node stability as a result of upgrading to
346346
these changes.
347347

348+
Note that proposal will NOT change:
349+
* backoff behavior for Pods transitioning from the "Success" state -- see
350+
[Notes/Constraints/Caveats](#on-success-and-the-10-minute-recovery-threshold)
351+
* the time Kubernetes waits before resetting the backoff counter -- see the
352+
[Notes/Constraints/Caveats](#on-success-and-the-10-minute-recovery-threshold)
353+
* the ImagePullBackoff -- out of scope, see [Design
354+
Details](#relationship-with-imagepullbackoff)
355+
* changes that address 'late recovery', or modifications to backoff behavior
356+
once the max cap has been reached -- see
357+
[Alternatives](#more-complex-heuristics)
358+
348359

349360
#### Existing backoff curve change: front loaded decay
350361

@@ -1490,6 +1501,30 @@ infrastructure to warrant implementing such a contrived backoff curve.
14901501

14911502
!["A graph showing the changes to restarts depending on some initial values"](initialvaluesandnumberofrestarts.png "Different CrashLoopBackoff initial values")
14921503

1504+
### Late recovery
1505+
1506+
There are many use cases not covered in this KEP's target [User
1507+
Stories](#user-stories) that share the common properties of being concerned with
1508+
the recovery timeline of Pods that have already reached their max cap for their
1509+
backoff. Today, some of these Pods will have their backoff counters reset once
1510+
they have run successfully for 10 minutes. However, user stories exist where
1511+
1512+
1. the Pod will never successfully run for 10 minutes by design
1513+
2. the user wants to be able to force the decay curve to restart
1514+
([Kubernetes#50375](https://github.com/kubernetes/kubernetes/issues/50375))
1515+
3. the application knows what to wait for and could communicate that to the
1516+
system (like a restart probe)
1517+
1518+
As discussed in
1519+
[Notes/Constraints/Caveats](#on-success-and-the-10-minute-recovery-threshold),
1520+
the first case is unlikely to be address by Kubernetes.
1521+
1522+
The latter two are considered out of scope for this KEP, as the most common use
1523+
cases are regarding the initial recovery period. If there is still sufficient
1524+
appetite after this KEP reaches beta to specifically address late recovery
1525+
scenarios, then that would be a good time to address them without the noise and
1526+
change of this KEP.
1527+
14931528
### More complex heuristics
14941529

14951530
The following alternatives are all considered by the author to be in the category of "more complex heuristics", meaning solutions predicated on kubelet making runtime decisions on a variety of system or workload states or trends. These approaches all share the common negatives of being:

0 commit comments

Comments
 (0)