Skip to content

Commit 299d699

Browse files
committed
Add note about movements in Job API backoffLimit behavior
Signed-off-by: Laura Lorenz <[email protected]>
1 parent a5573b5 commit 299d699

File tree

1 file changed

+34
-3
lines changed
  • keps/sig-node/4603-tune-crashloopbackoff

1 file changed

+34
-3
lines changed

keps/sig-node/4603-tune-crashloopbackoff/README.md

Lines changed: 34 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -791,9 +791,11 @@ spec:
791791
792792
The implementation of KEP-3329 is entirely in the Job controller, and the
793793
restarts are not handled by kubelet at all; in fact, use of this API is only
794-
available if the `restartPolicy` is set to `Never`. As a result, to expose the
795-
new backoff curve Jobs using this feature, the updated backoff curve must also
796-
be implemented in the Job controller.
794+
available if the `restartPolicy` is set to `Never` (though
795+
[kubernetes#125677](https://github.com/kubernetes/kubernetes/issues/125677)
796+
wants to relax this validation to allow it to be used with other `restartPolicy`
797+
values). As a result, to expose the new backoff curve Jobs using this feature,
798+
the updated backoff curve must also be implemented in the Job controller.
797799

798800
### Relationship with ImagePullBackOff
799801

@@ -1607,6 +1609,35 @@ and insight can help us improve late recovery later on (see also the related
16071609
discussion in Alternatives [here](#more-complex-heuristics) and
16081610
[here](#late-recovery)).
16091611

1612+
CrashLoopBackoff behavior has been stable and untouched for most of the
1613+
Kubernetes lifetime. It could be argued that it "isn't broken", that most people
1614+
are ok with it or have sufficient and architecturally well placed workarounds
1615+
using third party reaper processes or application code based solutions, and
1616+
changing it just invites high risk to the platform as a whole instead of
1617+
individual end user deployments. However, per the [Motivation](#motivation)
1618+
section, there are emerging workload use cases and a long history of a vocal
1619+
minority in favor of changes to this behavior, so trying to change it now is
1620+
timely. Obviously we could still decide not to graduate the change out of alpha
1621+
if the risks are determined to be too high or the feedback is not positive.
1622+
1623+
Though the issue is highly upvoted, on an analysis of the comments presented in
1624+
the canonical tracking issue
1625+
[Kubernetes#57291](https://github.com/kubernetes/kubernetes/issues/57291), 22
1626+
unique commenters were requesting a constant or instant backoff for `Succeeded`
1627+
Pods, 19 for earlier recovery tries, and 6 for better late recovery behavior;
1628+
the latter is arguably even more highly requested when also considering related
1629+
issue [Kubernetes#50375](https://github.com/kubernetes/kubernetes/issues/50375).
1630+
Though an early version of this KEP also addressed the `Success` case, in its
1631+
current version this KEP really only addresses the early recovery case, which by
1632+
our quantitative data is actually the least requested option. That being said,
1633+
other use cases described in [User Stories](#user-stories) that don't have
1634+
quantitative counts are also driving forces on why we should address the early
1635+
recovery cases now. On top of that, compared to the late recovery cases, early
1636+
recovery is more approachable and easily modelable and improving benchmarking
1637+
and insight can help us improve late recovery later on (see also the related
1638+
discussion in Alternatives [here](#more-complex-heuristics) and
1639+
[here](#late-recovery)).
1640+
16101641
## Alternatives
16111642

16121643
<!--

0 commit comments

Comments
 (0)