Skip to content

Commit 875d4b4

Browse files
committed
Add Risks & Mitigations section
Signed-off-by: lauralorenz <[email protected]>
1 parent 0deaea2 commit 875d4b4

File tree

1 file changed

+59
-0
lines changed
  • keps/sig-node/4603-tune-crashloopbackoff

1 file changed

+59
-0
lines changed

keps/sig-node/4603-tune-crashloopbackoff/README.md

Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -525,6 +525,65 @@ How will UX be reviewed, and by whom?
525525
Consider including folks who also work outside the SIG or subproject.
526526
-->
527527

528+
The biggest risk of this proposal is reducing the decay of the _default_
529+
CrashLoopBackoff: to do so too severely compromises node stability, risking the
530+
kubelet component to become too slow to respond and the pod lifecycle to
531+
increase in latency, or worse, causing entire nodes to crash if kubelet takes up
532+
too much CPU or memory. Since each phase transition for a Pod also has an
533+
accompanying API request, if the requests become rapid enough due to fast enough
534+
churn of Pods through CrashLoopBackoff phases, the central API server could
535+
become unresponsive, effectively taking down an entire cluster.
536+
537+
The same risk exists for the `Rapid` feature, which, while not default, is by
538+
design a more severe reduction in the decay behavior. It can abused by
539+
application developers that can edit their Pod template manifests, and in the
540+
worst case cause nodes to fully saturate with `Rapid`ly restarting pods that
541+
will never recover, risking similar issues as above: taking down nodes
542+
or at least nontrivially slowing kubelet, or increasing the API requests to
543+
store backoff state so significantly that the central API server is unresponsive
544+
and the cluster fails.
545+
546+
During alpha, naturally the first line of defense is that the enhancements, even
547+
the reduced "default" baseline curve for CrashLoopBackoff, are not usable by
548+
default and must be opted into. In this specific case they are opted into
549+
separately as kubelet flags, so clusters will only be affected by each risk if
550+
the cluster operator enables the new features during the alpha period.
551+
552+
Beyond this, there are two main mitigations during alpha: conservativism in
553+
changes to the default behavior, and API opt-in and redeployment required for
554+
the more aggressive behavior.
555+
556+
The alpha changes to the default backoff curve were chosen because they are
557+
minimal -- the proposal maintains the existing rate and max cap, and reduces the
558+
initial value to the point that only introduces 3 excess restarts per pod, the
559+
first 2 excess in the first 10 seconds and the last excess following in the next
560+
30 seconds (see [Design Details](#front-loaded-decay-curve-methodology)). For a
561+
hypothetical node with the max 110 pods all stuck in a simultaneous
562+
CrashLoopBackoff, API requests to change the state transition would increase at
563+
its fastest period from ~110 requests/10s to 330 requests/10s. By passing this
564+
minimal change through the existing SIG-scalability tests, while pursuing manual
565+
and more detailed periodic benchmarking during the alpha period, we can increase
566+
the confidence in this change and in the possibility of reducing further in the
567+
future.
568+
569+
For the `Rapid` case, because the change is more significant, including lowering
570+
the max cap, there is more risk to node stability expected. This change is of
571+
interest to be tested in the alpha period by end users, and is why it is still
572+
included with API opt-in even though the risks are higher. That being said it is
573+
still a relatively conservative change in an effor to minimize the unknown
574+
changes for fast feedback during alpha, while improved benchmarking and testing
575+
occurs. For a hypothetical node with the max 110 pods all stuck in a
576+
simultaneous `Rapid` CrashLoopBackoff, API requests to change the state
577+
transition would increase from ~110 requests/10s to 440 requests/10s, and since
578+
the max cap would be lowered, would exhibit up to 440 requests in excess every
579+
300s (5 minutes), or an extra 1.4 requests per second once all pods reached
580+
their max cap backoff. It also should be noted that due to the specifics of the
581+
configuration required in the Pod manifest, being against an immutable field,
582+
will require the Pods in question to be redeployed. This means it is unlikely
583+
that all Pods will be in a simultaneous CrashLoopBackoff even if they are
584+
designed to quickly crash, since they will all need to be redeployed and
585+
rescheduled.
586+
528587
## Design Details
529588

530589
<!--

0 commit comments

Comments
 (0)