@@ -525,6 +525,65 @@ How will UX be reviewed, and by whom?
525
525
Consider including folks who also work outside the SIG or subproject.
526
526
-->
527
527
528
+ The biggest risk of this proposal is reducing the decay of the _ default_
529
+ CrashLoopBackoff: to do so too severely compromises node stability, risking the
530
+ kubelet component to become too slow to respond and the pod lifecycle to
531
+ increase in latency, or worse, causing entire nodes to crash if kubelet takes up
532
+ too much CPU or memory. Since each phase transition for a Pod also has an
533
+ accompanying API request, if the requests become rapid enough due to fast enough
534
+ churn of Pods through CrashLoopBackoff phases, the central API server could
535
+ become unresponsive, effectively taking down an entire cluster.
536
+
537
+ The same risk exists for the ` Rapid ` feature, which, while not default, is by
538
+ design a more severe reduction in the decay behavior. It can abused by
539
+ application developers that can edit their Pod template manifests, and in the
540
+ worst case cause nodes to fully saturate with ` Rapid ` ly restarting pods that
541
+ will never recover, risking similar issues as above: taking down nodes
542
+ or at least nontrivially slowing kubelet, or increasing the API requests to
543
+ store backoff state so significantly that the central API server is unresponsive
544
+ and the cluster fails.
545
+
546
+ During alpha, naturally the first line of defense is that the enhancements, even
547
+ the reduced "default" baseline curve for CrashLoopBackoff, are not usable by
548
+ default and must be opted into. In this specific case they are opted into
549
+ separately as kubelet flags, so clusters will only be affected by each risk if
550
+ the cluster operator enables the new features during the alpha period.
551
+
552
+ Beyond this, there are two main mitigations during alpha: conservativism in
553
+ changes to the default behavior, and API opt-in and redeployment required for
554
+ the more aggressive behavior.
555
+
556
+ The alpha changes to the default backoff curve were chosen because they are
557
+ minimal -- the proposal maintains the existing rate and max cap, and reduces the
558
+ initial value to the point that only introduces 3 excess restarts per pod, the
559
+ first 2 excess in the first 10 seconds and the last excess following in the next
560
+ 30 seconds (see [ Design Details] ( #front-loaded-decay-curve-methodology ) ). For a
561
+ hypothetical node with the max 110 pods all stuck in a simultaneous
562
+ CrashLoopBackoff, API requests to change the state transition would increase at
563
+ its fastest period from ~ 110 requests/10s to 330 requests/10s. By passing this
564
+ minimal change through the existing SIG-scalability tests, while pursuing manual
565
+ and more detailed periodic benchmarking during the alpha period, we can increase
566
+ the confidence in this change and in the possibility of reducing further in the
567
+ future.
568
+
569
+ For the ` Rapid ` case, because the change is more significant, including lowering
570
+ the max cap, there is more risk to node stability expected. This change is of
571
+ interest to be tested in the alpha period by end users, and is why it is still
572
+ included with API opt-in even though the risks are higher. That being said it is
573
+ still a relatively conservative change in an effor to minimize the unknown
574
+ changes for fast feedback during alpha, while improved benchmarking and testing
575
+ occurs. For a hypothetical node with the max 110 pods all stuck in a
576
+ simultaneous ` Rapid ` CrashLoopBackoff, API requests to change the state
577
+ transition would increase from ~ 110 requests/10s to 440 requests/10s, and since
578
+ the max cap would be lowered, would exhibit up to 440 requests in excess every
579
+ 300s (5 minutes), or an extra 1.4 requests per second once all pods reached
580
+ their max cap backoff. It also should be noted that due to the specifics of the
581
+ configuration required in the Pod manifest, being against an immutable field,
582
+ will require the Pods in question to be redeployed. This means it is unlikely
583
+ that all Pods will be in a simultaneous CrashLoopBackoff even if they are
584
+ designed to quickly crash, since they will all need to be redeployed and
585
+ rescheduled.
586
+
528
587
## Design Details
529
588
530
589
<!--
0 commit comments