You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: keps/sig-node/4603-tune-crashloopbackoff/README.md
+62-32Lines changed: 62 additions & 32 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -712,28 +712,57 @@ does during pod restarts.
712
712
* Logs information about all those container operations (utilizing disk IO and
713
713
“spamming” logs)
714
714
715
-
### Observability
715
+
### Benchmarking
716
716
717
717
Again, let it be known that by definition this KEP will cause pods to restart
718
718
faster and more often than the current status quo and such a change is desired.
719
-
However, to do so as safely as possible, improved visibility into cluster
720
-
restart behavior is needed both for benchmarking this change and for cluster
721
-
operators to be able to quantify the risk posed to their specific clusters on
722
-
upgrade.
723
-
724
-
This KEP requires the ability to determine, for a given percentage of
725
-
heterogenity between "Succeeded" terminating pods, crashing pods whose
726
-
`restartPolicy: Always`, and crashing pods whose `restartPolicy: Rapid`,
719
+
However, to do so as safely as possible, it is required that during the alpha
720
+
period, we reevaluate the SLIs and SLOs and benchmarks related to this change and
721
+
expose clearly the methodology needed for cluster operators to be able to quantify the
722
+
risk posed to their specific clusters on upgrade.
723
+
724
+
To best reason about the changes in this KEP, we requires the ability to
725
+
determine, for a given percentage of heterogenity between "Succeeded"
726
+
terminating pods, crashing pods whose `restartPolicy: Always`, and crashing pods
727
+
whose `restartPolicy: Rapid`,
727
728
* what is the load and rate of Pod restart related API requests to the API
728
729
server?
729
-
* what are the performance (memory, CPU, and pod start latency) effects on the kubelet
730
-
component?
730
+
* what are the performance (memory, CPU, and pod start latency) effects on the
731
+
kubelet component?
732
+
733
+
Today there are alpha SLIs in Kubernetes that can observe that impact in
734
+
aggregate:
735
+
* Kubelet component CPU and memory
736
+
*`kubelet_http_inflight_requests`
737
+
*`kubelet_http_requests_duration_seconds`
738
+
*`kubelet_http_requests_total`
739
+
*`kubelet_pod_worker_duration_seconds`
740
+
*`kubelet_runtime_operations_duration_seconds`
741
+
*`kubelet_pod_start_duration_seconds`
742
+
*`kubelet_pod_start_sli_duration_seconds`
743
+
744
+
In addition, estimates given the currently suggested changes in API requests are
745
+
included in [Risks and Mitigations](#risks-and-mitigations) and were deciding
746
+
factors in specific changes to the backoff curves. Since the changes in this
747
+
proposal are deterministic, this is pre-calculatable for a given heterogenity of
748
+
quantity and rate of restarting pods.
749
+
750
+
In addition, the `kube-state-metrics`, project already implements
751
+
restart-specific metadata for metrics that can be used to observe pod restart
752
+
latency in more detail, including:
753
+
*`kube_pod_container_status_restarts_total`
754
+
*`kube_pod_restart_policy`
755
+
*`kube_pod_start_time`
756
+
*`kube_pod_created`
757
+
758
+
During the alpha period, these metrics, the SIG-Scalability benchmarking tests,
759
+
added kubelet performance tests, and manual benchmarking by the author against
760
+
`kube-state-metrics` will be used to answer the above questions, tying together the
761
+
container restart policy (inherited or declared), the terminal state of a
762
+
container before restarting, and the number of container restarts, to articulate
763
+
the rate and load of restart related API requests and the performance effects on
764
+
kubelet.
731
765
732
-
In order to answer these questions, metrics tying together the number of
733
-
container restarts, the container restart policy (inherited or declared), and
734
-
the terminal state of a container before restarting must be tracked. For a more
735
-
complete picture, pod lifecycle duration in CrashLoopBackoff state as opposed to
736
-
Running state would also be useful.
737
766
738
767
### Relationship with Job API podFailurePolicy and backoffLimit
739
768
@@ -1737,25 +1766,26 @@ change of this KEP.
1737
1766
1738
1767
### More complex heuristics
1739
1768
1740
-
The following alternatives are all considered by the author to be in the category of "more complex heuristics", meaning solutions predicated on kubelet making runtime decisions on a variety of system or workload states or trends. These approaches all share the common negatives of being:
1769
+
The following alternatives are all considered by the author to be in the
1770
+
category of "more complex heuristics", meaning solutions predicated on kubelet
1771
+
making runtime decisions on a variety of system or workload states or trends.
1772
+
These approaches all share the common negatives of being:
1741
1773
1. harder to reason about
1742
-
2. of unknown return on investment for use cases relative to the investment to implement
1774
+
2. of unknown return on investment for use cases relative to the investment to
1775
+
implement
1743
1776
3. expensive to benchmark and test
1744
1777
1745
-
That being said, after this initial KEP reaches beta and beyond, it is entirely possible that the community will desire more sophisticated behavior based on or inspired by some of these considered alternatives. As mentioned above, the observability and benchmarking work done within the scope of this KEP can help users provide empirical support for further enhancements, and the following review may be useful to such efforts in the future.
1746
-
1747
-
### Expose podFailurePolicy to nonJob Pods
1748
-
1749
-
TBD
1750
-
1751
-
#### Subsidize running time in backoff delay
1752
-
FIXME: Subsidize latest successful pod running time/readinessProbe/livenessProbe
1753
-
into the CrashLoopBackOff backoff, potentially restarting the backoff counter as
1754
-
a result
1755
-
1756
-
#### Detect anomalous workload crashes
1757
-
1758
-
TBD
1778
+
That being said, after this initial KEP reaches beta and beyond, it is entirely
1779
+
possible that the community will desire more sophisticated behavior based on or
1780
+
inspired by some of these considered alternatives. As mentioned above, the
1781
+
observability and benchmarking work done within the scope of this KEP can help
1782
+
users provide empirical support for further enhancements, and the following
1783
+
review may be useful to such efforts in the future.
1784
+
1785
+
* Expose podFailurePolicy to nonJob Pods
1786
+
* Subsidize successful running time/readinessProbe/livenessProbe seconds in
0 commit comments