Skip to content

Commit 8caa42a

Browse files
committed
Observability --> Benchmarking, and clean up alternatives
Signed-off-by: lauralorenz <[email protected]>
1 parent 9d3daed commit 8caa42a

File tree

1 file changed

+62
-32
lines changed
  • keps/sig-node/4603-tune-crashloopbackoff

1 file changed

+62
-32
lines changed

keps/sig-node/4603-tune-crashloopbackoff/README.md

Lines changed: 62 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -712,28 +712,57 @@ does during pod restarts.
712712
* Logs information about all those container operations (utilizing disk IO and
713713
“spamming” logs)
714714

715-
### Observability
715+
### Benchmarking
716716

717717
Again, let it be known that by definition this KEP will cause pods to restart
718718
faster and more often than the current status quo and such a change is desired.
719-
However, to do so as safely as possible, improved visibility into cluster
720-
restart behavior is needed both for benchmarking this change and for cluster
721-
operators to be able to quantify the risk posed to their specific clusters on
722-
upgrade.
723-
724-
This KEP requires the ability to determine, for a given percentage of
725-
heterogenity between "Succeeded" terminating pods, crashing pods whose
726-
`restartPolicy: Always`, and crashing pods whose `restartPolicy: Rapid`,
719+
However, to do so as safely as possible, it is required that during the alpha
720+
period, we reevaluate the SLIs and SLOs and benchmarks related to this change and
721+
expose clearly the methodology needed for cluster operators to be able to quantify the
722+
risk posed to their specific clusters on upgrade.
723+
724+
To best reason about the changes in this KEP, we requires the ability to
725+
determine, for a given percentage of heterogenity between "Succeeded"
726+
terminating pods, crashing pods whose `restartPolicy: Always`, and crashing pods
727+
whose `restartPolicy: Rapid`,
727728
* what is the load and rate of Pod restart related API requests to the API
728729
server?
729-
* what are the performance (memory, CPU, and pod start latency) effects on the kubelet
730-
component?
730+
* what are the performance (memory, CPU, and pod start latency) effects on the
731+
kubelet component?
732+
733+
Today there are alpha SLIs in Kubernetes that can observe that impact in
734+
aggregate:
735+
* Kubelet component CPU and memory
736+
* `kubelet_http_inflight_requests`
737+
* `kubelet_http_requests_duration_seconds`
738+
* `kubelet_http_requests_total`
739+
* `kubelet_pod_worker_duration_seconds`
740+
* `kubelet_runtime_operations_duration_seconds`
741+
* `kubelet_pod_start_duration_seconds`
742+
* `kubelet_pod_start_sli_duration_seconds`
743+
744+
In addition, estimates given the currently suggested changes in API requests are
745+
included in [Risks and Mitigations](#risks-and-mitigations) and were deciding
746+
factors in specific changes to the backoff curves. Since the changes in this
747+
proposal are deterministic, this is pre-calculatable for a given heterogenity of
748+
quantity and rate of restarting pods.
749+
750+
In addition, the `kube-state-metrics`, project already implements
751+
restart-specific metadata for metrics that can be used to observe pod restart
752+
latency in more detail, including:
753+
* `kube_pod_container_status_restarts_total`
754+
* `kube_pod_restart_policy`
755+
* `kube_pod_start_time`
756+
* `kube_pod_created`
757+
758+
During the alpha period, these metrics, the SIG-Scalability benchmarking tests,
759+
added kubelet performance tests, and manual benchmarking by the author against
760+
`kube-state-metrics` will be used to answer the above questions, tying together the
761+
container restart policy (inherited or declared), the terminal state of a
762+
container before restarting, and the number of container restarts, to articulate
763+
the rate and load of restart related API requests and the performance effects on
764+
kubelet.
731765

732-
In order to answer these questions, metrics tying together the number of
733-
container restarts, the container restart policy (inherited or declared), and
734-
the terminal state of a container before restarting must be tracked. For a more
735-
complete picture, pod lifecycle duration in CrashLoopBackoff state as opposed to
736-
Running state would also be useful.
737766

738767
### Relationship with Job API podFailurePolicy and backoffLimit
739768

@@ -1737,25 +1766,26 @@ change of this KEP.
17371766

17381767
### More complex heuristics
17391768

1740-
The following alternatives are all considered by the author to be in the category of "more complex heuristics", meaning solutions predicated on kubelet making runtime decisions on a variety of system or workload states or trends. These approaches all share the common negatives of being:
1769+
The following alternatives are all considered by the author to be in the
1770+
category of "more complex heuristics", meaning solutions predicated on kubelet
1771+
making runtime decisions on a variety of system or workload states or trends.
1772+
These approaches all share the common negatives of being:
17411773
1. harder to reason about
1742-
2. of unknown return on investment for use cases relative to the investment to implement
1774+
2. of unknown return on investment for use cases relative to the investment to
1775+
implement
17431776
3. expensive to benchmark and test
17441777

1745-
That being said, after this initial KEP reaches beta and beyond, it is entirely possible that the community will desire more sophisticated behavior based on or inspired by some of these considered alternatives. As mentioned above, the observability and benchmarking work done within the scope of this KEP can help users provide empirical support for further enhancements, and the following review may be useful to such efforts in the future.
1746-
1747-
### Expose podFailurePolicy to nonJob Pods
1748-
1749-
TBD
1750-
1751-
#### Subsidize running time in backoff delay
1752-
FIXME: Subsidize latest successful pod running time/readinessProbe/livenessProbe
1753-
into the CrashLoopBackOff backoff, potentially restarting the backoff counter as
1754-
a result
1755-
1756-
#### Detect anomalous workload crashes
1757-
1758-
TBD
1778+
That being said, after this initial KEP reaches beta and beyond, it is entirely
1779+
possible that the community will desire more sophisticated behavior based on or
1780+
inspired by some of these considered alternatives. As mentioned above, the
1781+
observability and benchmarking work done within the scope of this KEP can help
1782+
users provide empirical support for further enhancements, and the following
1783+
review may be useful to such efforts in the future.
1784+
1785+
* Expose podFailurePolicy to nonJob Pods
1786+
* Subsidize successful running time/readinessProbe/livenessProbe seconds in
1787+
current backoff delay
1788+
* Detect anomalous workload crashes
17591789

17601790

17611791
## Infrastructure Needed (Optional)

0 commit comments

Comments
 (0)