File tree Expand file tree Collapse file tree 1 file changed +22
-0
lines changed
keps/sig-node/4603-tune-crashloopbackoff Expand file tree Collapse file tree 1 file changed +22
-0
lines changed Original file line number Diff line number Diff line change @@ -1135,6 +1135,28 @@ What signals should users be paying attention to when the feature is young
1135
1135
that might indicate a serious problem?
1136
1136
-->
1137
1137
1138
+ This biggest bottleneck expected will be kubelet, as it is expected to get more
1139
+ restart requests and have to trigger all the overhead discussed in [Design
1140
+ Details](#kubelet-overhead-analysis) more often. Cluster operators should be
1141
+ closely watching these existing metrics :
1142
+
1143
+ * Kubelet component CPU and memory
1144
+ * `kubelet_http_inflight_requests`
1145
+ * `kubelet_http_requests_duration_seconds`
1146
+ * `kubelet_http_requests_total`
1147
+ * `kubelet_pod_worker_duration_seconds`
1148
+ * `kubelet_runtime_operations_duration_seconds`
1149
+
1150
+ Most important to the perception of the end user is Kubelet's actual ability to
1151
+ create pods, which we measure in the latency of a pod actually starting compared
1152
+ to its creation timestamp. The following existing metrics are for all pods, not
1153
+ just ones that are restarting, but at a certain saturation of restarting pods
1154
+ this metric would be expected to become slower and must be watched to determine
1155
+ rollback :
1156
+ * `kubelet_pod_start_duration_seconds`
1157
+ * `kubelet_pod_start_sli_duration_seconds`
1158
+
1159
+
1138
1160
# ##### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
1139
1161
1140
1162
<!--
You can’t perform that action at this time.
0 commit comments