Skip to content

Commit b1feb73

Browse files
committed
Add metrics discussion re: rollback
Signed-off-by: Laura Lorenz <[email protected]>
1 parent adf8d04 commit b1feb73

File tree

1 file changed

+22
-0
lines changed
  • keps/sig-node/4603-tune-crashloopbackoff

1 file changed

+22
-0
lines changed

keps/sig-node/4603-tune-crashloopbackoff/README.md

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1135,6 +1135,28 @@ What signals should users be paying attention to when the feature is young
11351135
that might indicate a serious problem?
11361136
-->
11371137

1138+
This biggest bottleneck expected will be kubelet, as it is expected to get more
1139+
restart requests and have to trigger all the overhead discussed in [Design
1140+
Details](#kubelet-overhead-analysis) more often. Cluster operators should be
1141+
closely watching these existing metrics:
1142+
1143+
* Kubelet component CPU and memory
1144+
* `kubelet_http_inflight_requests`
1145+
* `kubelet_http_requests_duration_seconds`
1146+
* `kubelet_http_requests_total`
1147+
* `kubelet_pod_worker_duration_seconds`
1148+
* `kubelet_runtime_operations_duration_seconds`
1149+
1150+
Most important to the perception of the end user is Kubelet's actual ability to
1151+
create pods, which we measure in the latency of a pod actually starting compared
1152+
to its creation timestamp. The following existing metrics are for all pods, not
1153+
just ones that are restarting, but at a certain saturation of restarting pods
1154+
this metric would be expected to become slower and must be watched to determine
1155+
rollback:
1156+
* `kubelet_pod_start_duration_seconds`
1157+
* `kubelet_pod_start_sli_duration_seconds`
1158+
1159+
11381160
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
11391161

11401162
<!--

0 commit comments

Comments
 (0)