Skip to content

Commit 7af241d

Browse files
authored
Merge pull request #5340 from natasha41575/metrics
KEP 1287: Instrumentation for in-place pod resize
2 parents 119b531 + 0d363e2 commit 7af241d

File tree

2 files changed

+75
-1
lines changed

2 files changed

+75
-1
lines changed

keps/sig-node/1287-in-place-update-pod-resources/README.md

Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,12 @@
3838
- [QOS Class](#qos-class)
3939
- [Resource Quota](#resource-quota)
4040
- [Affected Components](#affected-components)
41+
- [Instrumentation](#instrumentation)
42+
- [<code>kubelet_container_requested_resizes_total</code>](#kubelet_container_requested_resizes_total)
43+
- [<code>kubelet_pod_resize_duration_seconds</code>](#kubelet_pod_resize_duration_seconds)
44+
- [<code>kubelet_pod_pending_resizes</code>](#kubelet_pod_pending_resizes)
45+
- [<code>kubelet_pod_in_progress_resizes</code>](#kubelet_pod_in_progress_resizes)
46+
- [<code>kubelet_pod_deferred_resize_accepted_total</code>](#kubelet_pod_deferred_resize_accepted_total)
4147
- [Static CPU &amp; Memory Policy](#static-cpu--memory-policy)
4248
- [Future Enhancements](#future-enhancements)
4349
- [Mutable QOS Class &quot;Shape&quot;](#mutable-qos-class-shape)
@@ -912,6 +918,74 @@ Other components:
912918
* check how the change of meaning of resource requests influence other
913919
Kubernetes components.
914920

921+
### Instrumentation
922+
923+
The kubelet will record the following metrics:
924+
925+
#### `kubelet_container_requested_resizes_total`
926+
927+
This metric tracks the total number of resize attempts observed by the Kubelet, counted at the container level.
928+
A single pod update changing multiple containers will be considered separate resize attempts.
929+
930+
Labels:
931+
- `resource` - what resource. Possible values: `cpu`, or `memory`. If more than one of these is changing in the resize request, we increment the counter multiple times, once for each.
932+
- `requirement` - Possible values: `limits`, or `requests`. If more than one of these is changing in the resize request, we increment the counter multiple times, once for each.
933+
- `operation` - whether the resize is an increase or a decrease. Possible values: `increase`, `decrease`, `add`, or `remove`.
934+
- `namespace` - the namespace of the pod.
935+
936+
This metric is recorded as a counter.
937+
938+
#### `kubelet_pod_resize_duration_seconds`
939+
This metric tracks the duration of [doPodResizeAction](https://github.com/kubernetes/kubernetes/blob/92de70895830ea1a9c2c6554bdab4cbee7ce867d/pkg/kubelet/kuberuntime/kuberuntime_manager.go#L699), which
940+
is responsible for actuating the resize.
941+
942+
Labels:
943+
- `namespace` - the namespace of the pod.
944+
945+
This metric is recorded as a histogram.
946+
947+
#### `kubelet_pod_pending_resizes`
948+
949+
This metric tracks the current count of pods that the kubelet marks as pending. This will make it
950+
easier for us to see which of the current limitations users are running into the most.
951+
952+
Labels:
953+
- `reason` - why the resize is pending. Possible values: `infeasible` or `deferred`.
954+
- `reason_detail` - more details about why the resize is pending. Although a more detailed "message" will be provided in the `PodResizePending`
955+
condition in the pod, we limit this label to only the following possible values to keep cardinality low:
956+
- `guaranteed_pod_cpu_manager_static_policy` - In-place resize is not supported for Guaranteed Pods alongside CPU Manager static policy.
957+
- `guaranteed_pod_memory_manager_static_policy` - In-place resize is not supported for Guaranteed Pods alongside Memory Manager static policy.
958+
- `static_pod` - In-place resize is not supported for static pods.
959+
- `swap_limitation` - In-place resize is not supported for containers with swap.
960+
- `insufficient_node_allocatable` - The node doesn't have enough capacity for this resize request.
961+
- `namespace` - the namespace of the pod.
962+
963+
This list of possible reasons may shrink or grow depending on limitations that are added or removed in the future.
964+
965+
This metric is recorded as a gauge.
966+
967+
#### `kubelet_pod_in_progress_resizes`
968+
969+
This metric tracks the total count of resize requests that the kubelet marks as in progress, meaning that
970+
the resources have been allocated but not yet actuated.
971+
972+
Labels:
973+
- `namespace` - the namespace of the pod.
974+
975+
This metric is recorded as a gauge.
976+
977+
#### `kubelet_pod_deferred_resize_accepted_total`
978+
979+
This metric tracks the total number of resize requests that the Kubelet originally marked as deferred but
980+
later accepted. This metric primarily exists because if a deferred resize is accepted through the timed retry (as
981+
opposed to being triggered by an event such as another pod being deleted or sized down), it indicates an issue in the Kubelet's logic for handling deferred resizes that we should fix.
982+
983+
Labels:
984+
- `accepted_reason` - whether the resize was accepted through the timed retry or due to another pod event. Possible values: `periodic_retry`, `event_based`.
985+
- `namespace` - the namespace of the pod.
986+
987+
This metric is recorded as a counter.
988+
915989
### Static CPU & Memory Policy
916990

917991
Resizing pods with static CPU & memory policy configured is out-of-scope for the beta release of

keps/sig-node/1287-in-place-update-pod-resources/kep.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ replaces:
3434

3535
stage: "beta"
3636

37-
latest-milestone: "v1.33"
37+
latest-milestone: "v1.34"
3838

3939
milestone:
4040
alpha: "v1.27"

0 commit comments

Comments
 (0)