Skip to content

Commit ca1a000

Browse files
committed
KEP 1287: Instrumentation for in-place pod resize
1 parent 047426d commit ca1a000

File tree

1 file changed

+80
-0
lines changed
  • keps/sig-node/1287-in-place-update-pod-resources

1 file changed

+80
-0
lines changed

keps/sig-node/1287-in-place-update-pod-resources/README.md

Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,12 @@
3737
- [QOS Class](#qos-class)
3838
- [Resource Quota](#resource-quota)
3939
- [Affected Components](#affected-components)
40+
- [Instrumentation](#instrumentation)
41+
- [<code>kubelet_pod_resize_requests_total</code>](#kubelet_pod_resize_requests_total)
42+
- [<code>kubelet_container_resize_requests_total</code>](#kubelet_container_resize_requests_total)
43+
- [<code>kubelet_pod_resize_sli_duration_seconds</code>](#kubelet_pod_resize_sli_duration_seconds)
44+
- [<code>kubelet_pod_infeasible_resize_total</code>](#kubelet_pod_infeasible_resize_total)
45+
- [<code>kubelet_pod_deferred_resize_accepted_total</code>](#kubelet_pod_deferred_resize_accepted_total)
4046
- [Static CPU &amp; Memory Policy](#static-cpu--memory-policy)
4147
- [Future Enhancements](#future-enhancements)
4248
- [Mutable QOS Class &quot;Shape&quot;](#mutable-qos-class-shape)
@@ -881,6 +887,80 @@ Other components:
881887
* check how the change of meaning of resource requests influence other
882888
Kubernetes components.
883889

890+
### Instrumentation
891+
892+
The kubelet will record the following metrics:
893+
894+
#### `kubelet_pod_resize_requests_total`
895+
896+
This metric tracks the total number of resize requests observed by the Kubelet, counted at the pod level.
897+
A single pod update changing multiple containers will be considered a single resize request.
898+
899+
Labels:
900+
- `resource_type` - what type of resource is being resized. Possible values: `cpu_limits`, `cpu_requests` `memory_limits`, or `memory_requests`. If more than one of these resource types is changing in the resize request,
901+
we increment the counter multiple times, once for each. This means that a single pod update changing multiple
902+
resource types will be considered multiple requests for this metric.
903+
- `operation_type` - whether the resize is a net increase or a decrease (taken as an aggregate across
904+
all containers in the pod). Possible values: `increase`, `decrease`, `add`, or `remove`.
905+
906+
This metric is recorded as a counter.
907+
908+
#### `kubelet_container_resize_requests_total`
909+
910+
This metric tracks the total number of resize requests observed by the Kubelet, counted at the container level.
911+
A single pod update changing multiple containers will be considered separate resize requests.
912+
913+
Labels:
914+
- `resource_type` - what type of resource is being resized. Possible values: `cpu_limits`, `cpu_requests` `memory_limits`, or `memory_requests`. If more than one of these resource types is changing in the resize request,
915+
we increment the counter multiple times, once for each. This means that a single pod update changing multiple
916+
resource types will be considered multiple requests for this metric.
917+
- `operation_type` - whether the resize is an increase or a decrease. Possible values: `increase`, `decrease`, `add`, or `remove`.
918+
919+
This metric is recorded as a counter.
920+
921+
#### `kubelet_pod_resize_sli_duration_seconds`
922+
923+
This metric tracks the latency between when the kubelet accepts a resize request and when it finshes actuating
924+
the request. More precisely, this metric tracks the total amount of time that the `PodResizeInProgress` condition
925+
is present on a pod.
926+
927+
Labels:
928+
- `resource_type` - what type of resource is being resized. Possible values: `cpu_limits`, `cpu_requests` `memory_limits`, or `memory_requests`. If more than one of these resource types is changing in the resize request,
929+
we increment the counter multiple times, once for each.
930+
- `operation_type` - whether the resize is an increase or a decrease. Possible values: `increase`, `decrease`, `add`, or `remove`.
931+
932+
This metric is recorded as a gauge.
933+
934+
#### `kubelet_pod_infeasible_resize_total`
935+
936+
This metric tracks the total count of resize requests that the kubelet marks as infeasible. This will make it
937+
easier for us to see which of the current limitations users are running into the most.
938+
939+
Labels:
940+
- `reason` - why the resize is infeasible. Although a more detailed "reason" will be provided in the `PodResizePending`
941+
condition in the pod, we limit this label to only the following possible values to keep cardinality low:
942+
- `guaranteed_pod_cpu_manager_static_policy` - In-place resize is not supported for Guaranteed Pods alongside CPU Manager static policy.
943+
- `guaranteed_pod_memory_manager_static_policy` - In-place resize is not supported for Guaranteed Pods alongside Memory Manager static policy.
944+
- `static_pod` - In-place resize is not supported for static pods.
945+
- `swap_limitation` - In-place resize is not supported for containers with swap.
946+
- `node_capacity` - The node doesn't have enough capacity for this resize request.
947+
948+
This list of possible reasons may shrink or grow depending on limitations that are added or removed in the future.
949+
950+
This metric is recorded as a counter.
951+
952+
#### `kubelet_pod_deferred_resize_accepted_total`
953+
954+
This metric tracks the total number of resize requests that the Kubelet originally marked as deferred but
955+
later accepted. This metric primarily exists because if a deferred resize is accepted through the timed retry as
956+
opposed to being explicitly signaled, it indicates an issue in the Kubelet's logic for handling deferred
957+
resizes that we should fix.
958+
959+
Labels:
960+
- `retry_reason` - whether the resize was accepted through the timed retry or explicitly signaled. Possible values: `timed`, `signaled`.
961+
962+
This metric is recorded as a counter.
963+
884964
### Static CPU & Memory Policy
885965

886966
Resizing pods with static CPU & memory policy configured is out-of-scope for the beta release of

0 commit comments

Comments
 (0)