|
37 | 37 | - [QOS Class](#qos-class)
|
38 | 38 | - [Resource Quota](#resource-quota)
|
39 | 39 | - [Affected Components](#affected-components)
|
| 40 | + - [Instrumentation](#instrumentation) |
| 41 | + - [<code>kubelet_pod_resize_requests_total</code>](#kubelet_pod_resize_requests_total) |
| 42 | + - [<code>kubelet_container_resize_requests_total</code>](#kubelet_container_resize_requests_total) |
| 43 | + - [<code>kubelet_pod_resize_sli_duration_seconds</code>](#kubelet_pod_resize_sli_duration_seconds) |
| 44 | + - [<code>kubelet_pod_infeasible_resize_total</code>](#kubelet_pod_infeasible_resize_total) |
| 45 | + - [<code>kubelet_pod_deferred_resize_accepted_total</code>](#kubelet_pod_deferred_resize_accepted_total) |
40 | 46 | - [Static CPU & Memory Policy](#static-cpu--memory-policy)
|
41 | 47 | - [Future Enhancements](#future-enhancements)
|
42 | 48 | - [Mutable QOS Class "Shape"](#mutable-qos-class-shape)
|
@@ -881,6 +887,80 @@ Other components:
|
881 | 887 | * check how the change of meaning of resource requests influence other
|
882 | 888 | Kubernetes components.
|
883 | 889 |
|
| 890 | +### Instrumentation |
| 891 | + |
| 892 | +The kubelet will record the following metrics: |
| 893 | + |
| 894 | +#### `kubelet_pod_resize_requests_total` |
| 895 | + |
| 896 | +This metric tracks the total number of resize requests observed by the Kubelet, counted at the pod level. |
| 897 | +A single pod update changing multiple containers will be considered a single resize request. |
| 898 | + |
| 899 | +Labels: |
| 900 | +- `resource_type` - what type of resource is being resized. Possible values: `cpu_limits`, `cpu_requests` `memory_limits`, or `memory_requests`. If more than one of these resource types is changing in the resize request, |
| 901 | +we increment the counter multiple times, once for each. This means that a single pod update changing multiple |
| 902 | +resource types will be considered multiple requests for this metric. |
| 903 | +- `operation_type` - whether the resize is a net increase or a decrease (taken as an aggregate across |
| 904 | +all containers in the pod). Possible values: `increase`, `decrease`, `add`, or `remove`. |
| 905 | + |
| 906 | +This metric is recorded as a counter. |
| 907 | + |
| 908 | +#### `kubelet_container_resize_requests_total` |
| 909 | + |
| 910 | +This metric tracks the total number of resize requests observed by the Kubelet, counted at the container level. |
| 911 | +A single pod update changing multiple containers will be considered separate resize requests. |
| 912 | + |
| 913 | +Labels: |
| 914 | +- `resource_type` - what type of resource is being resized. Possible values: `cpu_limits`, `cpu_requests` `memory_limits`, or `memory_requests`. If more than one of these resource types is changing in the resize request, |
| 915 | +we increment the counter multiple times, once for each. This means that a single pod update changing multiple |
| 916 | +resource types will be considered multiple requests for this metric. |
| 917 | +- `operation_type` - whether the resize is an increase or a decrease. Possible values: `increase`, `decrease`, `add`, or `remove`. |
| 918 | + |
| 919 | +This metric is recorded as a counter. |
| 920 | + |
| 921 | +#### `kubelet_pod_resize_sli_duration_seconds` |
| 922 | + |
| 923 | +This metric tracks the latency between when the kubelet accepts a resize request and when it finshes actuating |
| 924 | +the request. More precisely, this metric tracks the total amount of time that the `PodResizeInProgress` condition |
| 925 | +is present on a pod. |
| 926 | + |
| 927 | +Labels: |
| 928 | +- `resource_type` - what type of resource is being resized. Possible values: `cpu_limits`, `cpu_requests` `memory_limits`, or `memory_requests`. If more than one of these resource types is changing in the resize request, |
| 929 | +we increment the counter multiple times, once for each. |
| 930 | +- `operation_type` - whether the resize is an increase or a decrease. Possible values: `increase`, `decrease`, `add`, or `remove`. |
| 931 | + |
| 932 | +This metric is recorded as a gauge. |
| 933 | + |
| 934 | +#### `kubelet_pod_infeasible_resize_total` |
| 935 | + |
| 936 | +This metric tracks the total count of resize requests that the kubelet marks as infeasible. This will make it |
| 937 | +easier for us to see which of the current limitations users are running into the most. |
| 938 | + |
| 939 | +Labels: |
| 940 | +- `reason` - why the resize is infeasible. Although a more detailed "reason" will be provided in the `PodResizePending` |
| 941 | +condition in the pod, we limit this label to only the following possible values to keep cardinality low: |
| 942 | + - `guaranteed_pod_cpu_manager_static_policy` - In-place resize is not supported for Guaranteed Pods alongside CPU Manager static policy. |
| 943 | + - `guaranteed_pod_memory_manager_static_policy` - In-place resize is not supported for Guaranteed Pods alongside Memory Manager static policy. |
| 944 | + - `static_pod` - In-place resize is not supported for static pods. |
| 945 | + - `swap_limitation` - In-place resize is not supported for containers with swap. |
| 946 | + - `node_capacity` - The node doesn't have enough capacity for this resize request. |
| 947 | + |
| 948 | +This list of possible reasons may shrink or grow depending on limitations that are added or removed in the future. |
| 949 | + |
| 950 | +This metric is recorded as a counter. |
| 951 | + |
| 952 | +#### `kubelet_pod_deferred_resize_accepted_total` |
| 953 | + |
| 954 | +This metric tracks the total number of resize requests that the Kubelet originally marked as deferred but |
| 955 | +later accepted. This metric primarily exists because if a deferred resize is accepted through the timed retry as |
| 956 | +opposed to being explicitly signaled, it indicates an issue in the Kubelet's logic for handling deferred |
| 957 | +resizes that we should fix. |
| 958 | + |
| 959 | +Labels: |
| 960 | + - `retry_reason` - whether the resize was accepted through the timed retry or explicitly signaled. Possible values: `timed`, `signaled`. |
| 961 | + |
| 962 | +This metric is recorded as a counter. |
| 963 | + |
884 | 964 | ### Static CPU & Memory Policy
|
885 | 965 |
|
886 | 966 | Resizing pods with static CPU & memory policy configured is out-of-scope for the beta release of
|
|
0 commit comments