|
38 | 38 | - [QOS Class](#qos-class)
|
39 | 39 | - [Resource Quota](#resource-quota)
|
40 | 40 | - [Affected Components](#affected-components)
|
| 41 | + - [Instrumentation](#instrumentation) |
| 42 | + - [<code>kubelet_container_requested_resizes_total</code>](#kubelet_container_requested_resizes_total) |
| 43 | + - [<code>kubelet_pod_resize_duration_seconds</code>](#kubelet_pod_resize_duration_seconds) |
| 44 | + - [<code>kubelet_pod_pending_resizes</code>](#kubelet_pod_pending_resizes) |
| 45 | + - [<code>kubelet_pod_in_progress_resizes</code>](#kubelet_pod_in_progress_resizes) |
| 46 | + - [<code>kubelet_pod_deferred_resize_accepted_total</code>](#kubelet_pod_deferred_resize_accepted_total) |
41 | 47 | - [Static CPU & Memory Policy](#static-cpu--memory-policy)
|
42 | 48 | - [Future Enhancements](#future-enhancements)
|
43 | 49 | - [Mutable QOS Class "Shape"](#mutable-qos-class-shape)
|
@@ -912,6 +918,74 @@ Other components:
|
912 | 918 | * check how the change of meaning of resource requests influence other
|
913 | 919 | Kubernetes components.
|
914 | 920 |
|
| 921 | +### Instrumentation |
| 922 | + |
| 923 | +The kubelet will record the following metrics: |
| 924 | + |
| 925 | +#### `kubelet_container_requested_resizes_total` |
| 926 | + |
| 927 | +This metric tracks the total number of resize attempts observed by the Kubelet, counted at the container level. |
| 928 | +A single pod update changing multiple containers will be considered separate resize attempts. |
| 929 | + |
| 930 | +Labels: |
| 931 | +- `resource` - what resource. Possible values: `cpu`, or `memory`. If more than one of these is changing in the resize request, we increment the counter multiple times, once for each. |
| 932 | +- `requirement` - Possible values: `limits`, or `requests`. If more than one of these is changing in the resize request, we increment the counter multiple times, once for each. |
| 933 | +- `operation` - whether the resize is an increase or a decrease. Possible values: `increase`, `decrease`, `add`, or `remove`. |
| 934 | +- `namespace` - the namespace of the pod. |
| 935 | + |
| 936 | +This metric is recorded as a counter. |
| 937 | + |
| 938 | +#### `kubelet_pod_resize_duration_seconds` |
| 939 | +This metric tracks the duration of [doPodResizeAction](https://github.com/kubernetes/kubernetes/blob/92de70895830ea1a9c2c6554bdab4cbee7ce867d/pkg/kubelet/kuberuntime/kuberuntime_manager.go#L699), which |
| 940 | +is responsible for actuating the resize. |
| 941 | + |
| 942 | +Labels: |
| 943 | +- `namespace` - the namespace of the pod. |
| 944 | + |
| 945 | +This metric is recorded as a histogram. |
| 946 | + |
| 947 | +#### `kubelet_pod_pending_resizes` |
| 948 | + |
| 949 | +This metric tracks the current count of pods that the kubelet marks as pending. This will make it |
| 950 | +easier for us to see which of the current limitations users are running into the most. |
| 951 | + |
| 952 | +Labels: |
| 953 | +- `reason` - why the resize is pending. Possible values: `infeasible` or `deferred`. |
| 954 | +- `reason_detail` - more details about why the resize is pending. Although a more detailed "message" will be provided in the `PodResizePending` |
| 955 | +condition in the pod, we limit this label to only the following possible values to keep cardinality low: |
| 956 | + - `guaranteed_pod_cpu_manager_static_policy` - In-place resize is not supported for Guaranteed Pods alongside CPU Manager static policy. |
| 957 | + - `guaranteed_pod_memory_manager_static_policy` - In-place resize is not supported for Guaranteed Pods alongside Memory Manager static policy. |
| 958 | + - `static_pod` - In-place resize is not supported for static pods. |
| 959 | + - `swap_limitation` - In-place resize is not supported for containers with swap. |
| 960 | + - `insufficient_node_allocatable` - The node doesn't have enough capacity for this resize request. |
| 961 | +- `namespace` - the namespace of the pod. |
| 962 | + |
| 963 | +This list of possible reasons may shrink or grow depending on limitations that are added or removed in the future. |
| 964 | + |
| 965 | +This metric is recorded as a gauge. |
| 966 | + |
| 967 | +#### `kubelet_pod_in_progress_resizes` |
| 968 | + |
| 969 | +This metric tracks the total count of resize requests that the kubelet marks as in progress, meaning that |
| 970 | +the resources have been allocated but not yet actuated. |
| 971 | + |
| 972 | +Labels: |
| 973 | +- `namespace` - the namespace of the pod. |
| 974 | + |
| 975 | +This metric is recorded as a gauge. |
| 976 | + |
| 977 | +#### `kubelet_pod_deferred_resize_accepted_total` |
| 978 | + |
| 979 | +This metric tracks the total number of resize requests that the Kubelet originally marked as deferred but |
| 980 | +later accepted. This metric primarily exists because if a deferred resize is accepted through the timed retry (as |
| 981 | +opposed to being triggered by an event such as another pod being deleted or sized down), it indicates an issue in the Kubelet's logic for handling deferred resizes that we should fix. |
| 982 | + |
| 983 | +Labels: |
| 984 | + - `accepted_reason` - whether the resize was accepted through the timed retry or due to another pod event. Possible values: `periodic_retry`, `event_based`. |
| 985 | + - `namespace` - the namespace of the pod. |
| 986 | + |
| 987 | +This metric is recorded as a counter. |
| 988 | + |
915 | 989 | ### Static CPU & Memory Policy
|
916 | 990 |
|
917 | 991 | Resizing pods with static CPU & memory policy configured is out-of-scope for the beta release of
|
|
0 commit comments