|
| 1 | +--- |
| 2 | +layout: blog |
| 3 | +title: 'Kubernetes 1.28: Node podresources API Graduates to GA' |
| 4 | +date: 2023-08-23 |
| 5 | +slug: kubelet-podresources-api-GA |
| 6 | +--- |
| 7 | + |
| 8 | +**Author:** |
| 9 | +Francesco Romani (Red Hat) |
| 10 | + |
| 11 | +The podresources API is an API served by the kubelet locally on the node, which exposes the compute resources exclusively |
| 12 | +allocated to containers. With the release of Kubernetes 1.28, that API is now Generally Available. |
| 13 | + |
| 14 | +## What problem does it solve? |
| 15 | + |
| 16 | +The kubelet can allocate exclusive resources to containers, like |
| 17 | +[CPUs, granting exclusive access to full cores](https://kubernetes.io/docs/tasks/administer-cluster/cpu-management-policies/) |
| 18 | +or [memory, either regions or hugepages](https://kubernetes.io/docs/tasks/administer-cluster/memory-manager/). |
| 19 | +Workloads which require high performance, or low latency (or both) leverage these features. |
| 20 | +The kubelet also can assign [devices to containers](https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/). |
| 21 | +Collectively, these features which enable exclusive assignments are known as "resource managers". |
| 22 | + |
| 23 | +Without an API like podresources, the only possible option to learn about resource assignment was to read the state files the |
| 24 | +resource managers use. While done out of necessity, the problem with this approach is the path and the format of these file are |
| 25 | +both internal implementation details. Albeit very stable, the project reserves the right to change them freely. |
| 26 | +Consuming the content of the state files is thus fragile and unsupported, and projects doing this are recommended to consider |
| 27 | +moving to podresources API or to other supported APIs. |
| 28 | + |
| 29 | +## Overview of the API |
| 30 | + |
| 31 | +The podresources API was [initially proposed to enable device monitoring](https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/#monitoring-device-plugin-resources). |
| 32 | +In order to enable monitoring agents, a key prerequisite is to enable introspection of device assignment, which is performed by the kubelet. |
| 33 | +Serving this purpose was the initial goal of the API. The first iteration of the API only had a single function implemented, `List`, |
| 34 | +to return information about the assignment of devices to containers. |
| 35 | +The API is used by [multus CNI](https://github.com/k8snetworkplumbingwg/multus-cni) and by |
| 36 | +[GPU monitoring tools](https://github.com/NVIDIA/dcgm-exporter). |
| 37 | + |
| 38 | +Since its inception, the podresources API increased its scope to cover other resource managers than device manager. |
| 39 | +Starting from Kubernetes 1.20, the `List` API reports also CPU cores and memory regions (including hugepages); the API also |
| 40 | +reports the NUMA locality of the devices, while the locality of CPUs and memory can be inferred from the system. |
| 41 | + |
| 42 | +In Kubernetes 1.21, the API [gained](https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/2403-pod-resources-allocatable-resources/README.md) |
| 43 | +the `GetAllocatableResources` function. |
| 44 | +This newer API complements the existing `List` API and enables monitoring agents to determine the unallocated resources, |
| 45 | +thus enabling new features built on top of the podresources API like a |
| 46 | +[NUMA-aware scheduler plugin](https://github.com/kubernetes-sigs/scheduler-plugins/blob/master/pkg/noderesourcetopology/README.md). |
| 47 | + |
| 48 | +Finally, in Kubernetes 1.27, another function, `Get` was introduced to be more friendly with CNI meta-plugins, to make it simpler to access resources |
| 49 | +allocated to a specific pod, rather than having to filter through resources for all pods on the node. The `Get` function is currently alpha level. |
| 50 | + |
| 51 | +## Consuming the API |
| 52 | + |
| 53 | +The podresources API is served by the kubelet locally, on the same node on which is running. |
| 54 | +On unix flavors, the endpoint is served over a unix domain socket; the default path is `/var/lib/kubelet/pod-resources/kubelet.sock`. |
| 55 | +On windows, the endpoint is served over a named pipe; the default path is `npipe://\\.\pipe\kubelet-pod-resources`. |
| 56 | + |
| 57 | +In order for the containerized monitoring application consume the API, the socket should be mounted inside the container. |
| 58 | +A good practice is to mount the directory on which the podresources socket endpoint sits rather than the socket directly. |
| 59 | +This will ensure that after a kubelet restart, the containerized monitor application will be able to re-connect to the socket. |
| 60 | + |
| 61 | +An example manifest for a hypothetical monitoring agent consuming the podresources API and deployed as a DaemonSet could look like: |
| 62 | + |
| 63 | +```yaml |
| 64 | +apiVersion: apps/v1 |
| 65 | +kind: DaemonSet |
| 66 | +metadata: |
| 67 | + name: podresources-monitoring-app |
| 68 | + namespace: monitoring |
| 69 | +spec: |
| 70 | + selector: |
| 71 | + matchLabels: |
| 72 | + name: podresources-monitoring |
| 73 | + template: |
| 74 | + metadata: |
| 75 | + labels: |
| 76 | + name: podresources-monitoring |
| 77 | + spec: |
| 78 | + containers: |
| 79 | + - args: |
| 80 | + - --podresources-socket=unix:///host-podresources/kubelet.sock |
| 81 | + command: |
| 82 | + - /bin/podresources-monitor |
| 83 | + image: podresources-monitor:latest # just for an example |
| 84 | + volumeMounts: |
| 85 | + - mountPath: /host-podresources |
| 86 | + name: host-podresources |
| 87 | + serviceAccountName: podresources-monitor |
| 88 | + volumes: |
| 89 | + - hostPath: |
| 90 | + path: /var/lib/kubelet/pod-resources |
| 91 | + type: Directory |
| 92 | + name: host-podresources |
| 93 | +``` |
| 94 | +
|
| 95 | +I hope you find it straightforward to consume the podresources API programmatically. |
| 96 | +The kubelet API package provides the protocol file and the go type definitions; however, a client package is not yet available from the project, |
| 97 | +and the existing code should not be used directly. |
| 98 | +The [recommended](https://github.com/kubernetes/kubernetes/blob/v1.28.0-rc.0/pkg/kubelet/apis/podresources/client.go#L32) |
| 99 | +approach is to reimplement the client in your projects, copying and pasting the related functions like for example |
| 100 | +the multus project is [doing](https://github.com/k8snetworkplumbingwg/multus-cni/blob/v4.0.2/pkg/kubeletclient/kubeletclient.go). |
| 101 | +
|
| 102 | +When operating the containerized monitoring application consuming the podresources API, few points are worth highlighting to prevent "gotcha" moments: |
| 103 | +
|
| 104 | +- Even though the API only exposes data, and doesn't allow by design clients to mutate the kubelet state, the gRPC request/response model requires |
| 105 | + read-write access to the podresources API socket. In other words, it is not possible to limit the container mount to `ReadOnly`. |
| 106 | +- Multiple clients are allowed to connect to the podresources socket and consume the API, since it is stateless. |
| 107 | +- The kubelet has [built-in rate limits](https://github.com/kubernetes/kubernetes/pull/116459) to mitigate local Denial of Service attacks from |
| 108 | + misbehaving or malicious consumers. The consumers of the API must tolerate rate limit errors returned by the server. The rate limit is currently |
| 109 | + hardcoded and global, so misbehaving clients can consume all the quota and potentially starve correctly behaving clients. |
| 110 | + |
| 111 | +## Future enhancements |
| 112 | + |
| 113 | +For historical reasons, the podresources API has a less precise specification than typical kubernetes APIs (such as the Kubernetes HTTP API, or the container runtime interface). |
| 114 | +This leads to unspecified behavior in corner cases. |
| 115 | +An [effort](https://issues.k8s.io/119423) is ongoing to rectify this state and to have a more precise specification. |
| 116 | + |
| 117 | +The [Dynamic Resource Allocation (DRA)](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/3063-dynamic-resource-allocation) infrastructure |
| 118 | +is a major overhaul of the resource management. |
| 119 | +The [integration](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/3695-pod-resources-for-dra) with the podresources API |
| 120 | +is already ongoing. |
| 121 | + |
| 122 | +An [effort](https://issues.k8s.io/119817) is ongoing to recommend or create a reference client package ready to be consumed. |
| 123 | + |
| 124 | +## Getting involved |
| 125 | + |
| 126 | +This feature is driven by [SIG Node](https://github.com/Kubernetes/community/blob/master/sig-node/README.md). |
| 127 | +Please join us to connect with the community and share your ideas and feedback around the above feature and |
| 128 | +beyond. We look forward to hearing from you! |
0 commit comments