|
| 1 | +# Kubelet endpoint for device assignment observation details |
| 2 | + |
| 3 | +## Table of Contents |
| 4 | + |
| 5 | +<!-- toc --> |
| 6 | +- [Release Signoff Checklist](#release-signoff-checklist) |
| 7 | +- [Summary](#summary) |
| 8 | +- [Motivation](#motivation) |
| 9 | + - [Goals](#goals) |
| 10 | +- [Proposal](#proposal) |
| 11 | + - [User Stories](#user-stories) |
| 12 | + - [Risks and Mitigations](#risks-and-mitigations) |
| 13 | +- [Design Details](#design-details) |
| 14 | + - [Proposed API](#proposed-api) |
| 15 | + - [Test Plan](#test-plan) |
| 16 | + - [Graduation Criteria](#graduation-criteria) |
| 17 | + - [Alpha](#alpha) |
| 18 | + - [Alpha to Beta Graduation](#alpha-to-beta-graduation) |
| 19 | + - [Beta to G.A Graduation](#beta-to-ga-graduation) |
| 20 | + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) |
| 21 | + - [Version Skew Strategy](#version-skew-strategy) |
| 22 | +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) |
| 23 | + - [Feature enablement and rollback](#feature-enablement-and-rollback) |
| 24 | + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) |
| 25 | + - [Monitoring requirements](#monitoring-requirements) |
| 26 | + - [Dependencies](#dependencies) |
| 27 | + - [Scalability](#scalability) |
| 28 | + - [Troubleshooting](#troubleshooting) |
| 29 | +- [Implementation History](#implementation-history) |
| 30 | +- [Alternatives](#alternatives) |
| 31 | + - [Add v1alpha1 Kubelet GRPC service, at <code>/var/lib/kubelet/pod-resources/kubelet.sock</code>, which returns a list of <a href="https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/apis/cri/runtime/v1alpha2/api.proto#L734">CreateContainerRequest</a>s used to create containers.](#add-v1alpha1-kubelet-grpc-service-at--which-returns-a-list-of-createcontainerrequests-used-to-create-containers) |
| 32 | + - [Add a field to Pod Status.](#add-a-field-to-pod-status) |
| 33 | + - [Use the Kubelet Device Manager Checkpoint file](#use-the-kubelet-device-manager-checkpoint-file) |
| 34 | + - [Add a field to the Pod Spec:](#add-a-field-to-the-pod-spec) |
| 35 | +<!-- /toc --> |
| 36 | + |
| 37 | +## Release Signoff Checklist |
| 38 | + |
| 39 | +Items marked with (R) are required *prior to targeting to a milestone / release*. |
| 40 | + |
| 41 | +- [X] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements](https://github.com/kubernetes/enhancements/issues/606) |
| 42 | +- [X] (R) KEP approvers have approved the KEP status as `implementable` |
| 43 | +- [X] (R) Design details are appropriately documented |
| 44 | +- [X] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input |
| 45 | +- [X] (R) Graduation criteria is in place |
| 46 | +- [X] (R) Production readiness review completed |
| 47 | +- [X] Production readiness review approved |
| 48 | +- [X] "Implementation History" section is up-to-date for milestone |
| 49 | +- ~~ [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] ~~ |
| 50 | +- [X] Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes |
| 51 | + |
| 52 | +[kubernetes.io]: https://kubernetes.io/ |
| 53 | +[kubernetes/enhancements]: https://git.k8s.io/enhancements |
| 54 | +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes |
| 55 | +[kubernetes/website]: https://git.k8s.io/website |
| 56 | + |
| 57 | +## Summary |
| 58 | + |
| 59 | +This document presents the kubelet endpoint which allows third party consumers to inspect the mapping between devices and pods. |
| 60 | + |
| 61 | +## Motivation |
| 62 | + |
| 63 | +[Device Monitoring](https://docs.google.com/document/d/1NYnqw-HDQ6Y3L_mk85Q3wkxDtGNWTxpsedsgw4NgWpg/edit?usp=sharing) in Kubernetes is expected to be implemented out of the kubernetes tree. |
| 64 | + |
| 65 | +For the metrics to be relevant to cluster administrators or Pod owners, they need to be able to be matched to specific containers and pod (e.g: GPU utilization for pod X). |
| 66 | +As such the external monitoring agents need to be able to determine the set of devices in-use by containers and attach pod and container metadata to the metrics. |
| 67 | + |
| 68 | +### Goals |
| 69 | + |
| 70 | +* Deprecate and remove current device-specific knowledge from the kubelet, such as [accelerator metrics](https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/apis/stats/v1alpha1/types.go#L229) |
| 71 | +* Enable external device monitoring agents to provide metrics relevant to Kubernetes |
| 72 | + |
| 73 | +## Proposal |
| 74 | + |
| 75 | +### User Stories |
| 76 | + |
| 77 | +* As a _Cluster Administrator_, I provide a set of devices from various vendors in my cluster. Each vendor independently maintains their own agent, so I run monitoring agents only for devices I provide. Each agent adheres to to the [node monitoring guidelines](https://docs.google.com/document/d/1_CdNWIjPBqVDMvu82aJICQsSCbh2BR-y9a8uXjQm4TI/edit?usp=sharing), so I can use a compatible monitoring pipeline to collect and analyze metrics from a variety of agents, even though they are maintained by different vendors. |
| 78 | +* As a _Device Vendor_, I manufacture devices and I have deep domain expertise in how to run and monitor them. Because I maintain my own Device Plugin implementation, as well as Device Monitoring Agent, I can provide consumers of my devices an easy way to consume and monitor my devices without requiring open-source contributions. The Device Monitoring Agent doesn't have any dependencies on the Device Plugin, so I can decouple monitoring from device lifecycle management. My Device Monitoring Agent works by periodically querying the `/devices/<ResourceName>` endpoint to discover which devices are being used, and to get the container/pod metadata associated with the metrics: |
| 79 | + |
| 80 | + |
| 81 | + |
| 82 | +### Risks and Mitigations |
| 83 | + |
| 84 | +This API is read-only, which removes a large class of risks. The aspects that we consider below are as follows: |
| 85 | +- What are the risks associated with the API service itself? |
| 86 | +- What are the risks associated with the data itself? |
| 87 | + |
| 88 | +| Risk | Impact | Mitigation | |
| 89 | +| --------------------------------------------------------- | ------------- | ---------- | |
| 90 | +| Too many requests risk impacting the kubelet performances | High | Implement rate limiting and or passive caching, follow best practices for gRPC resource management. | |
| 91 | +| Improper access to the data | Low | Server is listening on a root owned unix socket. This can be limited with proper pod security policies. | |
| 92 | + |
| 93 | + |
| 94 | +## Design Details |
| 95 | + |
| 96 | +### Proposed API |
| 97 | + |
| 98 | +We propose to add a new gRPC service to the Kubelet. This gRPC service would be listening on a unix socket at `/var/lib/kubelet/pod-resources/kubelet.sock` and return information about the kubelet's assignment of devices to containers. |
| 99 | + |
| 100 | +This information is obtained from the internal state of the kubelet's Device Manager. The GRPC Service has a single function named `List`, you will find below the v1 API: |
| 101 | +```protobuf |
| 102 | +// PodResources is a service provided by the kubelet that provides information about the |
| 103 | +// node resources consumed by pods and containers on the node |
| 104 | +service PodResources { |
| 105 | + rpc List(ListPodResourcesRequest) returns (ListPodResourcesResponse) {} |
| 106 | +} |
| 107 | +
|
| 108 | +// ListPodResourcesRequest is the request made to the PodResources service |
| 109 | +message ListPodResourcesRequest {} |
| 110 | +
|
| 111 | +// ListPodResourcesResponse is the response returned by List function |
| 112 | +message ListPodResourcesResponse { |
| 113 | + repeated PodResources pod_resources = 1; |
| 114 | +} |
| 115 | +
|
| 116 | +// PodResources contains information about the node resources assigned to a pod |
| 117 | +message PodResources { |
| 118 | + string name = 1; |
| 119 | + string namespace = 2; |
| 120 | + repeated ContainerResources containers = 3; |
| 121 | +} |
| 122 | +
|
| 123 | +// ContainerResources contains information about the resources assigned to a container |
| 124 | +message ContainerResources { |
| 125 | + string name = 1; |
| 126 | + repeated ContainerDevices devices = 2; |
| 127 | +} |
| 128 | +
|
| 129 | +// ContainerDevices contains information about the devices assigned to a container |
| 130 | +message ContainerDevices { |
| 131 | + string resource_name = 1; |
| 132 | + repeated string device_ids = 2; |
| 133 | +} |
| 134 | +``` |
| 135 | + |
| 136 | +### Test Plan |
| 137 | + |
| 138 | +Given that the API allows observing what device has been associated to what container, we need to be testing different configurations, such as: |
| 139 | +* Pods without devices assigned to any containers. |
| 140 | +* Pods with devices assigned to some but not all containers. |
| 141 | +- Pods with devices assigned to init containers. |
| 142 | +- ... |
| 143 | + |
| 144 | +We have identified two main ways of testing this API: |
| 145 | +- Unit Tests which won't rely on gRPC. They will test different configurations of pods and devices. |
| 146 | +- Node E2E tests which will allow us to test the service itself. |
| 147 | + |
| 148 | +E2E tests are explicitly not written because they would require us to generate and deploy a custom container. |
| 149 | +The infrastructure required is expensive and it is not clear what additional testing (and hence risk reduction) this would provide compare to node e2e tests. |
| 150 | + |
| 151 | +### Graduation Criteria |
| 152 | + |
| 153 | +#### Alpha |
| 154 | +- [X] Implement the new service API. |
| 155 | +- [X] [Ensure proper e2e node tests are in place](https://k8s-testgrid.appspot.com/sig-node-kubelet#node-kubelet-serial&include-filter-by-regex=DevicePluginProbe). |
| 156 | + |
| 157 | +#### Alpha to Beta Graduation |
| 158 | +- [X] Demonstrate that the endpoint can be used to replace in-tree GPU device metrics in production environments (NVIDIA, sig-node April 30, 2019). |
| 159 | + |
| 160 | +#### Beta to G.A Graduation |
| 161 | +- [X] Multiple real world examples ([Multus CNI](https://github.com/intel/multus-cni)). |
| 162 | +- [X] Allowing time for feedback (2 years). |
| 163 | +- [X] [Start Deprecation of Accelerator metrics in kubelet](https://github.com/kubernetes/kubernetes/pull/91930). |
| 164 | +- [X] Risks have been addressed. |
| 165 | + |
| 166 | +### Upgrade / Downgrade Strategy |
| 167 | + |
| 168 | +With gRPC the version is part of the service name. |
| 169 | +Old versions and new versions should always be served and listened by the kubelet. |
| 170 | + |
| 171 | +To a cluster admin upgrading to the newest API version, means upgrading Kubernetes to a newer version as well as upgrading the monitoring component. |
| 172 | + |
| 173 | +To a vendor changes in the API should always be backwards compatible. |
| 174 | + |
| 175 | +Downgrades here are related to downgrading the plugin |
| 176 | + |
| 177 | +### Version Skew Strategy |
| 178 | + |
| 179 | +Kubelet will always be backwards compatible, so going forward existing plugins are not expected to break. |
| 180 | + |
| 181 | +## Production Readiness Review Questionnaire |
| 182 | +### Feature enablement and rollback |
| 183 | + |
| 184 | +* **How can this feature be enabled / disabled in a live cluster?** |
| 185 | + - [X] Feature gate (also fill in values in `kep.yaml`). |
| 186 | + - Feature gate name: `KubeletPodResources`. |
| 187 | + - Components depending on the feature gate: N/A. |
| 188 | + |
| 189 | +* **Does enabling the feature change any default behavior?** No |
| 190 | +* **Can the feature be disabled once it has been enabled (i.e. can we rollback the enablement)?** Yes, through feature gates. |
| 191 | +* **What happens if we reenable the feature if it was previously rolled back?** The service recovers state from kubelet. |
| 192 | +* **Are there any tests for feature enablement/disablement?** No, however no data is created or deleted. |
| 193 | + |
| 194 | +### Rollout, Upgrade and Rollback Planning |
| 195 | + |
| 196 | +* **How can a rollout fail? Can it impact already running workloads?** Kubelet would fail to start. Errors would be caught in the CI. |
| 197 | +* **What specific metrics should inform a rollback?** Not Applicable, metrics wouldn't be available. |
| 198 | +* **Were upgrade and rollback tested? Was upgrade->downgrade->upgrade path tested?** Not Applicable. |
| 199 | +* **Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?** No. |
| 200 | + |
| 201 | +### Monitoring requirements |
| 202 | +* **How can an operator determine if the feature is in use by workloads?** |
| 203 | + - Look at the `pod_resources_endpoint_requests_total` metric exposed by the kubelet. |
| 204 | + - Look at hostPath mounts of privileged containers. |
| 205 | +* **What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?** |
| 206 | + - [X] Metrics |
| 207 | + - Metric name: `pod_resources_endpoint_requests_total` |
| 208 | + - Components exposing the metric: kubelet |
| 209 | + |
| 210 | +* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?** N/A or refer to Kubelet SLIs. |
| 211 | +* **Are there any missing metrics that would be useful to have to improve observability if this feature?** No. |
| 212 | + |
| 213 | + |
| 214 | +### Dependencies |
| 215 | + |
| 216 | +* **Does this feature depend on any specific services running in the cluster?** Not aplicable. |
| 217 | + |
| 218 | +### Scalability |
| 219 | + |
| 220 | +* **Will enabling / using this feature result in any new API calls?** No. |
| 221 | +* **Will enabling / using this feature result in introducing new API types?** No. |
| 222 | +* **Will enabling / using this feature result in any new calls to cloud provider?** No. |
| 223 | +* **Will enabling / using this feature result in increasing size or count of the existing API objects?** No. |
| 224 | +* **Will enabling / using this feature result in increasing time taken by any operations covered by [existing SLIs/SLOs][]?** No. Feature is out of existing any paths in kubelet. |
| 225 | +* **Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?** In 1.18, DDOSing the API can lead to resource exhaustion. It is planned to be addressed as part of G.A. |
| 226 | +Feature only collects data when requests comes in, data is then garbage collected. Data collected is proportional to the number of pods on the node. |
| 227 | + |
| 228 | +### Troubleshooting |
| 229 | + |
| 230 | +* **How does this feature react if the API server and/or etcd is unavailable?**: No effect. |
| 231 | +* **What are other known failure modes?** No known failure modes |
| 232 | +* **What steps should be taken if SLOs are not being met to determine the problem?** N/A |
| 233 | + |
| 234 | +[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md |
| 235 | +[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos |
| 236 | + |
| 237 | +## Implementation History |
| 238 | + |
| 239 | +- 2018-09-11: Final version of KEP (proposing pod-resources endpoint) published and presented to sig-node. [Slides](https://docs.google.com/presentation/u/1/d/1xz-iHs8Ec6PqtZGzsmG1e68aLGCX576j_WRptd2114g/edit?usp=sharing) |
| 240 | +- 2018-10-30: Demo with example gpu monitoring daemonset |
| 241 | +- 2018-11-10: KEP lgtm'd and approved |
| 242 | +- 2018-11-15: Implementation and e2e test merged before 1.13 release: kubernetes/kubernetes#70508 |
| 243 | +- 2019-04-30: Demo of production GPU monitoring by NVIDIA |
| 244 | +- 2019-04-30: Agreement in sig-node to move feature to beta in 1.15 |
| 245 | +- 2020-06-17: Agreement in sig-node to move feature to G.A in 1.19 or 1.20 |
| 246 | + |
| 247 | +## Alternatives |
| 248 | + |
| 249 | +### Add v1alpha1 Kubelet GRPC service, at `/var/lib/kubelet/pod-resources/kubelet.sock`, which returns a list of [CreateContainerRequest](https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/apis/cri/runtime/v1alpha2/api.proto#L734)s used to create containers. |
| 250 | +* Pros: |
| 251 | + * Reuse an existing API for describing containers rather than inventing a new one |
| 252 | +* Cons: |
| 253 | + * It ties the endpoint to the CreateContainerRequest, and may prevent us from adding other information we want in the future |
| 254 | + * It does not contain any additional information that will be useful to monitoring agents other than device, and contains lots of irrelevant information for this use-case. |
| 255 | +* Notes: |
| 256 | + * Does not include any reference to resource names. Monitoring agentes must identify devices by the device or environment variables passed to the pod or container. |
| 257 | + |
| 258 | +### Add a field to Pod Status. |
| 259 | +* Pros: |
| 260 | + * Allows for observation of container to device bindings local to the node through the `/pods` endpoint |
| 261 | +* Cons: |
| 262 | + * Only consumed locally, which doesn't justify an API change |
| 263 | + * Device Bindings are immutable after allocation, and are _debatably_ observable (they can be "observed" from the local checkpoint file). Device bindings are generally a poor fit for status. |
| 264 | + |
| 265 | +### Use the Kubelet Device Manager Checkpoint file |
| 266 | +* Allows for observability of device to container bindings through what exists in the checkpoint file |
| 267 | + * Requires adding additional metadata to the checkpoint file as required by the monitoring agent |
| 268 | +* Requires implementing versioning for the checkpoint file, and handling version skew between readers and the kubelet |
| 269 | +* Future modifications to the checkpoint file are more difficult. |
| 270 | + |
| 271 | +### Add a field to the Pod Spec: |
| 272 | +* A new object `ComputeDevice` will be defined and a new variable `ComputeDevices` will be added in the `Container` (Spec) object which will represent a list of `ComputeDevice` objects. |
| 273 | +```golang |
| 274 | +// ComputeDevice describes the devices assigned to this container for a given ResourceName |
| 275 | +type ComputeDevice struct { |
| 276 | + // DeviceIDs is the list of devices assigned to this container |
| 277 | + DeviceIDs []string |
| 278 | + // ResourceName is the name of the compute resource |
| 279 | + ResourceName string |
| 280 | +} |
| 281 | + |
| 282 | +// Container represents a single container that is expected to be run on the host. |
| 283 | +type Container struct { |
| 284 | + ... |
| 285 | + // ComputeDevices contains the devices assigned to this container |
| 286 | + // This field is alpha-level and is only honored by servers that enable the ComputeDevices feature. |
| 287 | + // +optional |
| 288 | + ComputeDevices []ComputeDevice |
| 289 | + ... |
| 290 | +} |
| 291 | +``` |
| 292 | +* During Kubelet pod admission, if `ComputeDevices` is found non-empty, specified devices will be allocated otherwise behaviour will remain same as it is today. |
| 293 | +* Before starting the pod, the kubelet writes the assigned `ComputeDevices` back to the pod spec. |
| 294 | + * Note: Writing to the Api Server and waiting to observe the updated pod spec in the kubelet's pod watch may add significant latency to pod startup. |
| 295 | +* Allows devices to potentially be assigned by a custom scheduler. |
| 296 | +* Serves as a permanent record of device assignments for the kubelet, and eliminates the need for the kubelet to maintain this state locally. |
0 commit comments