Skip to content

Commit fbb9d07

Browse files
author
Renaud Gaubert
committed
Update compute-device-assignment to follow the new KEP template
Signed-off-by: Renaud Gaubert <[email protected]>
1 parent ea4b9cc commit fbb9d07

File tree

1 file changed

+190
-45
lines changed

1 file changed

+190
-45
lines changed

keps/sig-node/compute-device-assignment.md

Lines changed: 190 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@ title: Kubelet endpoint for device assignment observation details
33
authors:
44
- "@dashpole"
55
- "@vikaschoudhary16"
6+
- "@renaudwastaken"
67
owning-sig: sig-node
78
reviewers:
89
- "@thockin"
@@ -21,46 +22,101 @@ status: implementable
2122
## Table of Contents
2223

2324
<!-- toc -->
24-
- [Abstract](#abstract)
25-
- [Background](#background)
26-
- [Objectives](#objectives)
27-
- [User Journeys](#user-journeys)
28-
- [Device Monitoring Agents](#device-monitoring-agents)
29-
- [Changes](#changes)
30-
- [Potential Future Improvements](#potential-future-improvements)
31-
- [Alternatives Considered](#alternatives-considered)
25+
- [Release Signoff Checklist](#release-signoff-checklist)
26+
- [Summary](#summary)
27+
- [Motivation](#motivation)
28+
- [Goals](#goals)
29+
- [Proposal](#proposal)
30+
- [User Stories](#user-stories)
31+
- [Risks and Mitigations](#risks-and-mitigations)
32+
- [Design Details](#design-details)
33+
- [Proposed API](#proposed-api)
34+
- [Test Plan](#test-plan)
35+
- [Graduation Criteria](#graduation-criteria)
36+
- [Alpha](#alpha)
37+
- [Alpha to Beta Graduation](#alpha-to-beta-graduation)
38+
- [Beta to G.A Graduation](#beta-to-ga-graduation)
39+
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
40+
- [Version Skew Strategy](#version-skew-strategy)
41+
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
42+
- [Feature enablement and rollback](#feature-enablement-and-rollback)
43+
- [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
44+
- [Monitoring requirements](#monitoring-requirements)
45+
- [Dependencies](#dependencies)
46+
- [Scalability](#scalability)
47+
- [Troubleshooting](#troubleshooting)
48+
- [Implementation History](#implementation-history)
49+
- [Alternatives](#alternatives)
3250
- [Add v1alpha1 Kubelet GRPC service, at <code>/var/lib/kubelet/pod-resources/kubelet.sock</code>, which returns a list of <a href="https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/apis/cri/runtime/v1alpha2/api.proto#L734">CreateContainerRequest</a>s used to create containers.](#add-v1alpha1-kubelet-grpc-service-at--which-returns-a-list-of-createcontainerrequests-used-to-create-containers)
3351
- [Add a field to Pod Status.](#add-a-field-to-pod-status)
3452
- [Use the Kubelet Device Manager Checkpoint file](#use-the-kubelet-device-manager-checkpoint-file)
3553
- [Add a field to the Pod Spec:](#add-a-field-to-the-pod-spec)
36-
- [Graduation Criteria](#graduation-criteria)
37-
- [Implementation History](#implementation-history)
3854
<!-- /toc -->
3955

40-
## Abstract
41-
In this document we will discuss the motivation and code changes required for introducing a kubelet endpoint to expose device to container bindings.
56+
## Release Signoff Checklist
57+
58+
Items marked with (R) are required *prior to targeting to a milestone / release*.
59+
60+
- [X] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements](https://github.com/kubernetes/enhancements/issues/606)
61+
- [X] (R) KEP approvers have approved the KEP status as `implementable`
62+
- [X] (R) Design details are appropriately documented
63+
- [X] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
64+
- [X] (R) Graduation criteria is in place
65+
- [X] (R) Production readiness review completed
66+
- [X] Production readiness review approved
67+
- [X] "Implementation History" section is up-to-date for milestone
68+
- ~~ [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] ~~
69+
- [X] Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
70+
71+
[kubernetes.io]: https://kubernetes.io/
72+
[kubernetes/enhancements]: https://git.k8s.io/enhancements
73+
[kubernetes/kubernetes]: https://git.k8s.io/kubernetes
74+
[kubernetes/website]: https://git.k8s.io/website
75+
76+
## Summary
77+
78+
This document presents the kubelet endpoint which allows third party consumers to inspect the mapping between devices and pods.
79+
80+
## Motivation
4281

43-
## Background
44-
[Device Monitoring](https://docs.google.com/document/d/1NYnqw-HDQ6Y3L_mk85Q3wkxDtGNWTxpsedsgw4NgWpg/edit?usp=sharing) requires external agents to be able to determine the set of devices in-use by containers and attach pod and container metadata for these devices.
82+
[Device Monitoring](https://docs.google.com/document/d/1NYnqw-HDQ6Y3L_mk85Q3wkxDtGNWTxpsedsgw4NgWpg/edit?usp=sharing) in Kubernetes is expected to be implemented out of the kubernetes tree.
4583

46-
## Objectives
84+
For the metrics to be relevant to cluster administrators or Pod owners, they need to be able to be matched to specific containers and pod (e.g: GPU utilization for pod X).
85+
As such the external monitoring agents need to be able to determine the set of devices in-use by containers and attach pod and container metadata to the metrics.
4786

48-
* To remove current device-specific knowledge from the kubelet, such as [accellerator metrics](https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/apis/stats/v1alpha1/types.go#L229)
49-
* To enable future use-cases requiring device-specific knowledge to be out-of-tree
87+
### Goals
5088

51-
## User Journeys
89+
* Deprecate and remove current device-specific knowledge from the kubelet, such as [accelerator metrics](https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/apis/stats/v1alpha1/types.go#L229)
90+
* Enable external device monitoring agents to provide metrics relevant to Kubernetes
5291

53-
### Device Monitoring Agents
92+
## Proposal
93+
94+
### User Stories
5495

5596
* As a _Cluster Administrator_, I provide a set of devices from various vendors in my cluster. Each vendor independently maintains their own agent, so I run monitoring agents only for devices I provide. Each agent adheres to to the [node monitoring guidelines](https://docs.google.com/document/d/1_CdNWIjPBqVDMvu82aJICQsSCbh2BR-y9a8uXjQm4TI/edit?usp=sharing), so I can use a compatible monitoring pipeline to collect and analyze metrics from a variety of agents, even though they are maintained by different vendors.
5697
* As a _Device Vendor_, I manufacture devices and I have deep domain expertise in how to run and monitor them. Because I maintain my own Device Plugin implementation, as well as Device Monitoring Agent, I can provide consumers of my devices an easy way to consume and monitor my devices without requiring open-source contributions. The Device Monitoring Agent doesn't have any dependencies on the Device Plugin, so I can decouple monitoring from device lifecycle management. My Device Monitoring Agent works by periodically querying the `/devices/<ResourceName>` endpoint to discover which devices are being used, and to get the container/pod metadata associated with the metrics:
5798

5899
![device monitoring architecture](https://user-images.githubusercontent.com/3262098/43926483-44331496-9bdf-11e8-82a0-14b47583b103.png)
59100

101+
### Risks and Mitigations
102+
103+
This API is read-only, which removes a large class of risks. The aspects that we consider below are as follows:
104+
- What are the risks associated with the API service itself?
105+
- What are the risks associated with the data itself?
106+
107+
| Risk | Impact | Mitigation |
108+
| --------------------------------------------------------- | ------------- | ---------- |
109+
| Too many requests risk impacting the kubelet performances | High | Implement rate limiting and or passive caching, follow best practices for gRPC resource management. |
110+
| Improper access to the data | Low | Server is listening on a root owned unix socket. This can be limited with proper pod security policies. |
111+
60112

61-
## Changes
113+
## Design Details
62114

63-
Add a v1alpha1 Kubelet GRPC service, at `/var/lib/kubelet/pod-resources/kubelet.sock`, which returns information about the kubelet's assignment of devices to containers. It obtains this information from the internal state of the kubelet's Device Manager. The GRPC Service returns a single PodResourcesResponse, which is shown in proto below:
115+
### Proposed API
116+
117+
We propose to add a new gRPC service to the Kubelet. This gRPC service would be listening on a unix socket at `/var/lib/kubelet/pod-resources/kubelet.sock` and return information about the kubelet's assignment of devices to containers.
118+
119+
This information is obtained from the internal state of the kubelet's Device Manager. The GRPC Service has a single function named `List`, you will find below the v1 API:
64120
```protobuf
65121
// PodResources is a service provided by the kubelet that provides information about the
66122
// node resources consumed by pods and containers on the node
@@ -96,13 +152,120 @@ message ContainerDevices {
96152
}
97153
```
98154

99-
### Potential Future Improvements
155+
### Test Plan
156+
157+
Given that the API allows observing what device has been associated to what container, we need to be testing different configurations, such as:
158+
* Pods without devices assigned to any containers.
159+
* Pods with devices assigned to some but not all containers.
160+
- Pods with devices assigned to init containers.
161+
- ...
162+
163+
We have identified two main ways of testing this API:
164+
- Unit Tests which won't rely on gRPC. They will test different configurations of pods and devices.
165+
- Node E2E tests which will allow us to test the service itself.
166+
167+
E2E tests are explicitly not written because they would require us to generate and deploy a custom container.
168+
The infrastructure required is expensive and it is not clear what additional testing (and hence risk reduction) this would provide compare to node e2e tests.
169+
170+
### Graduation Criteria
171+
172+
#### Alpha
173+
- [X] Implement the new service API.
174+
- [X] [Ensure proper e2e node tests are in place](https://k8s-testgrid.appspot.com/sig-node-kubelet#node-kubelet-serial&include-filter-by-regex=DevicePluginProbe).
175+
176+
#### Alpha to Beta Graduation
177+
- [X] Demonstrate that the endpoint can be used to replace in-tree GPU device metrics in production environments (NVIDIA, sig-node April 30, 2019).
178+
179+
#### Beta to G.A Graduation
180+
- [X] Multiple real world examples ([Multus CNI](https://github.com/intel/multus-cni)).
181+
- [X] Allowing time for feedback (2 years).
182+
- [X] [Start Deprecation of Accelerator metrics in kubelet](https://github.com/kubernetes/kubernetes/pull/91930).
183+
- [X] Risks have been addressed.
184+
185+
### Upgrade / Downgrade Strategy
186+
187+
With gRPC the version is part of the service name
188+
old versions and new versions should always be served and listened by the kubelet
189+
190+
To a cluster admin upgrading to the newest API version, means upgrading Kubernetes to a newer version as well as upgrading the monitoring component.
191+
192+
To a vendor
193+
194+
Changes in the API should always be backwards compatible.
195+
196+
Downgrades here are related to downgrading the plugin
197+
198+
### Version Skew Strategy
199+
200+
Kubelet will always be backwards compatible, so going forward existing plugins are not expected to break.
201+
202+
## Production Readiness Review Questionnaire
203+
### Feature enablement and rollback
204+
205+
* **How can this feature be enabled / disabled in a live cluster?**
206+
- [X] Feature gate (also fill in values in `kep.yaml`).
207+
- Feature gate name: `KubeletPodResources`.
208+
- Components depending on the feature gate: N/A.
209+
210+
* **Does enabling the feature change any default behavior?** No
211+
* **Can the feature be disabled once it has been enabled (i.e. can we rollback the enablement)?** Yes, through feature gates.
212+
* **What happens if we reenable the feature if it was previously rolled back?** The service recovers state from kubelet.
213+
* **Are there any tests for feature enablement/disablement?** No, however no data is created or deleted.
100214

101-
* Add `ListAndWatch()` function to the GRPC endpoint so monitoring agents don't need to poll.
102-
* Add identifiers for other resources used by pods to the `PodResources` message.
103-
* For example, persistent volume location on disk
215+
### Rollout, Upgrade and Rollback Planning
104216

105-
## Alternatives Considered
217+
* **How can a rollout fail? Can it impact already running workloads?** Kubelet would fail to start. Errors would be caught in the CI.
218+
* **What specific metrics should inform a rollback?** Not Applicable, metrics wouldn't be available.
219+
* **Were upgrade and rollback tested? Was upgrade->downgrade->upgrade path tested?** Not Applicable.
220+
* **Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?** No.
221+
222+
### Monitoring requirements
223+
* **How can an operator determine if the feature is in use by workloads?**
224+
- Look at the `pod_resources_requests_total` metric exposed by the kubelet.
225+
- Look at hostPath mounts of privileged containers.
226+
* **What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?**
227+
- [X] Metrics
228+
- Metric name: `pod_resources_requests_total`
229+
- Components exposing the metric: kubelet
230+
231+
* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?** N/A or refer to Kubelet SLIs.
232+
* **Are there any missing metrics that would be useful to have to improve observability if this feature?** No.
233+
234+
235+
### Dependencies
236+
237+
* **Does this feature depend on any specific services running in the cluster?** Not aplicable.
238+
239+
### Scalability
240+
241+
* **Will enabling / using this feature result in any new API calls?** No.
242+
* **Will enabling / using this feature result in introducing new API types?** No.
243+
* **Will enabling / using this feature result in any new calls to cloud provider?** No.
244+
* **Will enabling / using this feature result in increasing size or count of the existing API objects?** No.
245+
* **Will enabling / using this feature result in increasing time taken by any operations covered by [existing SLIs/SLOs][]?** No. Feature is out of existing any paths in kubelet.
246+
* **Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?** In 1.18, DDOSing the API can lead to resource exhaustion. It is planned to be addressed as part of G.A.
247+
Feature only collects data when requests comes in, data is then garbage collected. Data collected is proportional to the number of pods on the node.
248+
249+
### Troubleshooting
250+
251+
* **How does this feature react if the API server and/or etcd is unavailable?**: No effect.
252+
* **What are other known failure modes?** No known failure modes
253+
* **What steps should be taken if SLOs are not being met to determine the problem?** N/A
254+
255+
[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
256+
[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
257+
258+
## Implementation History
259+
260+
- 2018-09-11: Final version of KEP (proposing pod-resources endpoint) published and presented to sig-node. [Slides](https://docs.google.com/presentation/u/1/d/1xz-iHs8Ec6PqtZGzsmG1e68aLGCX576j_WRptd2114g/edit?usp=sharing)
261+
- 2018-10-30: Demo with example gpu monitoring daemonset
262+
- 2018-11-10: KEP lgtm'd and approved
263+
- 2018-11-15: Implementation and e2e test merged before 1.13 release: kubernetes/kubernetes#70508
264+
- 2019-04-30: Demo of production GPU monitoring by NVIDIA
265+
- 2019-04-30: Agreement in sig-node to move feature to beta in 1.15
266+
- 2020-06-17: Agreement in sig-node to move feature to G.A in 1.19 or 1.20
267+
268+
## Alternatives
106269

107270
### Add v1alpha1 Kubelet GRPC service, at `/var/lib/kubelet/pod-resources/kubelet.sock`, which returns a list of [CreateContainerRequest](https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/apis/cri/runtime/v1alpha2/api.proto#L734)s used to create containers.
108271
* Pros:
@@ -113,7 +276,7 @@ message ContainerDevices {
113276
* Notes:
114277
* Does not include any reference to resource names. Monitoring agentes must identify devices by the device or environment variables passed to the pod or container.
115278

116-
### Add a field to Pod Status.
279+
### Add a field to Pod Status.
117280
* Pros:
118281
* Allows for observation of container to device bindings local to the node through the `/pods` endpoint
119282
* Cons:
@@ -152,21 +315,3 @@ type Container struct {
152315
* Note: Writing to the Api Server and waiting to observe the updated pod spec in the kubelet's pod watch may add significant latency to pod startup.
153316
* Allows devices to potentially be assigned by a custom scheduler.
154317
* Serves as a permanent record of device assignments for the kubelet, and eliminates the need for the kubelet to maintain this state locally.
155-
156-
## Graduation Criteria
157-
158-
Alpha:
159-
- [x] Implement the endpoint as described above
160-
- [x] E2e node test tests the endpoint: https://k8s-testgrid.appspot.com/sig-node-kubelet#node-kubelet-serial&include-filter-by-regex=DevicePluginProbe
161-
162-
Beta:
163-
- [x] Demonstrate in production environments that the endpoint can be used to replace in-tree GPU device metrics (NVIDIA, sig-node April 30, 2019).
164-
165-
## Implementation History
166-
167-
- 2018-09-11: Final version of KEP (proposing pod-resources endpoint) published and presented to sig-node. [Slides](https://docs.google.com/presentation/u/1/d/1xz-iHs8Ec6PqtZGzsmG1e68aLGCX576j_WRptd2114g/edit?usp=sharing)
168-
- 2018-10-30: Demo with example gpu monitoring daemonset
169-
- 2018-11-10: KEP lgtm'd and approved
170-
- 2018-11-15: Implementation and e2e test merged before 1.13 release: kubernetes/kubernetes#70508
171-
- 2019-04-30: Demo of production GPU monitoring by NVIDIA
172-
- 2019-04-30: Agreement in sig-node to move feature to beta in 1.15

0 commit comments

Comments
 (0)