Skip to content

Commit 58744f8

Browse files
authored
Merge pull request kubernetes#1865 from RenaudWasTaken/compute-devices-ga
G.A Plan for Compute Devices + Updated to follow KEP template
2 parents f2a2de3 + 7b3fdb6 commit 58744f8

File tree

3 files changed

+343
-172
lines changed

3 files changed

+343
-172
lines changed
Lines changed: 296 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,296 @@
1+
# Kubelet endpoint for device assignment observation details
2+
3+
## Table of Contents
4+
5+
<!-- toc -->
6+
- [Release Signoff Checklist](#release-signoff-checklist)
7+
- [Summary](#summary)
8+
- [Motivation](#motivation)
9+
- [Goals](#goals)
10+
- [Proposal](#proposal)
11+
- [User Stories](#user-stories)
12+
- [Risks and Mitigations](#risks-and-mitigations)
13+
- [Design Details](#design-details)
14+
- [Proposed API](#proposed-api)
15+
- [Test Plan](#test-plan)
16+
- [Graduation Criteria](#graduation-criteria)
17+
- [Alpha](#alpha)
18+
- [Alpha to Beta Graduation](#alpha-to-beta-graduation)
19+
- [Beta to G.A Graduation](#beta-to-ga-graduation)
20+
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
21+
- [Version Skew Strategy](#version-skew-strategy)
22+
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
23+
- [Feature enablement and rollback](#feature-enablement-and-rollback)
24+
- [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
25+
- [Monitoring requirements](#monitoring-requirements)
26+
- [Dependencies](#dependencies)
27+
- [Scalability](#scalability)
28+
- [Troubleshooting](#troubleshooting)
29+
- [Implementation History](#implementation-history)
30+
- [Alternatives](#alternatives)
31+
- [Add v1alpha1 Kubelet GRPC service, at <code>/var/lib/kubelet/pod-resources/kubelet.sock</code>, which returns a list of <a href="https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/apis/cri/runtime/v1alpha2/api.proto#L734">CreateContainerRequest</a>s used to create containers.](#add-v1alpha1-kubelet-grpc-service-at--which-returns-a-list-of-createcontainerrequests-used-to-create-containers)
32+
- [Add a field to Pod Status.](#add-a-field-to-pod-status)
33+
- [Use the Kubelet Device Manager Checkpoint file](#use-the-kubelet-device-manager-checkpoint-file)
34+
- [Add a field to the Pod Spec:](#add-a-field-to-the-pod-spec)
35+
<!-- /toc -->
36+
37+
## Release Signoff Checklist
38+
39+
Items marked with (R) are required *prior to targeting to a milestone / release*.
40+
41+
- [X] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements](https://github.com/kubernetes/enhancements/issues/606)
42+
- [X] (R) KEP approvers have approved the KEP status as `implementable`
43+
- [X] (R) Design details are appropriately documented
44+
- [X] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
45+
- [X] (R) Graduation criteria is in place
46+
- [X] (R) Production readiness review completed
47+
- [X] Production readiness review approved
48+
- [X] "Implementation History" section is up-to-date for milestone
49+
- ~~ [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] ~~
50+
- [X] Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
51+
52+
[kubernetes.io]: https://kubernetes.io/
53+
[kubernetes/enhancements]: https://git.k8s.io/enhancements
54+
[kubernetes/kubernetes]: https://git.k8s.io/kubernetes
55+
[kubernetes/website]: https://git.k8s.io/website
56+
57+
## Summary
58+
59+
This document presents the kubelet endpoint which allows third party consumers to inspect the mapping between devices and pods.
60+
61+
## Motivation
62+
63+
[Device Monitoring](https://docs.google.com/document/d/1NYnqw-HDQ6Y3L_mk85Q3wkxDtGNWTxpsedsgw4NgWpg/edit?usp=sharing) in Kubernetes is expected to be implemented out of the kubernetes tree.
64+
65+
For the metrics to be relevant to cluster administrators or Pod owners, they need to be able to be matched to specific containers and pod (e.g: GPU utilization for pod X).
66+
As such the external monitoring agents need to be able to determine the set of devices in-use by containers and attach pod and container metadata to the metrics.
67+
68+
### Goals
69+
70+
* Deprecate and remove current device-specific knowledge from the kubelet, such as [accelerator metrics](https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/apis/stats/v1alpha1/types.go#L229)
71+
* Enable external device monitoring agents to provide metrics relevant to Kubernetes
72+
73+
## Proposal
74+
75+
### User Stories
76+
77+
* As a _Cluster Administrator_, I provide a set of devices from various vendors in my cluster. Each vendor independently maintains their own agent, so I run monitoring agents only for devices I provide. Each agent adheres to to the [node monitoring guidelines](https://docs.google.com/document/d/1_CdNWIjPBqVDMvu82aJICQsSCbh2BR-y9a8uXjQm4TI/edit?usp=sharing), so I can use a compatible monitoring pipeline to collect and analyze metrics from a variety of agents, even though they are maintained by different vendors.
78+
* As a _Device Vendor_, I manufacture devices and I have deep domain expertise in how to run and monitor them. Because I maintain my own Device Plugin implementation, as well as Device Monitoring Agent, I can provide consumers of my devices an easy way to consume and monitor my devices without requiring open-source contributions. The Device Monitoring Agent doesn't have any dependencies on the Device Plugin, so I can decouple monitoring from device lifecycle management. My Device Monitoring Agent works by periodically querying the `/devices/<ResourceName>` endpoint to discover which devices are being used, and to get the container/pod metadata associated with the metrics:
79+
80+
![device monitoring architecture](https://user-images.githubusercontent.com/3262098/43926483-44331496-9bdf-11e8-82a0-14b47583b103.png)
81+
82+
### Risks and Mitigations
83+
84+
This API is read-only, which removes a large class of risks. The aspects that we consider below are as follows:
85+
- What are the risks associated with the API service itself?
86+
- What are the risks associated with the data itself?
87+
88+
| Risk | Impact | Mitigation |
89+
| --------------------------------------------------------- | ------------- | ---------- |
90+
| Too many requests risk impacting the kubelet performances | High | Implement rate limiting and or passive caching, follow best practices for gRPC resource management. |
91+
| Improper access to the data | Low | Server is listening on a root owned unix socket. This can be limited with proper pod security policies. |
92+
93+
94+
## Design Details
95+
96+
### Proposed API
97+
98+
We propose to add a new gRPC service to the Kubelet. This gRPC service would be listening on a unix socket at `/var/lib/kubelet/pod-resources/kubelet.sock` and return information about the kubelet's assignment of devices to containers.
99+
100+
This information is obtained from the internal state of the kubelet's Device Manager. The GRPC Service has a single function named `List`, you will find below the v1 API:
101+
```protobuf
102+
// PodResources is a service provided by the kubelet that provides information about the
103+
// node resources consumed by pods and containers on the node
104+
service PodResources {
105+
rpc List(ListPodResourcesRequest) returns (ListPodResourcesResponse) {}
106+
}
107+
108+
// ListPodResourcesRequest is the request made to the PodResources service
109+
message ListPodResourcesRequest {}
110+
111+
// ListPodResourcesResponse is the response returned by List function
112+
message ListPodResourcesResponse {
113+
repeated PodResources pod_resources = 1;
114+
}
115+
116+
// PodResources contains information about the node resources assigned to a pod
117+
message PodResources {
118+
string name = 1;
119+
string namespace = 2;
120+
repeated ContainerResources containers = 3;
121+
}
122+
123+
// ContainerResources contains information about the resources assigned to a container
124+
message ContainerResources {
125+
string name = 1;
126+
repeated ContainerDevices devices = 2;
127+
}
128+
129+
// ContainerDevices contains information about the devices assigned to a container
130+
message ContainerDevices {
131+
string resource_name = 1;
132+
repeated string device_ids = 2;
133+
}
134+
```
135+
136+
### Test Plan
137+
138+
Given that the API allows observing what device has been associated to what container, we need to be testing different configurations, such as:
139+
* Pods without devices assigned to any containers.
140+
* Pods with devices assigned to some but not all containers.
141+
- Pods with devices assigned to init containers.
142+
- ...
143+
144+
We have identified two main ways of testing this API:
145+
- Unit Tests which won't rely on gRPC. They will test different configurations of pods and devices.
146+
- Node E2E tests which will allow us to test the service itself.
147+
148+
E2E tests are explicitly not written because they would require us to generate and deploy a custom container.
149+
The infrastructure required is expensive and it is not clear what additional testing (and hence risk reduction) this would provide compare to node e2e tests.
150+
151+
### Graduation Criteria
152+
153+
#### Alpha
154+
- [X] Implement the new service API.
155+
- [X] [Ensure proper e2e node tests are in place](https://k8s-testgrid.appspot.com/sig-node-kubelet#node-kubelet-serial&include-filter-by-regex=DevicePluginProbe).
156+
157+
#### Alpha to Beta Graduation
158+
- [X] Demonstrate that the endpoint can be used to replace in-tree GPU device metrics in production environments (NVIDIA, sig-node April 30, 2019).
159+
160+
#### Beta to G.A Graduation
161+
- [X] Multiple real world examples ([Multus CNI](https://github.com/intel/multus-cni)).
162+
- [X] Allowing time for feedback (2 years).
163+
- [X] [Start Deprecation of Accelerator metrics in kubelet](https://github.com/kubernetes/kubernetes/pull/91930).
164+
- [X] Risks have been addressed.
165+
166+
### Upgrade / Downgrade Strategy
167+
168+
With gRPC the version is part of the service name.
169+
Old versions and new versions should always be served and listened by the kubelet.
170+
171+
To a cluster admin upgrading to the newest API version, means upgrading Kubernetes to a newer version as well as upgrading the monitoring component.
172+
173+
To a vendor changes in the API should always be backwards compatible.
174+
175+
Downgrades here are related to downgrading the plugin
176+
177+
### Version Skew Strategy
178+
179+
Kubelet will always be backwards compatible, so going forward existing plugins are not expected to break.
180+
181+
## Production Readiness Review Questionnaire
182+
### Feature enablement and rollback
183+
184+
* **How can this feature be enabled / disabled in a live cluster?**
185+
- [X] Feature gate (also fill in values in `kep.yaml`).
186+
- Feature gate name: `KubeletPodResources`.
187+
- Components depending on the feature gate: N/A.
188+
189+
* **Does enabling the feature change any default behavior?** No
190+
* **Can the feature be disabled once it has been enabled (i.e. can we rollback the enablement)?** Yes, through feature gates.
191+
* **What happens if we reenable the feature if it was previously rolled back?** The service recovers state from kubelet.
192+
* **Are there any tests for feature enablement/disablement?** No, however no data is created or deleted.
193+
194+
### Rollout, Upgrade and Rollback Planning
195+
196+
* **How can a rollout fail? Can it impact already running workloads?** Kubelet would fail to start. Errors would be caught in the CI.
197+
* **What specific metrics should inform a rollback?** Not Applicable, metrics wouldn't be available.
198+
* **Were upgrade and rollback tested? Was upgrade->downgrade->upgrade path tested?** Not Applicable.
199+
* **Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?** No.
200+
201+
### Monitoring requirements
202+
* **How can an operator determine if the feature is in use by workloads?**
203+
- Look at the `pod_resources_endpoint_requests_total` metric exposed by the kubelet.
204+
- Look at hostPath mounts of privileged containers.
205+
* **What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?**
206+
- [X] Metrics
207+
- Metric name: `pod_resources_endpoint_requests_total`
208+
- Components exposing the metric: kubelet
209+
210+
* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?** N/A or refer to Kubelet SLIs.
211+
* **Are there any missing metrics that would be useful to have to improve observability if this feature?** No.
212+
213+
214+
### Dependencies
215+
216+
* **Does this feature depend on any specific services running in the cluster?** Not aplicable.
217+
218+
### Scalability
219+
220+
* **Will enabling / using this feature result in any new API calls?** No.
221+
* **Will enabling / using this feature result in introducing new API types?** No.
222+
* **Will enabling / using this feature result in any new calls to cloud provider?** No.
223+
* **Will enabling / using this feature result in increasing size or count of the existing API objects?** No.
224+
* **Will enabling / using this feature result in increasing time taken by any operations covered by [existing SLIs/SLOs][]?** No. Feature is out of existing any paths in kubelet.
225+
* **Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?** In 1.18, DDOSing the API can lead to resource exhaustion. It is planned to be addressed as part of G.A.
226+
Feature only collects data when requests comes in, data is then garbage collected. Data collected is proportional to the number of pods on the node.
227+
228+
### Troubleshooting
229+
230+
* **How does this feature react if the API server and/or etcd is unavailable?**: No effect.
231+
* **What are other known failure modes?** No known failure modes
232+
* **What steps should be taken if SLOs are not being met to determine the problem?** N/A
233+
234+
[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
235+
[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
236+
237+
## Implementation History
238+
239+
- 2018-09-11: Final version of KEP (proposing pod-resources endpoint) published and presented to sig-node. [Slides](https://docs.google.com/presentation/u/1/d/1xz-iHs8Ec6PqtZGzsmG1e68aLGCX576j_WRptd2114g/edit?usp=sharing)
240+
- 2018-10-30: Demo with example gpu monitoring daemonset
241+
- 2018-11-10: KEP lgtm'd and approved
242+
- 2018-11-15: Implementation and e2e test merged before 1.13 release: kubernetes/kubernetes#70508
243+
- 2019-04-30: Demo of production GPU monitoring by NVIDIA
244+
- 2019-04-30: Agreement in sig-node to move feature to beta in 1.15
245+
- 2020-06-17: Agreement in sig-node to move feature to G.A in 1.19 or 1.20
246+
247+
## Alternatives
248+
249+
### Add v1alpha1 Kubelet GRPC service, at `/var/lib/kubelet/pod-resources/kubelet.sock`, which returns a list of [CreateContainerRequest](https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/apis/cri/runtime/v1alpha2/api.proto#L734)s used to create containers.
250+
* Pros:
251+
* Reuse an existing API for describing containers rather than inventing a new one
252+
* Cons:
253+
* It ties the endpoint to the CreateContainerRequest, and may prevent us from adding other information we want in the future
254+
* It does not contain any additional information that will be useful to monitoring agents other than device, and contains lots of irrelevant information for this use-case.
255+
* Notes:
256+
* Does not include any reference to resource names. Monitoring agentes must identify devices by the device or environment variables passed to the pod or container.
257+
258+
### Add a field to Pod Status.
259+
* Pros:
260+
* Allows for observation of container to device bindings local to the node through the `/pods` endpoint
261+
* Cons:
262+
* Only consumed locally, which doesn't justify an API change
263+
* Device Bindings are immutable after allocation, and are _debatably_ observable (they can be "observed" from the local checkpoint file). Device bindings are generally a poor fit for status.
264+
265+
### Use the Kubelet Device Manager Checkpoint file
266+
* Allows for observability of device to container bindings through what exists in the checkpoint file
267+
* Requires adding additional metadata to the checkpoint file as required by the monitoring agent
268+
* Requires implementing versioning for the checkpoint file, and handling version skew between readers and the kubelet
269+
* Future modifications to the checkpoint file are more difficult.
270+
271+
### Add a field to the Pod Spec:
272+
* A new object `ComputeDevice` will be defined and a new variable `ComputeDevices` will be added in the `Container` (Spec) object which will represent a list of `ComputeDevice` objects.
273+
```golang
274+
// ComputeDevice describes the devices assigned to this container for a given ResourceName
275+
type ComputeDevice struct {
276+
// DeviceIDs is the list of devices assigned to this container
277+
DeviceIDs []string
278+
// ResourceName is the name of the compute resource
279+
ResourceName string
280+
}
281+
282+
// Container represents a single container that is expected to be run on the host.
283+
type Container struct {
284+
...
285+
// ComputeDevices contains the devices assigned to this container
286+
// This field is alpha-level and is only honored by servers that enable the ComputeDevices feature.
287+
// +optional
288+
ComputeDevices []ComputeDevice
289+
...
290+
}
291+
```
292+
* During Kubelet pod admission, if `ComputeDevices` is found non-empty, specified devices will be allocated otherwise behaviour will remain same as it is today.
293+
* Before starting the pod, the kubelet writes the assigned `ComputeDevices` back to the pod spec.
294+
* Note: Writing to the Api Server and waiting to observe the updated pod spec in the kubelet's pod watch may add significant latency to pod startup.
295+
* Allows devices to potentially be assigned by a custom scheduler.
296+
* Serves as a permanent record of device assignments for the kubelet, and eliminates the need for the kubelet to maintain this state locally.
Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
title: Kubelet endpoint for device assignment observation details
2+
kep-number: 606
3+
authors:
4+
- "@dashpole"
5+
- "@vikaschoudhary16"
6+
- "@renaudwastaken"
7+
owning-sig: sig-node
8+
participating-sigs: []
9+
status: implementable
10+
creation-date: "2018-07-19"
11+
reviewers:
12+
- "@thockin"
13+
- "@derekwaynecarr"
14+
- "@dchen1107"
15+
- "@vishh"
16+
approvers:
17+
- "@sig-node-leads"
18+
prr-approvers: []
19+
see-also: []
20+
replaces: []
21+
22+
# The target maturity stage in the current dev cycle for this KEP.
23+
stage: stable
24+
25+
# The most recent milestone for which work toward delivery of this KEP has been
26+
# done. This can be the current (upcoming) milestone, if it is being actively
27+
# worked on.
28+
latest-milestone: "v1.20"
29+
30+
# The milestone at which this feature was, or is targeted to be, at each stage.
31+
milestone:
32+
alpha: "v1.15"
33+
beta: "v1.13"
34+
stable: "v1.20"
35+
36+
# The following PRR answers are required at alpha release
37+
# List the feature gate name and the components for which it must be enabled
38+
feature-gates:
39+
- name: "KubeletPodResources"
40+
components:
41+
- kubelet
42+
- kube-controller-manager
43+
disable-supported: false
44+
45+
# The following PRR answers are required at beta release
46+
metrics:
47+
- pod_resources_endpoint_requests_total

0 commit comments

Comments
 (0)