You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
-[Add v1alpha1 Kubelet GRPC service, at <code>/var/lib/kubelet/pod-resources/kubelet.sock</code>, which returns a list of <ahref="https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/apis/cri/runtime/v1alpha2/api.proto#L734">CreateContainerRequest</a>s used to create containers.](#add-v1alpha1-kubelet-grpc-service-at--which-returns-a-list-of-createcontainerrequests-used-to-create-containers)
33
51
-[Add a field to Pod Status.](#add-a-field-to-pod-status)
34
52
-[Use the Kubelet Device Manager Checkpoint file](#use-the-kubelet-device-manager-checkpoint-file)
35
53
-[Add a field to the Pod Spec:](#add-a-field-to-the-pod-spec)
In this document we will discuss the motivation and code changes required for introducing a kubelet endpoint to expose device to container bindings.
56
+
## Release Signoff Checklist
57
+
58
+
Items marked with (R) are required *prior to targeting to a milestone / release*.
59
+
60
+
-[X] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements](https://github.com/kubernetes/enhancements/issues/606)
61
+
-[X] (R) KEP approvers have approved the KEP status as `implementable`
62
+
-[X] (R) Design details are appropriately documented
63
+
-[X] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
64
+
-[X] (R) Graduation criteria is in place
65
+
-[X] (R) Production readiness review completed
66
+
-[X] Production readiness review approved
67
+
-[X] "Implementation History" section is up-to-date for milestone
68
+
- ~~ [] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] ~~
69
+
-[X] Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
This document presents the kubelet endpoint which allows third party consumers to inspect the mapping between devices and pods.
79
+
80
+
## Motivation
42
81
43
-
## Background
44
-
[Device Monitoring](https://docs.google.com/document/d/1NYnqw-HDQ6Y3L_mk85Q3wkxDtGNWTxpsedsgw4NgWpg/edit?usp=sharing) requires external agents to be able to determine the set of devices in-use by containers and attach pod and container metadata for these devices.
82
+
[Device Monitoring](https://docs.google.com/document/d/1NYnqw-HDQ6Y3L_mk85Q3wkxDtGNWTxpsedsgw4NgWpg/edit?usp=sharing) in Kubernetes is expected to be implemented out of the kubernetes tree.
45
83
46
-
## Objectives
84
+
For the metrics to be relevant to cluster administrators or Pod owners, they need to be able to be matched to specific containers and pod (e.g: GPU utilization for pod X).
85
+
As such the external monitoring agents need to be able to determine the set of devices in-use by containers and attach pod and container metadata to the metrics.
47
86
48
-
* To remove current device-specific knowledge from the kubelet, such as [accellerator metrics](https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/apis/stats/v1alpha1/types.go#L229)
49
-
* To enable future use-cases requiring device-specific knowledge to be out-of-tree
87
+
### Goals
50
88
51
-
## User Journeys
89
+
* Deprecate and remove current device-specific knowledge from the kubelet, such as [accelerator metrics](https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/apis/stats/v1alpha1/types.go#L229)
90
+
* Enable external device monitoring agents to provide metrics relevant to Kubernetes
52
91
53
-
### Device Monitoring Agents
92
+
## Proposal
93
+
94
+
### User Stories
54
95
55
96
* As a _Cluster Administrator_, I provide a set of devices from various vendors in my cluster. Each vendor independently maintains their own agent, so I run monitoring agents only for devices I provide. Each agent adheres to to the [node monitoring guidelines](https://docs.google.com/document/d/1_CdNWIjPBqVDMvu82aJICQsSCbh2BR-y9a8uXjQm4TI/edit?usp=sharing), so I can use a compatible monitoring pipeline to collect and analyze metrics from a variety of agents, even though they are maintained by different vendors.
56
97
* As a _Device Vendor_, I manufacture devices and I have deep domain expertise in how to run and monitor them. Because I maintain my own Device Plugin implementation, as well as Device Monitoring Agent, I can provide consumers of my devices an easy way to consume and monitor my devices without requiring open-source contributions. The Device Monitoring Agent doesn't have any dependencies on the Device Plugin, so I can decouple monitoring from device lifecycle management. My Device Monitoring Agent works by periodically querying the `/devices/<ResourceName>` endpoint to discover which devices are being used, and to get the container/pod metadata associated with the metrics:
| Too many requests risk impacting the kubelet performances | High | Implement rate limiting and or passive caching, follow best practices for gRPC resource management. |
110
+
| Improper access to the data | Low | Server is listening on a root owned unix socket. This can be limited with proper pod security policies. |
111
+
60
112
61
-
## Changes
113
+
## Design Details
62
114
63
-
Add a v1alpha1 Kubelet GRPC service, at `/var/lib/kubelet/pod-resources/kubelet.sock`, which returns information about the kubelet's assignment of devices to containers. It obtains this information from the internal state of the kubelet's Device Manager. The GRPC Service returns a single PodResourcesResponse, which is shown in proto below:
115
+
### Proposed API
116
+
117
+
We propose to add a new gRPC service to the Kubelet. This gRPC service would be listening on a unix socket at `/var/lib/kubelet/pod-resources/kubelet.sock` and return information about the kubelet's assignment of devices to containers.
118
+
119
+
This information is obtained from the internal state of the kubelet's Device Manager. The GRPC Service has a single function named `List`, you will find below the v1 API:
64
120
```protobuf
65
121
// PodResources is a service provided by the kubelet that provides information about the
66
122
// node resources consumed by pods and containers on the node
@@ -96,13 +152,120 @@ message ContainerDevices {
96
152
}
97
153
```
98
154
99
-
### Potential Future Improvements
155
+
### Test Plan
156
+
157
+
Given that the API allows observing what device has been associated to what container, we need to be testing different configurations, such as:
158
+
* Pods without devices assigned to any containers.
159
+
* Pods with devices assigned to some but not all containers.
160
+
- Pods with devices assigned to init containers.
161
+
- ...
162
+
163
+
We have identified two main ways of testing this API:
164
+
- Unit Tests which won't rely on gRPC. They will test different configurations of pods and devices.
165
+
- Node E2E tests which will allow us to test the service itself.
166
+
167
+
E2E tests are explicitly not written because they would require us to generate and deploy a custom container.
168
+
The infrastructure required is expensive and it is not clear what additional testing (and hence risk reduction) this would provide compare to node e2e tests.
169
+
170
+
### Graduation Criteria
171
+
172
+
#### Alpha
173
+
-[X] Implement the new service API.
174
+
-[X][Ensure proper e2e node tests are in place](https://k8s-testgrid.appspot.com/sig-node-kubelet#node-kubelet-serial&include-filter-by-regex=DevicePluginProbe).
175
+
176
+
#### Alpha to Beta Graduation
177
+
-[X] Demonstrate that the endpoint can be used to replace in-tree GPU device metrics in production environments (NVIDIA, sig-node April 30, 2019).
178
+
179
+
#### Beta to G.A Graduation
180
+
-[X] Multiple real world examples ([Multus CNI](https://github.com/intel/multus-cni)).
181
+
-[X] Allowing time for feedback (2 years).
182
+
-[X][Start Deprecation of Accelerator metrics in kubelet](https://github.com/kubernetes/kubernetes/pull/91930).
183
+
-[X] Risks have been addressed.
184
+
185
+
### Upgrade / Downgrade Strategy
186
+
187
+
With gRPC the version is part of the service name
188
+
old versions and new versions should always be served and listened by the kubelet
189
+
190
+
To a cluster admin upgrading to the newest API version, means upgrading Kubernetes to a newer version as well as upgrading the monitoring component.
191
+
192
+
To a vendor
193
+
194
+
Changes in the API should always be backwards compatible.
195
+
196
+
Downgrades here are related to downgrading the plugin
197
+
198
+
### Version Skew Strategy
199
+
200
+
Kubelet will always be backwards compatible, so going forward existing plugins are not expected to break.
201
+
202
+
## Production Readiness Review Questionnaire
203
+
### Feature enablement and rollback
204
+
205
+
***How can this feature be enabled / disabled in a live cluster?**
206
+
-[X] Feature gate (also fill in values in `kep.yaml`).
207
+
- Feature gate name: `KubeletPodResources`.
208
+
- Components depending on the feature gate: N/A.
209
+
210
+
***Does enabling the feature change any default behavior?** No
211
+
***Can the feature be disabled once it has been enabled (i.e. can we rollback the enablement)?** Yes, through feature gates.
212
+
***What happens if we reenable the feature if it was previously rolled back?** The service recovers state from kubelet.
213
+
***Are there any tests for feature enablement/disablement?** No, however no data is created or deleted.
100
214
101
-
* Add `ListAndWatch()` function to the GRPC endpoint so monitoring agents don't need to poll.
102
-
* Add identifiers for other resources used by pods to the `PodResources` message.
103
-
* For example, persistent volume location on disk
215
+
### Rollout, Upgrade and Rollback Planning
104
216
105
-
## Alternatives Considered
217
+
***How can a rollout fail? Can it impact already running workloads?** Kubelet would fail to start. Errors would be caught in the CI.
218
+
***What specific metrics should inform a rollback?** Not Applicable, metrics wouldn't be available.
219
+
***Were upgrade and rollback tested? Was upgrade->downgrade->upgrade path tested?** Not Applicable.
220
+
***Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?** No.
221
+
222
+
### Monitoring requirements
223
+
***How can an operator determine if the feature is in use by workloads?**
224
+
- Look at the `pod_resources_requests_total` metric exposed by the kubelet.
225
+
- Look at hostPath mounts of privileged containers.
226
+
***What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?**
227
+
-[X] Metrics
228
+
- Metric name: `pod_resources_requests_total`
229
+
- Components exposing the metric: kubelet
230
+
231
+
***What are the reasonable SLOs (Service Level Objectives) for the above SLIs?** N/A or refer to Kubelet SLIs.
232
+
***Are there any missing metrics that would be useful to have to improve observability if this feature?** No.
233
+
234
+
235
+
### Dependencies
236
+
237
+
***Does this feature depend on any specific services running in the cluster?** Not aplicable.
238
+
239
+
### Scalability
240
+
241
+
***Will enabling / using this feature result in any new API calls?** No.
242
+
***Will enabling / using this feature result in introducing new API types?** No.
243
+
***Will enabling / using this feature result in any new calls to cloud provider?** No.
244
+
***Will enabling / using this feature result in increasing size or count of the existing API objects?** No.
245
+
***Will enabling / using this feature result in increasing time taken by any operations covered by [existing SLIs/SLOs][]?** No. Feature is out of existing any paths in kubelet.
246
+
***Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?** In 1.18, DDOSing the API can lead to resource exhaustion. It is planned to be addressed as part of G.A.
247
+
Feature only collects data when requests comes in, data is then garbage collected. Data collected is proportional to the number of pods on the node.
248
+
249
+
### Troubleshooting
250
+
251
+
***How does this feature react if the API server and/or etcd is unavailable?**: No effect.
252
+
***What are other known failure modes?** No known failure modes
253
+
***What steps should be taken if SLOs are not being met to determine the problem?** N/A
- 2018-09-11: Final version of KEP (proposing pod-resources endpoint) published and presented to sig-node. [Slides](https://docs.google.com/presentation/u/1/d/1xz-iHs8Ec6PqtZGzsmG1e68aLGCX576j_WRptd2114g/edit?usp=sharing)
261
+
- 2018-10-30: Demo with example gpu monitoring daemonset
262
+
- 2018-11-10: KEP lgtm'd and approved
263
+
- 2018-11-15: Implementation and e2e test merged before 1.13 release: kubernetes/kubernetes#70508
264
+
- 2019-04-30: Demo of production GPU monitoring by NVIDIA
265
+
- 2019-04-30: Agreement in sig-node to move feature to beta in 1.15
266
+
- 2020-06-17: Agreement in sig-node to move feature to G.A in 1.19 or 1.20
267
+
268
+
## Alternatives
106
269
107
270
### Add v1alpha1 Kubelet GRPC service, at `/var/lib/kubelet/pod-resources/kubelet.sock`, which returns a list of [CreateContainerRequest](https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/apis/cri/runtime/v1alpha2/api.proto#L734)s used to create containers.
108
271
* Pros:
@@ -113,7 +276,7 @@ message ContainerDevices {
113
276
* Notes:
114
277
* Does not include any reference to resource names. Monitoring agentes must identify devices by the device or environment variables passed to the pod or container.
115
278
116
-
### Add a field to Pod Status.
279
+
### Add a field to Pod Status.
117
280
* Pros:
118
281
* Allows for observation of container to device bindings local to the node through the `/pods` endpoint
119
282
* Cons:
@@ -152,21 +315,3 @@ type Container struct {
152
315
* Note: Writing to the Api Server and waiting to observe the updated pod spec in the kubelet's pod watch may add significant latency to pod startup.
153
316
* Allows devices to potentially be assigned by a custom scheduler.
154
317
* Serves as a permanent record of device assignments for the kubelet, and eliminates the need for the kubelet to maintain this state locally.
155
-
156
-
## Graduation Criteria
157
-
158
-
Alpha:
159
-
-[x] Implement the endpoint as described above
160
-
-[x] E2e node test tests the endpoint: https://k8s-testgrid.appspot.com/sig-node-kubelet#node-kubelet-serial&include-filter-by-regex=DevicePluginProbe
161
-
162
-
Beta:
163
-
-[x] Demonstrate in production environments that the endpoint can be used to replace in-tree GPU device metrics (NVIDIA, sig-node April 30, 2019).
164
-
165
-
## Implementation History
166
-
167
-
- 2018-09-11: Final version of KEP (proposing pod-resources endpoint) published and presented to sig-node. [Slides](https://docs.google.com/presentation/u/1/d/1xz-iHs8Ec6PqtZGzsmG1e68aLGCX576j_WRptd2114g/edit?usp=sharing)
168
-
- 2018-10-30: Demo with example gpu monitoring daemonset
169
-
- 2018-11-10: KEP lgtm'd and approved
170
-
- 2018-11-15: Implementation and e2e test merged before 1.13 release: kubernetes/kubernetes#70508
171
-
- 2019-04-30: Demo of production GPU monitoring by NVIDIA
172
-
- 2019-04-30: Agreement in sig-node to move feature to beta in 1.15
0 commit comments