Skip to content

Commit 308ba8d

Browse files
authored
Merge pull request kubernetes#4045 from ffromani/podres-getalloc-ga
KEP-2403: graduate to GA
2 parents ab06bb1 + 42c90b7 commit 308ba8d

File tree

3 files changed

+161
-44
lines changed

3 files changed

+161
-44
lines changed

keps/prod-readiness/sig-node/2403.yaml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,4 +2,6 @@ kep-number: 2403
22
alpha:
33
approver: "@johnbelamaric"
44
beta:
5-
approver: "@johnbelamaric"
5+
approver: "@johnbelamaric"
6+
stable:
7+
approver: "@johnbelamaric"

keps/sig-node/2403-pod-resources-allocatable-resources/README.md

Lines changed: 154 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@
77
- [Summary](#summary)
88
- [Motivation](#motivation)
99
- [Goals](#goals)
10+
- [Non-Goals](#non-goals)
1011
- [Proposal](#proposal)
1112
- [User Stories](#user-stories)
1213
- [Node Feature Discovery](#node-feature-discovery)
@@ -15,16 +16,21 @@
1516
- [Design Details](#design-details)
1617
- [Proposed API](#proposed-api)
1718
- [Test Plan](#test-plan)
19+
- [Prerequisite testing updates](#prerequisite-testing-updates)
20+
- [Unit tests](#unit-tests)
21+
- [Integration tests](#integration-tests)
22+
- [e2e tests](#e2e-tests)
1823
- [Graduation Criteria](#graduation-criteria)
1924
- [Alpha](#alpha)
2025
- [Alpha to Beta Graduation](#alpha-to-beta-graduation)
2126
- [Beta to G.A Graduation](#beta-to-ga-graduation)
27+
- [Deprecation](#deprecation)
2228
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
2329
- [Version Skew Strategy](#version-skew-strategy)
2430
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
25-
- [Feature enablement and rollback](#feature-enablement-and-rollback)
31+
- [Feature Enablement and Rollback](#feature-enablement-and-rollback)
2632
- [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
27-
- [Monitoring requirements](#monitoring-requirements)
33+
- [Monitoring Requirements](#monitoring-requirements)
2834
- [Dependencies](#dependencies)
2935
- [Scalability](#scalability)
3036
- [Troubleshooting](#troubleshooting)
@@ -41,11 +47,15 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
4147
- [X] (R) KEP approvers have approved the KEP status as `implementable`
4248
- [X] (R) Design details are appropriately documented
4349
- [X] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
50+
- [X] e2e Tests for all Beta API Operations (endpoints)
51+
- [X] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
52+
- [X] (R) Minimum Two Week Window for GA e2e tests to prove flake free
4453
- [X] (R) Graduation criteria is in place
54+
- [X] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
4555
- [X] (R) Production readiness review completed
4656
- [X] (R) Production readiness review approved
4757
- [X] "Implementation History" section is up-to-date for milestone
48-
- ~~ [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] ~~
58+
- [X] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
4959
- [X] Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
5060

5161
[kubernetes.io]: https://kubernetes.io/
@@ -64,6 +74,10 @@ compute device allocation, thus, alongside the existing pod resources API endpoi
6474

6575
* Enable node monitoring agents to know the allocatable compute resources on a node, thus properly calculate the node compute resource utilization.
6676

77+
### Non-Goals
78+
79+
* Add new endpoint (like kubelet `/pods`)
80+
6781
## Proposal
6882

6983
### User Stories
@@ -172,6 +186,24 @@ the new proposed `GetAllocatableResources` API.
172186

173187
Add additional tests to prove that unhealthy devices are skipped as part of GetAllocatable and empty NUMA topology is not returned.
174188

189+
[X] I/we understand the owners of the involved components may require updates to
190+
existing tests to make this code solid enough prior to committing the changes necessary
191+
to implement this enhancement.
192+
193+
##### Prerequisite testing updates
194+
195+
##### Unit tests
196+
197+
- `k8s.io/kubernetes/pkg/kubelet/api/podresources`: `20230530` - `68.6%`
198+
199+
##### Integration tests
200+
201+
N/A - node local feature covered by e2e test (`test/e2e_node`)
202+
203+
##### e2e tests
204+
205+
- `NodeFeature:PodResources`: https://storage.googleapis.com/k8s-triage/index.html?sig=node&test=NodeFeature%3APodResources
206+
175207
### Graduation Criteria
176208

177209
#### Alpha
@@ -192,6 +224,9 @@ Add additional tests to prove that unhealthy devices are skipped as part of GetA
192224
#### Beta to G.A Graduation
193225
- [X] Allowing time for feedback (1 year).
194226
- [X] Risks have been addressed.
227+
- [X] Rate limiting implemented as part of the podresources endpoint GA graduation (KEP 606).
228+
229+
#### Deprecation
195230

196231
### Upgrade / Downgrade Strategy
197232

@@ -207,67 +242,147 @@ To a vendor changes in the API should always be backwards compatible.
207242
Kubelet will always be backwards compatible, so going forward existing plugins are not expected to break.
208243

209244
## Production Readiness Review Questionnaire
210-
### Feature enablement and rollback
211245

212-
* **How can this feature be enabled / disabled in a live cluster?**
213-
- [X] Feature gate (also fill in values in `kep.yaml`).
214-
- Feature gate name: `KubeletPodResourcesGetAllocatable`.
215-
- Components depending on the feature gate: kubelet, 3rd party consumers.
246+
### Feature Enablement and Rollback
247+
248+
###### How can this feature be enabled / disabled in a live cluster?
249+
250+
- [X] Feature gate (also fill in values in `kep.yaml`)
251+
- Feature gate name: `KubeletPodResourcesGetAllocatable`.
252+
- Components depending on the feature gate: kubelet, 3rd party consumers.
253+
254+
###### Does enabling the feature change any default behavior?
216255

217-
* **Does enabling the feature change any default behavior?** No
218-
* **Can the feature be disabled once it has been enabled (i.e. can we rollback the enablement)?** Yes, through feature gates.
219-
* **What happens if we reenable the feature if it was previously rolled back?** The API becomes available again. The API is stateless, so no recovery is needed, clients can just consume the data.
220-
* **Are there any tests for feature enablement/disablement?** A e2e test will demonstrate that when the feature gate is disabled, the API returns the appropriate error code.
256+
No
257+
258+
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
259+
260+
Yes, through feature gate. Once GA, the feature can't be disabled and is always enabled.
261+
262+
###### What happens if we reenable the feature if it was previously rolled back?
263+
264+
The API becomes available again. The API is stateless, so no recovery is needed, clients can just consume the data.
265+
266+
###### Are there any tests for feature enablement/disablement?
267+
268+
An e2e test will demonstrate that when the feature gate is disabled, the API returns the appropriate error code.
221269

222270
### Rollout, Upgrade and Rollback Planning
223271

224-
* **How can a rollout fail? Can it impact already running workloads?** Kubelet may fail to start. The new API may report inconsistent data, or may cause the kubelet to crash.
225-
* **What specific metrics should inform a rollback?** `pod_resources_endpoint_errors_get_allocatable` - but only with feature gate enabled. Otherwise the API will always return a known error, giving a false negative signal.
226-
* **Were upgrade and rollback tested? Was upgrade->downgrade->upgrade path tested?** Not Applicable.
227-
* **Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?** No.
272+
###### How can a rollout or rollback fail? Can it impact already running workloads?
273+
274+
Kubelet may fail to start. The new API may report inconsistent data, or may cause the kubelet to crash.
275+
276+
###### What specific metrics should inform a rollback?
277+
278+
`pod_resources_endpoint_errors_get_allocatable` - but only with feature gate enabled. Otherwise the API will always return a known error, giving a false negative signal.
279+
280+
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
281+
282+
Not Applicable.
283+
284+
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
285+
286+
No (Not applicable)
228287

229-
### Monitoring requirements
230-
* **How can an operator determine if the feature is in use by workloads?**
231-
- Look at the `pod_resources_endpoint_requests_get_allocatable` metric exposed by the kubelet.
232-
- Clients are connected to the podresources unix socket, for example bychecking which containers mount the podresources socket path.
233-
* **What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?**
234-
- [X] Metrics
235-
- Metric name: `pod_resources_endpoint_requests_total`, `pod_resources_endpoint_requests_list`, `pod_resources_endpoint_requests_get_allocatable`, `pod_resources_endpoint_errors_list`, `pod_resources_endpoint_errors_get_allocatable`
236-
- Components exposing the metric: kubelet
288+
### Monitoring Requirements
237289

238-
* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?** N/A.
239-
* **Are there any missing metrics that would be useful to have to improve observability if this feature?** As part of this feature enhancement, per-API-endpoint resources metrics are being added; to observe this feature the `pod_resources_endpoint_requests_get_allocatable` metric should be used. We will also add error counting metrics to improve the observability of the API.
290+
###### How can an operator determine if the feature is in use by workloads?
240291

292+
- Look at the `pod_resources_endpoint_requests_get_allocatable` metric exposed by the kubelet.
293+
- Clients are connected to the podresources unix socket, for example by checking which containers mount the podresources socket path.
294+
295+
###### How can someone using this feature know that it is working for their instance?
296+
297+
- [ ] Events
298+
- Event Reason:
299+
- [ ] API .status
300+
- Condition name:
301+
- Other field:
302+
- [X] Other (treat as last resort)
303+
- Look at the `pod_resources_endpoint_requests_get_allocatable` and `pod_resources_endpoint_errors_get_allocatable` metrics exposed by the kubelet.
304+
305+
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
306+
307+
100% in normal operation. The proposed API exposes in read only mode kubelet internal data, critical for functioning of the kubelet.
308+
This data has to be available 100% of the time for the proper functioning of the kubelet, thus is expected to be available 100% of time.
309+
The only possible error source is the API calls being throttled by the rate-limiting introduced with the GA graduation of the parent KEP 606.
310+
311+
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
312+
313+
- [X] Metrics
314+
- Metric name:
315+
- `pod_resources_endpoint_requests_get_allocatable`
316+
- `pod_resources_endpoint_errors_get_allocatable`
317+
- Components exposing the metric: kubelet
318+
319+
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
320+
321+
As part of this feature enhancement, per-API-endpoint resources metrics are being added; to observe this feature the `pod_resources_endpoint_requests_get_allocatable` metric should be used.
322+
We added the `pod_resources_endpoint_errors_get_allocatable` metric to report errors. Because the nature of the API (exposing data already used by the kubelet with minimal processing)
323+
the error counter is expected to be stable zero.
241324

242325
### Dependencies
243326

244-
* **Does this feature depend on any specific services running in the cluster?** Not applicable.
327+
###### Does this feature depend on any specific services running in the cluster?
328+
329+
No
245330

246331
### Scalability
247332

248-
* **Will enabling / using this feature result in any new API calls?** No.
249-
* **Will enabling / using this feature result in introducing new API types?** No.
250-
* **Will enabling / using this feature result in any new calls to cloud provider?** No.
251-
* **Will enabling / using this feature result in increasing size or count of the existing API objects?** No.
252-
* **Will enabling / using this feature result in increasing time taken by any operations covered by [existing SLIs/SLOs][]?** No. Feature is out of existing any paths in kubelet.
253-
* **Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?** DDOSing the API can lead to resource exhaustion. It is planned to be addressed as part of G.A.
254-
Feature only collects data when requests comes in, data is then garbage collected. Data collected is proportional to the number of pods on the node.
333+
###### Will enabling / using this feature result in any new API calls?
334+
335+
No
336+
337+
###### Will enabling / using this feature result in introducing new API types?
338+
339+
No
340+
341+
###### Will enabling / using this feature result in any new calls to the cloud provider?
342+
343+
No
344+
345+
###### Will enabling / using this feature result in increasing size or count of the existing API objects?
346+
347+
No
348+
349+
###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
350+
351+
No. The feature is not affecting hot code paths in the kubelet, and just give access to cached data already computed by the kubelet for internal bookkeeping.
352+
353+
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
354+
355+
Negligible amount of CPU and memory, because the endpoint queries existing data structures inside the kubelet.
356+
357+
###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
358+
359+
No, because the endpoint queries existing data structures inside the kubelet.
255360

256361
### Troubleshooting
257362

258-
* **How does this feature react if the API server and/or etcd is unavailable?**: No effect.
259-
* **What are other known failure modes?** feature gate disabled: the API will always return a well-known error. In normal operation, the API is expected to never return error and always return a valid response, because it utilizes internal kubelet data which is always available. Bugs may lead to the API to return unexpected errors, or to return inconsistent data. Consumers of the API should treat unexpected errors as bugs of this API.
260-
* **What steps should be taken if SLOs are not being met to determine the problem?** N/A
363+
###### How does this feature react if the API server and/or etcd is unavailable?
364+
365+
No impact, the feature is node-local
366+
367+
###### What are other known failure modes?
368+
369+
feature gate disabled: the API will always return a well-known error. In normal operation, the API is expected to never return error and always return
370+
a valid response, because it utilizes internal kubelet data which is always available.
371+
Bugs may lead to the API to return unexpected errors, or to return inconsistent data.
372+
Consumers of the API should treat unexpected errors as bugs of this API.
373+
374+
###### What steps should be taken if SLOs are not being met to determine the problem?
261375

262-
[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
263-
[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
376+
Check the error code to learn if the consumer of the API is being throttle by rate limiting introduced in the parent KEP 606.
377+
Check the kubelet logs to learn about resource allocation errors.
264378

265379
## Implementation History
266380

267381
- 2021-02-02: KEP extracted from [previous iteration](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2043-pod-resource-concrete-assigments)
268382
- 2021-02-04: KEP polished, added feature gate, clarified the graduation criteria.
269383
- 2021-02-08: KEP updated adding per-specific-endpoint metrics to the podresources API and clarifying failure modes.
270384
- 2021-09-02: KEP updated to explicitly clarify the behavior of `GetAllocatableResources` and graduate to Beta in 1.23.
385+
- 2021-05-30: KEP updated to the new template and to graduate to GA in 1.28
271386

272387
## Alternatives
273388

keps/sig-node/2403-pod-resources-allocatable-resources/kep.yaml

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,14 @@
11
title: Extend kubelet pod resource assignment endpoint to return allocatable resources
22
kep-number: 2403
33
authors:
4-
- "@fromanirh"
4+
- "@ffromani"
55
- "@alexeyperevalov"
66
- "@swatisehgal"
77
owning-sig: sig-node
88
participating-sigs: []
99
status: implementable
1010
creation-date: "2021-02-02"
11-
last-updated: "2021-09-02"
11+
last-updated: "2023-06-06"
1212
reviewers:
1313
- "@derekwaynecarr"
1414
- "@renaudwastaken"
@@ -26,13 +26,13 @@ stage: beta
2626
# The most recent milestone for which work toward delivery of this KEP has been
2727
# done. This can be the current (upcoming) milestone, if it is being actively
2828
# worked on.
29-
latest-milestone: "v1.23"
29+
latest-milestone: "v1.28"
3030

3131
# The milestone at which this feature was, or is targeted to be, at each stage.
3232
milestone:
3333
alpha: "v1.21"
3434
beta: "v1.23"
35-
stable: "v1.24"
35+
stable: "v1.28"
3636

3737
# The following PRR answers are required at alpha release
3838
# List the feature gate name and the components for which it must be enabled

0 commit comments

Comments
 (0)