Skip to content

Commit bcda68c

Browse files
PodOverhead to GA (kubernetes#3146)
* PodOverhead to GA * Update keps/sig-node/688-pod-overhead/kep.yaml Co-authored-by: Elana Hashman <[email protected]> * Update keps/sig-node/688-pod-overhead/README.md Co-authored-by: Elana Hashman <[email protected]> Co-authored-by: Elana Hashman <[email protected]>
1 parent dcadd53 commit bcda68c

File tree

3 files changed

+163
-13
lines changed

3 files changed

+163
-13
lines changed

keps/prod-readiness/sig-node/688.yaml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
kep-number: 277
2+
stable:
3+
approver: "@ehashman"

keps/sig-node/688-pod-overhead/README.md

Lines changed: 145 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -26,21 +26,28 @@
2626
- [Alternatives](#alternatives)
2727
- [Introduce pod level resource requirements](#introduce-pod-level-resource-requirements)
2828
- [Leaving the PodSpec unchanged](#leaving-the-podspec-unchanged)
29+
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
30+
- [Feature Enablement and Rollback](#feature-enablement-and-rollback)
31+
- [Monitoring Requirements](#monitoring-requirements)
32+
- [Dependencies](#dependencies)
33+
- [Scalability](#scalability)
34+
- [Troubleshooting](#troubleshooting)
2935
- [Implementation History](#implementation-history)
3036
- [Version 1.16](#version-116)
3137
- [Version 1.18](#version-118)
38+
- [Version 1.24](#version-124)
3239
<!-- /toc -->
3340

3441
## Release Signoff Checklist
3542

36-
- [ ] kubernetes/enhancements issue in release milestone, which links to KEP (this should be a link to the KEP location in kubernetes/enhancements, not the initial KEP PR)
37-
- [ ] KEP approvers have set the KEP status to `implementable`
38-
- [ ] Design details are appropriately documented
39-
- [ ] Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
40-
- [ ] Graduation criteria is in place
41-
- [ ] "Implementation History" section is up-to-date for milestone
42-
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
43-
- [ ] Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
43+
- [X] kubernetes/enhancements issue in release milestone, which links to KEP (this should be a link to the KEP location in kubernetes/enhancements, not the initial KEP PR)
44+
- [X] KEP approvers have set the KEP status to `implementable`
45+
- [X] Design details are appropriately documented
46+
- [X] Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
47+
- [X] Graduation criteria is in place
48+
- [X] "Implementation History" section is up-to-date for milestone
49+
- [X] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
50+
- [X] Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
4451

4552
**Note:** Any PRs to move a KEP to `implementable` or significant changes once it is marked `implementable` should be approved by each of the KEP approvers. If any of those
4653
approvers is no longer appropriate than changes to that list should be approved by the remaining approvers and/or the owning SIG (or SIG-arch for cross cutting KEPs).
@@ -311,12 +318,140 @@ Cons:
311318
* Not user perceptible from a workload perspective.
312319
* very complicated if the runtimeClass policy changes after workloads are running
313320

321+
## Production Readiness Review Questionnaire
322+
323+
### Feature Enablement and Rollback
324+
325+
<!--
326+
This section must be completed when targeting alpha to a release.
327+
-->
328+
329+
Skipping this section as the feature was already rolled out to all supported k8s versions.
330+
331+
### Monitoring Requirements
332+
333+
<!--
334+
This section must be completed when targeting beta to a release.
335+
-->
336+
337+
###### How can an operator determine if the feature is in use by workloads?
338+
339+
Using metrics mentioned in documentation https://kubernetes.io/docs/concepts/scheduling-eviction/pod-overhead/#observability:
340+
341+
- `kube_pod_overhead_cpu_cores`
342+
- `kube_pod_overhead_memory_bytes`
343+
344+
###### How can someone using this feature know that it is working for their instance?
345+
346+
Using metrics mentioned in documentation https://kubernetes.io/docs/concepts/scheduling-eviction/pod-overhead/#observability:
347+
348+
- `kube_pod_overhead_cpu_cores`
349+
- `kube_pod_overhead_memory_bytes`
350+
351+
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
352+
353+
The ultimate SLO is that Pod will not be evicted when it does not exceed set
354+
limits because of the overhead introduced by the runtime. Due to the complex
355+
nature of estimating resources Pod and runtime use, this is hard to measure.
356+
357+
Closest approximation to the intended SLO is that Pod's `Overhead` will be
358+
updated on admission and cgroups will be adjusted as needed.
359+
360+
Since RuntimeClass Admission controller logic is straightforward and does not
361+
introduce any new API calls, just one value assignment, Pod scheduling
362+
latency is not affected by this feature.
363+
364+
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
365+
366+
Excessive Pod evictions on specific runtime that specifies an Overhead, may
367+
indicate that feature is not working. However this is a proxy indication that
368+
is very unreliable - there is a big chance that evictions are caused by Pod or
369+
Runtime behavior.
370+
371+
Checking Pod object and cgroup settings as described in [Usage Example](https://kubernetes.io/docs/concepts/scheduling-eviction/pod-overhead/#usage-example)
372+
section of the documentation may be used as a good proxy to check that the
373+
feature is functional.
374+
375+
Finally, increased pod scheduling latency may indicate an issue with the
376+
RuntimeClass admission controller.
377+
378+
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
379+
380+
No
381+
382+
### Dependencies
383+
384+
No
385+
386+
###### Does this feature depend on any specific services running in the cluster?
387+
388+
The feature depends on RuntimeClass admission controller presence.
389+
390+
### Scalability
391+
392+
393+
###### Will enabling / using this feature result in any new API calls?
394+
395+
No, RuntimeClass is already being checked for every pod in RuntimeClass
396+
Admission Controller and PodOverhead assignment doesn't introduce any new API
397+
calls. Same for the Kubelet.
398+
399+
###### Will enabling / using this feature result in introducing new API types?
400+
401+
No
402+
403+
###### Will enabling / using this feature result in any new calls to the cloud provider?
404+
405+
No
406+
407+
###### Will enabling / using this feature result in increasing size or count of the existing API objects?
408+
409+
Every Pod that is scheduled for the RuntimeClass with the Overhead specified
410+
will carry two additional values for the `Overhead` structure.
411+
412+
###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
413+
414+
No
415+
416+
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
417+
418+
N/A.
419+
420+
Note, specifying PodOverhead will increase the allocated resources for pods by design.
421+
422+
### Troubleshooting
423+
424+
Documentation has troubleshooting steps: https://kubernetes.io/docs/concepts/scheduling-eviction/pod-overhead/
425+
426+
###### How does this feature react if the API server and/or etcd is unavailable?
427+
428+
No dependency on etcd availability.
429+
430+
###### What are other known failure modes?
431+
432+
No
433+
434+
###### What steps should be taken if SLOs are not being met to determine the problem?
435+
436+
- Validate the RuntimeClass Admission controller is functional
437+
- Validate that Pod objects are updated correctly
438+
- Validate that cgroups are updated correctly
439+
314440
## Implementation History
315441

316-
2019-04-04: Initial KEP published.
442+
- 2019-04-04: Initial KEP published.
317443

318444
### Version 1.16
445+
319446
- Implemented as Alpha.
320447

321448
### Version 1.18
322-
- Promoted to Beta.
449+
450+
- Promoted to Beta.
451+
452+
### Version 1.24
453+
454+
1. Production usage: https://github.com/openshift/sandboxed-containers-operator/blob/0edbfbf353945dec4066a6d127bf9d88fbbc80a7/controllers/openshift_controller.go#L342
455+
2. Documentation is in place: https://kubernetes.io/docs/concepts/scheduling-eviction/pod-overhead/
456+
457+
- Promoted to stable

keps/sig-node/688-pod-overhead/kep.yaml

Lines changed: 15 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,19 @@ reviewers:
1414
approvers:
1515
- "@dchen1107"
1616
- "@derekwaynecarr"
17-
editor: Eric Ernst
17+
stage: stable
18+
latest-milestone: "v1.24"
19+
milestone:
20+
alpha: "v1.16"
21+
beta: "v1.18"
22+
stable: "v1.24"
1823
creation-date: 2019-02-26
19-
last-updated: 2020-10-27
20-
status: implemented (beta)
24+
last-updated: 2022-01-14
25+
status: implementable
26+
feature-gates:
27+
- name: PodOverhead
28+
components:
29+
- kubelet
30+
- kube-apiserver
31+
- kube-scheduler
32+
- controller-manager

0 commit comments

Comments
 (0)