Skip to content

Commit 8238ac0

Browse files
authored
Merge pull request kubernetes#2417 from smarterclayton/resource_kep
Update pod resource metrics KEP for beta with PRR
2 parents 3d0f82d + 0ccf479 commit 8238ac0

File tree

3 files changed

+210
-0
lines changed

3 files changed

+210
-0
lines changed
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
kep-number: 1748
2+
beta:
3+
approver: "@johnbelamaric"

keps/sig-instrumentation/1748-pod-resource-metrics/README.md

Lines changed: 200 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -102,6 +102,13 @@ tags, and then generate with `hack/update-toc.sh`.
102102
- [Beta -> GA Graduation](#beta---ga-graduation)
103103
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
104104
- [Version Skew Strategy](#version-skew-strategy)
105+
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
106+
- [Feature Enablement and Rollback](#feature-enablement-and-rollback)
107+
- [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
108+
- [Monitoring Requirements](#monitoring-requirements)
109+
- [Dependencies](#dependencies)
110+
- [Scalability](#scalability)
111+
- [Troubleshooting](#troubleshooting)
105112
- [Implementation History](#implementation-history)
106113
- [Drawbacks](#drawbacks)
107114
- [Alternatives](#alternatives)
@@ -409,11 +416,204 @@ enhancement:
409416
CRI or CNI may require updating that component before the kubelet.
410417
-->
411418

419+
420+
## Production Readiness Review Questionnaire
421+
422+
<!--
423+
424+
Production readiness reviews are intended to ensure that features merging into
425+
Kubernetes are observable, scalable and supportable; can be safely operated in
426+
production environments, and can be disabled or rolled back in the event they
427+
cause increased failures in production. See more in the PRR KEP at
428+
https://git.k8s.io/enhancements/keps/sig-architecture/1194-prod-readiness.
429+
430+
The production readiness review questionnaire must be completed and approved
431+
for the KEP to move to `implementable` status and be included in the release.
432+
433+
In some cases, the questions below should also have answers in `kep.yaml`. This
434+
is to enable automation to verify the presence of the review, and to reduce review
435+
burden and latency.
436+
437+
The KEP must have a approver from the
438+
[`prod-readiness-approvers`](http://git.k8s.io/enhancements/OWNERS_ALIASES)
439+
team. Please reach out on the
440+
[#prod-readiness](https://kubernetes.slack.com/archives/CPNHUMN74) channel if
441+
you need any help or guidance.
442+
443+
-->
444+
445+
### Feature Enablement and Rollback
446+
447+
* **How can this feature be enabled / disabled in a live cluster?**
448+
- [ ] Feature gate (also fill in values in `kep.yaml`)
449+
- Feature gate name:
450+
- Components depending on the feature gate:
451+
- [x] Other
452+
- Describe the mechanism: A metrics collector may scrape the `/metrics/resources` endpoint of all schedulers, as long as the scheduler exposes metrics of the required stability level.
453+
- Will enabling / disabling the feature require downtime of the control
454+
plane?
455+
- Will enabling / disabling the feature require downtime or reprovisioning
456+
of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled).
457+
458+
* **Does enabling the feature change any default behavior?**
459+
460+
Scraping these metrics does not change behavior of the system.
461+
462+
* **Can the feature be disabled once it has been enabled (i.e. can we roll back
463+
the enablement)?**
464+
465+
Yes, in order of increasing effort or impact to other areas:
466+
467+
* Administrators may stop scraping the endpoint, which will mean the metrics are not available and any impacted caused by scraping will stop.
468+
* The administrator may change the RBAC permissions on the delegated auth for the metrics endpoint to deny access to clients if a client is excessively targeting metrics and cannot be stopped.
469+
* The administrator may change the HTTP server arguments on the scheduler to disable information about the scheduler via the `--port` arguments, but doing so may require other changes to scheduler configuration as this will disable health checks and standard metrics.
470+
471+
* **What happens if we reenable the feature if it was previously rolled back?**
472+
473+
Metrics will start getting collected.
474+
475+
* **Are there any tests for feature enablement/disablement?**
476+
477+
As an opt-in metrics endpoint enablement is tested from our integration tests.
478+
479+
### Rollout, Upgrade and Rollback Planning
480+
481+
* **How can a rollout fail? Can it impact already running workloads?**
482+
483+
This cannot impact running workloads unless an unlikely performance issue is triggered due to
484+
excessive scraping of the scheduler metrics endpoints (which is already possible today).
485+
486+
Since the new metrics are proportionally less than the metrics an apiserver or node exposes,
487+
it is unlikely that scraping this endpoint would break a metrics collector.
488+
489+
* **What specific metrics should inform a rollback?**
490+
491+
Excessive CPU use from the Kube scheduler when metrics are scraped at a reasonable rate,
492+
although simply disabling optional scraping while waiting for the bug to be fixed would be
493+
a more reasonable path.
494+
495+
* **Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
496+
497+
Does not apply.
498+
499+
* **Is the rollout accompanied by any deprecations and/or removals of features, APIs,
500+
fields of API types, flags, etc.?**
501+
502+
No.
503+
504+
### Monitoring Requirements
505+
506+
* **How can an operator determine if the feature is in use by workloads?**
507+
508+
This would be up to the metrics collector component whose API is not under the
509+
scope of the Kubernetes project. Some third party software may use these metrics
510+
as part of a control loop or visualization, but that is entirely up to the metrics
511+
collector.
512+
513+
Administrators and visualization tools are the primary target of these metrics and
514+
so polling and canvassing of Kube distributions is one source of feedback.
515+
516+
* **What are the SLIs (Service Level Indicators) an operator can use to determine
517+
the health of the service?**
518+
- [ ] Metrics
519+
- Metric name:
520+
- [Optional] Aggregation method:
521+
- Components exposing the metric:
522+
- [x] Other (treat as last resort)
523+
- Details: Covered by existing scheduler SLIs (health check, CPU use, pod scheduling rate, http request counts).
524+
525+
* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
526+
527+
The existing scheduler SLOs should be sufficient and this change should have no measurable impact on the existing SLO.
528+
529+
The metrics endpoint should consume a tiny fraction of the CPU of the scheduler (less than 5% at idle) when scraped
530+
every 15s. The endpoint should return quickly (tens of milliseconds at a P99) when O(pods) is below 10,000. CPU and
531+
latency should be proportional to number of pods only, as the rest of the scheduler, and the metrics endpoint should
532+
scale linearly to that factor.
533+
534+
* **Are there any missing metrics that would be useful to have to improve observability
535+
of this feature?**
536+
537+
No
538+
539+
### Dependencies
540+
541+
_This section must be completed when targeting beta graduation to a release._
542+
543+
* **Does this feature depend on any specific services running in the cluster?**
544+
545+
- Scheduler
546+
- Hosts the metrics
547+
- Metrics collector
548+
- Scrapes the endpoint
549+
- May run on or off clutser
550+
551+
552+
### Scalability
553+
554+
* **Will enabling / using this feature result in any new API calls?**
555+
556+
No, this pulls directly from the scheduler's informer cache.
557+
558+
* **Will enabling / using this feature result in introducing new API types?**
559+
560+
No.
561+
562+
* **Will enabling / using this feature result in any new calls to the cloud
563+
provider?**
564+
565+
No.
566+
567+
* **Will enabling / using this feature result in increasing size or count of
568+
the existing API objects?**
569+
570+
No.
571+
572+
* **Will enabling / using this feature result in increasing time taken by any
573+
operations covered by [existing SLIs/SLOs]?**
574+
575+
The CPU usage of this feature when activated should have a negligible effect on
576+
scheduler throughput and latency. No additional memory usage is expected.
577+
578+
* **Will enabling / using this feature result in non-negligible increase of
579+
resource usage (CPU, RAM, disk, IO, ...) in any components?**
580+
581+
Negligible CPU use is expected and some increase in network transmit when the scheduler
582+
is scraped.
583+
584+
### Troubleshooting
585+
586+
The Troubleshooting section currently serves the `Playbook` role. We may consider
587+
splitting it into a dedicated `Playbook` document (potentially with some monitoring
588+
details). For now, we leave it here.
589+
590+
* **How does this feature react if the API server and/or etcd is unavailable?**
591+
592+
It returns the metrics of the last set of data received by the scheduler, or no
593+
metrics if the scheduler has been restarted since partitioned from the API server.
594+
595+
* **What are other known failure modes?**
596+
597+
- Panic due to unexpected code path or incomplete API objects returned in watch
598+
- Detection: The scrape of the component should fail
599+
- Mitigations: Stop scraping the endpoint
600+
- Diagnostics: Panic messages in the scheduler logs
601+
- Testing: We do not inject fake panics because the behavior of metrics endpoints are well known and there is no background processing.
602+
603+
* **What steps should be taken if SLOs are not being met to determine the problem?**
604+
605+
Perform a golang CPU profile of the scheduler and assess the percentage of CPU charged to the functions
606+
that generate the CPU metrics. If they exceed 5% of total usage, identify which methods are hotspots.
607+
Look for unexpected allocations via a heap profile (the metrics endpoint should not generate much if any
608+
allocations onto the heap).
609+
610+
412611
## Implementation History
413612

414613
* 2020/04/07 - [Prototyped](https://github.com/openshift/openshift-controller-manager/pull/90) in OpenShift after receiving feedback that resource metrics were opaque and difficult to alert on
415614
* 2020/04/21 - Discussed in sig-instrumentation and decided to move forward as KEP
416615
* 2020/07/30 - KEP draft
616+
* 2020/11/12 - Merged implementation https://github.com/kubernetes/kubernetes/pull/94866 for 1.20 Alpha
417617

418618
<!--
419619
Major milestones in the life cycle of a KEP should be tracked in this section.

keps/sig-instrumentation/1748-pod-resource-metrics/kep.yaml

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,5 +16,12 @@ approvers:
1616
- "@brancz"
1717
- "@dashpole"
1818
- "@ahg-g"
19+
prr-approvers:
20+
- "@johnbelamaric"
1921
see-also:
2022
replaces:
23+
latest-milestone: "v1.21"
24+
milestone:
25+
alpha: "v1.20"
26+
beta: "v1.21"
27+
stable: "v1.22"

0 commit comments

Comments
 (0)