You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[#prod-readiness](https://kubernetes.slack.com/archives/CPNHUMN74) channel if
441
+
you need any help or guidance.
442
+
443
+
-->
444
+
445
+
### Feature Enablement and Rollback
446
+
447
+
* **How can this feature be enabled / disabled in a live cluster?**
448
+
- [ ] Feature gate (also fill in values in `kep.yaml`)
449
+
- Feature gate name:
450
+
- Components depending on the feature gate:
451
+
- [x] Other
452
+
- Describe the mechanism: A metrics collector may scrape the `/metrics/resources` endpoint of all schedulers, as long as the scheduler exposes metrics of the required stability level.
453
+
- Will enabling / disabling the feature require downtime of the control
454
+
plane?
455
+
- Will enabling / disabling the feature require downtime or reprovisioning
456
+
of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled).
457
+
458
+
* **Does enabling the feature change any default behavior?**
459
+
460
+
Scraping these metrics does not change behavior of the system.
461
+
462
+
* **Can the feature be disabled once it has been enabled (i.e. can we roll back
463
+
the enablement)?**
464
+
465
+
Yes, in order of increasing effort or impact to other areas:
466
+
467
+
* Administrators may stop scraping the endpoint, which will mean the metrics are not available and any impacted caused by scraping will stop.
468
+
* The administrator may change the RBAC permissions on the delegated auth for the metrics endpoint to deny access to clients if a client is excessively targeting metrics and cannot be stopped.
469
+
* The administrator may change the HTTP server arguments on the scheduler to disable information about the scheduler via the `--port` arguments, but doing so may require other changes to scheduler configuration as this will disable health checks and standard metrics.
470
+
471
+
* **What happens if we reenable the feature if it was previously rolled back?**
472
+
473
+
Metrics will start getting collected.
474
+
475
+
* **Are there any tests for feature enablement/disablement?**
476
+
477
+
As an opt-in metrics endpoint enablement is tested from our integration tests.
478
+
479
+
### Rollout, Upgrade and Rollback Planning
480
+
481
+
* **How can a rollout fail? Can it impact already running workloads?**
482
+
483
+
This cannot impact running workloads unless an unlikely performance issue is triggered due to
484
+
excessive scraping of the scheduler metrics endpoints (which is already possible today).
485
+
486
+
Since the new metrics are proportionally less than the metrics an apiserver or node exposes,
487
+
it is unlikely that scraping this endpoint would break a metrics collector.
488
+
489
+
* **What specific metrics should inform a rollback?**
490
+
491
+
Excessive CPU use from the Kube scheduler when metrics are scraped at a reasonable rate,
492
+
although simply disabling optional scraping while waiting for the bug to be fixed would be
493
+
a more reasonable path.
494
+
495
+
* **Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
496
+
497
+
Does not apply.
498
+
499
+
* **Is the rollout accompanied by any deprecations and/or removals of features, APIs,
500
+
fields of API types, flags, etc.?**
501
+
502
+
No.
503
+
504
+
### Monitoring Requirements
505
+
506
+
* **How can an operator determine if the feature is in use by workloads?**
507
+
508
+
This would be up to the metrics collector component whose API is not under the
509
+
scope of the Kubernetes project. Some third party software may use these metrics
510
+
as part of a control loop or visualization, but that is entirely up to the metrics
511
+
collector.
512
+
513
+
Administrators and visualization tools are the primary target of these metrics and
514
+
so polling and canvassing of Kube distributions is one source of feedback.
515
+
516
+
* **What are the SLIs (Service Level Indicators) an operator can use to determine
517
+
the health of the service?**
518
+
- [ ] Metrics
519
+
- Metric name:
520
+
- [Optional] Aggregation method:
521
+
- Components exposing the metric:
522
+
- [x] Other (treat as last resort)
523
+
- Details: Covered by existing scheduler SLIs (health check, CPU use, pod scheduling rate, http request counts).
524
+
525
+
* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
526
+
527
+
The existing scheduler SLOs should be sufficient and this change should have no measurable impact on the existing SLO.
528
+
529
+
The metrics endpoint should consume a tiny fraction of the CPU of the scheduler (less than 5% at idle) when scraped
530
+
every 15s. The endpoint should return quickly (tens of milliseconds at a P99) when O(pods) is below 10,000. CPU and
531
+
latency should be proportional to number of pods only, as the rest of the scheduler, and the metrics endpoint should
532
+
scale linearly to that factor.
533
+
534
+
* **Are there any missing metrics that would be useful to have to improve observability
535
+
of this feature?**
536
+
537
+
No
538
+
539
+
### Dependencies
540
+
541
+
_This section must be completed when targeting beta graduation to a release._
542
+
543
+
* **Does this feature depend on any specific services running in the cluster?**
544
+
545
+
- Scheduler
546
+
- Hosts the metrics
547
+
- Metrics collector
548
+
- Scrapes the endpoint
549
+
- May run on or off clutser
550
+
551
+
552
+
### Scalability
553
+
554
+
* **Will enabling / using this feature result in any new API calls?**
555
+
556
+
No, this pulls directly from the scheduler's informer cache.
557
+
558
+
* **Will enabling / using this feature result in introducing new API types?**
559
+
560
+
No.
561
+
562
+
* **Will enabling / using this feature result in any new calls to the cloud
563
+
provider?**
564
+
565
+
No.
566
+
567
+
* **Will enabling / using this feature result in increasing size or count of
568
+
the existing API objects?**
569
+
570
+
No.
571
+
572
+
* **Will enabling / using this feature result in increasing time taken by any
573
+
operations covered by [existing SLIs/SLOs]?**
574
+
575
+
The CPU usage of this feature when activated should have a negligible effect on
576
+
scheduler throughput and latency. No additional memory usage is expected.
577
+
578
+
* **Will enabling / using this feature result in non-negligible increase of
579
+
resource usage (CPU, RAM, disk, IO, ...) in any components?**
580
+
581
+
Negligible CPU use is expected and some increase in network transmit when the scheduler
582
+
is scraped.
583
+
584
+
### Troubleshooting
585
+
586
+
The Troubleshooting section currently serves the `Playbook` role. We may consider
587
+
splitting it into a dedicated `Playbook` document (potentially with some monitoring
588
+
details). For now, we leave it here.
589
+
590
+
* **How does this feature react if the API server and/or etcd is unavailable?**
591
+
592
+
It returns the metrics of the last set of data received by the scheduler, or no
593
+
metrics if the scheduler has been restarted since partitioned from the API server.
594
+
595
+
* **What are other known failure modes?**
596
+
597
+
- Panic due to unexpected code path or incomplete API objects returned in watch
598
+
- Detection: The scrape of the component should fail
599
+
- Mitigations: Stop scraping the endpoint
600
+
- Diagnostics: Panic messages in the scheduler logs
601
+
- Testing: We do not inject fake panics because the behavior of metrics endpoints are well known and there is no background processing.
602
+
603
+
* **What steps should be taken if SLOs are not being met to determine the problem?**
604
+
605
+
Perform a golang CPU profile of the scheduler and assess the percentage of CPU charged to the functions
606
+
that generate the CPU metrics. If they exceed 5% of total usage, identify which methods are hotspots.
607
+
Look for unexpected allocations via a heap profile (the metrics endpoint should not generate much if any
608
+
allocations onto the heap).
609
+
610
+
412
611
## Implementation History
413
612
414
613
* 2020/04/07 - [Prototyped](https://github.com/openshift/openshift-controller-manager/pull/90) in OpenShift after receiving feedback that resource metrics were opaque and difficult to alert on
415
614
* 2020/04/21 - Discussed in sig-instrumentation and decided to move forward as KEP
416
615
* 2020/07/30 - KEP draft
616
+
* 2020/11/12 - Merged implementation https://github.com/kubernetes/kubernetes/pull/94866 for 1.20 Alpha
417
617
418
618
<!--
419
619
Major milestones in the life cycle of a KEP should be tracked in this section.
0 commit comments