|
| 1 | +# Disable CAdvisor Json Metrics |
| 2 | + |
| 3 | +<!-- toc --> |
| 4 | +- [Release Signoff Checklist](#release-signoff-checklist) |
| 5 | +- [Summary](#summary) |
| 6 | +- [Motivation](#motivation) |
| 7 | + - [Goals](#goals) |
| 8 | + - [Non-Goals](#non-goals) |
| 9 | +- [Proposal](#proposal) |
| 10 | + - [Risks and Mitigations](#risks-and-mitigations) |
| 11 | +- [Design Details](#design-details) |
| 12 | + - [Test Plan](#test-plan) |
| 13 | + - [Graduation Criteria](#graduation-criteria) |
| 14 | + - [GA Graduation](#ga-graduation) |
| 15 | + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) |
| 16 | + - [Version Skew Strategy](#version-skew-strategy) |
| 17 | +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) |
| 18 | + - [Feature enablement and rollback](#feature-enablement-and-rollback) |
| 19 | + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) |
| 20 | + - [Monitoring requirements](#monitoring-requirements) |
| 21 | + - [Dependencies](#dependencies) |
| 22 | + - [Scalability](#scalability) |
| 23 | + - [Troubleshooting](#troubleshooting) |
| 24 | +- [Implementation History](#implementation-history) |
| 25 | +- [Drawbacks](#drawbacks) |
| 26 | +- [Alternatives](#alternatives) |
| 27 | +<!-- /toc --> |
| 28 | + |
| 29 | +## Release Signoff Checklist |
| 30 | + |
| 31 | +- [X] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements](https://github.com/kubernetes/enhancements/issues/1867) |
| 32 | +- [ ] (R) KEP approvers have approved the KEP status as `implementable` |
| 33 | +- [X] (R) Design details are appropriately documented |
| 34 | +- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input |
| 35 | +- [X] (R) Graduation criteria is in place |
| 36 | +- [ ] (R) Production readiness review completed |
| 37 | +- [ ] Production readiness review approved |
| 38 | +- [ ] "Implementation History" section is up-to-date for milestone |
| 39 | +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] |
| 40 | +- [ ] Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes |
| 41 | + |
| 42 | +[kubernetes.io]: https://kubernetes.io/ |
| 43 | +[kubernetes/enhancements]: https://git.k8s.io/enhancements |
| 44 | +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes |
| 45 | +[kubernetes/website]: https://git.k8s.io/website |
| 46 | + |
| 47 | +## Summary |
| 48 | + |
| 49 | +This KEP outlines the process to deprecate cAdvisor json metrics collected by Kubelet. |
| 50 | + |
| 51 | +This is one step towards removing cAdvisor APIs from the kubelet, which has been a long-time goal of sig-node. cAdvisor only supports linux, and only supports a small set of container runtimes. For that reason, sig-node has historically wanted to allow vendors to replace cAdvisor without paying a double-collection performance penalty. This hasn't been achieved yet, despite some incremental progress. This KEP is another small step towards that goal. |
| 52 | + |
| 53 | +Note that cAdvisor endpoints are not believed to be widely used. They were entirely broken for multiple releases before someone reported it: [kubernetes/kubernetes#62544](https://github.com/kubernetes/kubernetes/pull/62544). |
| 54 | + |
| 55 | +This was originally part of the ["metrics overhaul" KEP](https://github.com/kubernetes/enhancements/blob/master/keps/sig-instrumentation/20181106-kubernetes-metrics-overhaul.md), but was [removed](https://github.com/kubernetes/enhancements/pull/1935) because it had not been completed. There were also [concerns raised](https://github.com/kubernetes/kubernetes/issues/68522#issuecomment-645454818) about removing these metrics. |
| 56 | + |
| 57 | +cAdvisor json metrics were disabled by default starting in 1.18, and could be re-enabled by setting `--enable-cadvisor-json-endpoints` to true. However, given the concerns about removing the endpoints, we want to re-visit the deprecation and removal of these metrics before removing them permanently. |
| 58 | + |
| 59 | +## Motivation |
| 60 | + |
| 61 | +### Goals |
| 62 | + |
| 63 | +* Remove cAdvisor v1 ContainerInfo json metrics (`/stats/container`, `/stats/<podname>/<containername>`, `/stats/<namespace>/<podname>/<poduid>/<containername>`) from the kubelet. |
| 64 | +* Remove cAdvisor v1 MachineInfo json metrics (/spec) from the kubelet. |
| 65 | + |
| 66 | +### Non-Goals |
| 67 | + |
| 68 | +* Remove or modify cadvisor prometheus metrics from the kubelet (/metrics/prometheus). |
| 69 | +* Remove or modify the Summary API |
| 70 | +* Eliminate the kubelet's dependence on cAdvisor for metrics to supply the Summary API. |
| 71 | + |
| 72 | +## Proposal |
| 73 | + |
| 74 | +### Risks and Mitigations |
| 75 | + |
| 76 | +The main risk that we face is breaking existing consumers of the cAdvisor json metrics. |
| 77 | + |
| 78 | +There are a few migration paths for users: |
| 79 | + |
| 80 | +For all metrics: Run cAdvisor as a daemonset. See the [Instructions for running cAdvisor as a daemonset](https://github.com/google/cadvisor/tree/master/deploy/kubernetes#cadvisor-kubernetes-daemonset). |
| 81 | +* Pros: Provides the exact same APIs. |
| 82 | +* Cons: Can be expensive to run another instance of cAdvisor. |
| 83 | + |
| 84 | +For container metrics: Use an alternative kubelet endpoint. Container metrics are available in /metrics/resource, /metrics/cadvisor, /stats/summary. |
| 85 | +* Pros: No additional cost |
| 86 | +* Cons: Metrics are in a different format and may not contain all information available in the json endpoints. |
| 87 | + |
| 88 | +For machine metrics: Use the prometheus node exporter. |
| 89 | +* Pros: Community-supported and widely used machine-level monitoring tool. Easy-to-use configuration to enable/disable metrics. |
| 90 | +* Cons: Metrics are in a different format, and may not have the same set of information |
| 91 | + |
| 92 | +For most metrics, run a prometheus server, and collect metrics from the /metrics/cadvisor endpoint. |
| 93 | +If you want to be able to access metrics in JSON format, you can use the [Prometheus server's HTTP API](https://prometheus.io/docs/prometheus/latest/querying/api/). |
| 94 | +* Pros: Similar metrics in JSON format |
| 95 | +* Cons: The JSON structure of metrics is different, and it requires running a prometheus server. |
| 96 | + |
| 97 | +## Design Details |
| 98 | + |
| 99 | +Remove the `--enable-cadvisor-json-endpoints` flag and the kubelet stops serving on the paths listed in the Goals section. |
| 100 | + |
| 101 | +### Test Plan |
| 102 | + |
| 103 | +* This will not have any e2e testing. |
| 104 | +* There are no existing kubernetes e2e tests which check these endpoints. |
| 105 | + |
| 106 | +### Graduation Criteria |
| 107 | + |
| 108 | +#### GA Graduation |
| 109 | + |
| 110 | +* The deprecated flag and relevant code have been removed. |
| 111 | +* We are moving directly to stable, as the endpoints have already been marked deprecated for at least 2 releases. |
| 112 | + |
| 113 | +### Upgrade / Downgrade Strategy |
| 114 | + |
| 115 | +- What changes (in invocations, configurations, API use, etc.) is an existing cluster required to make on upgrade in order to keep previous behavior? N/A. |
| 116 | +- What changes (in invocations, configurations, API use, etc.) is an existing cluster required to make on upgrade in order to make use of the enhancement? N/A. |
| 117 | + |
| 118 | +### Version Skew Strategy |
| 119 | + |
| 120 | +- Does this enhancement involve coordinating behavior in the control plane and in the kubelet? How does an n-2 kubelet without this feature available behave when this feature is used? N/A. |
| 121 | +- Will any other components on the node change? For example, changes to CSI, CRI or CNI may require updating that component before the kubelet. N/A. |
| 122 | + |
| 123 | +## Production Readiness Review Questionnaire |
| 124 | + |
| 125 | +### Feature enablement and rollback |
| 126 | + |
| 127 | +* **How can this feature be enabled / disabled in a live cluster?** |
| 128 | + As of 1.18, these metrics can be re-enabled using `--enable-cadvisor-json-endpoints=true`. After this KEP, it will not be possible to re-enable these metrics. |
| 129 | + |
| 130 | +* **Does enabling the feature change any default behavior?** No. These metrics are already disabled by default. |
| 131 | +* **Can the feature be disabled once it has been enabled (i.e. can we rollback the enablement)?** The metrics can be enabled using the flag, but after this "feature", it will no longer be possible to do so. |
| 132 | + |
| 133 | +* **What happens if we reenable the feature if it was previously rolled back?** Metrics are collected again. |
| 134 | +* **Are there any tests for feature enablement/disablement?** No. |
| 135 | + |
| 136 | +### Rollout, Upgrade and Rollback Planning |
| 137 | + |
| 138 | +* **How can a rollout fail? Can it impact already running workloads?** N/A. |
| 139 | +* **What specific metrics should inform a rollback?** N/A. |
| 140 | +* **Were upgrade and rollback tested? Was upgrade->downgrade->upgrade path tested?** Not yet, probably N/A. |
| 141 | +* **Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?** Yes. |
| 142 | + |
| 143 | +### Monitoring requirements |
| 144 | + |
| 145 | +* **How can an operator determine if the feature is in use by workloads?** N/A. |
| 146 | +* **What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?** N/A. |
| 147 | +* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?** N/A. |
| 148 | +* **Are there any missing metrics that would be useful to have to improve observability if this feature?** N/A. |
| 149 | + |
| 150 | +### Dependencies |
| 151 | + |
| 152 | +* **Does this feature depend on any specific services running in the cluster?** N/A. |
| 153 | + |
| 154 | +### Scalability |
| 155 | + |
| 156 | +* **Will enabling / using this feature result in any new API calls?** No. |
| 157 | +* **Will enabling / using this feature result in introducing new API types?** No. |
| 158 | +* **Will enabling / using this feature result in any new calls to cloud provider?** No. |
| 159 | +* **Will enabling / using this feature result in increasing size or count of the existing API objects?** No. |
| 160 | +* **Will enabling / using this feature result in increasing time taken by any operations covered by [existing SLIs/SLOs][]?** No. |
| 161 | +* **Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?** No. |
| 162 | + |
| 163 | +### Troubleshooting |
| 164 | + |
| 165 | +* **How does this feature react if the API server and/or etcd is unavailable?** N/A. |
| 166 | +* **What are other known failure modes?** N/A. |
| 167 | +* **What steps should be taken if SLOs are not being met to determine the problem?** N/A |
| 168 | + |
| 169 | +[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md |
| 170 | +[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos |
| 171 | + |
| 172 | +## Implementation History |
| 173 | + |
| 174 | +- 2020-10-04: Initial version of the KEP |
| 175 | +- 2021-02-04: Updates based on feedback |
| 176 | + |
| 177 | +## Drawbacks |
| 178 | + |
| 179 | +This feature is likely to break consumers of that metric. |
| 180 | + |
| 181 | +## Alternatives |
| 182 | + |
| 183 | +Keep the cAdvisor json endpoints. Either plan to remove them at a later date, or plan to keep cAdvisor in the kubelet. |
0 commit comments