Skip to content

Commit 9fe25b2

Browse files
authored
Merge pull request kubernetes#2130 from dashpole/remove_cadvisor_json
Add KEP for removing cAdvisor json metrics
2 parents 6576376 + 0c4ea16 commit 9fe25b2

File tree

3 files changed

+220
-0
lines changed

3 files changed

+220
-0
lines changed
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
kep-number: 2129
2+
stable:
3+
approver: "@johnbelamaric"
Lines changed: 183 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,183 @@
1+
# Disable CAdvisor Json Metrics
2+
3+
<!-- toc -->
4+
- [Release Signoff Checklist](#release-signoff-checklist)
5+
- [Summary](#summary)
6+
- [Motivation](#motivation)
7+
- [Goals](#goals)
8+
- [Non-Goals](#non-goals)
9+
- [Proposal](#proposal)
10+
- [Risks and Mitigations](#risks-and-mitigations)
11+
- [Design Details](#design-details)
12+
- [Test Plan](#test-plan)
13+
- [Graduation Criteria](#graduation-criteria)
14+
- [GA Graduation](#ga-graduation)
15+
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
16+
- [Version Skew Strategy](#version-skew-strategy)
17+
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
18+
- [Feature enablement and rollback](#feature-enablement-and-rollback)
19+
- [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
20+
- [Monitoring requirements](#monitoring-requirements)
21+
- [Dependencies](#dependencies)
22+
- [Scalability](#scalability)
23+
- [Troubleshooting](#troubleshooting)
24+
- [Implementation History](#implementation-history)
25+
- [Drawbacks](#drawbacks)
26+
- [Alternatives](#alternatives)
27+
<!-- /toc -->
28+
29+
## Release Signoff Checklist
30+
31+
- [X] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements](https://github.com/kubernetes/enhancements/issues/1867)
32+
- [ ] (R) KEP approvers have approved the KEP status as `implementable`
33+
- [X] (R) Design details are appropriately documented
34+
- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
35+
- [X] (R) Graduation criteria is in place
36+
- [ ] (R) Production readiness review completed
37+
- [ ] Production readiness review approved
38+
- [ ] "Implementation History" section is up-to-date for milestone
39+
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
40+
- [ ] Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
41+
42+
[kubernetes.io]: https://kubernetes.io/
43+
[kubernetes/enhancements]: https://git.k8s.io/enhancements
44+
[kubernetes/kubernetes]: https://git.k8s.io/kubernetes
45+
[kubernetes/website]: https://git.k8s.io/website
46+
47+
## Summary
48+
49+
This KEP outlines the process to deprecate cAdvisor json metrics collected by Kubelet.
50+
51+
This is one step towards removing cAdvisor APIs from the kubelet, which has been a long-time goal of sig-node. cAdvisor only supports linux, and only supports a small set of container runtimes. For that reason, sig-node has historically wanted to allow vendors to replace cAdvisor without paying a double-collection performance penalty. This hasn't been achieved yet, despite some incremental progress. This KEP is another small step towards that goal.
52+
53+
Note that cAdvisor endpoints are not believed to be widely used. They were entirely broken for multiple releases before someone reported it: [kubernetes/kubernetes#62544](https://github.com/kubernetes/kubernetes/pull/62544).
54+
55+
This was originally part of the ["metrics overhaul" KEP](https://github.com/kubernetes/enhancements/blob/master/keps/sig-instrumentation/20181106-kubernetes-metrics-overhaul.md), but was [removed](https://github.com/kubernetes/enhancements/pull/1935) because it had not been completed. There were also [concerns raised](https://github.com/kubernetes/kubernetes/issues/68522#issuecomment-645454818) about removing these metrics.
56+
57+
cAdvisor json metrics were disabled by default starting in 1.18, and could be re-enabled by setting `--enable-cadvisor-json-endpoints` to true. However, given the concerns about removing the endpoints, we want to re-visit the deprecation and removal of these metrics before removing them permanently.
58+
59+
## Motivation
60+
61+
### Goals
62+
63+
* Remove cAdvisor v1 ContainerInfo json metrics (`/stats/container`, `/stats/<podname>/<containername>`, `/stats/<namespace>/<podname>/<poduid>/<containername>`) from the kubelet.
64+
* Remove cAdvisor v1 MachineInfo json metrics (/spec) from the kubelet.
65+
66+
### Non-Goals
67+
68+
* Remove or modify cadvisor prometheus metrics from the kubelet (/metrics/prometheus).
69+
* Remove or modify the Summary API
70+
* Eliminate the kubelet's dependence on cAdvisor for metrics to supply the Summary API.
71+
72+
## Proposal
73+
74+
### Risks and Mitigations
75+
76+
The main risk that we face is breaking existing consumers of the cAdvisor json metrics.
77+
78+
There are a few migration paths for users:
79+
80+
For all metrics: Run cAdvisor as a daemonset. See the [Instructions for running cAdvisor as a daemonset](https://github.com/google/cadvisor/tree/master/deploy/kubernetes#cadvisor-kubernetes-daemonset).
81+
* Pros: Provides the exact same APIs.
82+
* Cons: Can be expensive to run another instance of cAdvisor.
83+
84+
For container metrics: Use an alternative kubelet endpoint. Container metrics are available in /metrics/resource, /metrics/cadvisor, /stats/summary.
85+
* Pros: No additional cost
86+
* Cons: Metrics are in a different format and may not contain all information available in the json endpoints.
87+
88+
For machine metrics: Use the prometheus node exporter.
89+
* Pros: Community-supported and widely used machine-level monitoring tool. Easy-to-use configuration to enable/disable metrics.
90+
* Cons: Metrics are in a different format, and may not have the same set of information
91+
92+
For most metrics, run a prometheus server, and collect metrics from the /metrics/cadvisor endpoint.
93+
If you want to be able to access metrics in JSON format, you can use the [Prometheus server's HTTP API](https://prometheus.io/docs/prometheus/latest/querying/api/).
94+
* Pros: Similar metrics in JSON format
95+
* Cons: The JSON structure of metrics is different, and it requires running a prometheus server.
96+
97+
## Design Details
98+
99+
Remove the `--enable-cadvisor-json-endpoints` flag and the kubelet stops serving on the paths listed in the Goals section.
100+
101+
### Test Plan
102+
103+
* This will not have any e2e testing.
104+
* There are no existing kubernetes e2e tests which check these endpoints.
105+
106+
### Graduation Criteria
107+
108+
#### GA Graduation
109+
110+
* The deprecated flag and relevant code have been removed.
111+
* We are moving directly to stable, as the endpoints have already been marked deprecated for at least 2 releases.
112+
113+
### Upgrade / Downgrade Strategy
114+
115+
- What changes (in invocations, configurations, API use, etc.) is an existing cluster required to make on upgrade in order to keep previous behavior? N/A.
116+
- What changes (in invocations, configurations, API use, etc.) is an existing cluster required to make on upgrade in order to make use of the enhancement? N/A.
117+
118+
### Version Skew Strategy
119+
120+
- Does this enhancement involve coordinating behavior in the control plane and in the kubelet? How does an n-2 kubelet without this feature available behave when this feature is used? N/A.
121+
- Will any other components on the node change? For example, changes to CSI, CRI or CNI may require updating that component before the kubelet. N/A.
122+
123+
## Production Readiness Review Questionnaire
124+
125+
### Feature enablement and rollback
126+
127+
* **How can this feature be enabled / disabled in a live cluster?**
128+
As of 1.18, these metrics can be re-enabled using `--enable-cadvisor-json-endpoints=true`. After this KEP, it will not be possible to re-enable these metrics.
129+
130+
* **Does enabling the feature change any default behavior?** No. These metrics are already disabled by default.
131+
* **Can the feature be disabled once it has been enabled (i.e. can we rollback the enablement)?** The metrics can be enabled using the flag, but after this "feature", it will no longer be possible to do so.
132+
133+
* **What happens if we reenable the feature if it was previously rolled back?** Metrics are collected again.
134+
* **Are there any tests for feature enablement/disablement?** No.
135+
136+
### Rollout, Upgrade and Rollback Planning
137+
138+
* **How can a rollout fail? Can it impact already running workloads?** N/A.
139+
* **What specific metrics should inform a rollback?** N/A.
140+
* **Were upgrade and rollback tested? Was upgrade->downgrade->upgrade path tested?** Not yet, probably N/A.
141+
* **Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?** Yes.
142+
143+
### Monitoring requirements
144+
145+
* **How can an operator determine if the feature is in use by workloads?** N/A.
146+
* **What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?** N/A.
147+
* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?** N/A.
148+
* **Are there any missing metrics that would be useful to have to improve observability if this feature?** N/A.
149+
150+
### Dependencies
151+
152+
* **Does this feature depend on any specific services running in the cluster?** N/A.
153+
154+
### Scalability
155+
156+
* **Will enabling / using this feature result in any new API calls?** No.
157+
* **Will enabling / using this feature result in introducing new API types?** No.
158+
* **Will enabling / using this feature result in any new calls to cloud provider?** No.
159+
* **Will enabling / using this feature result in increasing size or count of the existing API objects?** No.
160+
* **Will enabling / using this feature result in increasing time taken by any operations covered by [existing SLIs/SLOs][]?** No.
161+
* **Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?** No.
162+
163+
### Troubleshooting
164+
165+
* **How does this feature react if the API server and/or etcd is unavailable?** N/A.
166+
* **What are other known failure modes?** N/A.
167+
* **What steps should be taken if SLOs are not being met to determine the problem?** N/A
168+
169+
[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
170+
[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
171+
172+
## Implementation History
173+
174+
- 2020-10-04: Initial version of the KEP
175+
- 2021-02-04: Updates based on feedback
176+
177+
## Drawbacks
178+
179+
This feature is likely to break consumers of that metric.
180+
181+
## Alternatives
182+
183+
Keep the cAdvisor json endpoints. Either plan to remove them at a later date, or plan to keep cAdvisor in the kubelet.
Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
title: Disable cAdvisor json Metrics
2+
kep-number: 2129
3+
authors:
4+
- "@dashpole"
5+
owning-sig: sig-node
6+
participating-sigs: [sig-instrumentation]
7+
status: implementable
8+
creation-date: 2020-10-04
9+
reviewers:
10+
- "derekwaynecarr"
11+
- "dawnchen"
12+
approvers:
13+
- "derekwaynecarr"
14+
- "dawnchen"
15+
prr-approvers:
16+
- "@johnbelamaric"
17+
replaces: []
18+
19+
# The target maturity stage in the current dev cycle for this KEP.
20+
stage: stable
21+
22+
# The most recent milestone for which work toward delivery of this KEP has been
23+
# done. This can be the current (upcoming) milestone, if it is being actively
24+
# worked on.
25+
latest-milestone: "v1.21"
26+
27+
# The milestone at which this feature was, or is targeted to be, at each stage.
28+
milestone:
29+
alpha: "v1.21"
30+
beta: "v1.21"
31+
stable: "v1.21"
32+
33+
metrics:
34+
- "N/A"

0 commit comments

Comments
 (0)