Skip to content

Commit a9c06f9

Browse files
author
Han Kang
committed
use more recent kep template
1 parent 9f576e4 commit a9c06f9

File tree

2 files changed

+284
-35
lines changed

2 files changed

+284
-35
lines changed

keps/sig-instrumentation/3498-extending-stability/README.md

Lines changed: 252 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,80 @@
1-
## Table of Contents
1+
<!--
2+
**Note:** When your KEP is complete, all of these comment blocks should be removed.
3+
4+
To get started with this template:
5+
6+
- [ ] **Pick a hosting SIG.**
7+
Make sure that the problem space is something the SIG is interested in taking
8+
up. KEPs should not be checked in without a sponsoring SIG.
9+
- [ ] **Create an issue in kubernetes/enhancements**
10+
When filing an enhancement tracking issue, please make sure to complete all
11+
fields in that template. One of the fields asks for a link to the KEP. You
12+
can leave that blank until this KEP is filed, and then go back to the
13+
enhancement and add the link.
14+
- [ ] **Make a copy of this template directory.**
15+
Copy this template into the owning SIG's directory and name it
16+
`NNNN-short-descriptive-title`, where `NNNN` is the issue number (with no
17+
leading-zero padding) assigned to your enhancement above.
18+
- [ ] **Fill out as much of the kep.yaml file as you can.**
19+
At minimum, you should fill in the "Title", "Authors", "Owning-sig",
20+
"Status", and date-related fields.
21+
- [ ] **Fill out this file as best you can.**
22+
At minimum, you should fill in the "Summary" and "Motivation" sections.
23+
These should be easy if you've preflighted the idea of the KEP with the
24+
appropriate SIG(s).
25+
- [ ] **Create a PR for this KEP.**
26+
Assign it to people in the SIG who are sponsoring this process.
27+
- [ ] **Merge early and iterate.**
28+
Avoid getting hung up on specific details and instead aim to get the goals of
29+
the KEP clarified and merged quickly. The best way to do this is to just
30+
start with the high-level sections and fill out details incrementally in
31+
subsequent PRs.
32+
33+
Just because a KEP is merged does not mean it is complete or approved. Any KEP
34+
marked as `provisional` is a working document and subject to change. You can
35+
denote sections that are under active debate as follows:
36+
37+
```
38+
<<[UNRESOLVED optional short context or usernames ]>>
39+
Stuff that is being argued.
40+
<<[/UNRESOLVED]>>
41+
```
42+
43+
When editing KEPS, aim for tightly-scoped, single-topic PRs to keep discussions
44+
focused. If you disagree with what is already in a document, open a new PR
45+
with suggested changes.
46+
47+
One KEP corresponds to one "feature" or "enhancement" for its whole lifecycle.
48+
You do not need a new KEP to move from beta to GA, for example. If
49+
new details emerge that belong in the KEP, edit the KEP. Once a feature has become
50+
"implemented", major changes should get new KEPs.
51+
52+
The canonical place for the latest set of instructions (and the likely source
53+
of this file) is [here](/keps/NNNN-kep-template/README.md).
54+
55+
**Note:** Any PRs to move a KEP to `implementable`, or significant changes once
56+
it is marked `implementable`, must be approved by each of the KEP approvers.
57+
If none of those approvers are still appropriate, then changes to that list
58+
should be approved by the remaining approvers and/or the owning SIG (or
59+
SIG Architecture for cross-cutting KEPs).
60+
-->
61+
# KEP-NNNN: Extending Metrics Stability
62+
63+
<!--
64+
This is the title of your KEP. Keep it short, simple, and descriptive. A good
65+
title can help communicate what the KEP is and should be considered as part of
66+
any review.
67+
-->
68+
69+
<!--
70+
A table of contents is helpful for quickly jumping to sections of a KEP and for
71+
highlighting any additional information provided beyond the standard KEP
72+
template.
73+
74+
Ensure the TOC is wrapped with
75+
<code>&lt;!-- toc --&rt;&lt;!-- /toc --&rt;</code>
76+
tags, and then generate with `hack/update-toc.sh`.
77+
-->
278

379
<!-- toc -->
480
- [Release Signoff Checklist](#release-signoff-checklist)
@@ -10,28 +86,54 @@
1086
- [Risks and Mitigations](#risks-and-mitigations)
1187
- [Design Details](#design-details)
1288
- [Test Plan](#test-plan)
89+
- [Prerequisite testing updates](#prerequisite-testing-updates)
90+
- [Unit tests](#unit-tests)
91+
- [Integration tests](#integration-tests)
92+
- [e2e tests](#e2e-tests)
1393
- [Graduation Criteria](#graduation-criteria)
1494
- [Alpha](#alpha)
1595
- [Beta](#beta)
1696
- [GA](#ga)
97+
- [Deprecation](#deprecation)
98+
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
99+
- [Version Skew Strategy](#version-skew-strategy)
17100
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
18101
- [Feature Enablement and Rollback](#feature-enablement-and-rollback)
102+
- [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
19103
- [Monitoring Requirements](#monitoring-requirements)
104+
- [Dependencies](#dependencies)
105+
- [Scalability](#scalability)
106+
- [Troubleshooting](#troubleshooting)
20107
- [Implementation History](#implementation-history)
21108
- [Drawbacks](#drawbacks)
22109
- [Alternatives](#alternatives)
110+
- [Infrastructure Needed (Optional)](#infrastructure-needed-optional)
23111
<!-- /toc -->
24112

25113
## Release Signoff Checklist
26114

115+
<!--
116+
**ACTION REQUIRED:** In order to merge code into a release, there must be an
117+
issue in [kubernetes/enhancements] referencing this KEP and targeting a release
118+
milestone **before the [Enhancement Freeze](https://git.k8s.io/sig-release/releases)
119+
of the targeted release**.
120+
121+
For enhancements that make changes to code or processes/procedures in core
122+
Kubernetes—i.e., [kubernetes/kubernetes], we require the following Release
123+
Signoff checklist to be completed.
124+
125+
Check these off as they are completed for the Release Team to track. These
126+
checklist items _must_ be updated for the enhancement to be released.
127+
-->
128+
27129
Items marked with (R) are required *prior to targeting to a milestone / release*.
28130

29131
- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
30132
- [ ] (R) KEP approvers have approved the KEP status as `implementable`
31133
- [ ] (R) Design details are appropriately documented
32134
- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
33135
- [ ] e2e Tests for all Beta API Operations (endpoints)
34-
- [ ] (R) Ensure GA e2e tests for meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
136+
- [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
35137
- [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
36138
- [ ] (R) Graduation criteria is in place
37139
- [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
@@ -41,6 +143,9 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
41143
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
42144
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
43145

146+
<!--
147+
**Note:** This checklist is iterative and should be reviewed and updated every time this enhancement is being considered for a milestone.
148+
-->
44149

45150
[kubernetes.io]: https://kubernetes.io/
46151
[kubernetes/enhancements]: https://git.k8s.io/enhancements
@@ -61,18 +166,15 @@ It's become more obvious recently that we need additional stability classes, par
61166

62167
Introduce two more metric classes: `beta`, corresponding to the `beta` stage of feature release, and `internal` which corresponds to internal development related metrics.
63168

64-
65169
### Non-Goals
66170

67171
- establishing if specific metrics fall into a stability class, this exercise is left for component owners, who own their own metrics
68-
- establishing the guarantees of the `beta` stability class, this will be an exercise we will defer until the beta version of this KEP.
69-
70172

71173
## Proposal
72174

73175
We're proposing adding additional metadata fields to Kubernetes metrics. Specifically we want to add the following stability levels:
74176

75-
- `Internal` - representing internal usages of metrics (i.e. classes of metrics which do not correspond to features)
177+
- `Internal` - representing internal usages of metrics (i.e. classes of metrics which do not correspond to features) or low-level metrics that a typical operator would not understand (or would not be able to react to them properly).
76178
- `Beta` - representing a more mature stage in a feature metric, with greater stability guarantees than alpha or internal metrics, but less than `Stable`
77179

78180
We also propose amending the semantic meaning of an `Alpha` metric such that it represents the nascent stage of a KEP-proposed feature, rather than the entire class of metrics without stability guarantees.
@@ -83,7 +185,6 @@ Additionally we propose forced upgrades of metrics stability classes in the simi
83185
### Risks and Mitigations
84186

85187
The primary risk is that these changes break our existing (and working) metrics infrastructure. The mitigation should straightfoward, i.e. rollback the changes to the metrics framework.
86-
87188
## Design Details
88189

89190
Our plan is to add functionality to our static analysis framework which is hosted in the main `k8s/k8s` repo, under `test/instrumentation`. Specifically, we will need to support:
@@ -107,6 +208,32 @@ As an aside, much of this work has already been done, but is stashed in a local
107208
We have static analysis testing for stable metrics, we will extend our test coverage
108209
to include metrics which are `ALPHA` and `BETA` while ignoring `INTERNAL` metrics.
109210

211+
[ ] I/we understand the owners of the involved components may require updates to
212+
existing tests to make this code solid enough prior to committing the changes necessary
213+
to implement this enhancement.
214+
215+
##### Prerequisite testing updates
216+
217+
We already have thorough testing for the stability framework which has been GA for years.
218+
219+
##### Unit tests
220+
221+
[ ] parsing variables
222+
[ ] multi-line strings
223+
[ ] evaluating buckets
224+
[ ] buckets which are defined via variables and consts
225+
[ ] evaluation of simple consts
226+
[ ] evaluation of simple variables
227+
228+
- `test/instrumentation`: `09/20/2022` - `full coverage of existing stability framework`
229+
230+
##### Integration tests
231+
232+
We will test the static analysis parser on a test directory with all permutations of metrics which we expect to parse (and variants we expect not to be able to parse)
233+
234+
##### e2e tests
235+
236+
The statis analysis tooling runs in a precommit pipeline and is therefore exempt from runtime tests.
110237

111238
### Graduation Criteria
112239

@@ -121,53 +248,153 @@ to include metrics which are `ALPHA` and `BETA` while ignoring `INTERNAL` metric
121248

122249
- Kubernetes metrics framework will be enhanced to support marking `Alpha` and `Beta` metrics with a date. The semantics of this are yet to be determined. This date will be used to statically determine whether or not that metric should be decrepated automatically or promoted.
123250
- Kubernetes metrics framework will be enhanced with a script to auto-deprecate metrics which have passed their window of existence as an `Alpha` or `Beta` metric
124-
- It is at this point, we will determine the longevity rules for `Alpha` and `Beta` metrics
251+
- We will determine the semantics for `Alpha` and `Beta` metrics
125252
- The `beta` stage for this framework will be a few releases. During this time, we will evaluate the utility and the ergonomics of the framework, making adjustments as necessary
126253

127254
#### GA
128255

256+
- We will allow bake time before promoting this feature to GA
257+
- At this stage, we will promote our meta-metric for registered metrics to Stable
129258

130-
## Production Readiness Review Questionnaire
259+
#### Deprecation
260+
261+
- This section will pertain to the deprecation policy of deprecated `Alpha` and `Beta` metrics which will be determined in the `Beta` version of this KEP.
131262

132-
During the `alpha` stage of this KEP, we will not be making any user facing changes, except marking metrics as `Internal` which were previously `Alpha`. The stability guarantees of `Internal` metrics is the same as `Alpha` currently and therefore there will not be any changes to what users can expect from the metrics they are using.
263+
264+
### Upgrade / Downgrade Strategy
265+
266+
The static analysis code does not run in Kubernetes runtime code, with the exception of the registered_metrics metric.
267+
268+
### Version Skew Strategy
269+
270+
This feature does not require a version skew strategy.
271+
272+
## Production Readiness Review Questionnaire
133273

134274
### Feature Enablement and Rollback
135275

136-
We can revert our changes if it breaks the metrics framework. But we will be adding testing coverage as we enhance the framework, so it is unlikely that this will need to occur.
276+
This feature cannot be enabled or rolled back. It is built into the infrastructure of metrics, which will support two additional values for the enumeration of stable classes of metrics.
137277

138278
###### How can this feature be enabled / disabled in a live cluster?
139279

140-
Metrics stability framework is an internal feature of Kubernetes.
280+
It cannot. This is purely infrastructure based and requires adding additional enumeration values to metrics stability classes.
141281

142282
###### Does enabling the feature change any default behavior?
143283

144-
No.
284+
It will cause metrics previously annotated as `Alpha` metrics to be denoted as `Internal`.
145285

146286
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
147287

148-
Yes, this can be rolled back.
288+
No.
149289

150290
###### What happens if we reenable the feature if it was previously rolled back?
151291

152-
Metrics will be annotated with `Internal` instead of `Alpha` and vice versa.
292+
N/A
153293

154294
###### Are there any tests for feature enablement/disablement?
155295

156-
No.
296+
No.
297+
298+
### Rollout, Upgrade and Rollback Planning
299+
300+
###### How can a rollout or rollback fail? Can it impact already running workloads?
301+
302+
This should not affect rollout. It could affect workloads that depended on `Alpha` metrics, which will be recagetorized as `Internal`. But to be fair, we've already explicitly laid out the fact that `Alpha` metrics do not have stability guarantees.
303+
304+
###### What specific metrics should inform a rollback?
305+
306+
`registered_metrics_total` summing to zero.
307+
308+
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
309+
310+
This should not affect upgrade/rollback paths.
157311

312+
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
313+
314+
`Alpha` metrics will be recategorized as `Internal`.
158315

159316
### Monitoring Requirements
160317

161-
Well, this is a meta-monitoring improvement, so it's a strange thing to monitor. But I suppose we can add metrics around how many metrics are registered divided by stability-level and metric name.
318+
###### How can an operator determine if the feature is in use by workloads?
162319

320+
You can determine this by seeing if workloads depend on any Kubernetes control-plane metrics. If they do, they are using this feature.
163321

164322
###### How can someone using this feature know that it is working for their instance?
165323

166-
You will see metrics from your component.
324+
They will be able to see metrics.
325+
326+
327+
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
328+
329+
This tooling runs in precommit. It does not affect runtime SLOs.
330+
331+
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
332+
333+
N/A
334+
335+
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
336+
337+
`registered_metrics_total` will be used to calculate the number of registered stable metrics.
338+
339+
### Dependencies
340+
341+
Prometheus and the Kubernetes metric framework.
342+
343+
###### Does this feature depend on any specific services running in the cluster?
344+
345+
In order to ingest these metrics, one needs a prometheus scraping agent and some backend to persist the metric data.
346+
347+
### Scalability
348+
349+
###### Will enabling / using this feature result in any new API calls?
350+
351+
No.
352+
353+
###### Will enabling / using this feature result in introducing new API types?
167354

355+
No.
356+
357+
###### Will enabling / using this feature result in any new calls to the cloud provider?
358+
359+
No.
360+
361+
###### Will enabling / using this feature result in increasing size or count of the existing API objects?
362+
363+
No.
364+
365+
###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
366+
367+
No.
368+
369+
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
370+
371+
No.
372+
373+
### Troubleshooting
374+
375+
376+
###### How does this feature react if the API server and/or etcd is unavailable?
377+
378+
Apiserver needs to be available to scrape metrics, if etcd is not available, you may still be able to scrape metrics from the apiserver.
379+
380+
###### What are other known failure modes?
381+
382+
Runaway cardinality of metrics, but that is orthogonal to the scope of this KEP.
383+
384+
###### What steps should be taken if SLOs are not being met to determine the problem?
168385

169386
## Implementation History
170387

388+
<!--
389+
Major milestones in the lifecycle of a KEP should be tracked in this section.
390+
Major milestones might include:
391+
- the `Summary` and `Motivation` sections being merged, signaling SIG acceptance
392+
- the `Proposal` section being merged, signaling agreement on a proposed design
393+
- the date implementation started
394+
- the first Kubernetes release where an initial version of the KEP was available
395+
- the version of Kubernetes where the KEP graduated to general availability
396+
- when the KEP was retired or superseded
397+
-->
171398

172399
## Drawbacks
173400

@@ -177,3 +404,10 @@ This introduces complexity to metrics stability levels, however this has been as
177404

178405
Doing nothing is a viable alternative. However, we end up in a weird spot with feature metrics, where they have no guarantees or are fully stable.
179406

407+
## Infrastructure Needed (Optional)
408+
409+
<!--
410+
Use this section if you need things from the project/SIG. Examples include a
411+
new subproject, repos requested, or GitHub details. Listing these here allows a
412+
SIG to get the process for these resources started right away.
413+
-->

0 commit comments

Comments
 (0)