You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
**ACTION REQUIRED:** In order to merge code into a release, there must be an
117
+
issue in [kubernetes/enhancements] referencing this KEP and targeting a release
118
+
milestone **before the [Enhancement Freeze](https://git.k8s.io/sig-release/releases)
119
+
of the targeted release**.
120
+
121
+
For enhancements that make changes to code or processes/procedures in core
122
+
Kubernetes—i.e., [kubernetes/kubernetes], we require the following Release
123
+
Signoff checklist to be completed.
124
+
125
+
Check these off as they are completed for the Release Team to track. These
126
+
checklist items _must_ be updated for the enhancement to be released.
127
+
-->
128
+
27
129
Items marked with (R) are required *prior to targeting to a milestone / release*.
28
130
29
131
-[ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
30
132
-[ ] (R) KEP approvers have approved the KEP status as `implementable`
31
133
-[ ] (R) Design details are appropriately documented
32
134
-[ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
33
135
-[ ] e2e Tests for all Beta API Operations (endpoints)
34
-
-[ ] (R) Ensure GA e2e tests for meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
136
+
-[ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
35
137
-[ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
36
138
-[ ] (R) Graduation criteria is in place
37
139
-[ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
@@ -41,6 +143,9 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
41
143
-[ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
42
144
-[ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
43
145
146
+
<!--
147
+
**Note:** This checklist is iterative and should be reviewed and updated every time this enhancement is being considered for a milestone.
@@ -61,18 +166,15 @@ It's become more obvious recently that we need additional stability classes, par
61
166
62
167
Introduce two more metric classes: `beta`, corresponding to the `beta` stage of feature release, and `internal` which corresponds to internal development related metrics.
63
168
64
-
65
169
### Non-Goals
66
170
67
171
- establishing if specific metrics fall into a stability class, this exercise is left for component owners, who own their own metrics
68
-
- establishing the guarantees of the `beta` stability class, this will be an exercise we will defer until the beta version of this KEP.
69
-
70
172
71
173
## Proposal
72
174
73
175
We're proposing adding additional metadata fields to Kubernetes metrics. Specifically we want to add the following stability levels:
74
176
75
-
-`Internal` - representing internal usages of metrics (i.e. classes of metrics which do not correspond to features)
177
+
-`Internal` - representing internal usages of metrics (i.e. classes of metrics which do not correspond to features) or low-level metrics that a typical operator would not understand (or would not be able to react to them properly).
76
178
-`Beta` - representing a more mature stage in a feature metric, with greater stability guarantees than alpha or internal metrics, but less than `Stable`
77
179
78
180
We also propose amending the semantic meaning of an `Alpha` metric such that it represents the nascent stage of a KEP-proposed feature, rather than the entire class of metrics without stability guarantees.
@@ -83,7 +185,6 @@ Additionally we propose forced upgrades of metrics stability classes in the simi
83
185
### Risks and Mitigations
84
186
85
187
The primary risk is that these changes break our existing (and working) metrics infrastructure. The mitigation should straightfoward, i.e. rollback the changes to the metrics framework.
86
-
87
188
## Design Details
88
189
89
190
Our plan is to add functionality to our static analysis framework which is hosted in the main `k8s/k8s` repo, under `test/instrumentation`. Specifically, we will need to support:
@@ -107,6 +208,32 @@ As an aside, much of this work has already been done, but is stashed in a local
107
208
We have static analysis testing for stable metrics, we will extend our test coverage
108
209
to include metrics which are `ALPHA` and `BETA` while ignoring `INTERNAL` metrics.
109
210
211
+
[] I/we understand the owners of the involved components may require updates to
212
+
existing tests to make this code solid enough prior to committing the changes necessary
213
+
to implement this enhancement.
214
+
215
+
##### Prerequisite testing updates
216
+
217
+
We already have thorough testing for the stability framework which has been GA for years.
218
+
219
+
##### Unit tests
220
+
221
+
[] parsing variables
222
+
[] multi-line strings
223
+
[] evaluating buckets
224
+
[] buckets which are defined via variables and consts
225
+
[] evaluation of simple consts
226
+
[] evaluation of simple variables
227
+
228
+
-`test/instrumentation`: `09/20/2022` - `full coverage of existing stability framework`
229
+
230
+
##### Integration tests
231
+
232
+
We will test the static analysis parser on a test directory with all permutations of metrics which we expect to parse (and variants we expect not to be able to parse)
233
+
234
+
##### e2e tests
235
+
236
+
The statis analysis tooling runs in a precommit pipeline and is therefore exempt from runtime tests.
110
237
111
238
### Graduation Criteria
112
239
@@ -121,53 +248,153 @@ to include metrics which are `ALPHA` and `BETA` while ignoring `INTERNAL` metric
121
248
122
249
- Kubernetes metrics framework will be enhanced to support marking `Alpha` and `Beta` metrics with a date. The semantics of this are yet to be determined. This date will be used to statically determine whether or not that metric should be decrepated automatically or promoted.
123
250
- Kubernetes metrics framework will be enhanced with a script to auto-deprecate metrics which have passed their window of existence as an `Alpha` or `Beta` metric
124
-
-It is at this point, we will determine the longevity rules for `Alpha` and `Beta` metrics
251
+
-We will determine the semantics for `Alpha` and `Beta` metrics
125
252
- The `beta` stage for this framework will be a few releases. During this time, we will evaluate the utility and the ergonomics of the framework, making adjustments as necessary
126
253
127
254
#### GA
128
255
256
+
- We will allow bake time before promoting this feature to GA
257
+
- At this stage, we will promote our meta-metric for registered metrics to Stable
129
258
130
-
## Production Readiness Review Questionnaire
259
+
#### Deprecation
260
+
261
+
- This section will pertain to the deprecation policy of deprecated `Alpha` and `Beta` metrics which will be determined in the `Beta` version of this KEP.
131
262
132
-
During the `alpha` stage of this KEP, we will not be making any user facing changes, except marking metrics as `Internal` which were previously `Alpha`. The stability guarantees of `Internal` metrics is the same as `Alpha` currently and therefore there will not be any changes to what users can expect from the metrics they are using.
263
+
264
+
### Upgrade / Downgrade Strategy
265
+
266
+
The static analysis code does not run in Kubernetes runtime code, with the exception of the registered_metrics metric.
267
+
268
+
### Version Skew Strategy
269
+
270
+
This feature does not require a version skew strategy.
271
+
272
+
## Production Readiness Review Questionnaire
133
273
134
274
### Feature Enablement and Rollback
135
275
136
-
We can revert our changes if it breaks the metrics framework. But we will be adding testing coverage as we enhance the framework, so it is unlikely that this will need to occur.
276
+
This feature cannot be enabled or rolled back. It is built into the infrastructure of metrics, which will support two additional values for the enumeration of stable classes of metrics.
137
277
138
278
###### How can this feature be enabled / disabled in a live cluster?
139
279
140
-
Metrics stability framework is an internal feature of Kubernetes.
280
+
It cannot. This is purely infrastructure based and requires adding additional enumeration values to metrics stability classes.
141
281
142
282
###### Does enabling the feature change any default behavior?
143
283
144
-
No.
284
+
It will cause metrics previously annotated as `Alpha` metrics to be denoted as `Internal`.
145
285
146
286
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
147
287
148
-
Yes, this can be rolled back.
288
+
No.
149
289
150
290
###### What happens if we reenable the feature if it was previously rolled back?
151
291
152
-
Metrics will be annotated with `Internal` instead of `Alpha` and vice versa.
292
+
N/A
153
293
154
294
###### Are there any tests for feature enablement/disablement?
155
295
156
-
No.
296
+
No.
297
+
298
+
### Rollout, Upgrade and Rollback Planning
299
+
300
+
###### How can a rollout or rollback fail? Can it impact already running workloads?
301
+
302
+
This should not affect rollout. It could affect workloads that depended on `Alpha` metrics, which will be recagetorized as `Internal`. But to be fair, we've already explicitly laid out the fact that `Alpha` metrics do not have stability guarantees.
303
+
304
+
###### What specific metrics should inform a rollback?
305
+
306
+
`registered_metrics_total` summing to zero.
307
+
308
+
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
309
+
310
+
This should not affect upgrade/rollback paths.
157
311
312
+
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
313
+
314
+
`Alpha` metrics will be recategorized as `Internal`.
158
315
159
316
### Monitoring Requirements
160
317
161
-
Well, this is a meta-monitoring improvement, so it's a strange thing to monitor. But I suppose we can add metrics around how many metrics are registered divided by stability-level and metric name.
318
+
###### How can an operator determine if the feature is in use by workloads?
162
319
320
+
You can determine this by seeing if workloads depend on any Kubernetes control-plane metrics. If they do, they are using this feature.
163
321
164
322
###### How can someone using this feature know that it is working for their instance?
165
323
166
-
You will see metrics from your component.
324
+
They will be able to see metrics.
325
+
326
+
327
+
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
328
+
329
+
This tooling runs in precommit. It does not affect runtime SLOs.
330
+
331
+
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
332
+
333
+
N/A
334
+
335
+
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
336
+
337
+
`registered_metrics_total` will be used to calculate the number of registered stable metrics.
338
+
339
+
### Dependencies
340
+
341
+
Prometheus and the Kubernetes metric framework.
342
+
343
+
###### Does this feature depend on any specific services running in the cluster?
344
+
345
+
In order to ingest these metrics, one needs a prometheus scraping agent and some backend to persist the metric data.
346
+
347
+
### Scalability
348
+
349
+
###### Will enabling / using this feature result in any new API calls?
350
+
351
+
No.
352
+
353
+
###### Will enabling / using this feature result in introducing new API types?
167
354
355
+
No.
356
+
357
+
###### Will enabling / using this feature result in any new calls to the cloud provider?
358
+
359
+
No.
360
+
361
+
###### Will enabling / using this feature result in increasing size or count of the existing API objects?
362
+
363
+
No.
364
+
365
+
###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
366
+
367
+
No.
368
+
369
+
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
370
+
371
+
No.
372
+
373
+
### Troubleshooting
374
+
375
+
376
+
###### How does this feature react if the API server and/or etcd is unavailable?
377
+
378
+
Apiserver needs to be available to scrape metrics, if etcd is not available, you may still be able to scrape metrics from the apiserver.
379
+
380
+
###### What are other known failure modes?
381
+
382
+
Runaway cardinality of metrics, but that is orthogonal to the scope of this KEP.
383
+
384
+
###### What steps should be taken if SLOs are not being met to determine the problem?
168
385
169
386
## Implementation History
170
387
388
+
<!--
389
+
Major milestones in the lifecycle of a KEP should be tracked in this section.
390
+
Major milestones might include:
391
+
- the `Summary` and `Motivation` sections being merged, signaling SIG acceptance
392
+
- the `Proposal` section being merged, signaling agreement on a proposed design
393
+
- the date implementation started
394
+
- the first Kubernetes release where an initial version of the KEP was available
395
+
- the version of Kubernetes where the KEP graduated to general availability
396
+
- when the KEP was retired or superseded
397
+
-->
171
398
172
399
## Drawbacks
173
400
@@ -177,3 +404,10 @@ This introduces complexity to metrics stability levels, however this has been as
177
404
178
405
Doing nothing is a viable alternative. However, we end up in a weird spot with feature metrics, where they have no guarantees or are fully stable.
179
406
407
+
## Infrastructure Needed (Optional)
408
+
409
+
<!--
410
+
Use this section if you need things from the project/SIG. Examples include a
411
+
new subproject, repos requested, or GitHub details. Listing these here allows a
412
+
SIG to get the process for these resources started right away.
0 commit comments