You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Update KEP-2305: Metric cardinality enforcement based on the reviews
received on #2305 to continue the effort further.
Signed-off-by: Pranshu Srivastava <[email protected]>
@@ -266,14 +271,47 @@ This would then be interpreted by our machinery as this:
266
271
```
267
272
268
273
## Design Details
274
+
269
275
### Test Plan
270
-
For `Alpha`, unit test to verify that the metric label will be set to "unexpected" if the metric encounters label values outside our explicit allowlist of values.
276
+
277
+
[x] I/we understand the owners of the involved components may require updates to
278
+
existing tests to make this code solid enough prior to committing the changes necessary
279
+
to implement this enhancement.
280
+
281
+
##### Prerequisite testing updates
282
+
283
+
N/A
284
+
285
+
##### Unit tests
286
+
287
+
For `Alpha`, unit test to .
288
+
289
+
-`staging/src/k8s.io/component-base/metrics/counter_test.go`: `3/3/2021` - `verify that the metric label will be set to "unexpected" for counters if the metric encounters label values outside our explicit allowlist of values`
290
+
-`staging/src/k8s.io/component-base/metrics/gauge_test.go`: `4/3/21` - `verify that the metric label will be set to "unexpected" for gauges if the metric encounters label values outside our explicit allowlist of values`
291
+
-`staging/src/k8s.io/component-base/metrics/histogram_test.go`: `4/3/21` - `verify that the metric label will be set to "unexpected" for histograms if the metric encounters label values outside our explicit allowlist of values`
292
+
-`staging/src/k8s.io/component-base/metrics/summary_test.go`: `4/3/21` - `verify that the metric label will be set to "unexpected" for summaries if the metric encounters label values outside our explicit allowlist of values`
293
+
271
294
### Graduation Criteria
272
-
For `Alpha`, the allowlist of metrics can be configured via the exposed flag and the unit test is passed.
273
-
For `Beta`, the allowlist can be configured from a input file(e.g. yaml file).
295
+
296
+
#### Alpha
297
+
298
+
- Feature implemented behind a feature flag
299
+
- The allowlist of metrics can be configured via the exposed flag and the unit test is passed.
300
+
301
+
#### Beta
302
+
303
+
- The allowlist can be configured from a manifest.
304
+
305
+
#### GA
306
+
307
+
- Allow pattern-matching for labels in the allowlist.
308
+
274
309
### Upgrade / Downgrade strategy
310
+
275
311
N/A
312
+
276
313
### Version Skew Strategy
314
+
277
315
N/A
278
316
279
317
## Production Readiness Review Questionnaire
@@ -284,7 +322,16 @@ _This section must be completed when targeting alpha to a release._
284
322
***How can this feature be enabled / disabled in a live cluster?**
285
323
-[x] Feature gate (also fill in values in `kep.yaml`)
286
324
- Feature gate name: MetricCardinalityEnforcement
287
-
- Components depending on the feature gate: All components that emit metrics
325
+
- Components depending on the feature gate: All components that emit metrics, i.e. (at the time of writing),
326
+
- cmd/kube-apiserver
327
+
- cmd/kube-controller-manager
328
+
- cmd/kubelet
329
+
- pkg/kubelet/metrics
330
+
- pkg/kubelet/prober
331
+
- pkg/kubelet/server
332
+
- pkg/proxy/metrics
333
+
- cmd/kube-scheduler
334
+
- pkg/volume/util
288
335
289
336
***Does enabling the feature change any default behavior?**
290
337
Any change of default behavior may be surprising to users or break existing
@@ -298,8 +345,8 @@ _This section must be completed when targeting alpha to a release._
298
345
feature, can it break the existing applications?).
299
346
Yes, disabling the feature gate can revert it back to existing behavior
300
347
301
-
***What happens if we reenable the feature if it was previously rolled back?**
302
-
The enable-disable-enable process will not cause problem. But it may be problematic during the rolled back period with the unbounded metrics value.
348
+
***What happens if we re-enable the feature if it was previously rolled back?**
349
+
The enable-disable-enable process will not cause problem. But it may be problematic during the rolled back period with the unbounded metrics value. Note that metrics are a memory-only construct and do not persist, but re-generated across restarts.
303
350
304
351
***Are there any tests for feature enablement/disablement?**
305
352
Using unit tests to cover the combination cases w/wo feature and w/wo allowlist.
@@ -322,6 +369,7 @@ _This section must be completed when targeting beta graduation to a release._
322
369
***Is the rollout accompanied by any deprecations and/or removals of features, APIs,
323
370
fields of API types, flags, etc.?**
324
371
A component metric flag for ingesting allowlist to be added.
372
+
325
373
### Monitoring Requirements
326
374
327
375
_This section must be completed when targeting beta graduation to a release._
@@ -337,7 +385,7 @@ the health of the service?**
337
385
338
386
***Are there any missing metrics that would be useful to have to improve observability
339
387
of this feature?**
340
-
None.
388
+
-`cardinality_enforcement_unexpected_categorizations_total`: Increments whenever any metric falls into the "unexpected" case (i.e., goes out of the defined bounds).
341
389
342
390
### Dependencies
343
391
@@ -346,7 +394,6 @@ _This section must be completed when targeting beta graduation to a release._
346
394
***Does this feature depend on any specific services running in the cluster?**
347
395
No.
348
396
349
-
350
397
### Scalability
351
398
352
399
_For alpha, this section is encouraged: reviewers should consider these questions
@@ -379,6 +426,10 @@ operations covered by [existing SLIs/SLOs]?**
379
426
resource usage (CPU, RAM, disk, IO, ...) in any components?**
380
427
No.
381
428
429
+
***Can enabling / using this feature result in resource exhaustion of some
430
+
node resources (PIDs, sockets, inodes, etc.)?**
431
+
No.
432
+
382
433
### Troubleshooting
383
434
384
435
The Troubleshooting section currently serves the `Playbook` role. We may consider
0 commit comments