You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -82,6 +85,14 @@ It is *not a goal* to define the allowlist for each Kubernetes component metrics
82
85
83
86
## Proposal
84
87
88
+
We should be able to *dynamically configure* an allowlist of label values for a metric.
89
+
90
+
### Risks and Mitigations
91
+
92
+
None, we are actually providing a mitigation to cardinality explosion by limiting the number of allowed values for a label.
93
+
94
+
## Design Details
95
+
85
96
The simple solution to this problem would be for each metric added to keep the unbounded dimensions in mind and prevent it from happening. SIG instrumentation has already explicitly stated this in our instrumentation guidelines: which says that ["one should know a comprehensive list of all possible values for a label at instrumentation time."](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-instrumentation/instrumentation.md#dimensionality--cardinality). The problem is more complicated. First, SIG Instrumentation doesn't have a way to validate adherence to SIG instrumentation guidelines outside of reviewing PRs with instrumentation changes. Not only is this highly manual, and error-prone, we do not have a terrific existing procedure for ensuring SIG Instrumentation is tagged on relevant PRs. Even if we do have such a mechanism, it isn't a fully sufficient solution because:
86
97
87
98
1. metrics changes can be seemingly innocuous, even to the most diligent of code reviewers (i.e these issues are hard to catch)
@@ -270,8 +281,6 @@ This would then be interpreted by our machinery as this:
270
281
271
282
```
272
283
273
-
## Design Details
274
-
275
284
### Test Plan
276
285
277
286
[x] I/we understand the owners of the involved components may require updates to
@@ -291,6 +300,14 @@ For `Alpha`, unit test to .
291
300
-`staging/src/k8s.io/component-base/metrics/histogram_test.go`: `4/3/21` - `verify that the metric label will be set to "unexpected" for histograms if the metric encounters label values outside our explicit allowlist of values`
292
301
-`staging/src/k8s.io/component-base/metrics/summary_test.go`: `4/3/21` - `verify that the metric label will be set to "unexpected" for summaries if the metric encounters label values outside our explicit allowlist of values`
293
302
303
+
##### Integration tests
304
+
305
+
Unit tests suffice for the testing of this feature.
306
+
307
+
##### e2e tests
308
+
309
+
Unit tests suffice for the testing of this feature.
310
+
294
311
### Graduation Criteria
295
312
296
313
#### Alpha
@@ -317,134 +334,134 @@ N/A
317
334
## Production Readiness Review Questionnaire
318
335
### Feature Enablement and Rollback
319
336
320
-
_This section must be completed when targeting alpha to a release._
321
-
322
-
***How can this feature be enabled / disabled in a live cluster?**
323
-
-[x] Feature gate (also fill in values in `kep.yaml`)
324
-
- Feature gate name: MetricCardinalityEnforcement
325
-
- Components depending on the feature gate: All components that emit metrics, i.e. (at the time of writing),
326
-
- cmd/kube-apiserver
327
-
- cmd/kube-controller-manager
328
-
- cmd/kubelet
329
-
- pkg/kubelet/metrics
330
-
- pkg/kubelet/prober
331
-
- pkg/kubelet/server
332
-
- pkg/proxy/metrics
333
-
- cmd/kube-scheduler
334
-
- pkg/volume/util
335
-
336
-
***Does enabling the feature change any default behavior?**
337
-
Any change of default behavior may be surprising to users or break existing
338
-
automations, so be extremely careful here.
339
-
Using this feature requires restarting the component with the allowlist flag enabled. Once enabled, the metric label will be set to "unexpected" if the metric encounters label values outside our explicit allowlist of values.
340
-
341
-
***Can the feature be disabled once it has been enabled (i.e. can we roll back
342
-
the enablement)?**
343
-
Also set `disable-supported` to `true` or `false` in `kep.yaml`.
344
-
Describe the consequences on existing workloads (e.g., if this is a runtime
345
-
feature, can it break the existing applications?).
346
-
Yes, disabling the feature gate can revert it back to existing behavior
337
+
###### How can this feature be enabled / disabled in a live cluster?
338
+
339
+
-[x] Feature gate (also fill in values in `kep.yaml`)
340
+
- Feature gate name: MetricCardinalityEnforcement
341
+
- Components depending on the feature gate: All components that emit metrics, i.e. (at the time of writing),
342
+
- cmd/kube-apiserver
343
+
- cmd/kube-controller-manager
344
+
- cmd/kubelet
345
+
- pkg/kubelet/metrics
346
+
- pkg/kubelet/prober
347
+
- pkg/kubelet/server
348
+
- pkg/proxy/metrics
349
+
- cmd/kube-scheduler
350
+
- pkg/volume/util
351
+
352
+
###### Does enabling the feature change any default behavior?
353
+
354
+
Any change of default behavior may be surprising to users or break existing
355
+
automations, so be extremely careful here.
356
+
Using this feature requires restarting the component with the allowlist flag enabled. Once enabled, the metric label will be set to "unexpected" if the metric encounters label values outside our explicit allowlist of values.
357
+
358
+
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
347
359
348
-
***What happens if we re-enable the feature if it was previously rolled back?**
349
-
The enable-disable-enable process will not cause problem. But it may be problematic during the rolled back period with the unbounded metrics value. Note that metrics are a memory-only construct and do not persist, but re-generated across restarts.
360
+
Also set `disable-supported` to `true` or `false` in `kep.yaml`.
361
+
Describe the consequences on existing workloads (e.g., if this is a runtime
362
+
feature, can it break the existing applications?).
363
+
Yes, disabling the feature gate can revert it back to existing behavior
350
364
351
-
***Are there any tests for feature enablement/disablement?**
352
-
Using unit tests to cover the combination cases w/wo feature and w/wo allowlist.
365
+
###### What happens if we reenable the feature if it was previously rolled back?
366
+
367
+
The enable-disable-enable process will not cause problem. But it may be problematic during the rolled back period with the unbounded metrics value. Note that metrics are a memory-only construct and do not persist, but re-generated across restarts.
368
+
369
+
###### Are there any tests for feature enablement/disablement?
370
+
371
+
Using unit tests to cover the combination cases w/wo feature and w/wo allowlist.
353
372
354
373
### Rollout, Upgrade and Rollback Planning
355
374
356
-
_This section must be completed when targeting beta graduation to a release._
357
-
358
-
***How can a rollout fail? Can it impact already running workloads?**
359
-
Try to be as paranoid as possible - e.g., what if some components will restart
360
-
mid-rollout?
361
-
Using this feature requires restarting the component with the allowlist flag enabled.
362
-
***What specific metrics should inform a rollback?**
363
-
None.
364
-
***Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
365
-
Describe manual testing that was done and the outcomes.
366
-
Longer term, we may want to require automated upgrade/rollback tests, but we
367
-
are missing a bunch of machinery and tooling and can't do that now.
368
-
In alpha, we can do some manual tests on enable/disable allowlist flag and enable/disable feature gate.
369
-
***Is the rollout accompanied by any deprecations and/or removals of features, APIs,
370
-
fields of API types, flags, etc.?**
371
-
A component metric flag for ingesting allowlist to be added.
375
+
###### How can a rollout or rollback fail? Can it impact already running workloads?
376
+
377
+
Try to be as paranoid as possible - e.g., what if some components will restart
378
+
mid-rollout?
379
+
Using this feature requires restarting the component with the allowlist flag enabled.
380
+
381
+
###### What specific metrics should inform a rollback?
382
+
None.
383
+
384
+
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
385
+
386
+
Describe manual testing that was done and the outcomes.
387
+
Longer term, we may want to require automated upgrade/rollback tests, but we
388
+
are missing a bunch of machinery and tooling and can't do that now.
389
+
In alpha, we can do some manual tests on enable/disable allowlist flag and enable/disable feature gate.
390
+
391
+
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
392
+
393
+
A component metric flag for ingesting allowlist to be added.
372
394
373
395
### Monitoring Requirements
374
396
375
-
_This section must be completed when targeting beta graduation to a release._
397
+
###### How can an operator determine if the feature is in use by workloads?
398
+
399
+
The out-of-bound data will be recorded with label "unexpected" rather than the specific value.
376
400
377
-
***How can an operator determine if the feature is in use by workloads?**
378
-
The out-of-bound data will be recorded with label "unexpected" rather than the specific value.
379
-
***What are the SLIs (Service Level Indicators) an operator can use to determine
380
-
the health of the service?**
381
-
None.
401
+
###### How can someone using this feature know that it is working for their instance?
382
402
383
-
***What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
384
-
None.
403
+
The out-of-bound data will be recorded with label "unexpected" rather than the specific value.
385
404
386
-
***Are there any missing metrics that would be useful to have to improve observability
387
-
of this feature?**
388
-
-`cardinality_enforcement_unexpected_categorizations_total`: Increments whenever any metric falls into the "unexpected" case (i.e., goes out of the defined bounds).
405
+
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
406
+
407
+
None.
408
+
409
+
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
410
+
411
+
None.
412
+
413
+
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
414
+
-`cardinality_enforcement_unexpected_categorizations_total`: Increments whenever any metric falls into the "unexpected" case (i.e., goes out of the defined bounds).
389
415
390
416
### Dependencies
391
417
392
-
_This section must be completed when targeting beta graduation to a release._
418
+
###### Does this feature depend on any specific services running in the cluster?
393
419
394
-
***Does this feature depend on any specific services running in the cluster?**
395
-
No.
420
+
No.
396
421
397
422
### Scalability
398
423
399
-
_For alpha, this section is encouraged: reviewers should consider these questions
400
-
and attempt to answer them._
424
+
###### Will enabling / using this feature result in any new API calls?
425
+
426
+
No.
401
427
402
-
_For beta, this section is required: reviewers must answer these questions._
428
+
###### Will enabling / using this feature result in introducing new API types?
403
429
404
-
_For GA, this section is required: approvers should be able to confirm the
405
-
previous answers based on experience in the field._
430
+
No.
406
431
407
-
***Will enabling / using this feature result in any new API calls?**
408
-
No.
432
+
###### Will enabling / using this feature result in any new calls to the cloud provider?
409
433
410
-
***Will enabling / using this feature result in introducing new API types?**
411
-
Describe them, providing:
412
-
No.
434
+
No.
413
435
414
-
***Will enabling / using this feature result in any new calls to the cloud
415
-
provider?**
416
-
No.
417
-
***Will enabling / using this feature result in increasing size or count of
418
-
the existing API objects?**
419
-
No.
436
+
###### Will enabling / using this feature result in increasing size or count of the existing API objects?
420
437
421
-
***Will enabling / using this feature result in increasing time taken by any
422
-
operations covered by [existing SLIs/SLOs]?**
423
-
Checking metrics label against allowlist may increase the metric recording time.
438
+
No.
424
439
425
-
***Will enabling / using this feature result in non-negligible increase of
426
-
resource usage (CPU, RAM, disk, IO, ...) in any components?**
427
-
No.
440
+
###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
428
441
429
-
***Can enabling / using this feature result in resource exhaustion of some
430
-
node resources (PIDs, sockets, inodes, etc.)?**
431
-
No.
442
+
Checking metrics label against allowlist may increase the metric recording time.
443
+
444
+
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
445
+
446
+
No.
447
+
448
+
###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?**
449
+
450
+
No.
432
451
433
452
### Troubleshooting
434
453
435
-
The Troubleshooting section currently serves the `Playbook` role. We may consider
436
-
splitting it into a dedicated `Playbook` document (potentially with some monitoring
437
-
details). For now, we leave it here.
454
+
###### How does this feature react if the API server and/or etcd is unavailable?
455
+
456
+
No additional impact comparing to what already exists.
438
457
439
-
_This section must be completed when targeting beta graduation to a release._
458
+
###### What are other known failure modes?
440
459
441
-
***How does this feature react if the API server and/or etcd is unavailable?**
442
-
No additional impact comparing to what already exists.
443
-
***What are other known failure modes?**
444
-
None.
460
+
None.
445
461
446
-
***What steps should be taken if SLOs are not being met to determine the problem?**
447
-
None.
462
+
###### What steps should be taken if SLOs are not being met to determine the problem?
463
+
464
+
None.
448
465
449
466
## Implementation History
450
467
@@ -457,3 +474,11 @@ _This section must be completed when targeting beta graduation to a release._
457
474
2023-05-07: Graduate to beta
458
475
459
476
2024-01-20: Graduate to stable
477
+
478
+
## Drawbacks
479
+
480
+
None have been identified.
481
+
482
+
## Alternatives
483
+
484
+
Manually keep track of all the values a label can have while adding a new label to a metric. And have a rigorous review process for PRs that add new metrcs/metric labels. But this approach is manual and error prone.
0 commit comments