Skip to content

Commit f37407f

Browse files
committed
Update README
1 parent 2ed09de commit f37407f

File tree

1 file changed

+126
-101
lines changed
  • keps/sig-instrumentation/2305-metrics-cardinality-enforcement

1 file changed

+126
-101
lines changed

keps/sig-instrumentation/2305-metrics-cardinality-enforcement/README.md

Lines changed: 126 additions & 101 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,4 @@
1-
# Dynamic Cardinality Enforcement
2-
3-
## Table of Contents
1+
# KEP-2305: Dynamic Cardinality Enforcement
42

53
<!-- toc -->
64
- [Release Signoff Checklist](#release-signoff-checklist)
@@ -9,10 +7,13 @@
97
- [Goals](#goals)
108
- [Non-Goals](#non-goals)
119
- [Proposal](#proposal)
10+
- [Risks and Mitigations](#risks-and-mitigations)
1211
- [Design Details](#design-details)
1312
- [Test Plan](#test-plan)
1413
- [Prerequisite testing updates](#prerequisite-testing-updates)
1514
- [Unit tests](#unit-tests)
15+
- [Integration tests](#integration-tests)
16+
- [e2e tests](#e2e-tests)
1617
- [Graduation Criteria](#graduation-criteria)
1718
- [Alpha](#alpha)
1819
- [Beta](#beta)
@@ -27,6 +28,8 @@
2728
- [Scalability](#scalability)
2829
- [Troubleshooting](#troubleshooting)
2930
- [Implementation History](#implementation-history)
31+
- [Drawbacks](#drawbacks)
32+
- [Alternatives](#alternatives)
3033
<!-- /toc -->
3134
## Release Signoff Checklist
3235

@@ -82,6 +85,14 @@ It is *not a goal* to define the allowlist for each Kubernetes component metrics
8285

8386
## Proposal
8487

88+
We should be able to *dynamically configure* an allowlist of label values for a metric.
89+
90+
### Risks and Mitigations
91+
92+
None, we are actually providing a mitigation to cardinality explosion by limiting the number of allowed values for a label.
93+
94+
## Design Details
95+
8596
The simple solution to this problem would be for each metric added to keep the unbounded dimensions in mind and prevent it from happening. SIG instrumentation has already explicitly stated this in our instrumentation guidelines: which says that ["one should know a comprehensive list of all possible values for a label at instrumentation time."](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-instrumentation/instrumentation.md#dimensionality--cardinality). The problem is more complicated. First, SIG Instrumentation doesn't have a way to validate adherence to SIG instrumentation guidelines outside of reviewing PRs with instrumentation changes. Not only is this highly manual, and error-prone, we do not have a terrific existing procedure for ensuring SIG Instrumentation is tagged on relevant PRs. Even if we do have such a mechanism, it isn't a fully sufficient solution because:
8697

8798
1. metrics changes can be seemingly innocuous, even to the most diligent of code reviewers (i.e these issues are hard to catch)
@@ -270,8 +281,6 @@ This would then be interpreted by our machinery as this:
270281

271282
```
272283

273-
## Design Details
274-
275284
### Test Plan
276285

277286
[x] I/we understand the owners of the involved components may require updates to
@@ -291,6 +300,14 @@ For `Alpha`, unit test to .
291300
- `staging/src/k8s.io/component-base/metrics/histogram_test.go`: `4/3/21` - `verify that the metric label will be set to "unexpected" for histograms if the metric encounters label values outside our explicit allowlist of values`
292301
- `staging/src/k8s.io/component-base/metrics/summary_test.go`: `4/3/21` - `verify that the metric label will be set to "unexpected" for summaries if the metric encounters label values outside our explicit allowlist of values`
293302

303+
##### Integration tests
304+
305+
Unit tests suffice for the testing of this feature.
306+
307+
##### e2e tests
308+
309+
Unit tests suffice for the testing of this feature.
310+
294311
### Graduation Criteria
295312

296313
#### Alpha
@@ -317,134 +334,134 @@ N/A
317334
## Production Readiness Review Questionnaire
318335
### Feature Enablement and Rollback
319336

320-
_This section must be completed when targeting alpha to a release._
321-
322-
* **How can this feature be enabled / disabled in a live cluster?**
323-
- [x] Feature gate (also fill in values in `kep.yaml`)
324-
- Feature gate name: MetricCardinalityEnforcement
325-
- Components depending on the feature gate: All components that emit metrics, i.e. (at the time of writing),
326-
- cmd/kube-apiserver
327-
- cmd/kube-controller-manager
328-
- cmd/kubelet
329-
- pkg/kubelet/metrics
330-
- pkg/kubelet/prober
331-
- pkg/kubelet/server
332-
- pkg/proxy/metrics
333-
- cmd/kube-scheduler
334-
- pkg/volume/util
335-
336-
* **Does enabling the feature change any default behavior?**
337-
Any change of default behavior may be surprising to users or break existing
338-
automations, so be extremely careful here.
339-
Using this feature requires restarting the component with the allowlist flag enabled. Once enabled, the metric label will be set to "unexpected" if the metric encounters label values outside our explicit allowlist of values.
340-
341-
* **Can the feature be disabled once it has been enabled (i.e. can we roll back
342-
the enablement)?**
343-
Also set `disable-supported` to `true` or `false` in `kep.yaml`.
344-
Describe the consequences on existing workloads (e.g., if this is a runtime
345-
feature, can it break the existing applications?).
346-
Yes, disabling the feature gate can revert it back to existing behavior
337+
###### How can this feature be enabled / disabled in a live cluster?
338+
339+
- [x] Feature gate (also fill in values in `kep.yaml`)
340+
- Feature gate name: MetricCardinalityEnforcement
341+
- Components depending on the feature gate: All components that emit metrics, i.e. (at the time of writing),
342+
- cmd/kube-apiserver
343+
- cmd/kube-controller-manager
344+
- cmd/kubelet
345+
- pkg/kubelet/metrics
346+
- pkg/kubelet/prober
347+
- pkg/kubelet/server
348+
- pkg/proxy/metrics
349+
- cmd/kube-scheduler
350+
- pkg/volume/util
351+
352+
###### Does enabling the feature change any default behavior?
353+
354+
Any change of default behavior may be surprising to users or break existing
355+
automations, so be extremely careful here.
356+
Using this feature requires restarting the component with the allowlist flag enabled. Once enabled, the metric label will be set to "unexpected" if the metric encounters label values outside our explicit allowlist of values.
357+
358+
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
347359

348-
* **What happens if we re-enable the feature if it was previously rolled back?**
349-
The enable-disable-enable process will not cause problem. But it may be problematic during the rolled back period with the unbounded metrics value. Note that metrics are a memory-only construct and do not persist, but re-generated across restarts.
360+
Also set `disable-supported` to `true` or `false` in `kep.yaml`.
361+
Describe the consequences on existing workloads (e.g., if this is a runtime
362+
feature, can it break the existing applications?).
363+
Yes, disabling the feature gate can revert it back to existing behavior
350364

351-
* **Are there any tests for feature enablement/disablement?**
352-
Using unit tests to cover the combination cases w/wo feature and w/wo allowlist.
365+
###### What happens if we reenable the feature if it was previously rolled back?
366+
367+
The enable-disable-enable process will not cause problem. But it may be problematic during the rolled back period with the unbounded metrics value. Note that metrics are a memory-only construct and do not persist, but re-generated across restarts.
368+
369+
###### Are there any tests for feature enablement/disablement?
370+
371+
Using unit tests to cover the combination cases w/wo feature and w/wo allowlist.
353372

354373
### Rollout, Upgrade and Rollback Planning
355374

356-
_This section must be completed when targeting beta graduation to a release._
357-
358-
* **How can a rollout fail? Can it impact already running workloads?**
359-
Try to be as paranoid as possible - e.g., what if some components will restart
360-
mid-rollout?
361-
Using this feature requires restarting the component with the allowlist flag enabled.
362-
* **What specific metrics should inform a rollback?**
363-
None.
364-
* **Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
365-
Describe manual testing that was done and the outcomes.
366-
Longer term, we may want to require automated upgrade/rollback tests, but we
367-
are missing a bunch of machinery and tooling and can't do that now.
368-
In alpha, we can do some manual tests on enable/disable allowlist flag and enable/disable feature gate.
369-
* **Is the rollout accompanied by any deprecations and/or removals of features, APIs,
370-
fields of API types, flags, etc.?**
371-
A component metric flag for ingesting allowlist to be added.
375+
###### How can a rollout or rollback fail? Can it impact already running workloads?
376+
377+
Try to be as paranoid as possible - e.g., what if some components will restart
378+
mid-rollout?
379+
Using this feature requires restarting the component with the allowlist flag enabled.
380+
381+
###### What specific metrics should inform a rollback?
382+
None.
383+
384+
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
385+
386+
Describe manual testing that was done and the outcomes.
387+
Longer term, we may want to require automated upgrade/rollback tests, but we
388+
are missing a bunch of machinery and tooling and can't do that now.
389+
In alpha, we can do some manual tests on enable/disable allowlist flag and enable/disable feature gate.
390+
391+
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
392+
393+
A component metric flag for ingesting allowlist to be added.
372394

373395
### Monitoring Requirements
374396

375-
_This section must be completed when targeting beta graduation to a release._
397+
###### How can an operator determine if the feature is in use by workloads?
398+
399+
The out-of-bound data will be recorded with label "unexpected" rather than the specific value.
376400

377-
* **How can an operator determine if the feature is in use by workloads?**
378-
The out-of-bound data will be recorded with label "unexpected" rather than the specific value.
379-
* **What are the SLIs (Service Level Indicators) an operator can use to determine
380-
the health of the service?**
381-
None.
401+
###### How can someone using this feature know that it is working for their instance?
382402

383-
* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
384-
None.
403+
The out-of-bound data will be recorded with label "unexpected" rather than the specific value.
385404

386-
* **Are there any missing metrics that would be useful to have to improve observability
387-
of this feature?**
388-
- `cardinality_enforcement_unexpected_categorizations_total`: Increments whenever any metric falls into the "unexpected" case (i.e., goes out of the defined bounds).
405+
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
406+
407+
None.
408+
409+
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
410+
411+
None.
412+
413+
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
414+
- `cardinality_enforcement_unexpected_categorizations_total`: Increments whenever any metric falls into the "unexpected" case (i.e., goes out of the defined bounds).
389415

390416
### Dependencies
391417

392-
_This section must be completed when targeting beta graduation to a release._
418+
###### Does this feature depend on any specific services running in the cluster?
393419

394-
* **Does this feature depend on any specific services running in the cluster?**
395-
No.
420+
No.
396421

397422
### Scalability
398423

399-
_For alpha, this section is encouraged: reviewers should consider these questions
400-
and attempt to answer them._
424+
###### Will enabling / using this feature result in any new API calls?
425+
426+
No.
401427

402-
_For beta, this section is required: reviewers must answer these questions._
428+
###### Will enabling / using this feature result in introducing new API types?
403429

404-
_For GA, this section is required: approvers should be able to confirm the
405-
previous answers based on experience in the field._
430+
No.
406431

407-
* **Will enabling / using this feature result in any new API calls?**
408-
No.
432+
###### Will enabling / using this feature result in any new calls to the cloud provider?
409433

410-
* **Will enabling / using this feature result in introducing new API types?**
411-
Describe them, providing:
412-
No.
434+
No.
413435

414-
* **Will enabling / using this feature result in any new calls to the cloud
415-
provider?**
416-
No.
417-
* **Will enabling / using this feature result in increasing size or count of
418-
the existing API objects?**
419-
No.
436+
###### Will enabling / using this feature result in increasing size or count of the existing API objects?
420437

421-
* **Will enabling / using this feature result in increasing time taken by any
422-
operations covered by [existing SLIs/SLOs]?**
423-
Checking metrics label against allowlist may increase the metric recording time.
438+
No.
424439

425-
* **Will enabling / using this feature result in non-negligible increase of
426-
resource usage (CPU, RAM, disk, IO, ...) in any components?**
427-
No.
440+
###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
428441

429-
* **Can enabling / using this feature result in resource exhaustion of some
430-
node resources (PIDs, sockets, inodes, etc.)?**
431-
No.
442+
Checking metrics label against allowlist may increase the metric recording time.
443+
444+
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
445+
446+
No.
447+
448+
###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?**
449+
450+
No.
432451

433452
### Troubleshooting
434453

435-
The Troubleshooting section currently serves the `Playbook` role. We may consider
436-
splitting it into a dedicated `Playbook` document (potentially with some monitoring
437-
details). For now, we leave it here.
454+
###### How does this feature react if the API server and/or etcd is unavailable?
455+
456+
No additional impact comparing to what already exists.
438457

439-
_This section must be completed when targeting beta graduation to a release._
458+
###### What are other known failure modes?
440459

441-
* **How does this feature react if the API server and/or etcd is unavailable?**
442-
No additional impact comparing to what already exists.
443-
* **What are other known failure modes?**
444-
None.
460+
None.
445461

446-
* **What steps should be taken if SLOs are not being met to determine the problem?**
447-
None.
462+
###### What steps should be taken if SLOs are not being met to determine the problem?
463+
464+
None.
448465

449466
## Implementation History
450467

@@ -457,3 +474,11 @@ _This section must be completed when targeting beta graduation to a release._
457474
2023-05-07: Graduate to beta
458475

459476
2024-01-20: Graduate to stable
477+
478+
## Drawbacks
479+
480+
None have been identified.
481+
482+
## Alternatives
483+
484+
Manually keep track of all the values a label can have while adding a new label to a metric. And have a rigorous review process for PRs that add new metrcs/metric labels. But this approach is manual and error prone.

0 commit comments

Comments
 (0)