Skip to content

Commit dd2ec34

Browse files
authored
Merge pull request kubernetes#2468 from YoyinZyc/update_kep
Update the metric cardinality kep
2 parents 9fe25b2 + 9ff9443 commit dd2ec34

File tree

3 files changed

+177
-20
lines changed

3 files changed

+177
-20
lines changed
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
kep-number: 2305
2+
alpha:
3+
approver: "@johnbelamaric"

keps/sig-instrumentation/2305-metrics-cardinality-enforcement/README.md

Lines changed: 166 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -3,16 +3,49 @@
33
## Table of Contents
44

55
<!-- toc -->
6+
- [Release Signoff Checklist](#release-signoff-checklist)
67
- [Summary](#summary)
78
- [Motivation](#motivation)
89
- [Goals](#goals)
910
- [Non-Goals](#non-goals)
1011
- [Proposal](#proposal)
11-
- [Open-Question](#open-question)
12-
- [Graduation Criteria](#graduation-criteria)
13-
- [Post-Beta tasks](#post-beta-tasks)
12+
- [Design Details](#design-details)
13+
- [Test Plan](#test-plan)
14+
- [Graduation Criteria](#graduation-criteria)
15+
- [Upgrade / Downgrade strategy](#upgrade--downgrade-strategy)
16+
- [Version Skew Strategy](#version-skew-strategy)
17+
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
18+
- [Feature Enablement and Rollback](#feature-enablement-and-rollback)
19+
- [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
20+
- [Monitoring Requirements](#monitoring-requirements)
21+
- [Dependencies](#dependencies)
22+
- [Scalability](#scalability)
23+
- [Troubleshooting](#troubleshooting)
1424
- [Implementation History](#implementation-history)
1525
<!-- /toc -->
26+
## Release Signoff Checklist
27+
28+
Items marked with (R) are required *prior to targeting to a milestone / release*.
29+
30+
- [x] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
31+
- [x] (R) KEP approvers have approved the KEP status as `implementable`
32+
- [x] (R) Design details are appropriately documented
33+
- [x] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
34+
- [x] (R) Graduation criteria is in place
35+
- [x] (R) Production readiness review completed
36+
- [x] (R) Production readiness review approved
37+
- [x] "Implementation History" section is up-to-date for milestone
38+
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
39+
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
40+
41+
<!--
42+
**Note:** This checklist is iterative and should be reviewed and updated every time this enhancement is being considered for a milestone.
43+
-->
44+
45+
[kubernetes.io]: https://kubernetes.io/
46+
[kubernetes/enhancements]: https://git.k8s.io/enhancements
47+
[kubernetes/kubernetes]: https://git.k8s.io/kubernetes
48+
[kubernetes/website]: https://git.k8s.io/website
1649

1750
## Summary
1851

@@ -38,11 +71,9 @@ This KEP proposes a possible solution for this, with the primary goal being to e
3871

3972
### Non-Goals
4073

41-
We will expose the machinery and tools to bind a metric's labels to a discrete set of values.
74+
We will expose the machinery and tools to bind a metric's labels to a discrete set of values. The allowlist will be ingested via a new-added component metric flag.
4275

43-
It is *not a goal* to implement and plumb this solution for each Kubernetes component (there are many SIGs and a number of verticals, which may have their own preferred way of doing things). As such it will be up to component owners to leverage this functionality that we provide, by feeding configuration data through whatever mechanism deemed appropriate (i.e. command line flags or reading from a file).
44-
45-
These flags are really only meant to be used as escape hatches, and should not be used to have extremely customized kubernetes setups where our existing dashboards and alerting rule definitions are just not going to apply generally anymore.
76+
It is *not a goal* to define the allowlist for each Kubernetes component metrics.
4677

4778
## Proposal
4879

@@ -53,9 +84,7 @@ The simple solution to this problem would be for each metric added to keep the u
5384

5485
Instead, the proposed solution is we will be able to *dynamically configure* an allowlist of label values for a metric. By *dynamically configure*, we mean configure an allowlist *at runtime* rather than during the build/compile step. And by *at runtime*, we mean, more specifically, during the boot sequence for a Kubernetes component (and we mean daemons here, not CLI tools like kubectl).
5586

56-
Brief aside: a Kubernetes component (which is a daemon) is an executable, which can be launched from the command line manually if desired. Components take a number of start-up configuration flags, which are passed into the component to modify execution paths (if curious, check out the [large amount of flags we have on the kubelet](https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/)). It is also possible to read configuration data from files (like yaml format) during the component boot sequence. This KEP does not have an opinion on the specific mechanism used to load config data into a Kubernetes binary during the boot sequence. What we *actually* care about, is just the fact that it is possible.
57-
58-
Our design is thus config-ingestion agnostic.
87+
Brief aside: a Kubernetes component (which is a daemon) is an executable, which can be launched from the command line manually if desired. Components take a number of start-up configuration flags, which are passed into the component to modify execution paths (if curious, check out the [large amount of flags we have on the kubelet](https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/)). Our design is going to add a flag that applies to all components to ingest the allowlist.
5988

6089
Our design is also based on the premise that metrics can be uniquely identified (i.e. by their metric descriptor). Fortunately for us, this is actually a built in constraint of prometheus clients (which is how we instrument the Kubernetes components). This metric ID is resolvable to a unique string (this is pretty evident when looking at a raw prometheus client endpoint).
6190

@@ -236,19 +265,138 @@ This would then be interpreted by our machinery as this:
236265

237266
```
238267

239-
## Open-Question
240-
241-
242-
## Graduation Criteria
268+
## Design Details
269+
### Test Plan
270+
For `Alpha`, unit test to verify that the metric label will be set to "unexpected" if the metric encounters label values outside our explicit allowlist of values.
271+
### Graduation Criteria
272+
For `Alpha`, the allowlist of metrics can be configured via the exposed flag and the unit test is passed.
273+
For `Beta`, the allowlist can be configured from a input file(e.g. yaml file).
274+
### Upgrade / Downgrade strategy
275+
N/A
276+
### Version Skew Strategy
277+
N/A
278+
279+
## Production Readiness Review Questionnaire
280+
### Feature Enablement and Rollback
281+
282+
_This section must be completed when targeting alpha to a release._
283+
284+
* **How can this feature be enabled / disabled in a live cluster?**
285+
- [x] Feature gate (also fill in values in `kep.yaml`)
286+
- Feature gate name: MetricCardinalityEnforcement
287+
- Components depending on the feature gate: All components that emit metrics
288+
289+
* **Does enabling the feature change any default behavior?**
290+
Any change of default behavior may be surprising to users or break existing
291+
automations, so be extremely careful here.
292+
Using this feature requires restarting the component with the allowlist flag enabled. Once enabled, the metric label will be set to "unexpected" if the metric encounters label values outside our explicit allowlist of values.
293+
294+
* **Can the feature be disabled once it has been enabled (i.e. can we roll back
295+
the enablement)?**
296+
Also set `disable-supported` to `true` or `false` in `kep.yaml`.
297+
Describe the consequences on existing workloads (e.g., if this is a runtime
298+
feature, can it break the existing applications?).
299+
Yes, disabling the feature gate can revert it back to existing behavior
300+
301+
* **What happens if we reenable the feature if it was previously rolled back?**
302+
The enable-disable-enable process will not cause problem. But it may be problematic during the rolled back period with the unbounded metrics value.
303+
304+
* **Are there any tests for feature enablement/disablement?**
305+
Using unit tests to cover the combination cases w/wo feature and w/wo allowlist.
306+
307+
### Rollout, Upgrade and Rollback Planning
308+
309+
_This section must be completed when targeting beta graduation to a release._
310+
311+
* **How can a rollout fail? Can it impact already running workloads?**
312+
Try to be as paranoid as possible - e.g., what if some components will restart
313+
mid-rollout?
314+
Using this feature requires restarting the component with the allowlist flag enabled.
315+
* **What specific metrics should inform a rollback?**
316+
None.
317+
* **Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
318+
Describe manual testing that was done and the outcomes.
319+
Longer term, we may want to require automated upgrade/rollback tests, but we
320+
are missing a bunch of machinery and tooling and can't do that now.
321+
In alpha, we can do some manual tests on enable/disable allowlist flag and enable/disable feature gate.
322+
* **Is the rollout accompanied by any deprecations and/or removals of features, APIs,
323+
fields of API types, flags, etc.?**
324+
A component metric flag for ingesting allowlist to be added.
325+
### Monitoring Requirements
326+
327+
_This section must be completed when targeting beta graduation to a release._
328+
329+
* **How can an operator determine if the feature is in use by workloads?**
330+
The out-of-bound data will be recorded with label "unexpected" rather than the specific value.
331+
* **What are the SLIs (Service Level Indicators) an operator can use to determine
332+
the health of the service?**
333+
None.
334+
335+
* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
336+
None.
337+
338+
* **Are there any missing metrics that would be useful to have to improve observability
339+
of this feature?**
340+
None.
341+
342+
### Dependencies
343+
344+
_This section must be completed when targeting beta graduation to a release._
345+
346+
* **Does this feature depend on any specific services running in the cluster?**
347+
No.
348+
349+
350+
### Scalability
351+
352+
_For alpha, this section is encouraged: reviewers should consider these questions
353+
and attempt to answer them._
354+
355+
_For beta, this section is required: reviewers must answer these questions._
356+
357+
_For GA, this section is required: approvers should be able to confirm the
358+
previous answers based on experience in the field._
359+
360+
* **Will enabling / using this feature result in any new API calls?**
361+
No.
362+
363+
* **Will enabling / using this feature result in introducing new API types?**
364+
Describe them, providing:
365+
No.
366+
367+
* **Will enabling / using this feature result in any new calls to the cloud
368+
provider?**
369+
No.
370+
* **Will enabling / using this feature result in increasing size or count of
371+
the existing API objects?**
372+
No.
373+
374+
* **Will enabling / using this feature result in increasing time taken by any
375+
operations covered by [existing SLIs/SLOs]?**
376+
Checking metrics label against allowlist may increase the metric recording time.
243377

244-
todo
378+
* **Will enabling / using this feature result in non-negligible increase of
379+
resource usage (CPU, RAM, disk, IO, ...) in any components?**
380+
No.
245381

382+
### Troubleshooting
246383

247-
## Post-Beta tasks
384+
The Troubleshooting section currently serves the `Playbook` role. We may consider
385+
splitting it into a dedicated `Playbook` document (potentially with some monitoring
386+
details). For now, we leave it here.
248387

249-
todo
388+
_This section must be completed when targeting beta graduation to a release._
389+
390+
* **How does this feature react if the API server and/or etcd is unavailable?**
391+
No additional impact comparing to what already exists.
392+
* **What are other known failure modes?**
393+
None.
394+
395+
* **What steps should be taken if SLOs are not being met to determine the problem?**
396+
None.
250397

251398
## Implementation History
252399

253-
todo
400+
2020-04-15: KEP opened
254401

402+
2020-05-19: KEP marked implementable

keps/sig-instrumentation/2305-metrics-cardinality-enforcement/kep.yaml

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@ kep-number: 2305
33
authors:
44
- "@logicalhan"
55
- "@lilic"
6+
- "@yoyinzyc"
67
owning-sig: sig-instrumentation
78
participating-sigs:
89
- sig-instrumentation
@@ -12,8 +13,13 @@ reviewers:
1213
- "@ehashman"
1314
- "@x13n"
1415
approvers:
15-
- sig-instrumentation
16+
- "@ehashman"
17+
prr-approvers:
18+
- "@johnbelamaric"
1619
creation-date: 2020-04-15
17-
last-updated: 2020-05-19
20+
last-updated: 2021-02-08
1821
stage: alpha
1922
status: implementable
23+
latest-milestone: "v1.21"
24+
milestone:
25+
alpha: "v1.21"

0 commit comments

Comments
 (0)