Skip to content

Commit 96363f8

Browse files
committed
update the metric cardinality kep.
1 parent 19c03f2 commit 96363f8

File tree

3 files changed

+186
-21
lines changed

3 files changed

+186
-21
lines changed
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
kep-number: 2305
2+
alpha:
3+
approver: "@johnbelamaric"

keps/sig-instrumentation/2305-metrics-cardinality-enforcement/README.md

Lines changed: 175 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -3,16 +3,49 @@
33
## Table of Contents
44

55
<!-- toc -->
6+
- [Release Signoff Checklist](#release-signoff-checklist)
67
- [Summary](#summary)
78
- [Motivation](#motivation)
89
- [Goals](#goals)
910
- [Non-Goals](#non-goals)
1011
- [Proposal](#proposal)
11-
- [Open-Question](#open-question)
12-
- [Graduation Criteria](#graduation-criteria)
13-
- [Post-Beta tasks](#post-beta-tasks)
12+
- [Design Details](#design-details)
13+
- [Test Plan](#test-plan)
14+
- [Graduation Criteria](#graduation-criteria)
15+
- [Upgrade / Downgrade strategy](#upgrade--downgrade-strategy)
16+
- [Version Skew Strategy](#version-skew-strategy)
17+
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
18+
- [Feature Enablement and Rollback](#feature-enablement-and-rollback)
19+
- [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
20+
- [Monitoring Requirements](#monitoring-requirements)
21+
- [Dependencies](#dependencies)
22+
- [Scalability](#scalability)
23+
- [Troubleshooting](#troubleshooting)
1424
- [Implementation History](#implementation-history)
1525
<!-- /toc -->
26+
## Release Signoff Checklist
27+
28+
Items marked with (R) are required *prior to targeting to a milestone / release*.
29+
30+
- [x] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
31+
- [x] (R) KEP approvers have approved the KEP status as `implementable`
32+
- [x] (R) Design details are appropriately documented
33+
- [x] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
34+
- [x] (R) Graduation criteria is in place
35+
- [x] (R) Production readiness review completed
36+
- [x] (R) Production readiness review approved
37+
- [x] "Implementation History" section is up-to-date for milestone
38+
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
39+
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
40+
41+
<!--
42+
**Note:** This checklist is iterative and should be reviewed and updated every time this enhancement is being considered for a milestone.
43+
-->
44+
45+
[kubernetes.io]: https://kubernetes.io/
46+
[kubernetes/enhancements]: https://git.k8s.io/enhancements
47+
[kubernetes/kubernetes]: https://git.k8s.io/kubernetes
48+
[kubernetes/website]: https://git.k8s.io/website
1649

1750
## Summary
1851

@@ -38,11 +71,9 @@ This KEP proposes a possible solution for this, with the primary goal being to e
3871

3972
### Non-Goals
4073

41-
We will expose the machinery and tools to bind a metric's labels to a discrete set of values.
74+
We will expose the machinery and tools to bind a metric's labels to a discrete set of values. The allowlist will be ingested via a new-added component metric flag.
4275

43-
It is *not a goal* to implement and plumb this solution for each Kubernetes component (there are many SIGs and a number of verticals, which may have their own preferred way of doing things). As such it will be up to component owners to leverage this functionality that we provide, by feeding configuration data through whatever mechanism deemed appropriate (i.e. command line flags or reading from a file).
44-
45-
These flags are really only meant to be used as escape hatches, and should not be used to have extremely customized kubernetes setups where our existing dashboards and alerting rule definitions are just not going to apply generally anymore.
76+
It is *not a goal* to define the allowlist for each Kubernetes component metrics.
4677

4778
## Proposal
4879

@@ -53,9 +84,7 @@ The simple solution to this problem would be for each metric added to keep the u
5384

5485
Instead, the proposed solution is we will be able to *dynamically configure* an allowlist of label values for a metric. By *dynamically configure*, we mean configure an allowlist *at runtime* rather than during the build/compile step. And by *at runtime*, we mean, more specifically, during the boot sequence for a Kubernetes component (and we mean daemons here, not CLI tools like kubectl).
5586

56-
Brief aside: a Kubernetes component (which is a daemon) is an executable, which can be launched from the command line manually if desired. Components take a number of start-up configuration flags, which are passed into the component to modify execution paths (if curious, check out the [large amount of flags we have on the kubelet](https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/)). It is also possible to read configuration data from files (like yaml format) during the component boot sequence. This KEP does not have an opinion on the specific mechanism used to load config data into a Kubernetes binary during the boot sequence. What we *actually* care about, is just the fact that it is possible.
57-
58-
Our design is thus config-ingestion agnostic.
87+
Brief aside: a Kubernetes component (which is a daemon) is an executable, which can be launched from the command line manually if desired. Components take a number of start-up configuration flags, which are passed into the component to modify execution paths (if curious, check out the [large amount of flags we have on the kubelet](https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/)). Our design is going to add a flag that applies to all components to ingest the allowlist.
5988

6089
Our design is also based on the premise that metrics can be uniquely identified (i.e. by their metric descriptor). Fortunately for us, this is actually a built in constraint of prometheus clients (which is how we instrument the Kubernetes components). This metric ID is resolvable to a unique string (this is pretty evident when looking at a raw prometheus client endpoint).
6190

@@ -236,19 +265,146 @@ This would then be interpreted by our machinery as this:
236265

237266
```
238267

239-
## Open-Question
240-
241-
242-
## Graduation Criteria
243-
244-
todo
268+
## Design Details
269+
### Test Plan
270+
For `Alpha`, unit test to verify that the metric label will be set to "unexpected" if the metric encounters label values outside our explicit allowlist of values.
271+
### Graduation Criteria
272+
For `Alpha`, the allowlist of metrics can be configured via the exposed flag and the unit test is passed.
273+
### Upgrade / Downgrade strategy
274+
N/A
275+
### Version Skew Strategy
276+
N/A
277+
278+
## Production Readiness Review Questionnaire
279+
### Feature Enablement and Rollback
280+
281+
_This section must be completed when targeting alpha to a release._
282+
283+
* **How can this feature be enabled / disabled in a live cluster?**
284+
- [ ] Feature gate (also fill in values in `kep.yaml`)
285+
- Feature gate name:
286+
- Components depending on the feature gate:
287+
- [x] Other
288+
- Describe the mechanism:
289+
New flag will be used to config the allowlist of label values for a metric.
290+
This flag will become standard flag for all k8s components and will be added to
291+
`k8s.io/component-base`.
292+
- Will enabling / disabling the feature require downtime of the control
293+
plane? Yes, the components need to restart with flag enabled.
294+
- Will enabling / disabling the feature require downtime or reprovisioning
295+
of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled).
296+
Yes, the components need to restart with flag enabled.
297+
298+
* **Does enabling the feature change any default behavior?**
299+
Any change of default behavior may be surprising to users or break existing
300+
automations, so be extremely careful here.
301+
Using this feature requires restarting the component with the flag enabled. Once enabled, the metric label will be set to "unexpected" if the metric encounters label values outside our explicit allowlist of values.
302+
303+
* **Can the feature be disabled once it has been enabled (i.e. can we roll back
304+
the enablement)?**
305+
Also set `disable-supported` to `true` or `false` in `kep.yaml`.
306+
Describe the consequences on existing workloads (e.g., if this is a runtime
307+
feature, can it break the existing applications?).
308+
Yes, restarting the component without the allowlist flag will basically disable this feature.
309+
310+
* **What happens if we reenable the feature if it was previously rolled back?**
311+
The enable-disable-enable process will not cause problem. But it may be problematic during the rolled back period with the unbounded metrics value.
312+
313+
* **Are there any tests for feature enablement/disablement?**
314+
No.
315+
### Rollout, Upgrade and Rollback Planning
316+
317+
_This section must be completed when targeting beta graduation to a release._
318+
319+
* **How can a rollout fail? Can it impact already running workloads?**
320+
Try to be as paranoid as possible - e.g., what if some components will restart
321+
mid-rollout?
322+
Using this feature requires restarting the component with the flag enabled.
323+
* **What specific metrics should inform a rollback?**
324+
None.
325+
* **Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
326+
Describe manual testing that was done and the outcomes.
327+
Longer term, we may want to require automated upgrade/rollback tests, but we
328+
are missing a bunch of machinery and tooling and can't do that now.
329+
No.
330+
* **Is the rollout accompanied by any deprecations and/or removals of features, APIs,
331+
fields of API types, flags, etc.?**
332+
A component metric flag for ingesting allowlist to be added.
333+
### Monitoring Requirements
334+
335+
_This section must be completed when targeting beta graduation to a release._
336+
337+
* **How can an operator determine if the feature is in use by workloads?**
338+
The out-of-bound data will be recorded with label "unexpected" rather than the specific value.
339+
* **What are the SLIs (Service Level Indicators) an operator can use to determine
340+
the health of the service?**
341+
None.
342+
343+
* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
344+
None.
345+
346+
* **Are there any missing metrics that would be useful to have to improve observability
347+
of this feature?**
348+
None.
349+
350+
### Dependencies
351+
352+
_This section must be completed when targeting beta graduation to a release._
353+
354+
* **Does this feature depend on any specific services running in the cluster?**
355+
No.
356+
357+
358+
### Scalability
359+
360+
_For alpha, this section is encouraged: reviewers should consider these questions
361+
and attempt to answer them._
362+
363+
_For beta, this section is required: reviewers must answer these questions._
364+
365+
_For GA, this section is required: approvers should be able to confirm the
366+
previous answers based on experience in the field._
367+
368+
* **Will enabling / using this feature result in any new API calls?**
369+
No.
370+
371+
* **Will enabling / using this feature result in introducing new API types?**
372+
Describe them, providing:
373+
No.
374+
375+
* **Will enabling / using this feature result in any new calls to the cloud
376+
provider?**
377+
No.
378+
* **Will enabling / using this feature result in increasing size or count of
379+
the existing API objects?**
380+
No.
381+
382+
* **Will enabling / using this feature result in increasing time taken by any
383+
operations covered by [existing SLIs/SLOs]?**
384+
Checking metrics label against allowlist may increase the metric recording time.
385+
386+
* **Will enabling / using this feature result in non-negligible increase of
387+
resource usage (CPU, RAM, disk, IO, ...) in any components?**
388+
No.
389+
390+
### Troubleshooting
391+
392+
The Troubleshooting section currently serves the `Playbook` role. We may consider
393+
splitting it into a dedicated `Playbook` document (potentially with some monitoring
394+
details). For now, we leave it here.
245395

396+
_This section must be completed when targeting beta graduation to a release._
246397

247-
## Post-Beta tasks
398+
* **How does this feature react if the API server and/or etcd is unavailable?**
399+
No additional impact comparing to what already exists.
400+
* **What are other known failure modes?**
401+
None.
248402

249-
todo
403+
* **What steps should be taken if SLOs are not being met to determine the problem?**
404+
None.
250405

251406
## Implementation History
252407

253-
todo
408+
2020-04-15: KEP opened
254409

410+
2020-05-19: KEP marked implementable

keps/sig-instrumentation/2305-metrics-cardinality-enforcement/kep.yaml

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@ kep-number: 2305
33
authors:
44
- "@logicalhan"
55
- "@lilic"
6+
- "@yoyinzyc"
67
owning-sig: sig-instrumentation
78
participating-sigs:
89
- sig-instrumentation
@@ -12,8 +13,13 @@ reviewers:
1213
- "@ehashman"
1314
- "@x13n"
1415
approvers:
15-
- sig-instrumentation
16+
- "@ehashman"
17+
prr-approvers:
18+
- "@johnbelamaric"
1619
creation-date: 2020-04-15
17-
last-updated: 2020-05-19
20+
last-updated: 2021-02-08
1821
stage: alpha
1922
status: implementable
23+
latest-milestone: "v1.21"
24+
milestone:
25+
alpha: "v1.21"

0 commit comments

Comments
 (0)