You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -38,11 +71,9 @@ This KEP proposes a possible solution for this, with the primary goal being to e
38
71
39
72
### Non-Goals
40
73
41
-
We will expose the machinery and tools to bind a metric's labels to a discrete set of values.
74
+
We will expose the machinery and tools to bind a metric's labels to a discrete set of values. The allowlist will be ingested via a new-added component metric flag.
42
75
43
-
It is *not a goal* to implement and plumb this solution for each Kubernetes component (there are many SIGs and a number of verticals, which may have their own preferred way of doing things). As such it will be up to component owners to leverage this functionality that we provide, by feeding configuration data through whatever mechanism deemed appropriate (i.e. command line flags or reading from a file).
44
-
45
-
These flags are really only meant to be used as escape hatches, and should not be used to have extremely customized kubernetes setups where our existing dashboards and alerting rule definitions are just not going to apply generally anymore.
76
+
It is *not a goal* to define the allowlist for each Kubernetes component metrics.
46
77
47
78
## Proposal
48
79
@@ -53,9 +84,7 @@ The simple solution to this problem would be for each metric added to keep the u
53
84
54
85
Instead, the proposed solution is we will be able to *dynamically configure* an allowlist of label values for a metric. By *dynamically configure*, we mean configure an allowlist *at runtime* rather than during the build/compile step. And by *at runtime*, we mean, more specifically, during the boot sequence for a Kubernetes component (and we mean daemons here, not CLI tools like kubectl).
55
86
56
-
Brief aside: a Kubernetes component (which is a daemon) is an executable, which can be launched from the command line manually if desired. Components take a number of start-up configuration flags, which are passed into the component to modify execution paths (if curious, check out the [large amount of flags we have on the kubelet](https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/)). It is also possible to read configuration data from files (like yaml format) during the component boot sequence. This KEP does not have an opinion on the specific mechanism used to load config data into a Kubernetes binary during the boot sequence. What we *actually* care about, is just the fact that it is possible.
57
-
58
-
Our design is thus config-ingestion agnostic.
87
+
Brief aside: a Kubernetes component (which is a daemon) is an executable, which can be launched from the command line manually if desired. Components take a number of start-up configuration flags, which are passed into the component to modify execution paths (if curious, check out the [large amount of flags we have on the kubelet](https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/)). Our design is going to add a flag that applies to all components to ingest the allowlist.
59
88
60
89
Our design is also based on the premise that metrics can be uniquely identified (i.e. by their metric descriptor). Fortunately for us, this is actually a built in constraint of prometheus clients (which is how we instrument the Kubernetes components). This metric ID is resolvable to a unique string (this is pretty evident when looking at a raw prometheus client endpoint).
61
90
@@ -236,19 +265,138 @@ This would then be interpreted by our machinery as this:
236
265
237
266
```
238
267
239
-
## Open-Question
240
-
241
-
242
-
## Graduation Criteria
268
+
## Design Details
269
+
### Test Plan
270
+
For `Alpha`, unit test to verify that the metric label will be set to "unexpected" if the metric encounters label values outside our explicit allowlist of values.
271
+
### Graduation Criteria
272
+
For `Alpha`, the allowlist of metrics can be configured via the exposed flag and the unit test is passed.
273
+
For `Beta`, the allowlist can be configured from a input file(e.g. yaml file).
274
+
### Upgrade / Downgrade strategy
275
+
N/A
276
+
### Version Skew Strategy
277
+
N/A
278
+
279
+
## Production Readiness Review Questionnaire
280
+
### Feature Enablement and Rollback
281
+
282
+
_This section must be completed when targeting alpha to a release._
283
+
284
+
***How can this feature be enabled / disabled in a live cluster?**
285
+
-[x] Feature gate (also fill in values in `kep.yaml`)
286
+
- Feature gate name: MetricCardinalityEnforcement
287
+
- Components depending on the feature gate: All components that emit metrics
288
+
289
+
***Does enabling the feature change any default behavior?**
290
+
Any change of default behavior may be surprising to users or break existing
291
+
automations, so be extremely careful here.
292
+
Using this feature requires restarting the component with the allowlist flag enabled. Once enabled, the metric label will be set to "unexpected" if the metric encounters label values outside our explicit allowlist of values.
293
+
294
+
***Can the feature be disabled once it has been enabled (i.e. can we roll back
295
+
the enablement)?**
296
+
Also set `disable-supported` to `true` or `false` in `kep.yaml`.
297
+
Describe the consequences on existing workloads (e.g., if this is a runtime
298
+
feature, can it break the existing applications?).
299
+
Yes, disabling the feature gate can revert it back to existing behavior
300
+
301
+
***What happens if we reenable the feature if it was previously rolled back?**
302
+
The enable-disable-enable process will not cause problem. But it may be problematic during the rolled back period with the unbounded metrics value.
303
+
304
+
***Are there any tests for feature enablement/disablement?**
305
+
Using unit tests to cover the combination cases w/wo feature and w/wo allowlist.
306
+
307
+
### Rollout, Upgrade and Rollback Planning
308
+
309
+
_This section must be completed when targeting beta graduation to a release._
310
+
311
+
***How can a rollout fail? Can it impact already running workloads?**
312
+
Try to be as paranoid as possible - e.g., what if some components will restart
313
+
mid-rollout?
314
+
Using this feature requires restarting the component with the allowlist flag enabled.
315
+
***What specific metrics should inform a rollback?**
316
+
None.
317
+
***Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
318
+
Describe manual testing that was done and the outcomes.
319
+
Longer term, we may want to require automated upgrade/rollback tests, but we
320
+
are missing a bunch of machinery and tooling and can't do that now.
321
+
In alpha, we can do some manual tests on enable/disable allowlist flag and enable/disable feature gate.
322
+
***Is the rollout accompanied by any deprecations and/or removals of features, APIs,
323
+
fields of API types, flags, etc.?**
324
+
A component metric flag for ingesting allowlist to be added.
325
+
### Monitoring Requirements
326
+
327
+
_This section must be completed when targeting beta graduation to a release._
328
+
329
+
***How can an operator determine if the feature is in use by workloads?**
330
+
The out-of-bound data will be recorded with label "unexpected" rather than the specific value.
331
+
***What are the SLIs (Service Level Indicators) an operator can use to determine
332
+
the health of the service?**
333
+
None.
334
+
335
+
***What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
336
+
None.
337
+
338
+
***Are there any missing metrics that would be useful to have to improve observability
339
+
of this feature?**
340
+
None.
341
+
342
+
### Dependencies
343
+
344
+
_This section must be completed when targeting beta graduation to a release._
345
+
346
+
***Does this feature depend on any specific services running in the cluster?**
347
+
No.
348
+
349
+
350
+
### Scalability
351
+
352
+
_For alpha, this section is encouraged: reviewers should consider these questions
353
+
and attempt to answer them._
354
+
355
+
_For beta, this section is required: reviewers must answer these questions._
356
+
357
+
_For GA, this section is required: approvers should be able to confirm the
358
+
previous answers based on experience in the field._
359
+
360
+
***Will enabling / using this feature result in any new API calls?**
361
+
No.
362
+
363
+
***Will enabling / using this feature result in introducing new API types?**
364
+
Describe them, providing:
365
+
No.
366
+
367
+
***Will enabling / using this feature result in any new calls to the cloud
368
+
provider?**
369
+
No.
370
+
***Will enabling / using this feature result in increasing size or count of
371
+
the existing API objects?**
372
+
No.
373
+
374
+
***Will enabling / using this feature result in increasing time taken by any
375
+
operations covered by [existing SLIs/SLOs]?**
376
+
Checking metrics label against allowlist may increase the metric recording time.
243
377
244
-
todo
378
+
***Will enabling / using this feature result in non-negligible increase of
379
+
resource usage (CPU, RAM, disk, IO, ...) in any components?**
380
+
No.
245
381
382
+
### Troubleshooting
246
383
247
-
## Post-Beta tasks
384
+
The Troubleshooting section currently serves the `Playbook` role. We may consider
385
+
splitting it into a dedicated `Playbook` document (potentially with some monitoring
386
+
details). For now, we leave it here.
248
387
249
-
todo
388
+
_This section must be completed when targeting beta graduation to a release._
389
+
390
+
***How does this feature react if the API server and/or etcd is unavailable?**
391
+
No additional impact comparing to what already exists.
392
+
***What are other known failure modes?**
393
+
None.
394
+
395
+
***What steps should be taken if SLOs are not being met to determine the problem?**
0 commit comments