You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -38,11 +71,9 @@ This KEP proposes a possible solution for this, with the primary goal being to e
38
71
39
72
### Non-Goals
40
73
41
-
We will expose the machinery and tools to bind a metric's labels to a discrete set of values.
74
+
We will expose the machinery and tools to bind a metric's labels to a discrete set of values. The allowlist will be ingested via a new-added component metric flag.
42
75
43
-
It is *not a goal* to implement and plumb this solution for each Kubernetes component (there are many SIGs and a number of verticals, which may have their own preferred way of doing things). As such it will be up to component owners to leverage this functionality that we provide, by feeding configuration data through whatever mechanism deemed appropriate (i.e. command line flags or reading from a file).
44
-
45
-
These flags are really only meant to be used as escape hatches, and should not be used to have extremely customized kubernetes setups where our existing dashboards and alerting rule definitions are just not going to apply generally anymore.
76
+
It is *not a goal* to define the allowlist for each Kubernetes component metrics.
46
77
47
78
## Proposal
48
79
@@ -53,9 +84,7 @@ The simple solution to this problem would be for each metric added to keep the u
53
84
54
85
Instead, the proposed solution is we will be able to *dynamically configure* an allowlist of label values for a metric. By *dynamically configure*, we mean configure an allowlist *at runtime* rather than during the build/compile step. And by *at runtime*, we mean, more specifically, during the boot sequence for a Kubernetes component (and we mean daemons here, not CLI tools like kubectl).
55
86
56
-
Brief aside: a Kubernetes component (which is a daemon) is an executable, which can be launched from the command line manually if desired. Components take a number of start-up configuration flags, which are passed into the component to modify execution paths (if curious, check out the [large amount of flags we have on the kubelet](https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/)). It is also possible to read configuration data from files (like yaml format) during the component boot sequence. This KEP does not have an opinion on the specific mechanism used to load config data into a Kubernetes binary during the boot sequence. What we *actually* care about, is just the fact that it is possible.
57
-
58
-
Our design is thus config-ingestion agnostic.
87
+
Brief aside: a Kubernetes component (which is a daemon) is an executable, which can be launched from the command line manually if desired. Components take a number of start-up configuration flags, which are passed into the component to modify execution paths (if curious, check out the [large amount of flags we have on the kubelet](https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/)). Our design is going to add a flag that applies to all components to ingest the allowlist.
59
88
60
89
Our design is also based on the premise that metrics can be uniquely identified (i.e. by their metric descriptor). Fortunately for us, this is actually a built in constraint of prometheus clients (which is how we instrument the Kubernetes components). This metric ID is resolvable to a unique string (this is pretty evident when looking at a raw prometheus client endpoint).
61
90
@@ -236,19 +265,146 @@ This would then be interpreted by our machinery as this:
236
265
237
266
```
238
267
239
-
## Open-Question
240
-
241
-
242
-
## Graduation Criteria
243
-
244
-
todo
268
+
## Design Details
269
+
### Test Plan
270
+
For `Alpha`, unit test to verify that the metric label will be set to "unexpected" if the metric encounters label values outside our explicit allowlist of values.
271
+
### Graduation Criteria
272
+
For `Alpha`, the allowlist of metrics can be configured via the exposed flag and the unit test is passed.
273
+
### Upgrade / Downgrade strategy
274
+
N/A
275
+
### Version Skew Strategy
276
+
N/A
277
+
278
+
## Production Readiness Review Questionnaire
279
+
### Feature Enablement and Rollback
280
+
281
+
_This section must be completed when targeting alpha to a release._
282
+
283
+
***How can this feature be enabled / disabled in a live cluster?**
284
+
-[ ] Feature gate (also fill in values in `kep.yaml`)
285
+
- Feature gate name:
286
+
- Components depending on the feature gate:
287
+
-[x] Other
288
+
- Describe the mechanism:
289
+
New flag will be used to config the allowlist of label values for a metric.
290
+
This flag will become standard flag for all k8s components and will be added to
291
+
`k8s.io/component-base`.
292
+
- Will enabling / disabling the feature require downtime of the control
293
+
plane? Yes, the components need to restart with flag enabled.
294
+
- Will enabling / disabling the feature require downtime or reprovisioning
295
+
of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled).
296
+
Yes, the components need to restart with flag enabled.
297
+
298
+
***Does enabling the feature change any default behavior?**
299
+
Any change of default behavior may be surprising to users or break existing
300
+
automations, so be extremely careful here.
301
+
Using this feature requires restarting the component with the flag enabled. Once enabled, the metric label will be set to "unexpected" if the metric encounters label values outside our explicit allowlist of values.
302
+
303
+
***Can the feature be disabled once it has been enabled (i.e. can we roll back
304
+
the enablement)?**
305
+
Also set `disable-supported` to `true` or `false` in `kep.yaml`.
306
+
Describe the consequences on existing workloads (e.g., if this is a runtime
307
+
feature, can it break the existing applications?).
308
+
Yes, restarting the component without the allowlist flag will basically disable this feature.
309
+
310
+
***What happens if we reenable the feature if it was previously rolled back?**
311
+
The enable-disable-enable process will not cause problem. But it may be problematic during the rolled back period with the unbounded metrics value.
312
+
313
+
***Are there any tests for feature enablement/disablement?**
314
+
No.
315
+
### Rollout, Upgrade and Rollback Planning
316
+
317
+
_This section must be completed when targeting beta graduation to a release._
318
+
319
+
***How can a rollout fail? Can it impact already running workloads?**
320
+
Try to be as paranoid as possible - e.g., what if some components will restart
321
+
mid-rollout?
322
+
Using this feature requires restarting the component with the flag enabled.
323
+
***What specific metrics should inform a rollback?**
324
+
None.
325
+
***Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
326
+
Describe manual testing that was done and the outcomes.
327
+
Longer term, we may want to require automated upgrade/rollback tests, but we
328
+
are missing a bunch of machinery and tooling and can't do that now.
329
+
No.
330
+
***Is the rollout accompanied by any deprecations and/or removals of features, APIs,
331
+
fields of API types, flags, etc.?**
332
+
A component metric flag for ingesting allowlist to be added.
333
+
### Monitoring Requirements
334
+
335
+
_This section must be completed when targeting beta graduation to a release._
336
+
337
+
***How can an operator determine if the feature is in use by workloads?**
338
+
The out-of-bound data will be recorded with label "unexpected" rather than the specific value.
339
+
***What are the SLIs (Service Level Indicators) an operator can use to determine
340
+
the health of the service?**
341
+
None.
342
+
343
+
***What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
344
+
None.
345
+
346
+
***Are there any missing metrics that would be useful to have to improve observability
347
+
of this feature?**
348
+
None.
349
+
350
+
### Dependencies
351
+
352
+
_This section must be completed when targeting beta graduation to a release._
353
+
354
+
***Does this feature depend on any specific services running in the cluster?**
355
+
No.
356
+
357
+
358
+
### Scalability
359
+
360
+
_For alpha, this section is encouraged: reviewers should consider these questions
361
+
and attempt to answer them._
362
+
363
+
_For beta, this section is required: reviewers must answer these questions._
364
+
365
+
_For GA, this section is required: approvers should be able to confirm the
366
+
previous answers based on experience in the field._
367
+
368
+
***Will enabling / using this feature result in any new API calls?**
369
+
No.
370
+
371
+
***Will enabling / using this feature result in introducing new API types?**
372
+
Describe them, providing:
373
+
No.
374
+
375
+
***Will enabling / using this feature result in any new calls to the cloud
376
+
provider?**
377
+
No.
378
+
***Will enabling / using this feature result in increasing size or count of
379
+
the existing API objects?**
380
+
No.
381
+
382
+
***Will enabling / using this feature result in increasing time taken by any
383
+
operations covered by [existing SLIs/SLOs]?**
384
+
Checking metrics label against allowlist may increase the metric recording time.
385
+
386
+
***Will enabling / using this feature result in non-negligible increase of
387
+
resource usage (CPU, RAM, disk, IO, ...) in any components?**
388
+
No.
389
+
390
+
### Troubleshooting
391
+
392
+
The Troubleshooting section currently serves the `Playbook` role. We may consider
393
+
splitting it into a dedicated `Playbook` document (potentially with some monitoring
394
+
details). For now, we leave it here.
245
395
396
+
_This section must be completed when targeting beta graduation to a release._
246
397
247
-
## Post-Beta tasks
398
+
***How does this feature react if the API server and/or etcd is unavailable?**
399
+
No additional impact comparing to what already exists.
400
+
***What are other known failure modes?**
401
+
None.
248
402
249
-
todo
403
+
***What steps should be taken if SLOs are not being met to determine the problem?**
0 commit comments