@@ -125,9 +125,9 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
125
125
- [x] (R) Design details are appropriately documented
126
126
- [x] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
127
127
- [x] (R) Graduation criteria is in place
128
- - [ ] (R) Production readiness review completed
128
+ - [x ] (R) Production readiness review completed
129
129
- [ ] (R) Production readiness review approved
130
- - [ ] "Implementation History" section is up-to-date for milestone
130
+ - [x ] "Implementation History" section is up-to-date for milestone
131
131
- [ ] User-facing documentation has been created in [ kubernetes/website] , for publication to [ kubernetes.io]
132
132
- [x] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
133
133
380
380
381
381
## Production Readiness Review Questionnaire
382
382
383
- ** TODO** : This entire section seems completely irrelevant for KEPs that do not
384
- target changes to release artifacts. Delete this section?
385
-
386
383
<!--
387
384
388
385
Production readiness reviews are intended to ensure that features merging into
@@ -411,102 +408,76 @@ you need any help or guidance.
411
408
_ This section must be completed when targeting alpha to a release._
412
409
413
410
* ** How can this feature be enabled / disabled in a live cluster?**
414
- - [ ] Feature gate (also fill in values in ` kep.yaml ` )
415
- - Feature gate name:
416
- - Components depending on the feature gate:
417
- - [ ] Other
418
- - Describe the mechanism:
419
- - Will enabling / disabling the feature require downtime of the control
420
- plane?
421
- - Will enabling / disabling the feature require downtime or reprovisioning
422
- of a node? (Do not assume ` Dynamic Kubelet Config ` feature is enabled).
411
+
412
+ N/A
423
413
424
414
* ** Does enabling the feature change any default behavior?**
425
- Any change of default behavior may be surprising to users or break existing
426
- automations, so be extremely careful here.
415
+
416
+ N/A
427
417
428
418
* ** Can the feature be disabled once it has been enabled (i.e. can we roll back
429
419
the enablement)?**
430
- Also set ` disable-supported ` to ` true ` or ` false ` in ` kep.yaml ` .
431
- Describe the consequences on existing workloads (e.g., if this is a runtime
432
- feature, can it break the existing applications?).
420
+
421
+ N/A
433
422
434
423
* ** What happens if we reenable the feature if it was previously rolled back?**
435
424
425
+ N/A
426
+
436
427
* ** Are there any tests for feature enablement/disablement?**
437
- The e2e framework does not currently support enabling or disabling feature
438
- gates. However, unit tests in each component dealing with managing data, created
439
- with and without the feature, are necessary. At the very least, think about
440
- conversion tests if API types are being modified.
428
+
429
+ N/A
441
430
442
431
### Rollout, Upgrade and Rollback Planning
443
432
444
433
_ This section must be completed when targeting beta graduation to a release._
445
434
446
435
* ** How can a rollout fail? Can it impact already running workloads?**
447
- Try to be as paranoid as possible - e.g., what if some components will restart
448
- mid-rollout?
436
+
437
+ N/A
449
438
450
439
* ** What specific metrics should inform a rollback?**
451
440
441
+ N/A
442
+
452
443
* ** Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
453
- Describe manual testing that was done and the outcomes.
454
- Longer term, we may want to require automated upgrade/rollback tests, but we
455
- are missing a bunch of machinery and tooling and can't do that now.
444
+
445
+ N/A
456
446
457
447
* ** Is the rollout accompanied by any deprecations and/or removals of features, APIs,
458
448
fields of API types, flags, etc.?**
459
- Even if applying deprecation policies, they may still surprise some users.
449
+
450
+ N/A
460
451
461
452
### Monitoring Requirements
462
453
463
454
_ This section must be completed when targeting beta graduation to a release._
464
455
465
456
* ** How can an operator determine if the feature is in use by workloads?**
466
- Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
467
- checking if there are objects with field X set) may be a last resort. Avoid
468
- logs or events for this purpose.
457
+
458
+ N/A
469
459
470
460
* ** What are the SLIs (Service Level Indicators) an operator can use to determine
471
461
the health of the service?**
472
- - [ ] Metrics
473
- - Metric name:
474
- - [ Optional] Aggregation method:
475
- - Components exposing the metric:
476
- - [ ] Other (treat as last resort)
477
- - Details:
462
+
463
+ N/A
478
464
479
465
* ** What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
480
- At a high level, this usually will be in the form of "high percentile of SLI
481
- per day <= X". It's impossible to provide comprehensive guidance, but at the very
482
- high level (needs more precise definitions) those may be things like:
483
- - per-day percentage of API calls finishing with 5XX errors <= 1%
484
- - 99% percentile over day of absolute value from (job creation time minus expected
485
- job creation time) for cron job <= 10%
486
- - 99,9% of /health requests per day finish with 200 code
466
+
467
+ N/A
487
468
488
469
* ** Are there any missing metrics that would be useful to have to improve observability
489
470
of this feature?**
490
- Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
491
- implementation difficulties, etc.).
471
+
472
+ N/A
492
473
493
474
### Dependencies
494
475
495
476
_ This section must be completed when targeting beta graduation to a release._
496
477
497
478
* ** Does this feature depend on any specific services running in the cluster?**
498
- Think about both cluster-level services (e.g. metrics-server) as well
499
- as node-level agents (e.g. specific version of CRI). Focus on external or
500
- optional services that are needed. For example, if this feature depends on
501
- a cloud provider API, or upon an external software-defined storage or network
502
- control plane.
503
479
504
- For each of these, fill in the following—thinking about running existing user workloads
505
- and creating new ones, as well as about cluster-level services (e.g. DNS):
506
- - [ Dependency name]
507
- - Usage description:
508
- - Impact of its outage on the feature:
509
- - Impact of its degraded performance or high-error rates on the feature:
480
+ N/A
510
481
511
482
512
483
### Scalability
@@ -520,45 +491,32 @@ _For GA, this section is required: approvers should be able to confirm the
520
491
previous answers based on experience in the field._
521
492
522
493
* ** Will enabling / using this feature result in any new API calls?**
523
- Describe them, providing:
524
- - API call type (e.g. PATCH pods)
525
- - estimated throughput
526
- - originating component(s) (e.g. Kubelet, Feature-X-controller)
527
- focusing mostly on:
528
- - components listing and/or watching resources they didn't before
529
- - API calls that may be triggered by changes of some Kubernetes resources
530
- (e.g. update of object X triggers new updates of object Y)
531
- - periodic API calls to reconcile state (e.g. periodic fetching state,
532
- heartbeats, leader election, etc.)
494
+
495
+ N/A
533
496
534
497
* ** Will enabling / using this feature result in introducing new API types?**
535
- Describe them, providing:
536
- - API type
537
- - Supported number of objects per cluster
538
- - Supported number of objects per namespace (for namespace-scoped objects)
498
+
499
+ N/A
539
500
540
501
* ** Will enabling / using this feature result in any new calls to the cloud
541
502
provider?**
542
503
504
+ N/A
505
+
543
506
* ** Will enabling / using this feature result in increasing size or count of
544
507
the existing API objects?**
545
- Describe them, providing:
546
- - API type(s):
547
- - Estimated increase in size: (e.g., new annotation of size 32B)
548
- - Estimated amount of new objects: (e.g., new Object X for every existing Pod)
508
+
509
+ N/A
549
510
550
511
* ** Will enabling / using this feature result in increasing time taken by any
551
512
operations covered by [ existing SLIs/SLOs] ?**
552
- Think about adding additional work or introducing new steps in between
553
- (e.g. need to do X to start a container), etc. Please describe the details.
513
+
514
+ N/A
554
515
555
516
* ** Will enabling / using this feature result in non-negligible increase of
556
517
resource usage (CPU, RAM, disk, IO, ...) in any components?**
557
- Things to keep in mind include: additional in-memory state, additional
558
- non-trivial computations, excessive access to disks (including increased log
559
- volume), significant amount of data sent and/or received over network, etc.
560
- This through this both in small and large cases, again with respect to the
561
- [ supported limits] .
518
+
519
+ N/A
562
520
563
521
### Troubleshooting
564
522
@@ -570,22 +528,15 @@ _This section must be completed when targeting beta graduation to a release._
570
528
571
529
* ** How does this feature react if the API server and/or etcd is unavailable?**
572
530
531
+ N/A
532
+
573
533
* ** What are other known failure modes?**
574
- For each of them, fill in the following information by copying the below template:
575
- - [ Failure mode brief description]
576
- - Detection: How can it be detected via metrics? Stated another way:
577
- how can an operator troubleshoot without logging into a master or worker node?
578
- - Mitigations: What can be done to stop the bleeding, especially for already
579
- running user workloads?
580
- - Diagnostics: What are the useful log messages and their required logging
581
- levels that could help debug the issue?
582
- Not required until feature graduated to beta.
583
- - Testing: Are there any tests for failure mode? If not, describe why.
534
+
535
+ N/A
584
536
585
537
* ** What steps should be taken if SLOs are not being met to determine the problem?**
586
538
587
- [ supported limits ] : https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
588
- [ existing SLIs/SLOs ] : https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
539
+ N/A
589
540
590
541
## Implementation History
591
542
@@ -600,6 +551,9 @@ Major milestones might include:
600
551
- when the KEP was retired or superseded
601
552
-->
602
553
554
+ - 2020-02-04 - Initial KEP draft / provisional [ #2421 ] ( https://github.com/kubernetes/enhancements/pull/2421 )
555
+ - 2020-02-08 - KEP implementable [ #2469 ] ( https://github.com/kubernetes/enhancements/pull/2469 )
556
+
603
557
## Drawbacks
604
558
605
559
<!--
0 commit comments