Skip to content

Commit 722007d

Browse files
authored
Merge pull request kubernetes#2469 from BenTheElder/implementable-build
Update Reducing Kubernetes Build Maintenance to implementable
2 parents 3533411 + 8487b06 commit 722007d

File tree

3 files changed

+57
-99
lines changed

3 files changed

+57
-99
lines changed
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
kep-number: 2420
2+
beta:
3+
approver: "@johnbelamaric"

keps/sig-testing/2420-reducing-kubernetes-build-maintenance/README.md

Lines changed: 49 additions & 95 deletions
Original file line numberDiff line numberDiff line change
@@ -125,9 +125,9 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
125125
- [x] (R) Design details are appropriately documented
126126
- [x] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
127127
- [x] (R) Graduation criteria is in place
128-
- [ ] (R) Production readiness review completed
128+
- [x] (R) Production readiness review completed
129129
- [ ] (R) Production readiness review approved
130-
- [ ] "Implementation History" section is up-to-date for milestone
130+
- [x] "Implementation History" section is up-to-date for milestone
131131
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
132132
- [x] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
133133

@@ -380,9 +380,6 @@ n/a.
380380

381381
## Production Readiness Review Questionnaire
382382

383-
**TODO**: This entire section seems completely irrelevant for KEPs that do not
384-
target changes to release artifacts. Delete this section?
385-
386383
<!--
387384
388385
Production readiness reviews are intended to ensure that features merging into
@@ -411,102 +408,76 @@ you need any help or guidance.
411408
_This section must be completed when targeting alpha to a release._
412409

413410
* **How can this feature be enabled / disabled in a live cluster?**
414-
- [ ] Feature gate (also fill in values in `kep.yaml`)
415-
- Feature gate name:
416-
- Components depending on the feature gate:
417-
- [ ] Other
418-
- Describe the mechanism:
419-
- Will enabling / disabling the feature require downtime of the control
420-
plane?
421-
- Will enabling / disabling the feature require downtime or reprovisioning
422-
of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled).
411+
412+
N/A
423413

424414
* **Does enabling the feature change any default behavior?**
425-
Any change of default behavior may be surprising to users or break existing
426-
automations, so be extremely careful here.
415+
416+
N/A
427417

428418
* **Can the feature be disabled once it has been enabled (i.e. can we roll back
429419
the enablement)?**
430-
Also set `disable-supported` to `true` or `false` in `kep.yaml`.
431-
Describe the consequences on existing workloads (e.g., if this is a runtime
432-
feature, can it break the existing applications?).
420+
421+
N/A
433422

434423
* **What happens if we reenable the feature if it was previously rolled back?**
435424

425+
N/A
426+
436427
* **Are there any tests for feature enablement/disablement?**
437-
The e2e framework does not currently support enabling or disabling feature
438-
gates. However, unit tests in each component dealing with managing data, created
439-
with and without the feature, are necessary. At the very least, think about
440-
conversion tests if API types are being modified.
428+
429+
N/A
441430

442431
### Rollout, Upgrade and Rollback Planning
443432

444433
_This section must be completed when targeting beta graduation to a release._
445434

446435
* **How can a rollout fail? Can it impact already running workloads?**
447-
Try to be as paranoid as possible - e.g., what if some components will restart
448-
mid-rollout?
436+
437+
N/A
449438

450439
* **What specific metrics should inform a rollback?**
451440

441+
N/A
442+
452443
* **Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
453-
Describe manual testing that was done and the outcomes.
454-
Longer term, we may want to require automated upgrade/rollback tests, but we
455-
are missing a bunch of machinery and tooling and can't do that now.
444+
445+
N/A
456446

457447
* **Is the rollout accompanied by any deprecations and/or removals of features, APIs,
458448
fields of API types, flags, etc.?**
459-
Even if applying deprecation policies, they may still surprise some users.
449+
450+
N/A
460451

461452
### Monitoring Requirements
462453

463454
_This section must be completed when targeting beta graduation to a release._
464455

465456
* **How can an operator determine if the feature is in use by workloads?**
466-
Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
467-
checking if there are objects with field X set) may be a last resort. Avoid
468-
logs or events for this purpose.
457+
458+
N/A
469459

470460
* **What are the SLIs (Service Level Indicators) an operator can use to determine
471461
the health of the service?**
472-
- [ ] Metrics
473-
- Metric name:
474-
- [Optional] Aggregation method:
475-
- Components exposing the metric:
476-
- [ ] Other (treat as last resort)
477-
- Details:
462+
463+
N/A
478464

479465
* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
480-
At a high level, this usually will be in the form of "high percentile of SLI
481-
per day <= X". It's impossible to provide comprehensive guidance, but at the very
482-
high level (needs more precise definitions) those may be things like:
483-
- per-day percentage of API calls finishing with 5XX errors <= 1%
484-
- 99% percentile over day of absolute value from (job creation time minus expected
485-
job creation time) for cron job <= 10%
486-
- 99,9% of /health requests per day finish with 200 code
466+
467+
N/A
487468

488469
* **Are there any missing metrics that would be useful to have to improve observability
489470
of this feature?**
490-
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
491-
implementation difficulties, etc.).
471+
472+
N/A
492473

493474
### Dependencies
494475

495476
_This section must be completed when targeting beta graduation to a release._
496477

497478
* **Does this feature depend on any specific services running in the cluster?**
498-
Think about both cluster-level services (e.g. metrics-server) as well
499-
as node-level agents (e.g. specific version of CRI). Focus on external or
500-
optional services that are needed. For example, if this feature depends on
501-
a cloud provider API, or upon an external software-defined storage or network
502-
control plane.
503479

504-
For each of these, fill in the following—thinking about running existing user workloads
505-
and creating new ones, as well as about cluster-level services (e.g. DNS):
506-
- [Dependency name]
507-
- Usage description:
508-
- Impact of its outage on the feature:
509-
- Impact of its degraded performance or high-error rates on the feature:
480+
N/A
510481

511482

512483
### Scalability
@@ -520,45 +491,32 @@ _For GA, this section is required: approvers should be able to confirm the
520491
previous answers based on experience in the field._
521492

522493
* **Will enabling / using this feature result in any new API calls?**
523-
Describe them, providing:
524-
- API call type (e.g. PATCH pods)
525-
- estimated throughput
526-
- originating component(s) (e.g. Kubelet, Feature-X-controller)
527-
focusing mostly on:
528-
- components listing and/or watching resources they didn't before
529-
- API calls that may be triggered by changes of some Kubernetes resources
530-
(e.g. update of object X triggers new updates of object Y)
531-
- periodic API calls to reconcile state (e.g. periodic fetching state,
532-
heartbeats, leader election, etc.)
494+
495+
N/A
533496

534497
* **Will enabling / using this feature result in introducing new API types?**
535-
Describe them, providing:
536-
- API type
537-
- Supported number of objects per cluster
538-
- Supported number of objects per namespace (for namespace-scoped objects)
498+
499+
N/A
539500

540501
* **Will enabling / using this feature result in any new calls to the cloud
541502
provider?**
542503

504+
N/A
505+
543506
* **Will enabling / using this feature result in increasing size or count of
544507
the existing API objects?**
545-
Describe them, providing:
546-
- API type(s):
547-
- Estimated increase in size: (e.g., new annotation of size 32B)
548-
- Estimated amount of new objects: (e.g., new Object X for every existing Pod)
508+
509+
N/A
549510

550511
* **Will enabling / using this feature result in increasing time taken by any
551512
operations covered by [existing SLIs/SLOs]?**
552-
Think about adding additional work or introducing new steps in between
553-
(e.g. need to do X to start a container), etc. Please describe the details.
513+
514+
N/A
554515

555516
* **Will enabling / using this feature result in non-negligible increase of
556517
resource usage (CPU, RAM, disk, IO, ...) in any components?**
557-
Things to keep in mind include: additional in-memory state, additional
558-
non-trivial computations, excessive access to disks (including increased log
559-
volume), significant amount of data sent and/or received over network, etc.
560-
This through this both in small and large cases, again with respect to the
561-
[supported limits].
518+
519+
N/A
562520

563521
### Troubleshooting
564522

@@ -570,22 +528,15 @@ _This section must be completed when targeting beta graduation to a release._
570528

571529
* **How does this feature react if the API server and/or etcd is unavailable?**
572530

531+
N/A
532+
573533
* **What are other known failure modes?**
574-
For each of them, fill in the following information by copying the below template:
575-
- [Failure mode brief description]
576-
- Detection: How can it be detected via metrics? Stated another way:
577-
how can an operator troubleshoot without logging into a master or worker node?
578-
- Mitigations: What can be done to stop the bleeding, especially for already
579-
running user workloads?
580-
- Diagnostics: What are the useful log messages and their required logging
581-
levels that could help debug the issue?
582-
Not required until feature graduated to beta.
583-
- Testing: Are there any tests for failure mode? If not, describe why.
534+
535+
N/A
584536

585537
* **What steps should be taken if SLOs are not being met to determine the problem?**
586538

587-
[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
588-
[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
539+
N/A
589540

590541
## Implementation History
591542

@@ -600,6 +551,9 @@ Major milestones might include:
600551
- when the KEP was retired or superseded
601552
-->
602553

554+
- 2020-02-04 - Initial KEP draft / provisional [#2421](https://github.com/kubernetes/enhancements/pull/2421)
555+
- 2020-02-08 - KEP implementable [#2469](https://github.com/kubernetes/enhancements/pull/2469)
556+
603557
## Drawbacks
604558

605559
<!--

keps/sig-testing/2420-reducing-kubernetes-build-maintenance/kep.yml renamed to keps/sig-testing/2420-reducing-kubernetes-build-maintenance/kep.yaml

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,21 +6,22 @@ authors:
66
owning-sig: sig-testing
77
participating-sigs:
88
- sig-release
9-
status: provisional
9+
status: implementable
1010
creation-date: 2021-02-03
1111
reviewers:
1212
- dims
1313
- liggitt
1414
approvers:
1515
- spiffxp
1616
- justaugustus
17+
# NOTE: there's no production change in this KEP
1718
prr-approvers:
18-
- TBD
19+
- johnbelamaric
1920
see-also: []
2021
replaces: []
2122

2223
# The target maturity stage in the current dev cycle for this KEP.
23-
stage: alpha
24+
stage: beta
2425

2526
# The most recent milestone for which work toward delivery of this KEP has been
2627
# done. This can be the current (upcoming) milestone, if it is being actively
@@ -30,10 +31,10 @@ latest-milestone: "v1.21"
3031
# The milestone at which this feature was, or is targeted to be, at each stage.
3132
milestone:
3233
alpha: "v1.21"
33-
# TODO: figure out if these are the right milestones for beta/stable.
3434
beta: "v1.21"
3535
stable: "v1.23"
3636

37+
# these are N/A
3738
# The following PRR answers are required at alpha release
3839
# List the feature gate name and the components for which it must be enabled
3940
feature-gates: []

0 commit comments

Comments
 (0)