Skip to content

Commit e1231c1

Browse files
committed
Add toc for volume expansion KEP
1 parent b5afb53 commit e1231c1

File tree

1 file changed

+356
-0
lines changed
  • keps/sig-storage/284-enable-volume-expansion

1 file changed

+356
-0
lines changed

keps/sig-storage/284-enable-volume-expansion/README.md

Lines changed: 356 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,17 @@
2121
- [PVC API Change](#pvc-api-change)
2222
- [StorageClass API change](#storageclass-api-change)
2323
- [Other API changes](#other-api-changes)
24+
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
25+
- [Feature Enablement and Rollback](#feature-enablement-and-rollback)
26+
- [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
27+
- [Monitoring Requirements](#monitoring-requirements)
28+
- [Dependencies](#dependencies)
29+
- [Scalability](#scalability)
30+
- [Troubleshooting](#troubleshooting)
31+
- [Implementation History](#implementation-history)
32+
- [Drawbacks](#drawbacks)
33+
- [Alternatives](#alternatives)
34+
- [Infrastructure Needed (Optional)](#infrastructure-needed-optional)
2435
<!-- /toc -->
2536

2637
## Release Signoff Checklist
@@ -344,3 +355,348 @@ type StorageClass struct {
344355

345356
This proposal relies on ability to update PVC status from kubelet. While updating PVC's status
346357
a PATCH request must be made from kubelet to update the status.
358+
359+
## Production Readiness Review Questionnaire
360+
361+
<!--
362+
363+
Production readiness reviews are intended to ensure that features merging into
364+
Kubernetes are observable, scalable and supportable; can be safely operated in
365+
production environments, and can be disabled or rolled back in the event they
366+
cause increased failures in production. See more in the PRR KEP at
367+
https://git.k8s.io/enhancements/keps/sig-architecture/1194-prod-readiness.
368+
369+
The production readiness review questionnaire must be completed and approved
370+
for the KEP to move to `implementable` status and be included in the release.
371+
372+
In some cases, the questions below should also have answers in `kep.yaml`. This
373+
is to enable automation to verify the presence of the review, and to reduce review
374+
burden and latency.
375+
376+
The KEP must have a approver from the
377+
[`prod-readiness-approvers`](http://git.k8s.io/enhancements/OWNERS_ALIASES)
378+
team. Please reach out on the
379+
[#prod-readiness](https://kubernetes.slack.com/archives/CPNHUMN74) channel if
380+
you need any help or guidance.
381+
-->
382+
383+
### Feature Enablement and Rollback
384+
385+
<!--
386+
This section must be completed when targeting alpha to a release.
387+
-->
388+
389+
###### How can this feature be enabled / disabled in a live cluster?
390+
391+
<!--
392+
Pick one of these and delete the rest.
393+
-->
394+
395+
- [ ] Feature gate (also fill in values in `kep.yaml`)
396+
- Feature gate name:
397+
- Components depending on the feature gate:
398+
- [ ] Other
399+
- Describe the mechanism:
400+
- Will enabling / disabling the feature require downtime of the control
401+
plane?
402+
- Will enabling / disabling the feature require downtime or reprovisioning
403+
of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled).
404+
405+
###### Does enabling the feature change any default behavior?
406+
407+
<!--
408+
Any change of default behavior may be surprising to users or break existing
409+
automations, so be extremely careful here.
410+
-->
411+
412+
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
413+
414+
<!--
415+
Describe the consequences on existing workloads (e.g., if this is a runtime
416+
feature, can it break the existing applications?).
417+
418+
NOTE: Also set `disable-supported` to `true` or `false` in `kep.yaml`.
419+
-->
420+
421+
###### What happens if we reenable the feature if it was previously rolled back?
422+
423+
###### Are there any tests for feature enablement/disablement?
424+
425+
<!--
426+
The e2e framework does not currently support enabling or disabling feature
427+
gates. However, unit tests in each component dealing with managing data, created
428+
with and without the feature, are necessary. At the very least, think about
429+
conversion tests if API types are being modified.
430+
-->
431+
432+
### Rollout, Upgrade and Rollback Planning
433+
434+
<!--
435+
This section must be completed when targeting beta to a release.
436+
-->
437+
438+
###### How can a rollout or rollback fail? Can it impact already running workloads?
439+
440+
<!--
441+
Try to be as paranoid as possible - e.g., what if some components will restart
442+
mid-rollout?
443+
444+
Be sure to consider highly-available clusters, where, for example,
445+
feature flags will be enabled on some API servers and not others during the
446+
rollout. Similarly, consider large clusters and how enablement/disablement
447+
will rollout across nodes.
448+
-->
449+
450+
###### What specific metrics should inform a rollback?
451+
452+
<!--
453+
What signals should users be paying attention to when the feature is young
454+
that might indicate a serious problem?
455+
-->
456+
457+
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
458+
459+
<!--
460+
Describe manual testing that was done and the outcomes.
461+
Longer term, we may want to require automated upgrade/rollback tests, but we
462+
are missing a bunch of machinery and tooling and can't do that now.
463+
-->
464+
465+
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
466+
467+
<!--
468+
Even if applying deprecation policies, they may still surprise some users.
469+
-->
470+
471+
### Monitoring Requirements
472+
473+
<!--
474+
This section must be completed when targeting beta to a release.
475+
-->
476+
477+
###### How can an operator determine if the feature is in use by workloads?
478+
479+
<!--
480+
Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
481+
checking if there are objects with field X set) may be a last resort. Avoid
482+
logs or events for this purpose.
483+
-->
484+
485+
###### How can someone using this feature know that it is working for their instance?
486+
487+
<!--
488+
For instance, if this is a pod-related feature, it should be possible to determine if the feature is functioning properly
489+
for each individual pod.
490+
Pick one more of these and delete the rest.
491+
Please describe all items visible to end users below with sufficient detail so that they can verify correct enablement
492+
and operation of this feature.
493+
Recall that end users cannot usually observe component logs or access metrics.
494+
-->
495+
496+
- [ ] Events
497+
- Event Reason:
498+
- [ ] API .status
499+
- Condition name:
500+
- Other field:
501+
- [ ] Other (treat as last resort)
502+
- Details:
503+
504+
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
505+
506+
<!--
507+
This is your opportunity to define what "normal" quality of service looks like
508+
for a feature.
509+
510+
It's impossible to provide comprehensive guidance, but at the very
511+
high level (needs more precise definitions) those may be things like:
512+
- per-day percentage of API calls finishing with 5XX errors <= 1%
513+
- 99% percentile over day of absolute value from (job creation time minus expected
514+
job creation time) for cron job <= 10%
515+
- 99.9% of /health requests per day finish with 200 code
516+
517+
These goals will help you determine what you need to measure (SLIs) in the next
518+
question.
519+
-->
520+
521+
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
522+
523+
<!--
524+
Pick one more of these and delete the rest.
525+
-->
526+
527+
- [ ] Metrics
528+
- Metric name:
529+
- [Optional] Aggregation method:
530+
- Components exposing the metric:
531+
- [ ] Other (treat as last resort)
532+
- Details:
533+
534+
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
535+
536+
<!--
537+
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
538+
implementation difficulties, etc.).
539+
-->
540+
541+
### Dependencies
542+
543+
<!--
544+
This section must be completed when targeting beta to a release.
545+
-->
546+
547+
###### Does this feature depend on any specific services running in the cluster?
548+
549+
<!--
550+
Think about both cluster-level services (e.g. metrics-server) as well
551+
as node-level agents (e.g. specific version of CRI). Focus on external or
552+
optional services that are needed. For example, if this feature depends on
553+
a cloud provider API, or upon an external software-defined storage or network
554+
control plane.
555+
556+
For each of these, fill in the following—thinking about running existing user workloads
557+
and creating new ones, as well as about cluster-level services (e.g. DNS):
558+
- [Dependency name]
559+
- Usage description:
560+
- Impact of its outage on the feature:
561+
- Impact of its degraded performance or high-error rates on the feature:
562+
-->
563+
564+
### Scalability
565+
566+
<!--
567+
For alpha, this section is encouraged: reviewers should consider these questions
568+
and attempt to answer them.
569+
570+
For beta, this section is required: reviewers must answer these questions.
571+
572+
For GA, this section is required: approvers should be able to confirm the
573+
previous answers based on experience in the field.
574+
-->
575+
576+
###### Will enabling / using this feature result in any new API calls?
577+
578+
<!--
579+
Describe them, providing:
580+
- API call type (e.g. PATCH pods)
581+
- estimated throughput
582+
- originating component(s) (e.g. Kubelet, Feature-X-controller)
583+
Focusing mostly on:
584+
- components listing and/or watching resources they didn't before
585+
- API calls that may be triggered by changes of some Kubernetes resources
586+
(e.g. update of object X triggers new updates of object Y)
587+
- periodic API calls to reconcile state (e.g. periodic fetching state,
588+
heartbeats, leader election, etc.)
589+
-->
590+
591+
###### Will enabling / using this feature result in introducing new API types?
592+
593+
<!--
594+
Describe them, providing:
595+
- API type
596+
- Supported number of objects per cluster
597+
- Supported number of objects per namespace (for namespace-scoped objects)
598+
-->
599+
600+
###### Will enabling / using this feature result in any new calls to the cloud provider?
601+
602+
<!--
603+
Describe them, providing:
604+
- Which API(s):
605+
- Estimated increase:
606+
-->
607+
608+
###### Will enabling / using this feature result in increasing size or count of the existing API objects?
609+
610+
<!--
611+
Describe them, providing:
612+
- API type(s):
613+
- Estimated increase in size: (e.g., new annotation of size 32B)
614+
- Estimated amount of new objects: (e.g., new Object X for every existing Pod)
615+
-->
616+
617+
###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
618+
619+
<!--
620+
Look at the [existing SLIs/SLOs].
621+
622+
Think about adding additional work or introducing new steps in between
623+
(e.g. need to do X to start a container), etc. Please describe the details.
624+
625+
[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
626+
-->
627+
628+
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
629+
630+
<!--
631+
Things to keep in mind include: additional in-memory state, additional
632+
non-trivial computations, excessive access to disks (including increased log
633+
volume), significant amount of data sent and/or received over network, etc.
634+
This through this both in small and large cases, again with respect to the
635+
[supported limits].
636+
637+
[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
638+
-->
639+
640+
### Troubleshooting
641+
642+
<!--
643+
This section must be completed when targeting beta to a release.
644+
645+
The Troubleshooting section currently serves the `Playbook` role. We may consider
646+
splitting it into a dedicated `Playbook` document (potentially with some monitoring
647+
details). For now, we leave it here.
648+
-->
649+
650+
###### How does this feature react if the API server and/or etcd is unavailable?
651+
652+
###### What are other known failure modes?
653+
654+
<!--
655+
For each of them, fill in the following information by copying the below template:
656+
- [Failure mode brief description]
657+
- Detection: How can it be detected via metrics? Stated another way:
658+
how can an operator troubleshoot without logging into a master or worker node?
659+
- Mitigations: What can be done to stop the bleeding, especially for already
660+
running user workloads?
661+
- Diagnostics: What are the useful log messages and their required logging
662+
levels that could help debug the issue?
663+
Not required until feature graduated to beta.
664+
- Testing: Are there any tests for failure mode? If not, describe why.
665+
-->
666+
667+
###### What steps should be taken if SLOs are not being met to determine the problem?
668+
669+
## Implementation History
670+
671+
<!--
672+
Major milestones in the lifecycle of a KEP should be tracked in this section.
673+
Major milestones might include:
674+
- the `Summary` and `Motivation` sections being merged, signaling SIG acceptance
675+
- the `Proposal` section being merged, signaling agreement on a proposed design
676+
- the date implementation started
677+
- the first Kubernetes release where an initial version of the KEP was available
678+
- the version of Kubernetes where the KEP graduated to general availability
679+
- when the KEP was retired or superseded
680+
-->
681+
682+
## Drawbacks
683+
684+
<!--
685+
Why should this KEP _not_ be implemented?
686+
-->
687+
688+
## Alternatives
689+
690+
<!--
691+
What other approaches did you consider, and why did you rule them out? These do
692+
not need to be as detailed as the proposal, but should include enough
693+
information to express the idea and why it was not acceptable.
694+
-->
695+
696+
## Infrastructure Needed (Optional)
697+
698+
<!--
699+
Use this section if you need things from the project/SIG. Examples include a
700+
new subproject, repos requested, or GitHub details. Listing these here allows a
701+
SIG to get the process for these resources started right away.
702+
-->

0 commit comments

Comments
 (0)