Skip to content

Commit 80f44e0

Browse files
committed
PRR approval request for KEPs:
1287-in-place-update-pod-resources 2273-kubelet-container-resources-cri-api-changes
1 parent 27b1053 commit 80f44e0

File tree

6 files changed

+443
-2
lines changed

6 files changed

+443
-2
lines changed
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
kep-number: 1287
2+
alpha:
3+
approver: "@ehashman"
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
kep-number: 2273
2+
alpha:
3+
approver: "@ehashman"

keps/sig-node/1287-in-place-update-pod-resources/README.md

Lines changed: 205 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,9 +33,17 @@
3333
- [Alpha](#alpha)
3434
- [Beta](#beta)
3535
- [Stable](#stable)
36+
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
37+
- [Feature Enablement and Rollback](#feature-enablement-and-rollback)
38+
- [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
39+
- [Monitoring Requirements](#monitoring-requirements)
40+
- [Dependencies](#dependencies)
41+
- [Scalability](#scalability)
42+
- [Troubleshooting](#troubleshooting)
3643
- [Implementation History](#implementation-history)
3744
<!-- /toc -->
3845

46+
3947
## Summary
4048

4149
This proposal aims at allowing Pod resource requests & limits to be updated
@@ -629,6 +637,203 @@ TODO: Identify more cases
629637
- No major bugs reported for three months.
630638
- Pod-scoped resources are handled if that KEP is past alpha
631639

640+
## Production Readiness Review Questionnaire
641+
642+
<!--
643+
644+
Production readiness reviews are intended to ensure that features merging into
645+
Kubernetes are observable, scalable and supportable; can be safely operated in
646+
production environments, and can be disabled or rolled back in the event they
647+
cause increased failures in production. See more in the PRR KEP at
648+
https://git.k8s.io/enhancements/keps/sig-architecture/20190731-production-readiness-review-process.md.
649+
650+
The production readiness review questionnaire must be completed for features in
651+
v1.19 or later, but is non-blocking at this time. That is, approval is not
652+
required in order to be in the release.
653+
654+
In some cases, the questions below should also have answers in `kep.yaml`. This
655+
is to enable automation to verify the presence of the review, and to reduce review
656+
burden and latency.
657+
658+
The KEP must have a approver from the
659+
[`prod-readiness-approvers`](http://git.k8s.io/enhancements/OWNERS_ALIASES)
660+
team. Please reach out on the
661+
[#prod-readiness](https://kubernetes.slack.com/archives/CPNHUMN74) channel if
662+
you need any help or guidance.
663+
664+
-->
665+
666+
### Feature Enablement and Rollback
667+
668+
_This section must be completed when targeting alpha to a release._
669+
670+
* **How can this feature be enabled / disabled in a live cluster?**
671+
- [x] Feature gate (also fill in values in `kep.yaml`)
672+
- Feature gate name: InPlacePodVerticalScaling
673+
- Components depending on the feature gate: kubelet
674+
- [ ] Other
675+
- Describe the mechanism:
676+
- Will enabling / disabling the feature require downtime of the control
677+
plane?
678+
- Will enabling / disabling the feature require downtime or reprovisioning
679+
of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled).
680+
681+
* **Does enabling the feature change any default behavior?** No
682+
683+
* **Can the feature be disabled once it has been enabled (i.e. can we roll back
684+
the enablement)?** Yes
685+
686+
* **What happens if we reenable the feature if it was previously rolled back?**
687+
688+
* **Are there any tests for feature enablement/disablement?** Unit tests
689+
690+
### Rollout, Upgrade and Rollback Planning
691+
692+
_This section must be completed when targeting beta graduation to a release._
693+
694+
* **How can a rollout fail? Can it impact already running workloads?**
695+
Try to be as paranoid as possible - e.g., what if some components will restart
696+
mid-rollout?
697+
698+
* **What specific metrics should inform a rollback?**
699+
700+
* **Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
701+
Describe manual testing that was done and the outcomes.
702+
Longer term, we may want to require automated upgrade/rollback tests, but we
703+
are missing a bunch of machinery and tooling and can't do that now.
704+
705+
* **Is the rollout accompanied by any deprecations and/or removals of features, APIs,
706+
fields of API types, flags, etc.?**
707+
Even if applying deprecation policies, they may still surprise some users.
708+
709+
### Monitoring Requirements
710+
711+
_This section must be completed when targeting beta graduation to a release._
712+
713+
* **How can an operator determine if the feature is in use by workloads?**
714+
Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
715+
checking if there are objects with field X set) may be a last resort. Avoid
716+
logs or events for this purpose.
717+
718+
* **What are the SLIs (Service Level Indicators) an operator can use to determine
719+
the health of the service?**
720+
- [ ] Metrics
721+
- Metric name:
722+
- [Optional] Aggregation method:
723+
- Components exposing the metric:
724+
- [ ] Other (treat as last resort)
725+
- Details:
726+
727+
* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
728+
At a high level, this usually will be in the form of "high percentile of SLI
729+
per day <= X". It's impossible to provide comprehensive guidance, but at the very
730+
high level (needs more precise definitions) those may be things like:
731+
- per-day percentage of API calls finishing with 5XX errors <= 1%
732+
- 99% percentile over day of absolute value from (job creation time minus expected
733+
job creation time) for cron job <= 10%
734+
- 99,9% of /health requests per day finish with 200 code
735+
736+
* **Are there any missing metrics that would be useful to have to improve observability
737+
of this feature?**
738+
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
739+
implementation difficulties, etc.).
740+
741+
### Dependencies
742+
743+
_This section must be completed when targeting beta graduation to a release._
744+
745+
* **Does this feature depend on any specific services running in the cluster?**
746+
Think about both cluster-level services (e.g. metrics-server) as well
747+
as node-level agents (e.g. specific version of CRI). Focus on external or
748+
optional services that are needed. For example, if this feature depends on
749+
a cloud provider API, or upon an external software-defined storage or network
750+
control plane.
751+
752+
For each of these, fill in the following—thinking about running existing user workloads
753+
and creating new ones, as well as about cluster-level services (e.g. DNS):
754+
- [Dependency name]
755+
- Usage description:
756+
- Impact of its outage on the feature:
757+
- Impact of its degraded performance or high-error rates on the feature:
758+
759+
### Scalability
760+
761+
_For alpha, this section is encouraged: reviewers should consider these questions
762+
and attempt to answer them._
763+
764+
_For beta, this section is required: reviewers must answer these questions._
765+
766+
_For GA, this section is required: approvers should be able to confirm the
767+
previous answers based on experience in the field._
768+
769+
* **Will enabling / using this feature result in any new API calls?**
770+
Describe them, providing:
771+
- API call type (e.g. PATCH pods)
772+
- estimated throughput
773+
- originating component(s) (e.g. Kubelet, Feature-X-controller)
774+
focusing mostly on:
775+
- components listing and/or watching resources they didn't before
776+
- API calls that may be triggered by changes of some Kubernetes resources
777+
(e.g. update of object X triggers new updates of object Y)
778+
- periodic API calls to reconcile state (e.g. periodic fetching state,
779+
heartbeats, leader election, etc.)
780+
781+
* **Will enabling / using this feature result in introducing new API types?**
782+
Describe them, providing:
783+
- API type
784+
- Supported number of objects per cluster
785+
- Supported number of objects per namespace (for namespace-scoped objects)
786+
787+
* **Will enabling / using this feature result in any new calls to the cloud
788+
provider?**
789+
790+
* **Will enabling / using this feature result in increasing size or count of
791+
the existing API objects?**
792+
Describe them, providing:
793+
- API type(s):
794+
- Estimated increase in size: (e.g., new annotation of size 32B)
795+
- Estimated amount of new objects: (e.g., new Object X for every existing Pod)
796+
797+
* **Will enabling / using this feature result in increasing time taken by any
798+
operations covered by [existing SLIs/SLOs]?**
799+
Think about adding additional work or introducing new steps in between
800+
(e.g. need to do X to start a container), etc. Please describe the details.
801+
802+
* **Will enabling / using this feature result in non-negligible increase of
803+
resource usage (CPU, RAM, disk, IO, ...) in any components?**
804+
Things to keep in mind include: additional in-memory state, additional
805+
non-trivial computations, excessive access to disks (including increased log
806+
volume), significant amount of data sent and/or received over network, etc.
807+
This through this both in small and large cases, again with respect to the
808+
[supported limits].
809+
810+
### Troubleshooting
811+
812+
The Troubleshooting section currently serves the `Playbook` role. We may consider
813+
splitting it into a dedicated `Playbook` document (potentially with some monitoring
814+
details). For now, we leave it here.
815+
816+
_This section must be completed when targeting beta graduation to a release._
817+
818+
* **How does this feature react if the API server and/or etcd is unavailable?**
819+
820+
* **What are other known failure modes?**
821+
For each of them, fill in the following information by copying the below template:
822+
- [Failure mode brief description]
823+
- Detection: How can it be detected via metrics? Stated another way:
824+
how can an operator troubleshoot without logging into a master or worker node?
825+
- Mitigations: What can be done to stop the bleeding, especially for already
826+
running user workloads?
827+
- Diagnostics: What are the useful log messages and their required logging
828+
levels that could help debug the issue?
829+
Not required until feature graduated to beta.
830+
- Testing: Are there any tests for failure mode? If not, describe why.
831+
832+
* **What steps should be taken if SLOs are not being met to determine the problem?**
833+
834+
[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
835+
[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
836+
632837
## Implementation History
633838

634839
- 2018-11-06 - initial KEP draft created

keps/sig-node/1287-in-place-update-pod-resources/kep.yaml

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,12 +23,26 @@ approvers:
2323
- "@mwielgus"
2424
editor: TBD
2525
creation-date: 2018-11-06
26-
last-updated: 2020-01-14
26+
last-updated: 2021-02-05
2727
status: implementable
2828
see-also:
2929
- "/keps/sig-node/2273-kubelet-container-resources-cri-api-changes"
3030
replaces:
3131
superseded-by:
3232

33+
# PRR
34+
prr-approvers:
35+
- "@ehashman"
36+
feature-gates:
37+
- name: InPlacePodVerticalScaling
38+
components:
39+
- kube-apiserver
40+
- kube-scheduler
41+
- kubelet
42+
disable-supported: true
43+
milestone:
44+
alpha: "v1.22"
45+
beta: "v1.23"
46+
stable: "v1.24"
3347
latest-milestone: "0.0"
3448
stage: "alpha"

0 commit comments

Comments
 (0)