|
33 | 33 | - [Alpha](#alpha)
|
34 | 34 | - [Beta](#beta)
|
35 | 35 | - [Stable](#stable)
|
| 36 | +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) |
| 37 | + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) |
| 38 | + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) |
| 39 | + - [Monitoring Requirements](#monitoring-requirements) |
| 40 | + - [Dependencies](#dependencies) |
| 41 | + - [Scalability](#scalability) |
| 42 | + - [Troubleshooting](#troubleshooting) |
36 | 43 | - [Implementation History](#implementation-history)
|
37 | 44 | <!-- /toc -->
|
38 | 45 |
|
| 46 | + |
39 | 47 | ## Summary
|
40 | 48 |
|
41 | 49 | This proposal aims at allowing Pod resource requests & limits to be updated
|
@@ -629,6 +637,203 @@ TODO: Identify more cases
|
629 | 637 | - No major bugs reported for three months.
|
630 | 638 | - Pod-scoped resources are handled if that KEP is past alpha
|
631 | 639 |
|
| 640 | +## Production Readiness Review Questionnaire |
| 641 | + |
| 642 | +<!-- |
| 643 | +
|
| 644 | +Production readiness reviews are intended to ensure that features merging into |
| 645 | +Kubernetes are observable, scalable and supportable; can be safely operated in |
| 646 | +production environments, and can be disabled or rolled back in the event they |
| 647 | +cause increased failures in production. See more in the PRR KEP at |
| 648 | +https://git.k8s.io/enhancements/keps/sig-architecture/20190731-production-readiness-review-process.md. |
| 649 | +
|
| 650 | +The production readiness review questionnaire must be completed for features in |
| 651 | +v1.19 or later, but is non-blocking at this time. That is, approval is not |
| 652 | +required in order to be in the release. |
| 653 | +
|
| 654 | +In some cases, the questions below should also have answers in `kep.yaml`. This |
| 655 | +is to enable automation to verify the presence of the review, and to reduce review |
| 656 | +burden and latency. |
| 657 | +
|
| 658 | +The KEP must have a approver from the |
| 659 | +[`prod-readiness-approvers`](http://git.k8s.io/enhancements/OWNERS_ALIASES) |
| 660 | +team. Please reach out on the |
| 661 | +[#prod-readiness](https://kubernetes.slack.com/archives/CPNHUMN74) channel if |
| 662 | +you need any help or guidance. |
| 663 | +
|
| 664 | +--> |
| 665 | + |
| 666 | +### Feature Enablement and Rollback |
| 667 | + |
| 668 | +_This section must be completed when targeting alpha to a release._ |
| 669 | + |
| 670 | +* **How can this feature be enabled / disabled in a live cluster?** |
| 671 | + - [x] Feature gate (also fill in values in `kep.yaml`) |
| 672 | + - Feature gate name: InPlacePodVerticalScaling |
| 673 | + - Components depending on the feature gate: kubelet |
| 674 | + - [ ] Other |
| 675 | + - Describe the mechanism: |
| 676 | + - Will enabling / disabling the feature require downtime of the control |
| 677 | + plane? |
| 678 | + - Will enabling / disabling the feature require downtime or reprovisioning |
| 679 | + of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled). |
| 680 | + |
| 681 | +* **Does enabling the feature change any default behavior?** No |
| 682 | + |
| 683 | +* **Can the feature be disabled once it has been enabled (i.e. can we roll back |
| 684 | + the enablement)?** Yes |
| 685 | + |
| 686 | +* **What happens if we reenable the feature if it was previously rolled back?** |
| 687 | + |
| 688 | +* **Are there any tests for feature enablement/disablement?** Unit tests |
| 689 | + |
| 690 | +### Rollout, Upgrade and Rollback Planning |
| 691 | + |
| 692 | +_This section must be completed when targeting beta graduation to a release._ |
| 693 | + |
| 694 | +* **How can a rollout fail? Can it impact already running workloads?** |
| 695 | + Try to be as paranoid as possible - e.g., what if some components will restart |
| 696 | + mid-rollout? |
| 697 | + |
| 698 | +* **What specific metrics should inform a rollback?** |
| 699 | + |
| 700 | +* **Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?** |
| 701 | + Describe manual testing that was done and the outcomes. |
| 702 | + Longer term, we may want to require automated upgrade/rollback tests, but we |
| 703 | + are missing a bunch of machinery and tooling and can't do that now. |
| 704 | + |
| 705 | +* **Is the rollout accompanied by any deprecations and/or removals of features, APIs, |
| 706 | +fields of API types, flags, etc.?** |
| 707 | + Even if applying deprecation policies, they may still surprise some users. |
| 708 | + |
| 709 | +### Monitoring Requirements |
| 710 | + |
| 711 | +_This section must be completed when targeting beta graduation to a release._ |
| 712 | + |
| 713 | +* **How can an operator determine if the feature is in use by workloads?** |
| 714 | + Ideally, this should be a metric. Operations against the Kubernetes API (e.g., |
| 715 | + checking if there are objects with field X set) may be a last resort. Avoid |
| 716 | + logs or events for this purpose. |
| 717 | + |
| 718 | +* **What are the SLIs (Service Level Indicators) an operator can use to determine |
| 719 | +the health of the service?** |
| 720 | + - [ ] Metrics |
| 721 | + - Metric name: |
| 722 | + - [Optional] Aggregation method: |
| 723 | + - Components exposing the metric: |
| 724 | + - [ ] Other (treat as last resort) |
| 725 | + - Details: |
| 726 | + |
| 727 | +* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?** |
| 728 | + At a high level, this usually will be in the form of "high percentile of SLI |
| 729 | + per day <= X". It's impossible to provide comprehensive guidance, but at the very |
| 730 | + high level (needs more precise definitions) those may be things like: |
| 731 | + - per-day percentage of API calls finishing with 5XX errors <= 1% |
| 732 | + - 99% percentile over day of absolute value from (job creation time minus expected |
| 733 | + job creation time) for cron job <= 10% |
| 734 | + - 99,9% of /health requests per day finish with 200 code |
| 735 | + |
| 736 | +* **Are there any missing metrics that would be useful to have to improve observability |
| 737 | +of this feature?** |
| 738 | + Describe the metrics themselves and the reasons why they weren't added (e.g., cost, |
| 739 | + implementation difficulties, etc.). |
| 740 | + |
| 741 | +### Dependencies |
| 742 | + |
| 743 | +_This section must be completed when targeting beta graduation to a release._ |
| 744 | + |
| 745 | +* **Does this feature depend on any specific services running in the cluster?** |
| 746 | + Think about both cluster-level services (e.g. metrics-server) as well |
| 747 | + as node-level agents (e.g. specific version of CRI). Focus on external or |
| 748 | + optional services that are needed. For example, if this feature depends on |
| 749 | + a cloud provider API, or upon an external software-defined storage or network |
| 750 | + control plane. |
| 751 | + |
| 752 | + For each of these, fill in the following—thinking about running existing user workloads |
| 753 | + and creating new ones, as well as about cluster-level services (e.g. DNS): |
| 754 | + - [Dependency name] |
| 755 | + - Usage description: |
| 756 | + - Impact of its outage on the feature: |
| 757 | + - Impact of its degraded performance or high-error rates on the feature: |
| 758 | + |
| 759 | +### Scalability |
| 760 | + |
| 761 | +_For alpha, this section is encouraged: reviewers should consider these questions |
| 762 | +and attempt to answer them._ |
| 763 | + |
| 764 | +_For beta, this section is required: reviewers must answer these questions._ |
| 765 | + |
| 766 | +_For GA, this section is required: approvers should be able to confirm the |
| 767 | +previous answers based on experience in the field._ |
| 768 | + |
| 769 | +* **Will enabling / using this feature result in any new API calls?** |
| 770 | + Describe them, providing: |
| 771 | + - API call type (e.g. PATCH pods) |
| 772 | + - estimated throughput |
| 773 | + - originating component(s) (e.g. Kubelet, Feature-X-controller) |
| 774 | + focusing mostly on: |
| 775 | + - components listing and/or watching resources they didn't before |
| 776 | + - API calls that may be triggered by changes of some Kubernetes resources |
| 777 | + (e.g. update of object X triggers new updates of object Y) |
| 778 | + - periodic API calls to reconcile state (e.g. periodic fetching state, |
| 779 | + heartbeats, leader election, etc.) |
| 780 | + |
| 781 | +* **Will enabling / using this feature result in introducing new API types?** |
| 782 | + Describe them, providing: |
| 783 | + - API type |
| 784 | + - Supported number of objects per cluster |
| 785 | + - Supported number of objects per namespace (for namespace-scoped objects) |
| 786 | + |
| 787 | +* **Will enabling / using this feature result in any new calls to the cloud |
| 788 | +provider?** |
| 789 | + |
| 790 | +* **Will enabling / using this feature result in increasing size or count of |
| 791 | +the existing API objects?** |
| 792 | + Describe them, providing: |
| 793 | + - API type(s): |
| 794 | + - Estimated increase in size: (e.g., new annotation of size 32B) |
| 795 | + - Estimated amount of new objects: (e.g., new Object X for every existing Pod) |
| 796 | + |
| 797 | +* **Will enabling / using this feature result in increasing time taken by any |
| 798 | +operations covered by [existing SLIs/SLOs]?** |
| 799 | + Think about adding additional work or introducing new steps in between |
| 800 | + (e.g. need to do X to start a container), etc. Please describe the details. |
| 801 | + |
| 802 | +* **Will enabling / using this feature result in non-negligible increase of |
| 803 | +resource usage (CPU, RAM, disk, IO, ...) in any components?** |
| 804 | + Things to keep in mind include: additional in-memory state, additional |
| 805 | + non-trivial computations, excessive access to disks (including increased log |
| 806 | + volume), significant amount of data sent and/or received over network, etc. |
| 807 | + This through this both in small and large cases, again with respect to the |
| 808 | + [supported limits]. |
| 809 | + |
| 810 | +### Troubleshooting |
| 811 | + |
| 812 | +The Troubleshooting section currently serves the `Playbook` role. We may consider |
| 813 | +splitting it into a dedicated `Playbook` document (potentially with some monitoring |
| 814 | +details). For now, we leave it here. |
| 815 | + |
| 816 | +_This section must be completed when targeting beta graduation to a release._ |
| 817 | + |
| 818 | +* **How does this feature react if the API server and/or etcd is unavailable?** |
| 819 | + |
| 820 | +* **What are other known failure modes?** |
| 821 | + For each of them, fill in the following information by copying the below template: |
| 822 | + - [Failure mode brief description] |
| 823 | + - Detection: How can it be detected via metrics? Stated another way: |
| 824 | + how can an operator troubleshoot without logging into a master or worker node? |
| 825 | + - Mitigations: What can be done to stop the bleeding, especially for already |
| 826 | + running user workloads? |
| 827 | + - Diagnostics: What are the useful log messages and their required logging |
| 828 | + levels that could help debug the issue? |
| 829 | + Not required until feature graduated to beta. |
| 830 | + - Testing: Are there any tests for failure mode? If not, describe why. |
| 831 | + |
| 832 | +* **What steps should be taken if SLOs are not being met to determine the problem?** |
| 833 | + |
| 834 | +[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md |
| 835 | +[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos |
| 836 | + |
632 | 837 | ## Implementation History
|
633 | 838 |
|
634 | 839 | - 2018-11-06 - initial KEP draft created
|
|
0 commit comments