|
3 | 3 | ## Table of Contents
|
4 | 4 |
|
5 | 5 | <!-- toc -->
|
| 6 | +- [Release Signoff Checklist](#release-signoff-checklist) |
6 | 7 | - [Summary](#summary)
|
7 | 8 | - [Project Quotas](#project-quotas)
|
8 | 9 | - [Motivation](#motivation)
|
|
27 | 28 | - [Risks and Mitigations](#risks-and-mitigations)
|
28 | 29 | - [Graduation Criteria](#graduation-criteria)
|
29 | 30 | - [Phase 1: Alpha (1.15)](#phase-1-alpha-115)
|
30 |
| - - [Phase 2: Beta (target 1.16)](#phase-2-beta-target-116) |
| 31 | + - [Phase 2: Beta (target 1.22)](#phase-2-beta-target-122) |
31 | 32 | - [Phase 3: GA](#phase-3-ga)
|
32 | 33 | - [Performance Benchmarks](#performance-benchmarks)
|
33 | 34 | - [Elapsed Time](#elapsed-time)
|
34 | 35 | - [User CPU Time](#user-cpu-time)
|
35 | 36 | - [System CPU Time](#system-cpu-time)
|
| 37 | +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) |
| 38 | + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) |
| 39 | + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) |
| 40 | + - [Monitoring Requirements](#monitoring-requirements) |
| 41 | + - [Dependencies](#dependencies) |
| 42 | + - [Scalability](#scalability) |
| 43 | + - [Troubleshooting](#troubleshooting) |
36 | 44 | - [Implementation History](#implementation-history)
|
37 | 45 | - [Version 1.15](#version-115)
|
38 | 46 | - [Drawbacks [optional]](#drawbacks-optional)
|
|
49 | 57 |
|
50 | 58 | [Tools for generating]: https://github.com/ekalinin/github-markdown-toc
|
51 | 59 |
|
| 60 | +## Release Signoff Checklist |
| 61 | + |
| 62 | +Items marked with (R) are required *prior to targeting to a milestone / release*. |
| 63 | + |
| 64 | +- [X] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) |
| 65 | +- [X] (R) KEP approvers have approved the KEP status as `implementable` |
| 66 | +- [X] (R) Design details are appropriately documented |
| 67 | +- [X] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input |
| 68 | +- [X] (R) Graduation criteria is in place |
| 69 | +- [X] (R) Production readiness review completed |
| 70 | +- [X] (R) Production readiness review approved |
| 71 | +- [ ] "Implementation History" section is up-to-date for milestone |
| 72 | +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] |
| 73 | +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes |
| 74 | + |
52 | 75 | ## Summary
|
53 | 76 |
|
54 | 77 | This proposal applies to the use of quotas for ephemeral-storage
|
@@ -610,7 +633,7 @@ The following criteria applies to
|
610 | 633 | - Unit test coverage
|
611 | 634 | - Node e2e test
|
612 | 635 |
|
613 |
| -### Phase 2: Beta (target 1.16) |
| 636 | +### Phase 2: Beta (target 1.22) |
614 | 637 |
|
615 | 638 | - User feedback
|
616 | 639 | - Benchmarks to determine latency and overhead of using quotas
|
@@ -709,6 +732,150 @@ and are not reported here.
|
709 | 732 | | du after umount/mount | 66.0 | 82.4 | 29.2 | 28.1 |
|
710 | 733 | | Remove Files | 188.6 | 156.6 | 90.4 | 81.8 |
|
711 | 734 |
|
| 735 | +## Production Readiness Review Questionnaire |
| 736 | + |
| 737 | +<!-- |
| 738 | +Production readiness reviews are intended to ensure that features merging into |
| 739 | +Kubernetes are observable, scalable and supportable; can be safely operated in |
| 740 | +production environments, and can be disabled or rolled back in the event they |
| 741 | +cause increased failures in production. See more in the PRR KEP at |
| 742 | +https://git.k8s.io/enhancements/keps/sig-architecture/1194-prod-readiness. |
| 743 | +The production readiness review questionnaire must be completed and approved |
| 744 | +for the KEP to move to `implementable` status and be included in the release. |
| 745 | +In some cases, the questions below should also have answers in `kep.yaml`. This |
| 746 | +is to enable automation to verify the presence of the review, and to reduce review |
| 747 | +burden and latency. |
| 748 | +The KEP must have a approver from the |
| 749 | +[`prod-readiness-approvers`](http://git.k8s.io/enhancements/OWNERS_ALIASES) |
| 750 | +team. Please reach out on the |
| 751 | +[#prod-readiness](https://kubernetes.slack.com/archives/CPNHUMN74) channel if |
| 752 | +you need any help or guidance. |
| 753 | +--> |
| 754 | + |
| 755 | +### Feature Enablement and Rollback |
| 756 | + |
| 757 | +###### How can this feature be enabled / disabled in a live cluster? |
| 758 | + |
| 759 | +- [x] Feature gate (also fill in values in `kep.yaml`) |
| 760 | + - Feature gate name: LocalStorageCapacityIsolationFSQuotaMonitoring |
| 761 | + - Components depending on the feature gate: kubelet |
| 762 | + |
| 763 | +###### Does enabling the feature change any default behavior? |
| 764 | + |
| 765 | +None. Behavior will not change. |
| 766 | +When LocalStorageCapacityIsolation is enabled for local ephemeral storage and the backing filesystem for emptyDir volumes supports project quotas and they are enabled, use project quotas to monitor emptyDir volume storage consumption rather than filesystem walk for better performance and accuracy. |
| 767 | + |
| 768 | + |
| 769 | +###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? |
| 770 | + |
| 771 | +Yes. If the pod was created with enforcing quota, disable the feature gate will not change the running pod. |
| 772 | +After setting the feature gate to false, the newly created pod will not use the enforcing quota. |
| 773 | + |
| 774 | +###### What happens if we reenable the feature if it was previously rolled back? |
| 775 | + |
| 776 | +Performance changes. This feature uses project quotas to monitor emptyDir volume storage consumption rather than filesystem walk for better performance and accuracy. |
| 777 | + |
| 778 | +###### Are there any tests for feature enablement/disablement? |
| 779 | + |
| 780 | +Yes, test/e2e_node/quota_lsci_test.go |
| 781 | + |
| 782 | +### Rollout, Upgrade and Rollback Planning |
| 783 | + |
| 784 | + |
| 785 | +###### How can a rollout or rollback fail? Can it impact already running workloads? |
| 786 | + |
| 787 | +None. The rollout/rollback will not impact running workloads. |
| 788 | + |
| 789 | +###### What specific metrics should inform a rollback? |
| 790 | + |
| 791 | +None. To see its status, read kubelet log for eviction related logs or using xfs_quota to check the quota settings. |
| 792 | + |
| 793 | +###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? |
| 794 | + |
| 795 | +Yes. |
| 796 | + |
| 797 | +###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? |
| 798 | + |
| 799 | +LocalStorageCapacityIsolationFSQuotaMonitoring should be turned on only if LocalStorageCapacityIsolation is enabled as well. |
| 800 | + |
| 801 | +### Monitoring Requirements |
| 802 | + |
| 803 | +* **How can an operator determine if the feature is in use by workloads?** |
| 804 | + - A cluster-admin can set kubelet on each node. If the feature gate is disabled, workloads on that node will not use it. |
| 805 | + For example, run `xfs_quota -x -c 'report -h' /dev/sdc` to check quota settings in the device. |
| 806 | + Check `spec.containers[].resources.limits.ephemeral-storage` of each container. |
| 807 | + |
| 808 | +* **What are the SLIs (Service Level Indicators) an operator can use to determine |
| 809 | +the health of the service?** |
| 810 | + - Set a quota for the specified volume and try to write to the volume to check if there is a limitation. |
| 811 | + |
| 812 | +* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?** |
| 813 | + - N/A. |
| 814 | + |
| 815 | +* **Are there any missing metrics that would be useful to have to improve observability of this feature? ** |
| 816 | + - No. |
| 817 | + |
| 818 | + |
| 819 | +### Dependencies |
| 820 | +* **Does this feature depend on any specific services running in the cluster? ** |
| 821 | + - No. |
| 822 | + |
| 823 | +### Scalability |
| 824 | +* **Will enabling / using this feature result in any new API calls?** |
| 825 | + - No. |
| 826 | + |
| 827 | +* **Will enabling / using this feature result in introducing new API types?** |
| 828 | + - No. |
| 829 | + |
| 830 | +* **Will enabling / using this feature result in any new calls to the cloud |
| 831 | +provider?** |
| 832 | + - No. |
| 833 | + |
| 834 | +* **Will enabling / using this feature result in increasing size or count of |
| 835 | +the existing API objects?** |
| 836 | + - No. |
| 837 | + |
| 838 | +* **Will enabling / using this feature result in increasing time taken by any |
| 839 | +operations covered by [existing SLIs/SLOs]?** |
| 840 | + - No. |
| 841 | + |
| 842 | +* **Will enabling / using this feature result in non-negligible increase of |
| 843 | +resource usage (CPU, RAM, disk, IO, ...) in any components?** |
| 844 | + - Yes. It will use less CPU time and IO during ephemeral storage monitoring. `kubelet` now allows use of XFS quotas (on XFS and suitably configured ext4fs filesystems) to monitor storage consumption for ephemeral storage (currently for emptydir volumes only). This method of monitoring consumption is faster and more accurate than the old method of walking the filesystem tree. It does not enforce limits, only monitors consumption. |
| 845 | + |
| 846 | +### Troubleshooting |
| 847 | + |
| 848 | +<!-- |
| 849 | +This section must be completed when targeting beta to a release. |
| 850 | +The Troubleshooting section currently serves the `Playbook` role. We may consider |
| 851 | +splitting it into a dedicated `Playbook` document (potentially with some monitoring |
| 852 | +details). For now, we leave it here. |
| 853 | +--> |
| 854 | + |
| 855 | +###### How does this feature react if the API server and/or etcd is unavailable? |
| 856 | + |
| 857 | +###### What are other known failure modes? |
| 858 | + |
| 859 | +If the ephemeral storage limitation is reached, the pod will be evicted by kubelet. |
| 860 | + |
| 861 | +It should skip when the image is not configured correctly (unsupported FS or quota not enabled). |
| 862 | + |
| 863 | +<!-- |
| 864 | +For each of them, fill in the following information by copying the below template: |
| 865 | + - [Failure mode brief description] |
| 866 | + - Detection: How can it be detected via metrics? Stated another way: |
| 867 | + how can an operator troubleshoot without logging into a master or worker node? |
| 868 | + - Mitigations: What can be done to stop the bleeding, especially for already |
| 869 | + running user workloads? |
| 870 | + - Diagnostics: What are the useful log messages and their required logging |
| 871 | + levels that could help debug the issue? |
| 872 | + Not required until feature graduated to beta. |
| 873 | + - Testing: Are there any tests for failure mode? If not, describe why. |
| 874 | +--> |
| 875 | + |
| 876 | +###### What steps should be taken if SLOs are not being met to determine the problem? |
| 877 | + |
| 878 | + |
712 | 879 | ## Implementation History
|
713 | 880 |
|
714 | 881 | ### Version 1.15
|
|
0 commit comments