Skip to content

Commit c1a54f0

Browse files
committed
promote ephemeral-storage-quotas to beta in 1.22
Signed-off-by: pacoxu <[email protected]>
1 parent 4989a15 commit c1a54f0

File tree

3 files changed

+179
-4
lines changed

3 files changed

+179
-4
lines changed
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
kep-number: 1029
2+
alpha:
3+
approver: "@deads2k"
4+
beta:
5+
approver: "@deads2k"

keps/sig-node/1029-ephemeral-storage-quotas/README.md

Lines changed: 169 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33
## Table of Contents
44

55
<!-- toc -->
6+
- [Release Signoff Checklist](#release-signoff-checklist)
67
- [Summary](#summary)
78
- [Project Quotas](#project-quotas)
89
- [Motivation](#motivation)
@@ -27,12 +28,19 @@
2728
- [Risks and Mitigations](#risks-and-mitigations)
2829
- [Graduation Criteria](#graduation-criteria)
2930
- [Phase 1: Alpha (1.15)](#phase-1-alpha-115)
30-
- [Phase 2: Beta (target 1.16)](#phase-2-beta-target-116)
31+
- [Phase 2: Beta (target 1.22)](#phase-2-beta-target-122)
3132
- [Phase 3: GA](#phase-3-ga)
3233
- [Performance Benchmarks](#performance-benchmarks)
3334
- [Elapsed Time](#elapsed-time)
3435
- [User CPU Time](#user-cpu-time)
3536
- [System CPU Time](#system-cpu-time)
37+
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
38+
- [Feature Enablement and Rollback](#feature-enablement-and-rollback)
39+
- [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
40+
- [Monitoring Requirements](#monitoring-requirements)
41+
- [Dependencies](#dependencies)
42+
- [Scalability](#scalability)
43+
- [Troubleshooting](#troubleshooting)
3644
- [Implementation History](#implementation-history)
3745
- [Version 1.15](#version-115)
3846
- [Drawbacks [optional]](#drawbacks-optional)
@@ -49,6 +57,21 @@
4957

5058
[Tools for generating]: https://github.com/ekalinin/github-markdown-toc
5159

60+
## Release Signoff Checklist
61+
62+
Items marked with (R) are required *prior to targeting to a milestone / release*.
63+
64+
- [X] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
65+
- [X] (R) KEP approvers have approved the KEP status as `implementable`
66+
- [X] (R) Design details are appropriately documented
67+
- [X] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
68+
- [X] (R) Graduation criteria is in place
69+
- [X] (R) Production readiness review completed
70+
- [X] (R) Production readiness review approved
71+
- [ ] "Implementation History" section is up-to-date for milestone
72+
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
73+
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
74+
5275
## Summary
5376

5477
This proposal applies to the use of quotas for ephemeral-storage
@@ -610,7 +633,7 @@ The following criteria applies to
610633
- Unit test coverage
611634
- Node e2e test
612635

613-
### Phase 2: Beta (target 1.16)
636+
### Phase 2: Beta (target 1.22)
614637

615638
- User feedback
616639
- Benchmarks to determine latency and overhead of using quotas
@@ -709,6 +732,150 @@ and are not reported here.
709732
| du after umount/mount | 66.0 | 82.4 | 29.2 | 28.1 |
710733
| Remove Files | 188.6 | 156.6 | 90.4 | 81.8 |
711734

735+
## Production Readiness Review Questionnaire
736+
737+
<!--
738+
Production readiness reviews are intended to ensure that features merging into
739+
Kubernetes are observable, scalable and supportable; can be safely operated in
740+
production environments, and can be disabled or rolled back in the event they
741+
cause increased failures in production. See more in the PRR KEP at
742+
https://git.k8s.io/enhancements/keps/sig-architecture/1194-prod-readiness.
743+
The production readiness review questionnaire must be completed and approved
744+
for the KEP to move to `implementable` status and be included in the release.
745+
In some cases, the questions below should also have answers in `kep.yaml`. This
746+
is to enable automation to verify the presence of the review, and to reduce review
747+
burden and latency.
748+
The KEP must have a approver from the
749+
[`prod-readiness-approvers`](http://git.k8s.io/enhancements/OWNERS_ALIASES)
750+
team. Please reach out on the
751+
[#prod-readiness](https://kubernetes.slack.com/archives/CPNHUMN74) channel if
752+
you need any help or guidance.
753+
-->
754+
755+
### Feature Enablement and Rollback
756+
757+
###### How can this feature be enabled / disabled in a live cluster?
758+
759+
- [x] Feature gate (also fill in values in `kep.yaml`)
760+
- Feature gate name: LocalStorageCapacityIsolationFSQuotaMonitoring
761+
- Components depending on the feature gate: kubelet
762+
763+
###### Does enabling the feature change any default behavior?
764+
765+
None. Behavior will not change.
766+
When LocalStorageCapacityIsolation is enabled for local ephemeral storage and the backing filesystem for emptyDir volumes supports project quotas and they are enabled, use project quotas to monitor emptyDir volume storage consumption rather than filesystem walk for better performance and accuracy.
767+
768+
769+
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
770+
771+
Yes. If the pod was created with enforcing quota, disable the feature gate will not change the running pod.
772+
After setting the feature gate to false, the newly created pod will not use the enforcing quota.
773+
774+
###### What happens if we reenable the feature if it was previously rolled back?
775+
776+
Performance changes. This feature uses project quotas to monitor emptyDir volume storage consumption rather than filesystem walk for better performance and accuracy.
777+
778+
###### Are there any tests for feature enablement/disablement?
779+
780+
Yes, test/e2e_node/quota_lsci_test.go
781+
782+
### Rollout, Upgrade and Rollback Planning
783+
784+
785+
###### How can a rollout or rollback fail? Can it impact already running workloads?
786+
787+
None. The rollout/rollback will not impact running workloads.
788+
789+
###### What specific metrics should inform a rollback?
790+
791+
None. To see its status, read kubelet log for eviction related logs or using xfs_quota to check the quota settings.
792+
793+
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
794+
795+
Yes.
796+
797+
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
798+
799+
LocalStorageCapacityIsolationFSQuotaMonitoring should be turned on only if LocalStorageCapacityIsolation is enabled as well.
800+
801+
### Monitoring Requirements
802+
803+
* **How can an operator determine if the feature is in use by workloads?**
804+
- A cluster-admin can set kubelet on each node. If the feature gate is disabled, workloads on that node will not use it.
805+
For example, run `xfs_quota -x -c 'report -h' /dev/sdc` to check quota settings in the device.
806+
Check `spec.containers[].resources.limits.ephemeral-storage` of each container.
807+
808+
* **What are the SLIs (Service Level Indicators) an operator can use to determine
809+
the health of the service?**
810+
- Set a quota for the specified volume and try to write to the volume to check if there is a limitation.
811+
812+
* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
813+
- N/A.
814+
815+
* **Are there any missing metrics that would be useful to have to improve observability of this feature? **
816+
- No.
817+
818+
819+
### Dependencies
820+
* **Does this feature depend on any specific services running in the cluster? **
821+
- No.
822+
823+
### Scalability
824+
* **Will enabling / using this feature result in any new API calls?**
825+
- No.
826+
827+
* **Will enabling / using this feature result in introducing new API types?**
828+
- No.
829+
830+
* **Will enabling / using this feature result in any new calls to the cloud
831+
provider?**
832+
- No.
833+
834+
* **Will enabling / using this feature result in increasing size or count of
835+
the existing API objects?**
836+
- No.
837+
838+
* **Will enabling / using this feature result in increasing time taken by any
839+
operations covered by [existing SLIs/SLOs]?**
840+
- No.
841+
842+
* **Will enabling / using this feature result in non-negligible increase of
843+
resource usage (CPU, RAM, disk, IO, ...) in any components?**
844+
- Yes. It will use less CPU time and IO during ephemeral storage monitoring. `kubelet` now allows use of XFS quotas (on XFS and suitably configured ext4fs filesystems) to monitor storage consumption for ephemeral storage (currently for emptydir volumes only). This method of monitoring consumption is faster and more accurate than the old method of walking the filesystem tree. It does not enforce limits, only monitors consumption.
845+
846+
### Troubleshooting
847+
848+
<!--
849+
This section must be completed when targeting beta to a release.
850+
The Troubleshooting section currently serves the `Playbook` role. We may consider
851+
splitting it into a dedicated `Playbook` document (potentially with some monitoring
852+
details). For now, we leave it here.
853+
-->
854+
855+
###### How does this feature react if the API server and/or etcd is unavailable?
856+
857+
###### What are other known failure modes?
858+
859+
If the ephemeral storage limitation is reached, the pod will be evicted by kubelet.
860+
861+
It should skip when the image is not configured correctly (unsupported FS or quota not enabled).
862+
863+
<!--
864+
For each of them, fill in the following information by copying the below template:
865+
- [Failure mode brief description]
866+
- Detection: How can it be detected via metrics? Stated another way:
867+
how can an operator troubleshoot without logging into a master or worker node?
868+
- Mitigations: What can be done to stop the bleeding, especially for already
869+
running user workloads?
870+
- Diagnostics: What are the useful log messages and their required logging
871+
levels that could help debug the issue?
872+
Not required until feature graduated to beta.
873+
- Testing: Are there any tests for failure mode? If not, describe why.
874+
-->
875+
876+
###### What steps should be taken if SLOs are not being met to determine the problem?
877+
878+
712879
## Implementation History
713880

714881
### Version 1.15

keps/sig-node/1029-ephemeral-storage-quotas/kep.yaml

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,8 +13,11 @@ approvers:
1313
- "@derekwaynecarr"
1414
editor: TBD
1515
creation-date: 2018-09-06
16-
last-updated: 2019-06-04
16+
last-updated: 2021-05-08
1717
status: implementable
1818

19-
latest-milestone: "0.0"
19+
latest-milestone: "1.22"
2020
stage: "alpha"
21+
milestone:
22+
alpha: "1.15"
23+
beta: "1.22"

0 commit comments

Comments
 (0)