Skip to content

Commit 7eb9674

Browse files
authored
Merge pull request kubernetes#4093 from ndixita/memory-qos-beta
Memory QoS Beta Update and Production Readiness
2 parents 9e1a36e + 50463ff commit 7eb9674

File tree

3 files changed

+40
-11
lines changed

3 files changed

+40
-11
lines changed
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
11
kep-number: 2570
22
alpha:
3+
approver: "@johnbelamaric"
4+
beta:
35
approver: "@johnbelamaric"

keps/sig-node/2570-memory-qos/README.md

Lines changed: 34 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@
88
- [Proposal](#proposal)
99
- [Alpha v1.22](#alpha-v122)
1010
- [Alpha v1.27](#alpha-v127)
11+
- [Beta v1.28](#beta-v128)
1112
- [User Stories (Optional)](#user-stories-optional)
1213
- [Memory Sensitive Workload](#memory-sensitive-workload)
1314
- [Node Availability](#node-availability)
@@ -214,7 +215,7 @@ Some more examples to compare memory.high using Alpha v1.22 and Alpha v1.27 are
214215

215216
###### Quality of Service for Pods
216217

217-
In addition to the change in formula for memory.high, we are also adding the support for memory.high to be set as per `Quality of Service(QoS) for Pod` classes. Based on user feedback in Alpha v1.22, some users would like to opt-out of MemoryQoS on a per pod basis to ensure there is no early memory throttling. By making user's pods guaranteed, they will be able to do so. Guaranteed pod ,by definition, are not overcommitted, so memory.high does not provide significant value.
218+
In addition to the change in formula for memory.high, we are also adding the support for memory.high to be set as per `Quality of Service(QoS) for Pod` classes. Based on user feedback in Alpha v1.22, some users would like to opt-out of MemoryQoS on a per pod basis to ensure there is no early memory throttling. By making user's pods guaranteed, they will be able to do so. Guaranteed pod, by definition, are not overcommitted, so memory.high does not provide significant value.
218219

219220
Following are the different cases for setting memory.high as per QOS classes:
220221
1. Guaranteed
@@ -271,6 +272,10 @@ Alternative solutions that were discussed (but not preferred) before finalizing
271272
* It is simple to understand as it requires setting only 1 kubelet configuration for setting memory throttling factor.
272273
* It doesn't involve API changes, and doesn't expose low-level detail to customers.
273274

275+
#### Beta v1.28
276+
The feature is graduated to Beta in v1.28. Its implementation in Beta is same as Alpha
277+
v1.27.
278+
274279
### User Stories (Optional)
275280
#### Memory Sensitive Workload
276281
Some workloads are sensitive to memory allocation and availability, slight delays may cause service outage. In this case, a mechanism is needed to ensure the quality of memory.
@@ -485,6 +490,9 @@ The test will be reside in `test/e2e_node`.
485490
- Metrics and graphs to show the amount of reclaim done on a cgroup as it moves from below-request to above-request to throttling
486491
- Memory QoS is covered by unit and e2e-node tests
487492
- Memory QoS supports containerd, cri-o and dockershim
493+
- Expose memory events e.g. memory.high field of memory.events which can inform
494+
how many times memory.high was breached and the cgroup was throttled.
495+
https://docs.kernel.org/admin-guide/cgroup-v2.html
488496

489497
#### GA Graduation
490498
- [cgroup_v2](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2254-cgroup-v2) is in `GA`
@@ -538,7 +546,7 @@ Pick one of these and delete the rest.
538546
Any change of default behavior may be surprising to users or break existing
539547
automations, so be extremely careful here.
540548
-->
541-
Yes, the kubelet will set `memory.min` for Guaranteed and Burstable pod/container level cgroup. It also will set `memory.high` for burstable container, which may cause memory allocation throttle. `memory.min` for qos or node level cgroup will be set when `--cgroups-per-qos` or `--enforce-node-allocatable` is satisfied.
549+
Yes, the kubelet will set `memory.min` for Guaranteed and Burstable pod/container level cgroup. It also will set `memory.high` for burstable and best effort containers, which may cause memory allocation to be slowed down is the memory usage level in the containers reaches `memory.high` level. `memory.min` for qos or node level cgroup will be set when `--cgroups-per-qos` or `--enforce-node-allocatable` is satisfied.
542550

543551
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
544552

@@ -568,6 +576,11 @@ Yes, some unit tests are exercised with the feature both enabled and disabled to
568576
<!--
569577
This section must be completed when targeting beta to a release.
570578
-->
579+
N/A
580+
There's no API change involved. MemoryQos is a kubelet level flag, that will be enabled by default in Beta.
581+
It doesn't require any special opt-in by the user in their PodSpec.
582+
583+
The kubelet will reconcile `memory.min/memory.high` with related cgroups depending on whether the feature gate is enabled or not separately for each node.
571584

572585
###### How can a rollout or rollback fail? Can it impact already running workloads?
573586

@@ -580,6 +593,10 @@ feature flags will be enabled on some API servers and not others during the
580593
rollout. Similarly, consider large clusters and how enablement/disablement
581594
will rollout across nodes.
582595
-->
596+
Already running workloads will not have `memory.min/memory.high` set at Pod level. Only `memory.min` will be
597+
set at Node level cgroup when the kubelet restarts. The existing workloads will be impacted only when kernel
598+
isn't able to maintain at least `memory.min` level of memory for the non-guaranteed workloads within the
599+
Node level cgroup.
583600

584601
###### What specific metrics should inform a rollback?
585602

@@ -601,6 +618,7 @@ are missing a bunch of machinery and tooling and can't do that now.
601618
<!--
602619
Even if applying deprecation policies, they may still surprise some users.
603620
-->
621+
No
604622

605623
### Monitoring Requirements
606624

@@ -619,6 +637,8 @@ checking if there are objects with field X set) may be a last resort. Avoid
619637
logs or events for this purpose.
620638
-->
621639

640+
An operator could run ls `/sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod<SOME_ID>.slice` on a node with cgroupv2 enabled to confirm the presence of `memory.min` file which tells us that the feature is in use by the workloads.
641+
622642
###### How can someone using this feature know that it is working for their instance?
623643

624644
<!--
@@ -630,13 +650,15 @@ and operation of this feature.
630650
Recall that end users cannot usually observe component logs or access metrics.
631651
-->
632652

633-
- [ ] Events
653+
- [] Events
634654
- Event Reason:
635655
- [ ] API .status
636656
- Condition name:
637657
- Other field:
638-
- [ ] Other (treat as last resort)
639-
- Details:
658+
- [X] Other (treat as last resort)
659+
- Details: Kernel memory events will be available in kubelet logs via cadvisor.
660+
These events will inform about the number of times `memory.min` and `memory.high`
661+
levels were breached.
640662

641663
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
642664

@@ -654,6 +676,7 @@ high level (needs more precise definitions) those may be things like:
654676
These goals will help you determine what you need to measure (SLIs) in the next
655677
question.
656678
-->
679+
N/A. Same as when running without this feature.
657680

658681
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
659682

@@ -665,15 +688,16 @@ Pick one more of these and delete the rest.
665688
- Metric name:
666689
- [Optional] Aggregation method:
667690
- Components exposing the metric:
668-
- [ ] Other (treat as last resort)
669-
- Details:
691+
- [X] Other (treat as last resort)
692+
- Details: Not a service
670693

671694
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
672695

673696
<!--
674697
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
675698
implementation difficulties, etc.).
676699
-->
700+
No
677701

678702
### Dependencies
679703

@@ -697,6 +721,7 @@ and creating new ones, as well as about cluster-level services (e.g. DNS):
697721
- Impact of its outage on the feature:
698722
- Impact of its degraded performance or high-error rates on the feature:
699723
-->
724+
The container runtime must also support cgroup v2
700725

701726
### Scalability
702727

@@ -835,7 +860,8 @@ For each of them, fill in the following information by copying the below templat
835860
## Implementation History
836861
- 2020/03/14: initial proposal
837862
- 2020/05/05: target Alpha to v1.22
838-
863+
- 2023/03/03: target Alpha v2 to v1.27
864+
- 2023/06/14: target Beta to v1.28
839865
## Drawbacks
840866

841867
<!--

keps/sig-node/2570-memory-qos/kep.yaml

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -14,11 +14,12 @@ owning-sig: sig-node
1414
status: implementable
1515
editor: "@ndixita"
1616
creation-date: 2021-03-14
17-
last-updated: 2023-02-02
18-
stage: alpha
19-
latest-milestone: "v1.27"
17+
last-updated: 2023-06-14
18+
stage: beta
19+
latest-milestone: "v1.28"
2020
milestone:
2121
alpha: "v1.27"
22+
beta: "v1.28"
2223
feature-gates:
2324
- name: MemoryQoS
2425
components:

0 commit comments

Comments
 (0)