Skip to content

Commit 54db0aa

Browse files
committed
Modifying the Memory QOS Kep for production readiness review in Alpha stage for K8s v1.27
Signed-off-by: Dixita Narang <[email protected]>
1 parent 48599d1 commit 54db0aa

File tree

3 files changed

+166
-10
lines changed

3 files changed

+166
-10
lines changed

keps/sig-node/2570-memory-qos/README.md

Lines changed: 161 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,8 @@
66
- [Goals](#goals)
77
- [Non-Goals](#non-goals)
88
- [Proposal](#proposal)
9+
- [Alpha v1.22](#alpha-v122)
10+
- [Alpha v1.27](#alpha-v127)
911
- [User Stories (Optional)](#user-stories-optional)
1012
- [Memory Sensitive Workload](#memory-sensitive-workload)
1113
- [Node Availability](#node-availability)
@@ -52,7 +54,7 @@
5254
Support memory qos with cgroups v2.
5355

5456
## Motivation
55-
In traditional cgroups v1 implement in Kubernetes, we can only limit cpu resources, such as `cpu_shares / cpu_set / cpu_quota / cpu_period`, memory qos is not yet implemented. cgroups v2 brings new capabilities for memory controller and it would help Kubernetes enhance memory isolation quality.
57+
In traditional cgroups v1 implement in Kubernetes, we can only limit cpu resources, such as `cpu_shares / cpu_set / cpu_quota / cpu_period`, memory qos has not been implemented yet. cgroups v2 brings new capabilities for memory controller and it would help Kubernetes enhance memory isolation quality.
5658

5759
### Goals
5860
- Provide guarantees around memory availability for pod and container memory requests and limits
@@ -87,22 +89,173 @@ Cgroups v2 introduces a better way to protect and guarantee memory quality.
8789

8890
This proposal sets `requests.memory` to `memory.min` for protecting container memory requests. `limits.memory` is set to `memory.max` (this is consistent with existing `memory.limit_in_bytes` for cgroups v1, we do nothing because [cgroup_v2](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2254-cgroup-v2) has implemented for that).
8991

90-
We also introduce `memory.high` to throttle container memory overcommit allocation. It will be set based on a formula:
92+
We also introduce `memory.high` for container cgroup to throttle container memory overcommit allocation.
93+
***Note***: memory.high is set for container-level cgroup, and not for pod-level cgroup. If a container in a pod sees a spike in memory usage, it could result in total pod-level memory usage to reach memory.high level set at pod-level cgroup. This will induce throttling in other containers as the pod-level memory.high was hit. Hence to avoid containers from affecting each other, we set memory.high for only container-level cgroup.
94+
95+
#### Alpha v1.22
96+
It is based on a formula:
9197
```
92-
memory.high=limits.memory/node allocatable memory * memory throttling factor
98+
memory.high=(limits.memory or node allocatable memory) * memory throttling factor,
99+
where default value of memory throttling factor is set to 0.8
93100
```
94101

95-
e.g. if a container has `requests.memory=50, limits.memory=100`, and we have a throttling factor of .8, `memory.high` would be 80. if a container had no memory limit specified, we substitute `limits.memory` for `node allocatable memory` and apply the throttling factor of .8 to that value.
96-
It must be ensure that `memory.high` is always greater than `memory.min`.
102+
e.g. If a container has `requests.memory=50, limits.memory=100`, and we have a throttling factor of .8, `memory.high` would be 80. If a container has no memory limit specified, we substitute `limits.memory` for `node allocatable memory` and apply the throttling factor of .8 to that value.
103+
It must be ensured that `memory.high` is always greater than `memory.min`.
97104

98105
Node reserved resources(kube-reserved/system-reserved) are either considered. It is tied to `--enforce-node-allocatable` and `memory.min` will be set properly.
99106

100107
Brief map as follows:
101108
| type | memory.min | memory.high |
102109
| -------- | -------- | -------- |
103110
| container | requests.memory | limits.memory/node allocatable memory * memory throttling factor |
104-
| pod | sum(requests.memory) | n |
105-
| node | n/a | pods, kube-reserved, system-reserved | n |
111+
| pod | sum(requests.memory) | N/A |
112+
| node | pods, kube-reserved, system-reserved | N/A |
113+
114+
#### Alpha v1.27
115+
The formula for memory.high for container cgroup is modified in Alpha stage of the feature in K8s v1.27. It will be set based on formula:
116+
```
117+
memory.high=floor[(requests.memory + memory throttling factor * (limits.memory or node allocatable memory - requests.memory))/pageSize] * pageSize, where default value of memory throttling factor is set to 0.9
118+
```
119+
Note: If a container has no memory limit specified, we substitute `limits.memory` for `node allocatable memory` and apply the throttling factor of .9 to that value.
120+
121+
122+
The table below runs over the examples with different values requests.memory and 1Mi pageSize:
123+
124+
| limits.memory (1000) | memory throttling factor (0.9)|
125+
| ---------------------- | ----------------------------- |
126+
| request 0 | 900 |
127+
| request 100 | 910 |
128+
| request 200 | 920 |
129+
| request 300 | 930 |
130+
| request 400 | 940 |
131+
| request 500 | 950 |
132+
| request 600 | 960 |
133+
| request 700 | 970 |
134+
| request 800 | 960 |
135+
| request 900 | 980 |
136+
| request 1000 | 1000 |
137+
138+
139+
Node reserved resources(kube-reserved/system-reserved) are either considered. It is tied to `--enforce-node-allocatable` and `memory.min` will be set properly.
140+
141+
Brief map as follows:
142+
| type | memory.min | memory.high |
143+
| -------- | -------- | -------- |
144+
| container | requests.memory | floor[(requests.memory + memory throttling factor * (limits.memory or node allocatable memory - requests.memory))/pageSize] * pageSize |
145+
| pod | sum(requests.memory) | N/A |
146+
| node | n/a | pods, kube-reserved, system-reserved | N/A |
147+
148+
###### Reasons for changing the formula of memory.high calculation in Alpha v1.27
149+
150+
The formula for memory.high has changed in K8s v1.27 as the Alpha v1.22 implementation has following problems:
151+
1. It fails to throttle when requested memory is closer to memory limits (or node allocatable) as it results in memory.high being less than requests.memory.
152+
153+
For example, if `requests.memory = 85, limits.memory=100`, and we have a throttling factor of 0.8, then as per the Alpha implementation memory.high = memory throttling factor * limits.memory i.e. memory.high = 80. In this case the level at which throttling is supposed to occur i.e. memory.high is less than requests.memory. Hence there won't be any throttling as the Alpha v1.22 implementation doesn't allow memory.high to be less than requested memory.
154+
155+
2. It could result in early throttling putting the processes under early heavy reclaim pressure.
156+
157+
For example,
158+
* `requests.memory` = 800Mi
159+
160+
`memory throttling factor` = 0.8
161+
162+
`limits.memory` = 1000Mi
163+
164+
As per Alpha v1.22 implementation,
165+
166+
`memory.high` = memory throttling factor * limits.memory = 0.8 * 1000Mi = 800Mi
167+
168+
This results in early throttling and puts the processed under heavy reclaim pressure at 800Mi memory usage levels. There's a significant difference of 200Mi between the memory throttling limit (800Mi) and memory usage hard limit (1000Mi).
169+
170+
* `requests.memory` = 500Mi
171+
172+
`memory throttling factor` = 0.6
173+
174+
`limits.memory` = 1000Mi
175+
176+
As per Alpha v1.22 implementation,
177+
178+
`memory.high` = memory throttling factor * limits.memory = 0.6 * 1000Mi = 600Mi
179+
180+
Throttling occurs at 600Mi which is just a 100Mi over the requested memory. There's a significant difference of 400Mi between the memory throttle limit (600Mi) and memory usage hard limit (1000Mi).
181+
182+
183+
3. Default throttling factor of 0.8 may be too aggressive for some applications that are latency sensitive and always use memory close to memory limits.
184+
185+
For example, there are some known Java workloads that use 85% of the memory will start to get throttled once this feature is enabled by default. Hence the default 0.8 MemoryThrottlingFactor value may not be a good value for many applications due to inducing throttling too early.
186+
187+
<br>
188+
Some more examples to compare memory.high using Alpha v1.22 and Alpha v1.27 are listed below:
189+
190+
| Limit 1000Mi <br /> Request, factor | Alpha v1.22: memory.high = memory throttling factor \* memory.limit (or node allocatable if memory.limit is not set) | Alpha v1.27: memory.high = floor[(requests.memory + memory throttling factor * (limits.memory or node allocatable memory - requests.memory))/pageSize] * pageSize assuming 1Mi pageSize
191+
| -------------------------------- | ------------------------------------------------------- | ------------------------------------------------
192+
| request 500Mi, factor 0.6 | 600Mi (very early throttling when memory usage is just 100Mi above requested memory; 400Mi unused) | 800Mi
193+
| request 800Mi, factor 0.6 | no throttling (600 < 800 i.e. memory.high < memory.request => no throttling) | 920Mi
194+
| request 1Gi, factor 0.6 | max | max
195+
| request 500Mi, factor 0.8 | 800Mi (early throttling at 800Mi, when 200Mi is unused) | 900Mi
196+
| request 850Mi, factor 0.8 | no throttling (800 < 850 i.e. memory.high < memory.request => no throttling) | 970Mi
197+
| request 500Gi, factor 0.4 | no throttling (800 < 400 i.e. memory.high < memory.request => no throttling) | 700Mi
198+
199+
***Note***: As seen from the examples in the table, the formula used in Alpha v1.27 implementation eliminates the cases of memory.high being less than memory.request. However, it still can result in early throttling if memory throttling factor is set low. Hence, it is recommended to set a high memory throttling factor to avoid early throttling.
200+
201+
###### Quality of Service for Pods
202+
203+
In addition to the change in formula for memory.high, we are also adding the support for memory.high to be set as per `Quality of Service(QoS) for Pod` classes. Based on user feedback in Alpha v1.22, some users would like to opt-out of MemoryQoS on a per pod basis to ensure there is no early memory throttling. By making user's pods guaranteed, they will be able to do so. Guaranteed pod ,by definition, are not overcommitted, so memory.high does not provide significant value.
204+
205+
Following are the different cases for setting memory.high as per QOS classes:
206+
1. Guaranteed
207+
Guaranteed pods by their QoS definition require memory requests=memory limits and are not overcommitted. Hence MemoryQoS feature is disabled on those pods by not setting memory.high. This ensures that Guaranteed pods can fully use their memory requests up to their set limit, and not hit any throttling.
208+
209+
2. Burstable
210+
Burstable pods by their QoS definity require at least one container in the Pod with CPU or memory request or limit set.
211+
212+
Case I: When requests.memory and limits.memory are set, the forumula is used as-is:
213+
```
214+
memory.high = floor[ (requests.memory + memory throttling factor * (limits.memory - requests.memory)) / pageSize ] * pageSize
215+
```
216+
217+
Case II. When requests.memory is set, limits.memory is not set, we substitute limits.memory for node allocatable memory in the formula:
218+
```
219+
memory.high = floor[ (requests.memory + memory throttling factor * (node allocatable memory - requests.memory))/ pageSize ] * pageSize
220+
```
221+
222+
Case III. When requests.memory is not set and limits.memory is set, we set `requests.memory = 0` in the formula:
223+
```
224+
memory.high = floor[ (memory throttling factor * limits.memory) / pageSize) ] * pageSize
225+
```
226+
227+
3. BestEffort
228+
The pod gets a BestEffort class if limits.memory and requests.memory are not set. We set `requests.memory = 0` and substitute limits.memory for node allocatable memory in the formula:
229+
```
230+
memory.high = floor[ (memoryThrottlingFactor * node allocatable memory) / pageSize) * pageSize
231+
```
232+
233+
###### Alternative solutions for implementing memory.high
234+
Alternative solutions that were discussed (but not preferred) before finalizing the implementation for memory.high are:
235+
1. Allow customers to set memoryThrottlingFactor for each pod in annotations.
236+
237+
Proposal: Add a new annotation for customers to set memoryThrottlingFactor to override kubelet level memoryThrottlingFactor.
238+
* Pros
239+
* Allows more flexibility.
240+
* Can be quickly implemented.
241+
* Cons
242+
* Customers might not need per pod memoryThrottlingFactor configuration.
243+
* It is too low-level detail to expose to customers.
244+
245+
2. Allow customers to set MemoryThrottlingFactor in pod yaml.
246+
247+
Proposal: Add a new field in API for customers to set memoryThrottlingFactor to override kubelet level memoryThrottlingFactor.
248+
* Pros
249+
* Allows more flexibility.
250+
* Cons
251+
* Customers might not need per pod memoryThrottlingFactor configuration.
252+
* API changes take a lot of time, and we might eventually realize that the customers don’t need per pod level setting.
253+
* It is too low-level detail to expose to customers, and it is highly unlikely to get an API approval.
254+
255+
***[Preferred Alternative]***: Considering the cons of the alternatives mentioned above, adding support for memory QoS looks more preferrable over other solutions for following reasons:
256+
* Memory QOS complies with QOS which is a wider known concept.
257+
* It is simple to understand as it requires setting only 1 kubelet configuration for setting memory throttling factor.
258+
* It doesn't involve API changes, and doesn't expose low-level detail to customers.
106259

107260
### User Stories (Optional)
108261
#### Memory Sensitive Workload
@@ -212,6 +365,7 @@ For `GA`, the introduced e2e tests will be promoted to conformance. It was also
212365

213366
#### Beta Graduation
214367
- [cgroup_v2](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2254-cgroup-v2) is in `Beta`
368+
- Metrics and graphs to show the amount of reclaim done on a cgroup as it moves from below-request to above-request to throttling
215369
- Memory QoS is covered by unit and e2e-node tests
216370
- Memory QoS supports containerd, cri-o and dockershim
217371

keps/sig-node/2570-memory-qos/kep.yaml

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,16 +7,18 @@ reviewers:
77
- "@bobbypage"
88
- "@mrunalp"
99
- "@giuseppe"
10+
- "@pacoxu"
1011
approvers:
1112
- "@derekwaynecarr"
1213
owning-sig: sig-node
1314
status: implementable
14-
editor: TBD
15+
editor: "@ndixita"
1516
creation-date: 2021-03-14
17+
last-updated: 2023-02-02
1618
stage: alpha
17-
latest-milestone: "v1.22"
19+
latest-milestone: "v1.27"
1820
milestone:
19-
alpha: "v1.22"
21+
alpha: "v1.27"
2022
feature-gates:
2123
- name: MemoryQoS
2224
components:
48.6 KB
Loading

0 commit comments

Comments
 (0)