You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Simple drop of Phase 1 contents. Keep Phase 2 contents
We are splitting the Phase 2 into its own KEP here, to allow the two Phases to move at different paces.
We plan to graduate Phase 1 into Beta, while Phase 2 is still being developed and stay in Alpha.
@@ -85,7 +78,7 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
85
78
86
79
## Summary
87
80
88
-
This KEP proposes adding support in kubelet to read Pressure Stall Information (PSI) metric pertaining to CPU, Memory and IO resources exposed from cAdvisor and runc. This will enable kubelet to report node conditions which will be utilized to prevent scheduling of pods on nodes experiencing significant resource constraints.
81
+
This KEP proposes enabling kubelet to report node conditions which will be utilized to prevent scheduling of pods on nodes experiencing significant resource constraints.
89
82
90
83
## Motivation
91
84
@@ -96,13 +89,7 @@ In short, PSI metric are like barometers that provide fair warning of impending
96
89
### Goals
97
90
98
91
This proposal aims to:
99
-
1. Enable the kubelet to have the PSI metric of cgroupv2 exposed from cAdvisor and Runc.
100
-
2. Enable the pod level PSI metric and expose it in the Summary API.
101
-
3. Utilize the node level PSI metric to set node condition and node taints.
102
-
103
-
It will have two phases:
104
-
Phase 1: includes goal 1, 2
105
-
Phase 2: includes goal 3
92
+
1. Utilize the node level PSI metric to set node condition and node taints.
106
93
107
94
### Non-Goals
108
95
@@ -115,86 +102,17 @@ userspace OOM kills, and so on, for future KEPs.
115
102
116
103
#### Story 1
117
104
118
-
Today, to identify disruptions caused by resource crunches, Kubernetes users need to
119
-
install node exporter to read PSI metric. With the feature proposed in this enhancement,
120
-
PSI metric will be available for users in the Kubernetes metrics API.
121
-
122
-
#### Story 2
123
-
124
105
Kubernetes users want to prevent new pods to be scheduled on the nodes that have resource starvation. By using PSI metric, the kubelet will set Node Condition to avoid pods being scheduled on nodes under high resource pressure. The node controller could then set a [taint on the node based on these new Node Conditions](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/#taint-nodes-by-condition).
125
106
126
107
### Risks and Mitigations
127
108
128
-
There are no significant risks associated with Phase 1 implementation that involves integrating
129
-
the PSI metric in kubelet from either from cadvisor runc libcontainer library or kubelet's CRI runc libcontainer implementation which doesn't involve any shelled binary operations.
130
-
131
-
Phase 2 involves utilizing the PSI metric to report node conditions. There is a potential
109
+
There is a potential
132
110
risk of early reporting for nodes under pressure. We intend to address this concern
133
111
by conducting careful experimentation with PSI threshold values to identify the optimal
134
112
default threshold to be used for reporting the nodes under heavy resource pressure.
135
113
136
114
## Design Details
137
115
138
-
#### Phase 1
139
-
1. Add new Data structures PSIData and PSIStats corresponding to the PSI metric output format as following:
140
-
141
-
```
142
-
some avg10=0.00 avg60=0.00 avg300=0.00 total=0
143
-
full avg10=0.00 avg60=0.00 avg300=0.00 total=0
144
-
```
145
-
146
-
```go
147
-
typePSIDatastruct {
148
-
Avg10 *float64`json:"avg10"`
149
-
Avg60 *float64`json:"avg60"`
150
-
Avg300 *float64`json:"avg300"`
151
-
Total *float64`json:"total"`
152
-
}
153
-
154
-
typePSIStatsstruct {
155
-
Some *PSIData `json:"some,omitempty"`
156
-
Full *PSIData `json:"full,omitempty"`
157
-
}
158
-
```
159
-
160
-
2. Summary API includes stats for both system and kubepods level cgroups. Extend the Summary API to include PSI metric data for each resource obtained from cadvisor.
161
-
Note: if cadvisor-less is implemented prior to the implementation of this enhancement, the PSI
162
-
metric data will be available through CRI instead.
163
-
164
-
##### CPU
165
-
```go
166
-
typeCPUStatsstruct {
167
-
// PSI stats of the overall node
168
-
PSI cadvisorapi.PSIStats`json:"psi,omitempty"`
169
-
}
170
-
```
171
-
172
-
##### Memory
173
-
```go
174
-
typeMemoryStatsstruct {
175
-
// PSI stats of the overall node
176
-
PSI cadvisorapi.PSIStats`json:"psi,omitempty"`
177
-
}
178
-
```
179
-
180
-
##### IO
181
-
```go
182
-
// IOStats contains data about IO usage.
183
-
typeIOStatsstruct {
184
-
// The time at which these stats were updated.
185
-
Time metav1.Time`json:"time"`
186
-
187
-
// PSI stats of the overall node
188
-
PSI cadvisorapi.PSIStats`json:"psi,omitempty"`
189
-
}
190
-
191
-
typeNodeStatsstruct {
192
-
// Stats about the IO pressure of the node
193
-
IO *IOStats `json:"io,omitempty"`
194
-
}
195
-
```
196
-
197
-
#### Phase 2 to add PSI based actions.
198
116
**Note:** These actions are tentative, and will depend on different the outcome from testing and discussions with sig-node members, users, and other folks.
199
117
200
118
1. Introduce a new kubelet config parameter, pressure threshold, to let users specify the pressure percentage beyond which the kubelet would report the node condition to disallow workloads to be scheduled on it.
@@ -318,15 +236,9 @@ We expect no non-infra related flakes in the last month as a GA graduation crite
318
236
319
237
### Graduation Criteria
320
238
321
-
#### Phase 1: Alpha
322
-
323
-
- PSI integrated in kubelet behind a feature flag.
324
-
- Unit tests to check the fields are populated in the
325
-
Summary API response.
239
+
#### Alpha
326
240
327
-
#### Phase 2: Alpha
328
-
329
-
- Implement Phase 2 of the enhancement which enables kubelet to
241
+
- Enables kubelet to
330
242
report node conditions based off PSI values.
331
243
- Initial e2e tests completed and enabled if CRI implementation supports
332
244
it.
@@ -407,7 +319,7 @@ well as the [existing list] of feature gates.
407
319
408
320
-[X] Feature gate (also fill in values in `kep.yaml`)
409
321
- Feature gate name: PSINodeCondition
410
-
- Components depending on the feature gate: kubelet
322
+
- Components depending on the feature gate: kubelet, kube-controller-manager, kube-scheduler
411
323
-[ ] Other
412
324
- Describe the mechanism:
413
325
- Will enabling / disabling the feature require downtime of the control
@@ -421,7 +333,7 @@ well as the [existing list] of feature gates.
421
333
Any change of default behavior may be surprising to users or break existing
422
334
automations, so be extremely careful here.
423
335
-->
424
-
Not in Phase 1. Phase 2 is TBD in K8s 1.31.
336
+
TBD
425
337
426
338
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
427
339
@@ -513,11 +425,6 @@ Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
513
425
checking if there are objects with field X set) may be a last resort. Avoid
514
426
logs or events for this purpose.
515
427
-->
516
-
For Phase 1:
517
-
Use `kubectl get --raw "/api/v1/nodes/{$nodeName}/proxy/stats/summary"` to call Summary API. If the PSIStats field is seen in the API response,
518
-
the feature is available to be used by workloads.
519
-
520
-
For Phase 2:
521
428
TBD
522
429
523
430
###### How can someone using this feature know that it is working for their instance?
@@ -658,12 +565,10 @@ NA
658
565
## Implementation History
659
566
660
567
- 2023/09/13: Initial proposal
568
+
- 2025/06/11: Only keep Phase 2 contents in this new KEP. Phase 1 contents are kept in the original KEP.
661
569
662
570
## Drawbacks
663
571
664
-
No drawbacks in Phase 1 identified. There's no reason the enhancement should not be
665
-
implemented. This enhancement now makes it possible to read PSI metric without installing
0 commit comments