You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
CPU PSI pressure as mesured by the Kernel is not able to
discriminate self imposed pressure due to CPU throttling
on CPU limited containers from pressure caused
by resource contention with other neighbors.
See: kubernetes/enhancements#5062
In oder to filter out evident false positives,
we can introduce a new ActualUtilizationProfile
that is consuming CPU PSI pressure value only when the
actual average CPU utilization on the node is above a certain
threshold (70%), otherwise 0.
This is not a systematic solution but a mitigation.
Fixes: https://issues.redhat.com/browse/OCPBUGS-52341
Signed-off-by: Simone Tiraboschi <[email protected]>
Copy file name to clipboardExpand all lines: README.md
+4-3Lines changed: 4 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -181,9 +181,10 @@ the `profileCustomizations` field:
181
181
## Prometheus query profiles
182
182
The operator provides the following profiles:
183
183
-`PrometheusCPUUsage`: `instance:node_cpu:rate:sum` (metric available in OpenShift by default)
184
-
-`PrometheusCPUPSIPressure`: `rate(node_pressure_cpu_waiting_seconds_total[1m])` (`node_pressure_cpu_waiting_seconds_total` is a custom metric that needs to be provided)
185
-
-`PrometheusMemoryPSIPressure`: `rate(node_pressure_memory_waiting_seconds_total[1m])` (`node_pressure_memory_waiting_seconds_total` is a custom metric that needs to be provided)
186
-
-`PrometheusIOPSIPressure`: `rate(node_pressure_io_waiting_seconds_total[1m])` (`node_pressure_memory_waiting_seconds_total` is a custom metric that needs to be provided)
184
+
-`PrometheusCPUPSIPressure`: `rate(node_pressure_cpu_waiting_seconds_total[1m])` (`node_pressure_cpu_waiting_seconds_total` is reported in OpenShift only for nodes configured with psi=1 kernel argument)
185
+
-`PrometheusCPUPSIPressureByUtilization`: `avg by (instance) ( rate(node_pressure_cpu_waiting_seconds_total[1m])) and (1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[1m]))) > 0.7 or avg by (instance) ( rate(node_pressure_cpu_waiting_seconds_total[1m])) * 0` (`node_pressure_cpu_waiting_seconds_total` is reported in OpenShift only for nodes configured with psi=1 kernel argument; the query is filtering out PSI pressure on nodes with average CPU utilization < 0.7 to filter out false positives pressure spikes due to self imposed CPU throttling)
186
+
-`PrometheusMemoryPSIPressure`: `rate(node_pressure_memory_waiting_seconds_total[1m])` (`node_pressure_memory_waiting_seconds_total` is reported in OpenShift only for nodes configured with psi=1 kernel argument)
187
+
-`PrometheusIOPSIPressure`: `rate(node_pressure_io_waiting_seconds_total[1m])` (`node_pressure_memory_waiting_seconds_total` is reported in OpenShift only for nodes configured with psi=1 kernel argument)
query="avg by (instance) ( rate(node_pressure_cpu_waiting_seconds_total[1m])) and (1 - avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[1m]))) > 0.7 or avg by (instance) ( rate(node_pressure_cpu_waiting_seconds_total[1m])) * 0"
0 commit comments