feat: new container cpu usage recording rule using rate() #1025

sleepyfoodie · 2025-02-06T18:40:56Z

From discussion in this issue, this PR adds a rate version of the irate rule introduced from this PR.

Also updates dashboards to use the new rate version of the rule.

Existing recording rule node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate will be kept for backwards-compatibility.

We tested a sum on the irate query (to avoid hitting limits when too many series are returned):

sum(
	sum by (cluster, namespace, pod, container) (
    	irate(container_cpu_usage_seconds_total{image!="",job="kube-system/cadvisor"}[5m])
	) 
	* on (cluster, namespace, pod) group_left (node) 
	topk by (cluster, namespace, pod) (1, #dedup
	    max by (cluster, namespace, pod, node) (kube_pod_info{node!=""}) #dedup
	)
)

This took an average of 6.53s to run based on three query requests:

Then we also tested a sum on rate query :

sum(
	sum by (cluster, namespace, pod, container) (
    	rate(container_cpu_usage_seconds_total{image!="",job="kube-system/cadvisor"}[5m])
	) 
	* on (cluster, namespace, pod) group_left (node) 
	topk by (cluster, namespace, pod) (1, #dedup
	    max by (cluster, namespace, pod, node) (kube_pod_info{node!=""}) #dedup
	)
)

This took an average of 7.87s to run based on 4 query requests:

The average evaluation time in our dataset for this rule is roughly ~12s. Since the above recording rules have roughly the same evaluation time, we can conclude that the recording rule evaluation time will double to ~24s, and still be under the 60s limit.
The 12s is the same as an evaluation on from here back on October 27, 2023.

sum by (cluster, namespace, pod, container) (
    rate(container_cpu_usage_seconds_total{image!="",job="kube-system/cadvisor"}[5m])
) 
* on (cluster, namespace, pod) group_left (node) 
topk by (cluster, namespace, pod) (1, #dedup
    max by (cluster, namespace, pod, node) (kube_pod_info{node!=""}) #dedup
)

Testing Steps:

New recording rule is healthy

We added the new recording rule (using rate) into our alerts rule. Recording rule here is shown as OK health.

Then we tested that new recording rule query works and has data coming in.

Data on new recording rule query

We built new dashboards and prometheus rules with the new recording rule, and pointed our local docker container to the volume with the change. The dashboards with new recording rule has data coming in locally.

Fixes #670
Fixes #679

rules/apps.libsonnet

tests.yaml

rules/apps.libsonnet

skl · 2025-02-07T18:55:27Z

FYI the tests have moved into tests/ directory since:

tests: add more tests #1002

…well

tests/tests.yaml

rules/apps.libsonnet

skl · 2025-02-13T18:24:25Z

@bboreham would appreciate your review, as this request came from you originally.

bboreham

#670 asked for it first, but I did summarise my opinion in #679.

Presumably the name mismatch is unintended, but it would be good to have some confidence the change as proposed has been tested.

Doubling up the rule will raise overall metrics cardinality by 1 per container, which is likely a low percentage, but a large absolute number for some users. Still, I agree this is the best way to avoid disrupting people who relied on the previous rule.

Maybe the irate version could be declared deprecated so that it could be removed after a suitable period?

bboreham · 2025-02-14T09:30:49Z

dashboards/resources/cluster.libsonnet

          prometheus.new(
            '${datasource}',
-            'sum(node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate{%(clusterLabel)s="$cluster"}) by (namespace)' % $._config
+            'sum(node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate{%(clusterLabel)s="$cluster"}) by (namespace)' % $._config


The recording rule says node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate5m, but all the code changes say node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate.

skl · 2025-02-14T11:25:51Z

Maybe the irate version could be declared deprecated so that it could be removed after a suitable period?

This could fit with the recent move to semver, i.e. we could remove the irate recording rule in a future major release. There seems to have been a similar approach in KSM here.

Co-authored-by: Stephen Lang <[email protected]>

dashboards/resources/node.libsonnet

README.md

Co-authored-by: Stephen Lang <[email protected]>

skl · 2025-02-20T19:13:06Z

waiting on end-to-end test evidence before merge (rule --> dashboard)

github-actions · 2025-03-23T00:29:01Z

This PR has been automatically marked as stale because it has not
had any activity in the past 30 days.

The next time this stale check runs, the stale label will be
removed if there is new activity. The issue will be closed in 7
days if there is no new activity.

Thank you for your contributions!

sleepyfoodie requested review from povilasv and skl as code owners February 6, 2025 18:40

skl requested changes Feb 6, 2025

View reviewed changes

rules/apps.libsonnet Outdated Show resolved Hide resolved

sleepyfoodie requested a review from skl February 7, 2025 13:22

skl reviewed Feb 7, 2025

View reviewed changes

tests.yaml Outdated Show resolved Hide resolved

rules/apps.libsonnet Show resolved Hide resolved

skl changed the title ~~chore: revert irate function back to rate~~ feat: new container cpu usage recording rule using rate() Feb 7, 2025

sleepyfoodie added 3 commits February 13, 2025 12:30

chore: revert irate function back to rate

943b74d

chore: keep irate for node_namespace_pod_ontainer and add rate in as …

d9264b5

…well

chore: add test for apps

c7ba140

sleepyfoodie force-pushed the serena/rate-function-irate-to-rate branch from 4436883 to c7ba140 Compare February 13, 2025 17:32

skl approved these changes Feb 13, 2025

View reviewed changes

tests/tests.yaml Outdated Show resolved Hide resolved

rules/apps.libsonnet Show resolved Hide resolved

skl added the enhancement New feature or request label Feb 13, 2025

bboreham reviewed Feb 14, 2025

View reviewed changes

sleepyfoodie and others added 3 commits February 14, 2025 13:53

Update tests/tests.yaml

ae2c9a6

Co-authored-by: Stephen Lang <[email protected]>

chore: add 5m to queries to match recording name

c2a9315

chore: update readme

13c42ae

sleepyfoodie force-pushed the serena/rate-function-irate-to-rate branch from a957d7d to 13c42ae Compare February 14, 2025 20:01

sleepyfoodie added 2 commits February 14, 2025 15:11

update readme

b39c5e8

chore: fix readme formatting

48b28f4

skl reviewed Feb 17, 2025

View reviewed changes

dashboards/resources/node.libsonnet Outdated Show resolved Hide resolved

README.md Outdated Show resolved Hide resolved

sleepyfoodie and others added 3 commits February 18, 2025 08:40

Update README.md

ff2894b

Co-authored-by: Stephen Lang <[email protected]>

Update dashboards/resources/node.libsonnet

b94046a

Co-authored-by: Stephen Lang <[email protected]>

try adding space

c7cf95c

github-actions bot added the stale label Mar 23, 2025

skl removed the stale label Mar 23, 2025

skl added the keepalive Use to prevent automatic closing label Mar 23, 2025

skl merged commit 834daaa into kubernetes-monitoring:master Mar 27, 2025
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: new container cpu usage recording rule using rate() #1025

feat: new container cpu usage recording rule using rate() #1025

Uh oh!

sleepyfoodie commented Feb 6, 2025 •

edited by skl

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

skl commented Feb 7, 2025

Uh oh!

Uh oh!

Uh oh!

skl commented Feb 13, 2025

Uh oh!

bboreham left a comment

Uh oh!

bboreham Feb 14, 2025

Uh oh!

skl commented Feb 14, 2025

Uh oh!

Uh oh!

Uh oh!

skl commented Feb 20, 2025

Uh oh!

github-actions bot commented Mar 23, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: new container cpu usage recording rule using rate() #1025

feat: new container cpu usage recording rule using rate() #1025

Uh oh!

Conversation

sleepyfoodie commented Feb 6, 2025 • edited by skl Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Testing Steps:

New recording rule is healthy

Data on new recording rule query

Uh oh!

Uh oh!

Uh oh!

Uh oh!

skl commented Feb 7, 2025

Uh oh!

Uh oh!

Uh oh!

skl commented Feb 13, 2025

Uh oh!

bboreham left a comment

Choose a reason for hiding this comment

Uh oh!

bboreham Feb 14, 2025

Choose a reason for hiding this comment

Uh oh!

skl commented Feb 14, 2025

Uh oh!

Uh oh!

Uh oh!

skl commented Feb 20, 2025

Uh oh!

github-actions bot commented Mar 23, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sleepyfoodie commented Feb 6, 2025 •

edited by skl

Loading