Skip to content

Conversation

@sleepyfoodie
Copy link
Contributor

@sleepyfoodie sleepyfoodie commented Feb 6, 2025

From discussion in this issue, this PR adds a rate version of the irate rule introduced from this PR.

Also updates dashboards to use the new rate version of the rule.

Existing recording rule node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate will be kept for backwards-compatibility.

We tested a sum on the irate query (to avoid hitting limits when too many series are returned):

sum(
	sum by (cluster, namespace, pod, container) (
    	irate(container_cpu_usage_seconds_total{image!="",job="kube-system/cadvisor"}[5m])
	) 
	* on (cluster, namespace, pod) group_left (node) 
	topk by (cluster, namespace, pod) (1, #dedup
	    max by (cluster, namespace, pod, node) (kube_pod_info{node!=""}) #dedup
	)
)

This took an average of 6.53s to run based on three query requests:
Screenshot 2025-02-13 at 12 08 59

Then we also tested a sum on rate query :

sum(
	sum by (cluster, namespace, pod, container) (
    	rate(container_cpu_usage_seconds_total{image!="",job="kube-system/cadvisor"}[5m])
	) 
	* on (cluster, namespace, pod) group_left (node) 
	topk by (cluster, namespace, pod) (1, #dedup
	    max by (cluster, namespace, pod, node) (kube_pod_info{node!=""}) #dedup
	)
)

This took an average of 7.87s to run based on 4 query requests:
Screenshot 2025-02-13 at 11 00 14

The average evaluation time in our dataset for this rule is roughly ~12s. Since the above recording rules have roughly the same evaluation time, we can conclude that the recording rule evaluation time will double to ~24s, and still be under the 60s limit.
The 12s is the same as an evaluation on from here back on October 27, 2023.

sum by (cluster, namespace, pod, container) (
    rate(container_cpu_usage_seconds_total{image!="",job="kube-system/cadvisor"}[5m])
) 
* on (cluster, namespace, pod) group_left (node) 
topk by (cluster, namespace, pod) (1, #dedup
    max by (cluster, namespace, pod, node) (kube_pod_info{node!=""}) #dedup
)

Testing Steps:

New recording rule is healthy

We added the new recording rule (using rate) into our alerts rule. Recording rule here is shown as OK health.
Screenshot 2025-03-27 at 10 55 46

Then we tested that new recording rule query works and has data coming in.
Screenshot 2025-03-27 at 10 57 55
Screenshot 2025-03-27 at 10 59 19

Data on new recording rule query

We built new dashboards and prometheus rules with the new recording rule, and pointed our local docker container to the volume with the change. The dashboards with new recording rule has data coming in locally.
Screenshot 2025-03-27 at 10 57 22

Fixes #670
Fixes #679

@sleepyfoodie sleepyfoodie requested a review from skl February 7, 2025 13:22
@skl skl changed the title chore: revert irate function back to rate feat: new container cpu usage recording rule using rate() Feb 7, 2025
@skl
Copy link
Collaborator

skl commented Feb 7, 2025

FYI the tests have moved into tests/ directory since:

@sleepyfoodie sleepyfoodie force-pushed the serena/rate-function-irate-to-rate branch from 4436883 to c7ba140 Compare February 13, 2025 17:32
@skl skl added the enhancement New feature or request label Feb 13, 2025
@skl
Copy link
Collaborator

skl commented Feb 13, 2025

@bboreham would appreciate your review, as this request came from you originally.

Copy link

@bboreham bboreham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#670 asked for it first, but I did summarise my opinion in #679.

Presumably the name mismatch is unintended, but it would be good to have some confidence the change as proposed has been tested.

Doubling up the rule will raise overall metrics cardinality by 1 per container, which is likely a low percentage, but a large absolute number for some users. Still, I agree this is the best way to avoid disrupting people who relied on the previous rule.

Maybe the irate version could be declared deprecated so that it could be removed after a suitable period?

prometheus.new(
'${datasource}',
'sum(node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate{%(clusterLabel)s="$cluster"}) by (namespace)' % $._config
'sum(node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate{%(clusterLabel)s="$cluster"}) by (namespace)' % $._config

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The recording rule says node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate5m, but all the code changes say node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate.

@skl
Copy link
Collaborator

skl commented Feb 14, 2025

Maybe the irate version could be declared deprecated so that it could be removed after a suitable period?

This could fit with the recent move to semver, i.e. we could remove the irate recording rule in a future major release. There seems to have been a similar approach in KSM here.

@sleepyfoodie sleepyfoodie force-pushed the serena/rate-function-irate-to-rate branch from a957d7d to 13c42ae Compare February 14, 2025 20:01
@skl
Copy link
Collaborator

skl commented Feb 20, 2025

waiting on end-to-end test evidence before merge (rule --> dashboard)

@github-actions
Copy link

This PR has been automatically marked as stale because it has not
had any activity in the past 30 days.

The next time this stale check runs, the stale label will be
removed if there is new activity. The issue will be closed in 7
days if there is no new activity.

Thank you for your contributions!

@github-actions github-actions bot added the stale label Mar 23, 2025
@skl skl removed the stale label Mar 23, 2025
@skl skl added the keepalive Use to prevent automatic closing label Mar 23, 2025
@skl skl merged commit 834daaa into kubernetes-monitoring:master Mar 27, 2025
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request keepalive Use to prevent automatic closing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

irate is bad The change from rate() to irate() is a breaking change

3 participants