- 
                Notifications
    You must be signed in to change notification settings 
- Fork 629
feat: new container cpu usage recording rule using rate() #1025
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: new container cpu usage recording rule using rate() #1025
Conversation
| FYI the tests have moved into  | 
4436883    to
    c7ba140      
    Compare
  
    | @bboreham would appreciate your review, as this request came from you originally. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#670 asked for it first, but I did summarise my opinion in #679.
Presumably the name mismatch is unintended, but it would be good to have some confidence the change as proposed has been tested.
Doubling up the rule will raise overall metrics cardinality by 1 per container, which is likely a low percentage, but a large absolute number for some users. Still, I agree this is the best way to avoid disrupting people who relied on the previous rule.
Maybe the irate version could be declared deprecated so that it could be removed after a suitable period?
| prometheus.new( | ||
| '${datasource}', | ||
| 'sum(node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate{%(clusterLabel)s="$cluster"}) by (namespace)' % $._config | ||
| 'sum(node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate{%(clusterLabel)s="$cluster"}) by (namespace)' % $._config | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The recording rule says node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate5m, but all the code changes say node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate.
| 
 This could fit with the recent move to semver, i.e. we could remove the irate recording rule in a future major release. There seems to have been a similar approach in KSM here. | 
Co-authored-by: Stephen Lang <[email protected]>
a957d7d    to
    13c42ae      
    Compare
  
    Co-authored-by: Stephen Lang <[email protected]>
Co-authored-by: Stephen Lang <[email protected]>
| waiting on end-to-end test evidence before merge (rule --> dashboard) | 
| This PR has been automatically marked as stale because it has not The next time this stale check runs, the stale label will be Thank you for your contributions! | 
From discussion in this issue, this PR adds a rate version of the irate rule introduced from this PR.
Also updates dashboards to use the new rate version of the rule.
Existing recording rule
node_namespace_pod_container:container_cpu_usage_seconds_total:sum_iratewill be kept for backwards-compatibility.We tested a sum on the
iratequery (to avoid hitting limits when too many series are returned):sum( sum by (cluster, namespace, pod, container) ( irate(container_cpu_usage_seconds_total{image!="",job="kube-system/cadvisor"}[5m]) ) * on (cluster, namespace, pod) group_left (node) topk by (cluster, namespace, pod) (1, #dedup max by (cluster, namespace, pod, node) (kube_pod_info{node!=""}) #dedup ) )This took an average of 6.53s to run based on three query requests:

Then we also tested a sum on
ratequery :sum( sum by (cluster, namespace, pod, container) ( rate(container_cpu_usage_seconds_total{image!="",job="kube-system/cadvisor"}[5m]) ) * on (cluster, namespace, pod) group_left (node) topk by (cluster, namespace, pod) (1, #dedup max by (cluster, namespace, pod, node) (kube_pod_info{node!=""}) #dedup ) )This took an average of 7.87s to run based on 4 query requests:

The average evaluation time in our dataset for this rule is roughly ~12s. Since the above recording rules have roughly the same evaluation time, we can conclude that the recording rule evaluation time will double to ~24s, and still be under the 60s limit.
The 12s is the same as an evaluation on from here back on October 27, 2023.
sum by (cluster, namespace, pod, container) ( rate(container_cpu_usage_seconds_total{image!="",job="kube-system/cadvisor"}[5m]) ) * on (cluster, namespace, pod) group_left (node) topk by (cluster, namespace, pod) (1, #dedup max by (cluster, namespace, pod, node) (kube_pod_info{node!=""}) #dedup )Testing Steps:
New recording rule is healthy
We added the new recording rule (using

rate) into our alerts rule. Recording rule here is shown asOKhealth.Then we tested that new recording rule query works and has data coming in.


Data on new recording rule query
We built new dashboards and prometheus rules with the new recording rule, and pointed our local docker container to the volume with the change. The dashboards with new recording rule has data coming in locally.

Fixes #670
Fixes #679