Skip to content

[Enhancement]: Worst-case performance of most join-based queries can be improved #1034

@solidDoWant

Description

@solidDoWant

What's the general idea for the enhancement?

Quite a few queries in this repo make use of a similar pattern:

sum(
    rate(some_metric{namespace!=""})
    * on(cluster,namespace,pod) group_left()
    kube_pod_info{host_network="false"} # Sometimes this is wrapped by `topk`
) by (namespace) # Or pod

This is essentially just:

  1. Calculate some metric of interest on a (potentially) very large number of series
  2. Perform a join between this metric and another metric with pod information
  3. Sum the resulting metric

Let S represent the number of series returned by rate(some_metric{namespace!=""}), and P be the number of pods in the cluster (across all namespaces), and N be the number of namespaces. Note that in some cases S >> P, if there are additional labels (such as the interface label on network metrics when a pod has multiple metrics).

The series returned in (1) are all eventually grouped by just a few labels, and then aggregated over one. With the way this query is currently written, the query engine needs to store S series for the rate calculation metric, perform a join on P series for pod info (S * P worst case, P * P best case), and then sum the joined series into a resulting N series.

Here's where this can be optimized. If the rate query is summed over the labels used by the join and the outer labels, the resulting number of time series is cut down from S to P. The subsequent join now results in P * P in every case. This reduces the number of series that needs to be stored, and processed, reducing the memory requirements significantly (S / Px less memory required). While summing multiple times is less efficient, there are fewer values to sum. I measured a 20% decrease in latency when switching to this (below).

Please provide any helpful snippets.

Here's an example pulled from the cluster networking dashboard. There are a few dozen places that this could be implemented - just search the repo for group_left\s*\(\s*\).

sum(
  rate(container_network_receive_bytes_total{cluster="",namespace!=""}[3s])
    * on(cluster,namespace,pod) group_left()
  topk(
    1,
    max(kube_pod_info{host_network="false"}) by(cluster,namespace,pod)
  ) by(cluster,namespace,pod)
) by(namespace)

This requires an estimated (by the query engine I'm using) 57,899,664 bytes to process the rate( rate(container_network_receive_bytes_total{cluster="",namespace!=""}[3s]) query, and another 5,095,680 bytes for the max(kube_pod_info{host_network="false"}) by(cluster,namespace,pod) query. The total execution time was about 206ms.

Here's what the query looks like with this improvement:

sum(
  sum(
    rate(container_network_receive_bytes_total{cluster="",namespace!=""}[3s])
  ) by(cluster,namespace,pod)
    * on(cluster,namespace,pod) group_left()
  topk(
    1,
    max(kube_pod_info{host_network="false"}) by(cluster,namespace,pod)
  ) by(cluster,namespace,pod)
) by(namespace)

This requires an estimated (by the query engine I'm using) 10,616,000 bytes to process the rate( rate(container_network_receive_bytes_total{cluster="",namespace!=""}[3s]) query, and another 1,900,264 bytes for the max(kube_pod_info{host_network="false"}) by(cluster,namespace,pod) query. The total execution time was about 166ms.

What parts of the codebase does the enhancement target?

Dashboards, Alerts, Rules

Anything else relevant to the enhancement that would help with the triage process?

No response

I agree to the following terms:

  • I agree to follow this project's Code of Conduct.
  • I have filled out all the required information above to the best of my ability.
  • I have searched the issues of this repository and believe that this is not a duplicate.
  • I have confirmed this proposal applies to the default branch of the repository, as of the latest commit at the time of submission.

Metadata

Metadata

Assignees

No one assigned

    Labels

    keepaliveUse to prevent automatic closing

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions