Skip to content

Investigate using approx_top_count for other high-cardinality queries #127

@trieloff

Description

@trieloff

Background

In PR #116 (lambda enhancements), we solved an OOM issue on high-cardinality breakdown refinement queries by using ClickHouse's approx_top_count function. The technique uses bounded memory to find top-N candidates, then computes exact counts via an IN filter, with a UNION ALL to get full-dataset totals.

The query pattern:

WITH top_dims AS (
  SELECT tupleElement(pair, 1) AS dim
  FROM (
    SELECT arrayJoin(approx_top_count(N)(col)) AS pair
    FROM table WHERE ...
  )
)
SELECT col AS dim, count() AS cnt, ...
FROM table WHERE ... AND col IN (SELECT dim FROM top_dims)
GROUP BY dim ORDER BY cnt DESC LIMIT N
UNION ALL
SELECT '' AS dim, count() AS cnt, ...
FROM table WHERE ...

This ran in ~4s on 244M rows with 83M unique message values, vs the regular GROUP BY which OOMed at the 4GB memory limit.

Investigation Areas

  • Main CDN dashboard high-cardinality breakdowns: The main dashboard has high-cardinality facets (hosts, URLs, user agents, client IPs, referers, forwarded hosts, redirect locations) that currently use sampling for both initial and refinement passes. Could approx_top_count replace the refinement sampling for these too, giving exact counts without OOM risk?

  • Facet search: facet-search-initial.sql and facet-search-pattern.sql do GROUP BY queries that could benefit from bounded-memory top-N for very high-cardinality columns.

  • Log investigation queries: investigate-facet.sql does GROUP BY on arbitrary facets — high-cardinality ones could benefit.

  • Preview breakdowns: Time range selection preview queries currently use sampling. Could approx_top_count provide better accuracy for short preview windows?

Current Implementation

  • SQL template: sql/queries/breakdown-approx-top.sql
  • Routing logic: isApproxTopRefinement() in js/breakdowns/index.js
  • Totals extraction: extractApproxTopTotals() in js/breakdowns/index.js
  • Tests: js/breakdowns/approx-top.test.js

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions