-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Background
In PR #116 (lambda enhancements), we solved an OOM issue on high-cardinality breakdown refinement queries by using ClickHouse's approx_top_count function. The technique uses bounded memory to find top-N candidates, then computes exact counts via an IN filter, with a UNION ALL to get full-dataset totals.
The query pattern:
WITH top_dims AS (
SELECT tupleElement(pair, 1) AS dim
FROM (
SELECT arrayJoin(approx_top_count(N)(col)) AS pair
FROM table WHERE ...
)
)
SELECT col AS dim, count() AS cnt, ...
FROM table WHERE ... AND col IN (SELECT dim FROM top_dims)
GROUP BY dim ORDER BY cnt DESC LIMIT N
UNION ALL
SELECT '' AS dim, count() AS cnt, ...
FROM table WHERE ...This ran in ~4s on 244M rows with 83M unique message values, vs the regular GROUP BY which OOMed at the 4GB memory limit.
Investigation Areas
-
Main CDN dashboard high-cardinality breakdowns: The main dashboard has high-cardinality facets (hosts, URLs, user agents, client IPs, referers, forwarded hosts, redirect locations) that currently use sampling for both initial and refinement passes. Could
approx_top_countreplace the refinement sampling for these too, giving exact counts without OOM risk? -
Facet search:
facet-search-initial.sqlandfacet-search-pattern.sqldoGROUP BYqueries that could benefit from bounded-memory top-N for very high-cardinality columns. -
Log investigation queries:
investigate-facet.sqldoesGROUP BYon arbitrary facets — high-cardinality ones could benefit. -
Preview breakdowns: Time range selection preview queries currently use sampling. Could
approx_top_countprovide better accuracy for short preview windows?
Current Implementation
- SQL template:
sql/queries/breakdown-approx-top.sql - Routing logic:
isApproxTopRefinement()injs/breakdowns/index.js - Totals extraction:
extractApproxTopTotals()injs/breakdowns/index.js - Tests:
js/breakdowns/approx-top.test.js