Skip to content

Optimize DistinctCountHLL aggregation for high-cardinality dictionary-encoded columns #17336

@praveenc7

Description

@praveenc7

Problem

The DISTINCTCOUNTHLL aggregation function suffers from severe performance degradation when processing high-cardinality dictionary-encoded columns (14 Million). Profiling shows that 50% of CPU time is spent in RoaringBitmap operations:

Image

For dictionary-encoded columns, the current implementation uses RoaringBitmap to track dictionary IDs during aggregation. While memory-efficient for low cardinality, this approach has O(n log n) insertion complexity that becomes prohibitively expensive for high-cardinality columns (>100K distinct values).

Queries on high-cardinality columns (1M - 15M) (e.g., user IDs, member) takes about 6 - 10sec. RoaringBitmap operations dominate query execution time. No performance benefit from using HLL over distinct count

Proposed Solution

Implement adaptive cardinality handling that dynamically switches from RoaringBitmap to HyperLogLog:

  1. Low cardinality : Use RoaringBitmap (memory efficient, exact counts)
  2. High cardinality : Convert to HyperLogLog (O(1) insertions)

Tested with a POC code where we choose HyperLogLog for High- cardinality column and observed improvements from 8sec -> 700ms

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions