Skip to content

[VL] Incorrect result with small Parquet pages + Velox aggregation pushdown #10366

@zhztheplayer

Description

@zhztheplayer

Backend

VL (Velox)

Bug description

Found in Gluten + Velox when running ClickBench Q31. This is reproduce-able with the default Gluten / Spark configurations. No specific options are required to set.

The minimal SQL to reproduce:

SELECT SearchEngineID, SUM(IsRefresh) FROM hits WHERE SearchPhrase <> '' AND (SearchEngineID in (1, 2, 4)) AND ClientIP = -807147100 GROUP BY SearchEngineID

Data:

hits.parquet from https://github.com/ClickHouse/ClickBench/?tab=readme-ov-file#data-loading

Spark answer:

+--------------+--------------+
|SearchEngineID|sum(IsRefresh)|
+--------------+--------------+
|             1|             1|
|             2|            35|
|             4|             3|
+--------------+--------------+

Gluten (Velox) answer:

+--------------+--------------+
|SearchEngineID|sum(IsRefresh)|
+--------------+--------------+
|             4|             1|
|             2|            37|
|             1|             1|
+--------------+--------------+

Gluten version

Latest

Spark version

3.5

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingtriage

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions