Skip to content

[VL] Poor performance when aggregates contains sum/avg for floating type and the data mainly consists of unique keys #10367

@NEUpanning

Description

@NEUpanning

Backend

VL (Velox)

Bug description

Spark sql fragment that has 35 aggregate functions:

SELECT
supply_id,
mt_main_poi_id,
SUM(CASE WHEN partition_date = '2025-08-02' THEN trade_coupon_cnt ELSE 0 END) AS trade_coupon_cnt_1d,
SUM(CASE WHEN partition_date BETWEEN '2025-07-27' AND '2025-08-02' THEN trade_coupon_cnt ELSE 0 END) AS trade_coupon_cnt_7d,
SUM(CASE WHEN partition_date BETWEEN '2025-07-20' AND '2025-08-02' THEN trade_coupon_cnt ELSE 0 END) AS trade_coupon_cnt_14d,
SUM(CASE WHEN partition_date BETWEEN '2025-07-04' AND '2025-08-02' THEN trade_coupon_cnt ELSE 0 END) AS trade_coupon_cnt_30d,
SUM(CASE WHEN partition_date BETWEEN '2025-06-04' AND '2025-08-02' THEN trade_coupon_cnt ELSE 0 END) AS trade_coupon_cnt_60d,
SUM(CASE WHEN partition_date BETWEEN '2025-05-05' AND '2025-08-02' THEN trade_coupon_cnt ELSE 0 END) AS trade_coupon_cnt_90d,
SUM(CASE WHEN partition_date BETWEEN '2025-07-27' AND '2025-08-02' THEN trade_coupon_cnt ELSE 0 END) / 7.0 AS trade_coupon_cnt_7d_average,
SUM(CASE WHEN partition_date = '2025-08-02' THEN trade_order_cnt ELSE 0 END) AS trade_order_cnt_1d,
SUM(CASE WHEN partition_date BETWEEN '2025-07-27' AND '2025-08-02' THEN trade_order_cnt ELSE 0 END) AS trade_order_cnt_7d,
SUM(CASE WHEN partition_date BETWEEN '2025-07-20' AND '2025-08-02' THEN trade_order_cnt ELSE 0 END) AS trade_order_cnt_14d,
SUM(CASE WHEN partition_date BETWEEN '2025-07-04' AND '2025-08-02' THEN trade_order_cnt ELSE 0 END) AS trade_order_cnt_30d,
SUM(CASE WHEN partition_date BETWEEN '2025-06-04' AND '2025-08-02' THEN trade_order_cnt ELSE 0 END) AS trade_order_cnt_60d,
SUM(CASE WHEN partition_date BETWEEN '2025-05-05' AND '2025-08-02' THEN trade_order_cnt ELSE 0 END) AS trade_order_cnt_90d,
SUM(CASE WHEN partition_date BETWEEN '2025-07-27' AND '2025-08-02' THEN trade_order_cnt ELSE 0 END) / 7.0 AS trade_order_cnt_7d_average,
SUM(CASE WHEN partition_date = '2025-08-02' THEN gtv ELSE 0 END) AS gtv_1d,
SUM(CASE WHEN partition_date BETWEEN '2025-07-27' AND '2025-08-02' THEN gtv ELSE 0 END) AS gtv_7d,
SUM(CASE WHEN partition_date BETWEEN '2025-07-20' AND '2025-08-02' THEN gtv ELSE 0 END) AS gtv_14d,
SUM(CASE WHEN partition_date BETWEEN '2025-07-04' AND '2025-08-02' THEN gtv ELSE 0 END) AS gtv_30d,
SUM(CASE WHEN partition_date BETWEEN '2025-06-04' AND '2025-08-02' THEN gtv ELSE 0 END) AS gtv_60d,
SUM(CASE WHEN partition_date BETWEEN '2025-05-05' AND '2025-08-02' THEN gtv ELSE 0 END) AS gtv_90d,
SUM(CASE WHEN partition_date BETWEEN '2025-07-27' AND '2025-08-02' THEN gtv ELSE 0 END) / 7.0 AS gtv_7d_average,
SUM(CASE WHEN partition_date = '2025-08-02' THEN order_fact_pay_amt ELSE 0 END) AS order_fact_pay_amt_1d,
SUM(CASE WHEN partition_date BETWEEN '2025-07-27' AND '2025-08-02' THEN order_fact_pay_amt ELSE 0 END) AS order_fact_pay_amt_7d,
SUM(CASE WHEN partition_date BETWEEN '2025-07-20' AND '2025-08-02' THEN order_fact_pay_amt ELSE 0 END) AS order_fact_pay_amt_14d,
SUM(CASE WHEN partition_date BETWEEN '2025-07-04' AND '2025-08-02' THEN order_fact_pay_amt ELSE 0 END) AS order_fact_pay_amt_30d,
SUM(CASE WHEN partition_date BETWEEN '2025-06-04' AND '2025-08-02' THEN order_fact_pay_amt ELSE 0 END) AS order_fact_pay_amt_60d,
SUM(CASE WHEN partition_date BETWEEN '2025-05-05' AND '2025-08-02' THEN order_fact_pay_amt ELSE 0 END) AS order_fact_pay_amt_90d,
SUM(CASE WHEN partition_date BETWEEN '2025-07-27' AND '2025-08-02' THEN order_fact_pay_amt ELSE 0 END) / 7.0 AS order_fact_pay_amt_7d_average,
COUNT(DISTINCT CASE WHEN is_pay_dx = 1 AND partition_date = '2025-08-02' THEN partition_date ELSE NULL END) AS dx_pay_day_1d,
COUNT(DISTINCT CASE WHEN is_pay_dx = 1 AND partition_date BETWEEN '2025-07-27' AND '2025-08-02' THEN partition_date ELSE NULL END) AS dx_pay_day_7d,
COUNT(DISTINCT CASE WHEN is_pay_dx = 1 AND partition_date BETWEEN '2025-07-20' AND '2025-08-02' THEN partition_date ELSE NULL END) AS dx_pay_day_14d,
COUNT(DISTINCT CASE WHEN is_pay_dx = 1 AND partition_date BETWEEN '2025-07-04' AND '2025-08-02' THEN partition_date ELSE NULL END) AS dx_pay_day_30d,
COUNT(DISTINCT CASE WHEN is_pay_dx = 1 AND partition_date BETWEEN '2025-06-04' AND '2025-08-02' THEN partition_date ELSE NULL END) AS dx_pay_day_60d,
COUNT(DISTINCT CASE WHEN is_pay_dx = 1 AND partition_date BETWEEN '2025-05-05' AND '2025-08-02' THEN partition_date ELSE NULL END) AS dx_pay_day_90d
FROM
mart_aiosupply.daocan_app_aggr_supply_poi_deal_cube_d_extnl_view
WHERE
partition_date <= '2025-08-02'
AND partition_date >= '2025-05-05'
AND dim_type = 'poi#deal#type'
AND product_type IN (1, 2)
GROUP BY
supply_id,
mt_main_poi_id

vanilla spark aggregate metrics:

HashAggregate

spill size total (min, med, max (stageId: taskId))
0.0 B (0.0 B, 0.0 B, 0.0 B (stage 6.0: task 15017))
time in aggregation build total (min, med, max (stageId: taskId))
192.60 h (6.1 m, 11.6 m, 19.6 m (stage 6.0: task 15015))
peak memory total (min, med, max (stageId: taskId))
965.3 GiB (987.5 MiB, 988.4 MiB, 988.4 MiB (stage 6.0: task 15322))
number of output rows: 3,440,719,610
avg hash probe bucket list iters (min, med, max (stageId: taskId)):
(1.6, 1.6, 1.6 (stage 6.0: task 15017))

gluten aggregate metrics that are 3.5x total time compared to vanilla:

RegularHashAggregateExecTransformer

time of extraction total (min, med, max (stageId: taskId))
49.0 m (23 ms, 205 ms, 517 ms (stage 2.0: task 966))
number of memory allocations: 614,976,915
number of output vectors: 44,714,272
number of spilled bytes total (min, med, max (stageId: taskId))
1864.2 GiB (0.0 B, 131.6 MiB, 141.5 MiB (stage 2.0: task 491))
number of output rows: 182,915,536,712
number of output bytes total (min, med, max (stageId: taskId))
101.6 TiB (1179.8 MiB, 7.2 GiB, 7.6 GiB (stage 2.0: task 491))
number of spilled files: 612,736
time of aggregation total (min, med, max (stageId: taskId))
675.75 h (5.6 s, 2.8 m, 8.3 m (stage 2.0: task 9949))
peak memory bytes total (min, med, max (stageId: taskId))
53.4 TiB (1624.6 MiB, 3.8 GiB, 4.3 GiB (stage 2.0: task 12750))
number of spilled rows: 182,905,342,963
cpu wall time count: 478,741,203
number of spilled partitions: 115,360
extraction cpu wall time count: 402,803,446

Gluten version

Gluten-1.3

Spark version

Spark-3.5.x

Spark configurations

No response

System information

No response

Relevant logs

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingtriage

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions