-
Notifications
You must be signed in to change notification settings - Fork 14
Description
Similar to a push-down model, preserving data distribution for large datasets allows parallel processing to continue longer, resulting in faster execution.
Let us look at TPC-H Q1
select
l_returnflag,
l_linestatus,
sum(l_quantity) as sum_qty,
sum(l_extendedprice) as sum_base_price,
sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
avg(l_quantity) as avg_qty,
avg(l_extendedprice) as avg_price,
avg(l_discount) as avg_disc,
count(*) as count_order
from
lineitem
where
l_shipdate <= date '1998-09-02'
group by
l_returnflag,
l_linestatus
order by
l_returnflag,
l_linestatus;We experimented with various configurations—scale factors (1, 5, 10), file counts per table (3, 4, 8), partitioning settings (3, 4), and partitions per task (2, 4)—yet consistently observed a similar plan structure, as shown below.
In Stage 1, data is repartitioned by the group-by key (l_returnflag, l_linestatus). However, all output is then funneled into a single task on one worker in Stage 2 for the final aggregation and sort. For large datasets, it’s more efficient to keep the data distributed across both workers and push down the final aggregation and sort operations.
The improved plan would look like this:
The revised plan consists of three stages, with operators like final aggregate and sort pushed into separate stages to maintain data distribution and parallelism.