Skip to content

Improvement: Preserve Data Distribution for Large Data Set #117

@NGA-TRAN

Description

@NGA-TRAN

Similar to a push-down model, preserving data distribution for large datasets allows parallel processing to continue longer, resulting in faster execution.

Let us look at TPC-H Q1

select
    l_returnflag,
    l_linestatus,
    sum(l_quantity) as sum_qty,
    sum(l_extendedprice) as sum_base_price,
    sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
    sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
    avg(l_quantity) as avg_qty,
    avg(l_extendedprice) as avg_price,
    avg(l_discount) as avg_disc,
    count(*) as count_order
from
    lineitem
where
        l_shipdate <= date '1998-09-02'
group by
    l_returnflag,
    l_linestatus
order by
    l_returnflag,
    l_linestatus;

We experimented with various configurations—scale factors (1, 5, 10), file counts per table (3, 4, 8), partitioning settings (3, 4), and partitions per task (2, 4)—yet consistently observed a similar plan structure, as shown below.

Image

In Stage 1, data is repartitioned by the group-by key (l_returnflag, l_linestatus). However, all output is then funneled into a single task on one worker in Stage 2 for the final aggregation and sort. For large datasets, it’s more efficient to keep the data distributed across both workers and push down the final aggregation and sort operations.

The improved plan would look like this:

Image

The revised plan consists of three stages, with operators like final aggregate and sort pushed into separate stages to maintain data distribution and parallelism.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions