Improvement: Assign each worker to the corresponding non-overlapping partitioned tasks to minimize network communication

By assigning workers to tasks based on their partitioning knowledge, we can reduce network communication and avoid unnecessary data serialization and deserialization.

### Let us look at this input plan
Ref: See https://github.com/datafusion-contrib/datafusion-distributed/issues/117 for full context of this query

<img width="871" height="922" alt="Image" src="https://github.com/user-attachments/assets/70f256fa-5b18-488b-b37e-a94003faf062" />

If the files are partitioned or sorted by the hash key `flag, status`, we can split them into non-overlapping groups—making p0 identical to p0′, p1 to p1′, and p2 to p2′. With this alignment, assigning workers as shown below eliminates the need for data transfer between workers, except in the final stage.

<img width="907" height="977" alt="Image" src="https://github.com/user-attachments/assets/6b8fc3a4-f87a-4a62-ba00-e68214c2dede" />

This ties into a broader topic: [understanding data layout and strategically partitioning data to take full advantage of it](https://github.com/datafusion-contrib/datafusion-distributed/issues/119)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improvement: Assign each worker to the corresponding non-overlapping partitioned tasks to minimize network communication #118

Let us look at this input plan

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improvement: Assign each worker to the corresponding non-overlapping partitioned tasks to minimize network communication #118

Description

Let us look at this input plan

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions