By assigning workers to tasks based on their partitioning knowledge, we can reduce network communication and avoid unnecessary data serialization and deserialization.
Let us look at this input plan
Ref: See #117 for full context of this query
If the files are partitioned or sorted by the hash key flag, status, we can split them into non-overlapping groups—making p0 identical to p0′, p1 to p1′, and p2 to p2′. With this alignment, assigning workers as shown below eliminates the need for data transfer between workers, except in the final stage.
This ties into a broader topic: understanding data layout and strategically partitioning data to take full advantage of it