-
Notifications
You must be signed in to change notification settings - Fork 14
Add partition coalescing at the head of the plan #164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add partition coalescing at the head of the plan #164
Conversation
| format!("Stage {:<3}{}", self.num, child_str) | ||
| } | ||
|
|
||
| pub fn try_assign(self, channel_resolver: &impl ChannelResolver) -> Result<Self> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove as it's unused
NGA-TRAN
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually a very nice fix and improvement. I just wonder what happened before this PR regarding the stage assigning task URLs.
| │ AggregateExec: mode=Partial, gby=[RainToday@0 as RainToday, WindGustDir@1 as WindGustDir], aggr=[] | ||
| │ PartitionIsolatorExec Tasks: t0:[p0,p1,__] t1:[__,__,p0] | ||
| │ DataSourceExec: file_groups={3 groups: [[/testdata/weather/result-000000.parquet], [/testdata/weather/result-000001.parquet], [/testdata/weather/result-000002.parquet]]}, projection=[RainToday, WindGustDir], file_type=parquet | ||
| └────────────────────────────────────────────────── |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This turns out to be a very good optimization for large data set that the final aggregation is done in many tasks/workers 👍 🎉 .
This isn’t directly related to the current PR, but worth noting as a future optimization:
If the input files are partitioned on the group-by columns RainToday, WindGustDir (i.e., the three files don’t overlap on those values), we can eliminate RoundRobin partitioning. This would allow stage 1 and stage 2 to merge into a single stage, executing both partial and final aggregation together. I’m currently exploring use cases for this and will share query plans soon to illustrate the idea more clearly.
| // once on 1 partition. | ||
| if depth == 0 && distributed.data.output_partitioning().partition_count() > 1 { | ||
| distributed.data = Arc::new(CoalescePartitionsExec::new(distributed.data)); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: What did we do before this PR? I am asking because I do not see any removed or modified code in this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously we did pretty much nothing, and therefore, we were potentially executing multiple partitions with potentially conflicting URL assignations.
jayshrivastava
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall :) It reminds me of how SQL databases stream a single stream to the client.
Some random thoughts
- It will be interesting to see how this evolves if we decide to make a "distributed endpoint" as @fmonjalet mentioned where we may stream different partitions to a distributed consumer / client like kafka etc.
- I'm curious if we can break the distributed optimizer rule into smaller rules (although having more than one rule makes it harder to migrate to distributed DF from vanilla DF)
Actually this is a a good idea. I bet in the future, we will be able to add more rules |
👍 that makes sense. I'd say that's way too soon for the stage in which the project is in right now. Hopefully this feature can be driven by demand. |
I'd say right now is not too bad. The rule is pretty condensed and all the modifications provided go hand in hand with each other. As this grows, it's something we should definitely keep an eye on. |
There's a couple of reasons why we want to manually enforce that the head node of the head stage of a distributed plan is a
CoalescePartitionsExec:StageExecnode callstry_assing_urls lazily upon calling.execute(). This means that.execute()must only be called once, as we cannot afford to perform several random URL assignation while calling multiple partitions, as they will differ, producing an invalid planCoalescePartitionsExecon top of the plan allows us to place a newNetworkCoalesceExecon the top of the plan, executing the full plan in parallel.