-
Notifications
You must be signed in to change notification settings - Fork 14
Rework execution plan hierarchy for better interoperability #178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
gabotechs
merged 5 commits into
main
from
gabrielmusat/rework-stage-hierarchy-for-better-interoperability
Oct 9, 2025
Merged
Changes from all commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
ab113ee
Add task_count to context
gabotechs e469202
Rework stage hierarchy for better interoperability with DataFusion ec…
gabotechs 76cbed2
Merge branch 'main' into gabrielmusat/rework-stage-hierarchy-for-bett…
gabotechs 3ae9c4c
Scope common execution_plan helpers to src/execution_plans/common.rs
gabotechs 6d884f8
Add clarifying comment
gabotechs File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,7 +1,4 @@ | ||
| mod map_last_stream; | ||
| mod partitioning; | ||
| #[allow(unused)] | ||
| pub mod ttl_map; | ||
|
|
||
| pub(crate) use map_last_stream::map_last_stream; | ||
| pub(crate) use partitioning::{scale_partitioning, scale_partitioning_props}; |
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,126 @@ | ||
| use crate::channel_resolver_ext::get_distributed_channel_resolver; | ||
| use crate::distributed_physical_optimizer_rule::NetworkBoundaryExt; | ||
| use crate::execution_plans::common::require_one_child; | ||
| use crate::protobuf::DistributedCodec; | ||
| use crate::{ExecutionTask, Stage}; | ||
| use datafusion::common::exec_err; | ||
| use datafusion::common::tree_node::{Transformed, TreeNode}; | ||
| use datafusion::execution::{SendableRecordBatchStream, TaskContext}; | ||
| use datafusion::physical_plan::{DisplayAs, DisplayFormatType, ExecutionPlan, PlanProperties}; | ||
| use datafusion_proto::physical_plan::PhysicalExtensionCodec; | ||
| use rand::Rng; | ||
| use std::any::Any; | ||
| use std::fmt::Formatter; | ||
| use std::sync::Arc; | ||
| use url::Url; | ||
|
|
||
| /// [ExecutionPlan] that executes the inner plan in distributed mode. | ||
| /// Before executing it, two modifications are lazily performed on the plan: | ||
| /// 1. Assigns worker URLs to all the stages. A random set of URLs are sampled from the | ||
| /// channel resolver and assigned to each task in each stage. | ||
| /// 2. Encodes all the plans in protobuf format so that network boundary nodes can send them | ||
| /// over the wire. | ||
| #[derive(Debug, Clone)] | ||
| pub struct DistributedExec { | ||
| pub plan: Arc<dyn ExecutionPlan>, | ||
| } | ||
|
|
||
| impl DistributedExec { | ||
| pub fn new(plan: Arc<dyn ExecutionPlan>) -> Self { | ||
| Self { plan } | ||
| } | ||
|
|
||
| fn prepare_plan( | ||
| &self, | ||
| urls: &[Url], | ||
| codec: &dyn PhysicalExtensionCodec, | ||
| ) -> datafusion::common::Result<Arc<dyn ExecutionPlan>> { | ||
| let prepared = Arc::clone(&self.plan).transform_up(|plan| { | ||
| let Some(plan) = plan.as_network_boundary() else { | ||
| return Ok(Transformed::no(plan)); | ||
| }; | ||
|
|
||
| let mut rng = rand::thread_rng(); | ||
| let start_idx = rng.gen_range(0..urls.len()); | ||
|
|
||
| let Some(stage) = plan.input_stage() else { | ||
| return exec_err!( | ||
| "NetworkBoundary '{}' has not been assigned a stage", | ||
| plan.name() | ||
| ); | ||
| }; | ||
|
|
||
| let ready_stage = Stage { | ||
| query_id: stage.query_id, | ||
| num: stage.num, | ||
| plan: stage.plan.to_encoded(codec)?, | ||
| tasks: stage | ||
| .tasks | ||
| .iter() | ||
| .enumerate() | ||
| .map(|(i, _)| ExecutionTask { | ||
| url: Some(urls[(start_idx + i) % urls.len()].clone()), | ||
| }) | ||
| .collect::<Vec<_>>(), | ||
| }; | ||
|
|
||
| Ok(Transformed::yes(plan.with_input_stage(ready_stage)?)) | ||
| })?; | ||
| Ok(prepared.data) | ||
| } | ||
| } | ||
|
|
||
| impl DisplayAs for DistributedExec { | ||
| fn fmt_as(&self, _: DisplayFormatType, f: &mut Formatter) -> std::fmt::Result { | ||
| write!(f, "DistributedExec") | ||
| } | ||
| } | ||
|
|
||
| impl ExecutionPlan for DistributedExec { | ||
| fn name(&self) -> &str { | ||
| "DistributedExec" | ||
| } | ||
|
|
||
| fn as_any(&self) -> &dyn Any { | ||
| self | ||
| } | ||
|
|
||
| fn properties(&self) -> &PlanProperties { | ||
| self.plan.properties() | ||
| } | ||
|
|
||
| fn children(&self) -> Vec<&Arc<dyn ExecutionPlan>> { | ||
| vec![&self.plan] | ||
| } | ||
|
|
||
| fn with_new_children( | ||
| self: Arc<Self>, | ||
| children: Vec<Arc<dyn ExecutionPlan>>, | ||
| ) -> datafusion::common::Result<Arc<dyn ExecutionPlan>> { | ||
| Ok(Arc::new(DistributedExec { | ||
| plan: require_one_child(&children)?, | ||
| })) | ||
| } | ||
|
|
||
| fn execute( | ||
| &self, | ||
| partition: usize, | ||
| context: Arc<TaskContext>, | ||
| ) -> datafusion::common::Result<SendableRecordBatchStream> { | ||
| if partition > 0 { | ||
| // The DistributedExec node calls try_assign_urls() lazily upon calling .execute(). This means | ||
| // that .execute() must only be called once, as we cannot afford to perform several | ||
| // random URL assignation while calling multiple partitions, as they will differ, | ||
| // producing an invalid plan | ||
| return exec_err!( | ||
| "DistributedExec must only have 1 partition, but it was called with partition index {partition}" | ||
| ); | ||
| } | ||
|
|
||
| let channel_resolver = get_distributed_channel_resolver(context.session_config())?; | ||
| let codec = DistributedCodec::new_combined_with_user(context.session_config()); | ||
|
|
||
| let plan = self.prepare_plan(&channel_resolver.get_urls()?, &codec)?; | ||
| plan.execute(partition, context) | ||
| } | ||
| } | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,13 +1,12 @@ | ||
| mod common; | ||
| mod distributed; | ||
| mod metrics; | ||
| mod network_coalesce; | ||
| mod network_shuffle; | ||
| mod partition_isolator; | ||
| mod stage; | ||
|
|
||
| pub use distributed::DistributedExec; | ||
| pub use metrics::MetricsWrapperExec; | ||
| pub use network_coalesce::{NetworkCoalesceExec, NetworkCoalesceReady}; | ||
| pub use network_shuffle::{NetworkShuffleExec, NetworkShuffleReadyExec}; | ||
| pub use partition_isolator::PartitionIsolatorExec; | ||
| pub(crate) use stage::InputStage; | ||
| pub use stage::display_plan_graphviz; | ||
| pub use stage::{DistributedTaskContext, ExecutionTask, StageExec}; |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be done during planning? Totally fine to do as a follow up. Curious why we do it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we do two things:
we want to do 1) as late as possible, because otherwise, if we do it during planning, by the time we get to execute the plan, the assigned URLs might no longer be valid. By making the URL assignation happen as close to execution as possible, we reduce the risk of workers no longer being available. This is how it was done previously also, it's not really something new.
we want to do 2) right before execution because plans in the "encoded" state are not visible or inspectable or pretty much anything, and with a plan at hand, we might want to be able to display it or things like that.