-
Notifications
You must be signed in to change notification settings - Fork 13
Introduce with_distributed_execution #209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
… from anything that's not a SessionStateBuilder
| if cfg.network_coalesce_tasks.is_none() && cfg.network_shuffle_tasks.is_none() { | ||
| return Ok(plan); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As now DistributedConfig is always in the config, we cannot rely on it not being present as a signal that the query should not be distributed. Now we also need to check that it's not configured at all in order to skip the distribution.
| if plan.as_any().is::<DistributedExec>() { | ||
| return Ok(plan); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really related to this PR, but found this footgun while making the tests pass, so I thought that it might be fine to add it here, it would have helped me.
|
@adriangb any feedback on this? for us (DataDog) this API change makes sense, but let me know what you guys think |
|
I think we may need a bit finer grained control. While we're still in the stage of getting DATAFUSION_DISTRIBUTED_CHANCE=0.25 # run 25% of queries via `datafusion-distributed` by defaultThis sets the default for an -- SET extension.distributed_chance = 1.0
SELECT * FROM t LIMIT 10;The way we do this is currently an optimizer rule wrapper I think we could do something similar with this new system by tracking But overall I think it's a much nicer / simpler API for most if not all users 🥳 |
🤔 it seems like even if you do that then #216 will make things even more complicated to you as that PR just removes My impression then is that we probably should still let people manually inject the |
|
Given that #216 is going to rework how this all works, I'm leaning towards not shipping this PR and just work on #216 to keep something like: let state = SessionStateBuilder::new()
.with_default_features()
.with_distributed_channel_resolver(localhost_resolver)
.with_physical_optimizer_rule(Arc::new(DistributedPhysicalOptimizerRule))
.build();instead of let state = SessionStateBuilder::new()
.with_default_features()
.with_distributed_execution(localhost_resolver)
.build();The ergonomic improvement does not seem significant enough to justify removing the ability to provide rules other than |
|
Why can't we have both? In general I like it when software makes the "default" or "simple" thing easy but allows for unwrapping the onion and poking into the inner layers if users need to do something more complex. |
|
yeah, we could, that'd be fine. Don't have a strong opinion as the ergonomic difference between both options is not super big, but if you think that can give a "cleaner" API for the general use case lets go for it! |
This PR refactors the public API people use for enriching a DataFusion
SessionStatewith distributed capabilities.Before, all the
with_distributed_*andset_distributed_*methods were applicable to a lot of DataFusion structs:SessionContextSessionConfigSessionStateSessionStateBuilderBut there's an ergonomic issue with that:
Users have no other option but to use
SessionStateBuilderfor adding configuring distributed capabilities, because that's the only possible place in DataFusion to add aPhysicalOptimizerRuleimplementation (builder.with_physical_optimizer_rule()).This means that no matter how many ergonomic improvements we bring to other structs, people need to configure things through
SessionStateBuilder.This PR allows configuring distributed DataFusion with a new
with_distributed_execution()that will automatically:ChannelResolverimplementationDistributedConfigstruct in theConfigOptionsDistributedPhysicalOptimizerRuleas an optimization ruleAs these are all things that users need to do proactively one way or another, we now just expose it in one method:
This is a preparation PR for a bigger one: