Introduce with_distributed_execution #209

gabotechs · 2025-10-30T16:15:02Z

This PR refactors the public API people use for enriching a DataFusion SessionState with distributed capabilities.

Before, all the with_distributed_* and set_distributed_* methods were applicable to a lot of DataFusion structs:

SessionContext
SessionConfig
SessionState
SessionStateBuilder

But there's an ergonomic issue with that:

Users have no other option but to use SessionStateBuilder for adding configuring distributed capabilities, because that's the only possible place in DataFusion to add a PhysicalOptimizerRule implementation (builder.with_physical_optimizer_rule()).

This means that no matter how many ergonomic improvements we bring to other structs, people need to configure things through SessionStateBuilder.

This PR allows configuring distributed DataFusion with a new with_distributed_execution() that will automatically:

set the ChannelResolver implementation
inject a DistributedConfig struct in the ConfigOptions
Add a DistributedPhysicalOptimizerRule as an optimization rule

As these are all things that users need to do proactively one way or another, we now just expose it in one method:

let state = SessionStateBuilder::new()
    .with_distributed_execution(CustomChannelResolver)
    .build();

This is a preparation PR for a bigger one:

Rework task assignation mechanism #216

… from anything that's not a SessionStateBuilder

gabotechs · 2025-10-30T16:16:22Z

src/distributed_planner/distributed_physical_optimizer_rule.rs

+        if cfg.network_coalesce_tasks.is_none() && cfg.network_shuffle_tasks.is_none() {
+            return Ok(plan);
+        }


As now DistributedConfig is always in the config, we cannot rely on it not being present as a signal that the query should not be distributed. Now we also need to check that it's not configured at all in order to skip the distribution.

gabotechs · 2025-10-30T16:17:18Z

src/distributed_planner/distributed_physical_optimizer_rule.rs

+    if plan.as_any().is::<DistributedExec>() {
+        return Ok(plan);
+    }


Not really related to this PR, but found this footgun while making the tests pass, so I thought that it might be fine to add it here, it would have helped me.

gabotechs · 2025-11-04T21:57:52Z

@adriangb any feedback on this? for us (DataDog) this API change makes sense, but let me know what you guys think

adriangb · 2025-11-05T04:59:17Z

I think we may need a bit finer grained control. While we're still in the stage of getting datafusion-distributed working for all of our queries we are using a feature flag to toggle it on/off. We have a system of runtime feature flags that can be embedded in queries, and usually use floats instead of bools so I can set an env var along the lines of:

DATAFUSION_DISTRIBUTED_CHANCE=0.25  # run 25% of queries via `datafusion-distributed` by default

This sets the default for an ExtensionOptions (similar to DistributedConfig). Then we can tweak it on a per-customer/tenant basis and even per query:

-- SET extension.distributed_chance = 1.0
SELECT * FROM t LIMIT 10;

The way we do this is currently an optimizer rule wrapper FeatureFlaggedOptimizerRule that lets you dynamically run a wrapped optimizer rule (DistributedPhysicalOptimizerRule) depending on ConfigOptions.

I think we could do something similar with this new system by tracking distributed_chance outside of ConfigOptions and then setting network_coalesce_tasks and cfg.network_shuffle_tasks to None on a per-query basis. I'll try to confirm tomorrow by folding in this PR. That might actually simplify things for us.

But overall I think it's a much nicer / simpler API for most if not all users 🥳

gabotechs · 2025-11-05T11:02:39Z

I think we could do something similar with this new system by tracking distributed_chance outside of ConfigOptions and then setting network_coalesce_tasks and cfg.network_shuffle_tasks to None on a per-query basis.

🤔 it seems like even if you do that then #216 will make things even more complicated to you as that PR just removes network_coalesce_tasks and network_shuffle_tasks.

My impression then is that we probably should still let people manually inject the DistributedPhysicalOptimizerRule (or any other custom rule like FeatureFlaggedOptimizerRule) rather than doing it ourselves in the set_distributed_execution() method.

gabotechs · 2025-11-05T11:44:47Z

Given that #216 is going to rework how this all works, I'm leaning towards not shipping this PR and just work on #216 to keep something like:

    let state = SessionStateBuilder::new()
        .with_default_features()
        .with_distributed_channel_resolver(localhost_resolver)
        .with_physical_optimizer_rule(Arc::new(DistributedPhysicalOptimizerRule))
        .build();

instead of

    let state = SessionStateBuilder::new()
        .with_default_features()
        .with_distributed_execution(localhost_resolver)
        .build();

The ergonomic improvement does not seem significant enough to justify removing the ability to provide rules other than DistributedPhysicalOptimizerRule.

adriangb · 2025-11-05T14:10:48Z

Why can't we have both? In general I like it when software makes the "default" or "simple" thing easy but allows for unwrapping the onion and poking into the inner layers if users need to do something more complex.

gabotechs · 2025-11-05T14:57:15Z

yeah, we could, that'd be fine. Don't have a strong opinion as the ergonomic difference between both options is not super big, but if you think that can give a "cleaner" API for the general use case lets go for it!

Introduce with_distributed_execution method and remove DistributedExt…

8b38bf0

… from anything that's not a SessionStateBuilder

gabotechs commented Oct 30, 2025

View reviewed changes

gabotechs marked this pull request as ready for review November 4, 2025 21:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Introduce with_distributed_execution #209

Introduce with_distributed_execution #209

gabotechs commented Oct 30, 2025 •

edited

Loading

Uh oh!

gabotechs Oct 30, 2025

Uh oh!

gabotechs Oct 30, 2025

Uh oh!

gabotechs commented Nov 4, 2025

Uh oh!

adriangb commented Nov 5, 2025

Uh oh!

gabotechs commented Nov 5, 2025

Uh oh!

gabotechs commented Nov 5, 2025 •

edited

Loading

Uh oh!

adriangb commented Nov 5, 2025

Uh oh!

gabotechs commented Nov 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Introduce with_distributed_execution #209

Are you sure you want to change the base?

Introduce with_distributed_execution #209

Conversation

gabotechs commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gabotechs Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

gabotechs Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

gabotechs commented Nov 4, 2025

Uh oh!

adriangb commented Nov 5, 2025

Uh oh!

gabotechs commented Nov 5, 2025

Uh oh!

gabotechs commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adriangb commented Nov 5, 2025

Uh oh!

gabotechs commented Nov 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gabotechs commented Oct 30, 2025 •

edited

Loading

gabotechs commented Nov 5, 2025 •

edited

Loading