Tailor examples and docs

gabotechs · gabotechs · commit e56530a0b5a4 · 2025-12-19T11:57:12.000+01:00
diff --git a/docs/source/contributor-guide/index.md b/docs/source/contributor-guide/index.md
@@ -1,9 +1,7 @@
-# Introduction
+# Index
 
 Welcome to the DataFusion Distributed contributor guide!
 
-## Contents
-
 - [Setup](setup.md) - Getting started with development
 - [Tests](tests.md) - Running unit and integration tests
 - [Benchmarks](benchmarks.md) - Local and remote performance benchmarks
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -5,13 +5,18 @@ DataFusion Distributed
 DataFusion Distributed is a library that enhances `Apache DataFusion <https://datafusion.apache.org>`_ with distributed
 capabilities.
 
+These docs will guide you towards using the library for building your own Distributed DataFusion cluster, and
+how to contribute changes to the library yourself.
+
 .. _toc.guide:
 .. toctree::
    :maxdepth: 2
    :caption: User Guide
 
    user-guide/index
    user-guide/getting-started
+   user-guide/task-estimator
+   user-guide/channel-resolver
    user-guide/concepts
    user-guide/how-a-distributed-plan-is-built
 
diff --git a/docs/source/user-guide/channel-resolver.md b/docs/source/user-guide/channel-resolver.md
@@ -0,0 +1,3 @@
+# Building a ChannelResolver
+
+> WARNING: under construction
diff --git a/docs/source/user-guide/concepts.md b/docs/source/user-guide/concepts.md
@@ -13,15 +13,15 @@ These are some terms you should be familiar with before getting started:
   children. Implemented as DataFusion `ExecutionPlan`s: `NeworkShuffle` and `NetworkCoalesce`.
 - `Subplan`: a slice of the overall plan. Each stage will execute a subplan of the overall plan.
 
-## [DistributedPhysicalOptimizerRule](https://github.com/datafusion-contrib/datafusion-distributed/blob/6d014eaebd809bcbe676823698838b2c83d93900/src/distributed_planner/distributed_physical_optimizer_rule.rs#L60)
+## [DistributedPhysicalOptimizerRule](https://github.com/datafusion-contrib/datafusion-distributed/blob/main/src/distributed_planner/distributed_physical_optimizer_rule.rs)
 
 This is a physical optimizer rule that converts a single-node DataFusion query into a distributed query. It reads
 a fully formed physical plan and injects the appropriate nodes to execute the query in a distributed fashion.
 
 It builds the distributed plan from bottom to top, and based on the present nodes in the original plan,
 it will inject network boundaries in the appropriate places.
 
-## [TaskEstimator](https://github.com/datafusion-contrib/datafusion-distributed/blob/6d014eaebd809bcbe676823698838b2c83d93900/src/distributed_planner/task_estimator.rs#L40-L40)
+## [TaskEstimator](https://github.com/datafusion-contrib/datafusion-distributed/blob/main/src/distributed_planner/task_estimator.rs)
 
 Estimates the number of tasks required in the leaf stage of a distributed query.
 
@@ -30,7 +30,7 @@ tasks they need to execute based on the amount of data their leaf nodes will pul
 will have their number of tasks reduced or increased depending on how much the cardinality of the data was reduced in
 previous stages.
 
-## [DistributedTaskContext](https://github.com/datafusion-contrib/datafusion-distributed/blob/6d014eaebd809bcbe676823698838b2c83d93900/src/stage.rs#L137-L137)
+## [DistributedTaskContext](https://github.com/datafusion-contrib/datafusion-distributed/blob/main/src/stage.rs)
 
 An extension present during the `ExecutionPlan::execute()` that contains information about the current task in
 which the plan is being executed.
@@ -41,15 +41,15 @@ you are in, you might want to return a different set of data.
 For example, if you are on the task with index 0 of a 3-task stage, you might want to return only the first 1/3 of the
 data. If you are on the task with index 2, you might want to return the last 1/3 of the data, and so on.
 
-## [ChannelResolver](https://github.com/datafusion-contrib/datafusion-distributed/blob/6d014eaebd809bcbe676823698838b2c83d93900/src/channel_resolver_ext.rs#L57-L57)
+## [ChannelResolver](https://github.com/datafusion-contrib/datafusion-distributed/blob/main/src/channel_resolver_ext.rs)
 
 Establishes the number of workers available in the distributed DataFusion cluster, their URLs, and how to connect
 to them.
 
 Each organization does networking differently, so this extension allows you to plug in a custom networking
 implementation that caters to your organization's needs.
 
-## [NetworkBoundary](https://github.com/datafusion-contrib/datafusion-distributed/blob/6d014eaebd809bcbe676823698838b2c83d93900/src/distributed_planner/network_boundary.rs#L23-L23)
+## [NetworkBoundary](https://github.com/datafusion-contrib/datafusion-distributed/blob/main/src/distributed_planner/network_boundary.rs)
 
 A network boundary is a node that, instead of pulling data from its children by executing them, serializes them
 and sends them over the wire so that they are executed on a remote worker.
diff --git a/docs/source/user-guide/getting-started.md b/docs/source/user-guide/getting-started.md
@@ -1,13 +1,71 @@
 # Getting Started
 
 Rather than being opinionated about your setup and how you serve queries to users,
-Distributed DataFusion allows you to plug in your own networking stack and spawn your own gRPC servers that act as workers in the cluster.
+Distributed DataFusion allows you to plug in your own networking stack and spawn your own gRPC servers that act as
+workers in the cluster.
 
 This project heavily relies on the [Tonic](https://github.com/hyperium/tonic) ecosystem for the networking layer.
 Users of this library are responsible for building their own Tonic server, adding the Arrow Flight distributed
-DataFusion service to it, and spawning it on a port so that it can be reached by other workers in the cluster.
+DataFusion service to it and spawning it on a port so that it can be reached by other workers in the cluster.
 
-The best way to get started is to check out the available examples:
+For a basic setup, all you need to do is to enrich your DataFusion `SessionStateBuilder` with the tools this project
+ships:
+
+```rs
+ let state = SessionStateBuilder::new()
++    .with_distributed_channel_resolver(my_custom_channel_resolver)
++    .with_physical_optimizer_rule(Arc::new(DistributedPhysicalOptimizerRule))
+     .build();
+```
+
+And the `my_custom_channel_resolver` variable should be an implementation of
+the [ChannelResolver](https://github.com/datafusion-contrib/datafusion-distributed/blob/6d014eaebd809bcbe676823698838b2c83d93900/src/channel_resolver_ext.rs#L57-L57)
+trait, which tells Distributed DataFusion how to connect to other workers in the cluster.
+
+A very basic example of such an implementation that resolves workers in the localhost machine is:
+
+```rust
+#[derive(Clone)]
+struct LocalhostChannelResolver {
+    ports: Vec<u16>,
+    cached: DashMap<Url, FlightServiceClient<BoxCloneSyncChannel>>,
+}
+
+#[async_trait]
+impl ChannelResolver for LocalhostChannelResolver {
+    fn get_urls(&self) -> Result<Vec<Url>, DataFusionError> {
+        Ok(self.ports.iter().map(|port| Url::parse(&format!("http://localhost:{port}")).unwrap()).collect())
+    }
+
+    async fn get_flight_client_for_url(
+        &self,
+        url: &Url,
+    ) -> Result<FlightServiceClient<BoxCloneSyncChannel>, DataFusionError> {
+        match self.cached.entry(url.clone()) {
+            Entry::Occupied(v) => Ok(v.get().clone()),
+            Entry::Vacant(v) => {
+                let channel = Channel::from_shared(url.to_string())
+                    .unwrap()
+                    .connect_lazy();
+                let channel = FlightServiceClient::new(BoxCloneSyncChannel::new(channel));
+                v.insert(channel.clone());
+                Ok(channel)
+            }
+        }
+    }
+}
+```
+
+> NOTE: This example is not production-ready and is meant to showcase the basic concepts of the library.
+
+## Next steps
+
+The next two sections of this guide will walk you through tailoring the library's traits to your own needs:
+
+- [Build your own ChannelResolver](channel-resolver.md)
+- [Build your own TaskEstimator](task-estimator.md)
+
+Here are some other resources in the codebase:
 
 - [In-memory cluster example](https://github.com/datafusion-contrib/datafusion-distributed/blob/main/examples/in_memory.md)
 - [Localhost cluster example](https://github.com/datafusion-contrib/datafusion-distributed/blob/main/examples/localhost.md)
diff --git a/docs/source/user-guide/index.md b/docs/source/user-guide/index.md
@@ -1,4 +1,4 @@
-# Introduction
+# Index
 
 Distributed DataFusion is a library that brings distributed capabilities to DataFusion.
 It provides a set of execution plans, optimization rules, configuration extensions, and new traits
@@ -7,8 +7,8 @@ to enable distributed execution.
 This user guide will walk you through using the tools in this project to set up
 your own distributed DataFusion cluster.
 
-## Concepts
-
 - [Concepts](concepts.md)
 - [Getting Started](getting-started.md)
+- [Building a ChannelResolver](channel-resolver.md)
+- [Building a TaskEstimator](task-estimator.md)
 - [How a distributed plan is built](how-a-distributed-plan-is-built.md)
diff --git a/docs/source/user-guide/task-estimator.md b/docs/source/user-guide/task-estimator.md
@@ -0,0 +1,3 @@
+# Building a TaskEstimator
+
+> WARNING: under construction
diff --git a/examples/in_memory.md b/examples/in_memory.md
@@ -1,9 +1,9 @@
 # In-memory cluster example
 
-This examples shows how queries can be run in a distributed context without making any
+This example shows how queries can be run in a distributed context without making any
 network IO for communicating between workers.
 
-This is specially useful for testing, as no servers need to be spawned in localhost ports,
+This is especially useful for testing, as no servers need to be spawned in localhost ports,
 the setup is quite easy, and the code coverage for running in this mode is the same as
 running in an actual distributed cluster.
 
@@ -19,32 +19,20 @@ git lfs checkout
 
 ### Issuing a distributed SQL query
 
+The `--show-distributed-plan` flag can be passed to render the distributed plan:
+
 ```shell
-cargo run --example in_memory_cluster -- 'SELECT count(*), "MinTemp" FROM weather GROUP BY "MinTemp"'
+cargo run --example in_memory_cluster -- 'SELECT count(*), "MinTemp" FROM weather GROUP BY "MinTemp"' --show-distributed-plan
 ```
 
-Additionally, the `--explain` flag can be passed to render the distributed plan:
+Not passing the flag will execute the query:
 
 ```shell
-cargo run --example in_memory_cluster -- 'SELECT count(*), "MinTemp" FROM weather GROUP BY "MinTemp"' --explain 
+cargo run --example in_memory_cluster -- 'SELECT count(*), "MinTemp" FROM weather GROUP BY "MinTemp"'
 ```
 
 ### Available tables
 
-Two tables are available in this example:
-
-- `flights_1m`: Flight data with 1m rows
-
-```
-FL_DATE [INT32]
-DEP_DELAY [INT32]
-ARR_DELAY [INT32]
-AIR_TIME [INT32]
-DISTANCE [INT32]
-DEP_TIME [FLOAT]
-ARR_TIME [FLOAT]
-```
-
 - `weather`: Small dataset of weather data
 
 ```
diff --git a/examples/in_memory_cluster.rs b/examples/in_memory_cluster.rs
@@ -2,13 +2,12 @@ use arrow::util::pretty::pretty_format_batches;
 use arrow_flight::flight_service_client::FlightServiceClient;
 use async_trait::async_trait;
 use datafusion::common::DataFusionError;
-use datafusion::common::utils::get_available_parallelism;
 use datafusion::execution::SessionStateBuilder;
-use datafusion::physical_plan::displayable;
 use datafusion::prelude::{ParquetReadOptions, SessionContext};
 use datafusion_distributed::{
     ArrowFlightEndpoint, BoxCloneSyncChannel, ChannelResolver, DistributedExt,
     DistributedPhysicalOptimizerRule, DistributedSessionBuilderContext, create_flight_client,
+    display_plan_ascii,
 };
 use futures::TryStreamExt;
 use hyper_util::rt::TokioIo;
@@ -20,20 +19,16 @@ use tonic::transport::{Endpoint, Server};
 #[derive(StructOpt)]
 #[structopt(
     name = "run",
-    about = "An in-memory cluster Distributed DataFusion runner"
+    about = "Run a query in an in-memory Distributed DataFusion cluster"
 )]
 struct Args {
+    /// The SQL query to run.
     #[structopt()]
     query: String,
 
+    /// Whether the distributed plan should be rendered instead of executing the query.
     #[structopt(long)]
-    explain: bool,
-
-    #[structopt(long)]
-    files_per_task: Option<usize>,
-
-    #[structopt(long)]
-    cardinality_task_sf: Option<f64>,
+    show_distributed_plan: bool,
 }
 
 #[tokio::main]
@@ -44,31 +39,18 @@ async fn main() -> Result<(), Box<dyn Error>> {
         .with_default_features()
         .with_distributed_channel_resolver(InMemoryChannelResolver::new())
         .with_physical_optimizer_rule(Arc::new(DistributedPhysicalOptimizerRule))
-        .with_distributed_files_per_task(
-            args.files_per_task.unwrap_or(get_available_parallelism()),
-        )?
-        .with_distributed_cardinality_effect_task_scale_factor(
-            args.cardinality_task_sf.unwrap_or(1.),
-        )?
+        .with_distributed_files_per_task(1)?
         .build();
 
     let ctx = SessionContext::from(state);
 
-    ctx.register_parquet(
-        "flights_1m",
-        "testdata/flights-1m.parquet",
-        ParquetReadOptions::default(),
-    )
-    .await?;
-
     ctx.register_parquet("weather", "testdata/weather", ParquetReadOptions::default())
         .await?;
 
     let df = ctx.sql(&args.query).await?;
-    if args.explain {
+    if args.show_distributed_plan {
         let plan = df.create_physical_plan().await?;
-        let display = displayable(plan.as_ref()).indent(true).to_string();
-        println!("{display}");
+        println!("{}", display_plan_ascii(plan.as_ref(), false));
     } else {
         let stream = df.execute_stream().await?;
         let batches = stream.try_collect::<Vec<_>>().await?;
@@ -133,7 +115,7 @@ impl InMemoryChannelResolver {
 #[async_trait]
 impl ChannelResolver for InMemoryChannelResolver {
     fn get_urls(&self) -> Result<Vec<url::Url>, DataFusionError> {
-        Ok(vec![url::Url::parse(DUMMY_URL).unwrap()])
+        Ok(vec![url::Url::parse(DUMMY_URL).unwrap(); 16]) // simulate 16 workers.
     }
 
     async fn get_flight_client_for_url(
diff --git a/examples/localhost.md b/examples/localhost.md
@@ -19,15 +19,14 @@ git lfs checkout
 In two different terminals spawn two ArrowFlightEndpoints
 
 ```shell
-cargo run --example localhost_worker -- 8080 --cluster-ports 8080,8081
+cargo run --example localhost_worker -- 8080
 ```
 
 ```shell
-cargo run --example localhost_worker -- 8081 --cluster-ports 8080,8081
+cargo run --example localhost_worker -- 8081
 ```
 
-- The positional numeric argument is the port in which each Arrow Flight endpoint will listen
-- The `--cluster-ports` parameter tells the Arrow Flight endpoint all the available localhost workers in the cluster
+The positional numeric argument is the port in which each Arrow Flight endpoint will listen to.
 
 ### Issuing a distributed SQL query
 
@@ -43,7 +42,7 @@ command, but further stages will be delegated to the workers running on ports 80
 Additionally, the `--explain` flag can be passed to render the distributed plan:
 
 ```shell
-cargo run --example localhost_run -- 'SELECT count(*), "MinTemp" FROM weather GROUP BY "MinTemp"' --cluster-ports 8080,8081 --explain
+cargo run --example localhost_run -- 'SELECT count(*), "MinTemp" FROM weather GROUP BY "MinTemp"' --cluster-ports 8080,8081 --show-distributed-plan
 ```
 
 ### Available tables
diff --git a/examples/localhost_run.rs b/examples/localhost_run.rs
diff --git a/examples/localhost_worker.rs b/examples/localhost_worker.rs
diff --git a/src/protobuf/errors/mod.rs b/src/protobuf/errors/mod.rs

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+# Building a ChannelResolver`
	`2`	`+`
	`3`	`+> WARNING: under construction`
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+# Building a TaskEstimator`
	`2`	`+`
	`3`	`+> WARNING: under construction`