@@ -8,8 +8,8 @@ Library that brings distributed execution capabilities to [DataFusion](https://g
88## What can you do with this crate?
99
1010This crate is a toolkit that extends [ DataFusion] ( https://github.com/apache/datafusion ) with distributed capabilities,
11- providing a developer experience as close as possible to vanilla DataFusion and being opinionated about the networking
12- stack used for hosting the different workers involved in a query.
11+ providing a developer experience as close as possible to vanilla DataFusion while being unopinionated about the
12+ networking stack used for hosting the different workers involved in a query.
1313
1414Users of this library can expect to take their existing single-node DataFusion-based systems and add distributed
1515capabilities with minimal changes.
@@ -18,16 +18,16 @@ capabilities with minimal changes.
1818
1919- Be as close as possible to vanilla DataFusion, providing a seamless integration with existing DataFusion systems and
2020 a familiar API for building applications.
21- - Unopinionated networking. This crate does not take any opinion about the networking stack, and users are expected
22- to leverage their own infrastructure for hosting DataFusion nodes.
21+ - Unopinionated about networking. This crate does not take any opinion about the networking stack, and users are
22+ expected to leverage their own infrastructure for hosting DataFusion nodes.
2323- No coordinator-worker architecture. To keep infrastructure simple, any node can act as a coordinator or a worker.
2424
2525## Architecture
2626
2727A distributed DataFusion query is executed in a very similar fashion as a normal DataFusion query with one key
2828difference:
2929
30- The physical plan is divided into stages that can be executed in different machines, and exchange data using Arrow
30+ The physical plan is divided into stages that can be executed on different machines and exchange data using Arrow
3131Flight. All of this is done at the physical plan level, and is implemented as a ` PhysicalOptimizerRule ` that:
3232
33331 . Inspects the non-distributed plan, placing network boundaries (` ArrowFlightReadExec ` nodes) in the appropriate places
@@ -111,7 +111,8 @@ shows the partitioning scheme in each node):
111111 └──────[0][1][2]───────┘
112112```
113113
114- Based on that boundary, the plan is divided into stages and tasks are assigned to each stage:
114+ Based on that boundary, the plan is divided into stages and tasks are assigned to each stage. Each task will be
115+ responsible for the different partitions in the original plan.
115116
116117```
117118 ┌────── (stage 2) ───────┐
@@ -150,9 +151,30 @@ This means that:
1501512 . Upon calling ` .execute() ` on the ` ArrowFlightReadExec ` , instead of recursively calling ` .execute() ` on its children,
151152 they will be serialized and sent over the wire to another node.
1521533 . The next node, which is hosting an Arrow Flight Endpoint listening for gRPC requests over an HTTP server, will pick
153- up the request containing the serialized chunk of the overall plan, and execute it.
154+ up the request containing the serialized chunk of the overall plan and execute it.
1541554 . This is repeated for each stage, and data will start flowing from bottom to top until it reaches the head stage.
155156
157+ ## Getting familiar with distributed DataFusion
158+
159+ There are some runnable examples showcasing how to provide a localhost implementation for Distributed DataFusion in
160+ [ examples/] ( examples ) :
161+
162+ - [ localhost_worker.rs] ( examples/localhost_worker.rs ) : code that spawns an Arrow Flight Endpoint listening for physical
163+ plans over the network.
164+ - [ localhost_run.rs] ( examples/localhost_run.rs ) : code that distributes a query across the spawned Arrow Flight Endpoints
165+ and executes it.
166+
167+ The integration tests also provide an idea about how to use the library and what can be achieved with it:
168+
169+ - [ tpch_validation_test.rs] ( tests/tpch_validation_test.rs ) : executes all TPCH queries and performs assertions over the
170+ distributed plans.
171+ - [ custom_config_extension.rs] ( tests/custom_config_extension.rs ) : showcases how to propagate custom DataFusion config
172+ extensions.
173+ - [ custom_extension_codec.rs] ( tests/custom_extension_codec.rs ) : showcases how to propagate custom physical extension
174+ codecs.
175+ - [ distributed_aggregation.rs] ( tests/distributed_aggregation.rs ) : showcases how to manually place ` ArrowFlightReadExec `
176+ nodes in a plan and build a distributed query out of it.
177+
156178## Development
157179
158180### Prerequisites
0 commit comments