@@ -26,7 +26,7 @@ capabilities with minimal changes.
2626
2727Before diving into the architecture, it's important to clarify some terms and what they mean:
2828
29- - ` worker ` : a physical machine listening to serialized plans over an Arrow Flight interface.
29+ - ` worker ` : a physical machine listening to serialized execution plans over an Arrow Flight interface.
3030- ` network boundary ` : a node in the plan that streams data from a network interface rather than directly from its
3131 children. Implemented as an ` ArrowFlightReadExec ` physical DataFusion node.
3232- ` stage ` : a portion of the plan separated by a network boundary from other parts of the plan. Implemented as any
@@ -37,8 +37,8 @@ Before diving into the architecture, it's important to clarify some terms and wh
3737A distributed DataFusion query is executed in a very similar fashion as a normal DataFusion query with one key
3838difference:
3939
40- The physical plan is divided into stages that can be executed on different workers and exchange data using Arrow
41- Flight. All of this is done at the physical plan level, and is implemented as a ` PhysicalOptimizerRule ` that:
40+ The physical plan is divided into stages, each stage is assigned tasks that run in parallel in different workers. All of
41+ this is done at the physical plan level, and is implemented as a ` PhysicalOptimizerRule ` that:
4242
43431 . Inspects the non-distributed plan, placing network boundaries (` ArrowFlightReadExec ` nodes) in the appropriate places
44442 . Based on the placed network boundaries, divides the plan into stages and assigns tasks to them.
@@ -158,8 +158,8 @@ will issue 3 concurrent Arrow Flight requests to the appropriate physical nodes.
158158This means that:
159159
1601601 . The head stage is executed normally as if the query was not distributed.
161- 2 . Upon calling ` .execute() ` on the ` ArrowFlightReadExec ` , instead of recursively calling ` .execute() ` on its children ,
162- the child subplan will be serialized and sent over the wire to another node .
161+ 2 . Upon calling ` .execute() ` on ` ArrowFlightReadExec ` , instead of propagating the ` .execute() ` call on its child ,
162+ the subplan is serialized and sent over the wire to be executed on another worker .
1631633 . The next node, which is hosting an Arrow Flight Endpoint listening for gRPC requests over an HTTP server, will pick
164164 up the request containing the serialized chunk of the overall plan and execute it.
1651654 . This is repeated for each stage, and data will start flowing from bottom to top until it reaches the head stage.
0 commit comments