-
Notifications
You must be signed in to change notification settings - Fork 14
Add README.md and LICENSE.txt #114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
NGA-TRAN
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIce doc
README.md
Outdated
|
|
||
| 1. The head stage is executed normally as if the query was not distributed. | ||
| 2. Upon calling `.execute()` on the `ArrowFlightReadExec`, instead of recursively calling `.execute()` on its children, | ||
| they will be serialized and sent over the wire to another node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I asked myself what the word they means here. The next sentence (point 3) lets me know it is a plan. Maybe explain they here will make it clearer? And it is a sub-plan that needs to be executed in stage1, right?
What if the plan has 3 stages: stage3 us the head stage that will get data from stage2 and stage2 gets data from stage1? The subplan stage3 sends to stage2 also include the sub-sub-plan that stage2 will send to stage1, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's right. Here "they" refers to the immediate previous word "children". As you pointed out, I think this can be clearer, let me fix that
README.md
Outdated
| The integration tests also provide an idea about how to use the library and what can be achieved with it: | ||
|
|
||
| - [tpch_validation_test.rs](tests/tpch_validation_test.rs): executes all TPCH queries and performs assertions over the | ||
| distributed plans. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add bit more saying we validate by comparing distributed results with single node vanilla ones; and also saying we use scale factor 0.001?
0ef17d9 to
c1b6d3c
Compare
README.md
Outdated
|
|
||
| Before diving into the architecture, it's important to clarify some terms and what they mean: | ||
|
|
||
| - `worker`: a physical machine listening to serialized plans over an Arrow Flight interface. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logical or execution plans?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
README.md
Outdated
| A distributed DataFusion query is executed in a very similar fashion as a normal DataFusion query with one key | ||
| difference: | ||
|
|
||
| The physical plan is divided into stages that can be executed on different workers and exchange data using Arrow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not super clear: who exchanges data? The workers? The stages? The plan?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed that phrase, it should be clearer now
README.md
Outdated
| 2. Upon calling `.execute()` on the `ArrowFlightReadExec`, instead of recursively calling `.execute()` on its children, | ||
| the child subplan will be serialized and sent over the wire to another node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a convoluted way of saying that ArrowFlightReadExec is just a standard input node, with the child plan as part of the request?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rephrased that to make it a bit clearer
| The physical plan is divided into stages that can be executed on different workers and exchange data using Arrow | ||
| Flight. All of this is done at the physical plan level, and is implemented as a `PhysicalOptimizerRule` that: | ||
|
|
||
| 1. Inspects the non-distributed plan, placing network boundaries (`ArrowFlightReadExec` nodes) in the appropriate places |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"in the appropriate places" is the core problem: might be worth it to expand a bit on this. Also, how is the number of partitions chosen now that it can't just be the number of CPU on the single node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd really like to give a more informative description about what's happening here, but I'm afraid this is in so early stages that whatever I put here might changed several times in the following days, so keeping it ambiguous for now might be the best card to play.
Adds a README.md and a LICENSE file