Skip to content

Conversation

@gabotechs
Copy link
Collaborator

@gabotechs gabotechs commented Sep 2, 2025

Adds a README.md and a LICENSE file

@gabotechs gabotechs changed the title Add README.md Add README.md and LICENSE.txt Sep 2, 2025
Copy link
Collaborator

@NGA-TRAN NGA-TRAN left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIce doc

README.md Outdated

1. The head stage is executed normally as if the query was not distributed.
2. Upon calling `.execute()` on the `ArrowFlightReadExec`, instead of recursively calling `.execute()` on its children,
they will be serialized and sent over the wire to another node.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I asked myself what the word they means here. The next sentence (point 3) lets me know it is a plan. Maybe explain they here will make it clearer? And it is a sub-plan that needs to be executed in stage1, right?

What if the plan has 3 stages: stage3 us the head stage that will get data from stage2 and stage2 gets data from stage1? The subplan stage3 sends to stage2 also include the sub-sub-plan that stage2 will send to stage1, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's right. Here "they" refers to the immediate previous word "children". As you pointed out, I think this can be clearer, let me fix that

README.md Outdated
The integration tests also provide an idea about how to use the library and what can be achieved with it:

- [tpch_validation_test.rs](tests/tpch_validation_test.rs): executes all TPCH queries and performs assertions over the
distributed plans.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add bit more saying we validate by comparing distributed results with single node vanilla ones; and also saying we use scale factor 0.001?

@gabotechs gabotechs force-pushed the gabrielmusat/readme-and-license branch from 0ef17d9 to c1b6d3c Compare September 3, 2025 07:53
README.md Outdated

Before diving into the architecture, it's important to clarify some terms and what they mean:

- `worker`: a physical machine listening to serialized plans over an Arrow Flight interface.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logical or execution plans?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

README.md Outdated
A distributed DataFusion query is executed in a very similar fashion as a normal DataFusion query with one key
difference:

The physical plan is divided into stages that can be executed on different workers and exchange data using Arrow
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not super clear: who exchanges data? The workers? The stages? The plan?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed that phrase, it should be clearer now

README.md Outdated
Comment on lines 161 to 162
2. Upon calling `.execute()` on the `ArrowFlightReadExec`, instead of recursively calling `.execute()` on its children,
the child subplan will be serialized and sent over the wire to another node.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a convoluted way of saying that ArrowFlightReadExec is just a standard input node, with the child plan as part of the request?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rephrased that to make it a bit clearer

The physical plan is divided into stages that can be executed on different workers and exchange data using Arrow
Flight. All of this is done at the physical plan level, and is implemented as a `PhysicalOptimizerRule` that:

1. Inspects the non-distributed plan, placing network boundaries (`ArrowFlightReadExec` nodes) in the appropriate places
Copy link
Collaborator

@geoffreyclaude geoffreyclaude Sep 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"in the appropriate places" is the core problem: might be worth it to expand a bit on this. Also, how is the number of partitions chosen now that it can't just be the number of CPU on the single node?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd really like to give a more informative description about what's happening here, but I'm afraid this is in so early stages that whatever I put here might changed several times in the following days, so keeping it ambiguous for now might be the best card to play.

@gabotechs gabotechs merged commit 9459890 into main Sep 3, 2025
3 checks passed
@gabotechs gabotechs deleted the gabrielmusat/readme-and-license branch September 3, 2025 15:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants