-
Notifications
You must be signed in to change notification settings - Fork 14
Remove unnecessary StageExec proto serde overhead #163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove unnecessary StageExec proto serde overhead #163
Conversation
NGA-TRAN
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This kind of optimization is always needed in distributed system. Nice work @gabotechs
c303d80 to
a1f5d8b
Compare
jayshrivastava
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM :)
I think some comments on stage_from_proto and proto_from_stage might be useful.
proto_from_stagenever needs to decode stage.inputs because these will be passed throughstage_from_protosees decodedInputStagein the coordinator only, otherwise, to should be decoded.
Having two variants for InputStage adds some complexity so it would be nice to make this behavior more clear to the reader
…serde # Conflicts: # src/distributed_physical_optimizer_rule.rs # src/execution_plans/stage.rs # src/protobuf/mod.rs
|
This the exact improvement that I was hoping we’d get to from the first design of stages; we only need to decode the top of the stage tree. Nice work! I’ll get to review the PR later today |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super happy about this PR. Thank you @gabotechs 🚀 !
Addressing @jayshrivastava 's comments about clarity to the library user, perhaps something we can add in an architecture document in the future is a diagram of the execution stage tree and how it gets sent over the network showing the decoded root and encoded children. Perhaps that will help the clarity.
The alternative is the way we used to have things structured, which was to send all tasks to workers before execution, which added an additional round of preflight requests and more state to keep track of. I very much prefer this solution of the execution tree arriving in the do_get_statement payload, so I'm glad this optimization arrives.
Regarding quantifing improvements. Scale factor 0.1 is quite small, so the relative importance of communication overhead is amplified. If we had done an improvement to the flow of data instead of communication overhead, it might get lost at such a small scale. I'm not sure where the right value for scale factor, but I think it is somewhere closer to SF=10, where it can still be done in a single node way, but we are also moving a more significant amount of data. WDYT, @gabotechs ?
With a SF=10 the overhead of moving data is so much that it outweights the overhead of serializing/deserializing plans over the network. The truth is that TPCH queries are very small compared to the amount of data they move, so they are not the best tool for measuring the changes in this PR. A more suitable way of benchmarking this would be to have another set of SQL queries with thousands of LOC, but I don't think we have those benchmarks. |
| fn with_new_children( | ||
| self: Arc<Self>, | ||
| children: Vec<Arc<dyn ExecutionPlan>>, | ||
| _children: Vec<Arc<dyn ExecutionPlan>>, | ||
| ) -> Result<Arc<dyn ExecutionPlan>> { | ||
| Ok(Arc::new(StageExec { | ||
| query_id: self.query_id, | ||
| num: self.num, | ||
| name: self.name.clone(), | ||
| plan: self.plan.clone(), | ||
| inputs: children, | ||
| tasks: self.tasks.clone(), | ||
| depth: self.depth, | ||
| })) | ||
| plan_err!("with_new_children() not supported for StageExec") | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI with this change datafusion-distributed is now incompatible with datafusion-tracing which I believe uses the TreeNode API to wrap every node: https://github.com/datafusion-contrib/datafusion-tracing/blob/f0aee9ed2960fa101570ddc7ad11670c8ee64289/datafusion-tracing/src/instrument_rule.rs#L72
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And there's no trivial solution: it seems that both this optimizer rule and the instrumentation rule want to be the last / outermost optimizer 😢
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 this seems challenging...
Even if we were able to return children like before, the children are the child StageExec, not the inner StageExec.plan that is actually the thing that gets executed, so pretty much the whole plan is going to be left untraced.
In order to solve this, the children of StageExec would probably need to be its inner plan, and not its InputStages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll see if I can draft something up, seems difficult but not impossible
Previously, upon crossing a network boundary, we serialize and deserialize the full plan for sending it over the wire. This results in a good amount of unnecessary serialization/deserialization, for example:
NetworkBoundaryimplementations do not need to read the input stage plan at all, they just need it there so that it can be sent in the gRPC request to the ArrowFlightEndpoint, therefore, it does not need to be deserialized at that pointArrowFlightEndpointneeds the current Stage's subplan deserialized as it needs to be executed there, but it does not need any of the input stages plans to be deserialized, those can perfectly live encoded as Bytes, as they are just useful to be sent over the wire to otherArrowFlightEndpoints.This PR Introduces a new
InputStagestruct that store input stages either:Decoded: the inputStageExecstruct is stored as isEncoded: the inputStageExecis stored asBytesStoring the
InputStageasEncodedallows us to just passthrough theByteswithout performing any unnecessary decoding + re-encoding in parts of the application that just need it be passed as part of a protobuf request.The changes in this PR running against TPCH benchmarks with a 0.1 scale factor result in this: