-
Notifications
You must be signed in to change notification settings - Fork 5.5k
feat: Expand procedure architecture for distributed execution, and support iceberg procedure rewrite_data_files
#22659
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
7ec819c to
9440737
Compare
f89dc40 to
e796fa2
Compare
acb0351 to
c3eaa96
Compare
05de3c8 to
0dc3dbb
Compare
rewrite_data_filesrewrite_data_files
steveburnett
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the draft doc! Some nits about punctuation, formatting, and some suggested rephrasing for readability and conciseness, but the content looks good.
0dc3dbb to
a78c41c
Compare
|
@steveburnett Thanks a lot for your suggestion, all be fixed. Please take a look when convenient! |
steveburnett
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! (docs)
Pull updated branch, new local doc build, looks good. Thanks!
a78c41c to
2fdbab7
Compare
2fdbab7 to
befe9a7
Compare
3f53a40 to
0d5a811
Compare
|
@hantangwangd I'm reviewing this. One quick thing, could you please add documentation in our developer guide (in develop) that explain how these distributed procedures are built and registered? |
|
@tdcmeehan Thanks for the review. Sure, I'll add the relevant documentation as soon as possible. |
tdcmeehan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good work. Well done! I've left some feedback, but it's mostly minor.
I would split this PR into at least 3 parts:
- All of the code in core Presto to support distributed prcoedures
- The C++ counterpart for this code
- The Iceberg integration
| private final BeginCallDistributedProcedure beginCallDistributedProcedure; | ||
| private final FinishCallDistributedProcedure finishCallDistributedProcedure; | ||
|
|
||
| protected DistributedProcedure(DistributedProcedureType type, String schema, String name, List<Argument> arguments, BeginCallDistributedProcedure beginCallDistributedProcedure, FinishCallDistributedProcedure finishCallDistributedProcedure) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be easier to read and more idiomatic to make DistributedProcedure abstract, and make beginCallDistributedProcedure and finishCallDistributedProcedure abstract methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great idea, I completely agree! Initially, I didn't declare DistributedProcedure as abstract because Procedure itself wasn't declared as abstract. Now, I've made both Procedure and DistributedProcedure abstract, and introduced a LocalProcedure to represent the original coordinator-only procedures. This makes the overall procedure architecture much clearer and easier to understand.
presto-main-base/src/main/java/com/facebook/presto/sql/analyzer/StatementAnalyzer.java
Outdated
Show resolved
Hide resolved
presto-main-base/src/main/java/com/facebook/presto/execution/CallTask.java
Outdated
Show resolved
Hide resolved
presto-main-base/src/main/java/com/facebook/presto/sql/planner/optimizations/SymbolMapper.java
Outdated
Show resolved
Hide resolved
| new DynamicFiltersChecker(), | ||
| new WarnOnScanWithoutPartitionPredicate(featuresConfig)); | ||
| new WarnOnScanWithoutPartitionPredicate(featuresConfig), | ||
| new CallDistributedProcedureValidator()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hantangwangd it would be nice to have plan tests, like TestHashGenerationOptimizer, that show the type of plan that gets generated by a distributed procedure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. Since the CALL DISTRIBUTED PROCEDURE statement requires a valid distributed procedure to be invoked, and currently only one has been implemented in Iceberg connector, I've added the test case to TestIcebergLogicalPlanner. Please take a look when you have time, thanks a lot.
presto-main-base/src/main/java/com/facebook/presto/testing/TestProcedureRegistry.java
Show resolved
Hide resolved
presto-spi/src/main/java/com/facebook/presto/spi/procedure/IProcedureRegistry.java
Outdated
Show resolved
Hide resolved
| return source; | ||
| } | ||
|
|
||
| @JsonIgnore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this intentionally ignored?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is intentionally ignored. Subclasses of WriteTarget are only used during planning -- they will not be serialized.
presto-spi/src/main/java/com/facebook/presto/spi/procedure/DistributedProcedure.java
Outdated
Show resolved
Hide resolved
8d53e4d to
106c2aa
Compare
|
@tdcmeehan thanks for your review and feedback. I've addressed all your comments except the one about adding documentation. Please take a look when you have time.
Are you suggesting that I split this into three separate PRs, or should I squash it into three commits within a single PR? |
|
@hantangwangd since we now squash commits on merge, let's make 3 separate PRs. |
|
@tdcmeehan Sure, I'll do it. |
…ailable in analyzer
Use a subclass `TableDataRewriteDistributedProcedure` for table rewrite tasks, for example, merge small data files, sort table data, repartition table data etc.
Accordingly rename previous ProcedureRegistry to BuiltInProcedureRegistry
… abstract classes And introduce a new class `LocalProcedure` to represent the former coordinator-only procedures
106c2aa to
a192c74
Compare
Description
This PR expand the current procedure architecture in presto, support defining, registering and calling procedures which need to be executed in a distributed way. Then support distributed procedure in Iceberg connector and implement a specific procedure
rewrite_data_filesfor it.Referring to: prestodb/rfcs#12
The whole PR is separated into 6 parts:
Re-factor
ProcedureRegistry/Proceduredata structure to support the creation and register ofDistributedProcedure. And make sureProcedureRegistrybe available in presto-analyzer module, so that we can recognize distributed procedures in call statement during prepare and analyze stages.Handle call statement on distributed procedures in preparer stage. In this stage, we figure out the procedure's type in call statement, and define a new query type
CALL_DISTRIBUTED_PROCEDUREforcall distributed procedureinBuiltInPreparedQuery. In this way,call distributed procedurestatement would be handled bySqlQueryExecutionFactory, then be created and handled as aSqlQueryExecution.Analyze and plan the
call distributed procedurestatement based on the subtype of the distributed procedure. For subtypeTableDataRewriteDistributedProcedure, ultimately generate a logical plan for it as follows:Optimize, segmentation, grouped tag and local plan for the logical plan generated above. The handle logical for
CallDistributedProcedureNodeis similar asTableWriterNode. Besides, a new optimizerRewriteWriterTargetis added, which is placed after all optimization rules. It is used to update theTableHandleheld inTableFinishNodeandCallDistributedProcedureNodebased on the underlyingTableScanNodeafter the entire optimization is completed, considering the possible filter pushing down.Re-factor Iceberg connector to support
call distributed procedure. Introduce Iceberg's procedure context and expandIcebergSplitManagerto support split source planned byIcebergAbstractMetadata.beginCallDistributedProcedure(...). This split source will be set to procedure context, and use procedure context to hold all the files to be rewritten as well.Support Iceberg
rewrite_data_filesprocedure. It build a customized split source, set the split source to procedure context in order to be used inIcebergSplitManager. And register a file scan task consumer to collector and hold all the scanned files into procedure context. Then finally in the commit stage, get all the data files and delete files that has been rewritten, and all the files that has been newly generated, change and commit their metadata through Iceberg table'sRewriteFilestransaction.Motivation and Context
N/A
Impact
N/A
Test Plan
rewrite_data_filesContributor checklist
Release Notes