Rahil/metaclient executor by rahil-c · Pull Request #5 · rahil-c/hudi

rahil-c · 2025-08-22T16:06:05Z

Change Logs

Describe context and summary for this change. Highlight if any code was copied.

Impact

Describe any public API or user-facing feature change or any performance impact.

Risk level (write none, low medium or high below)

If medium or high, explain what verification was done to mitigate the risks.

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

The config description must be updated if new configs are added or the default value of the configs are changed
Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
ticket number here and follow the instruction to make
changes to the website.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

…o avoid data duplication (apache#13659)

…k multi-format reading to FileGroupReader (apache#13632)

…d nested columns (apache#13663) * fix bug and clean up * refactoring to be more efficient --------- Co-authored-by: Jonathan Vexler <=> Co-authored-by: danny0405 <yuzhao.cyz@gmail.com>

…riter (apache#13672) * fix field renaming in spark projection * should not use full path so that we match the output of the merger * update comments that were wrong * avoid string handling when field is not renamed --------- Co-authored-by: Jonathan Vexler <=> Co-authored-by: danny0405 <yuzhao.cyz@gmail.com>

…ed is not set in cleaner output (apache#13660)

) * fix restore sequence to be in completion reverse order, still requested time comparison for compaction * add a custom comparator for the restore instant sort --------- Co-authored-by: danny0405 <yuzhao.cyz@gmail.com>

…ge handle migration (apache#13670) Co-authored-by: danny0405 <yuzhao.cyz@gmail.com>

…apache#13676)

…apache#13600) * use the BufferedRecordMerger to deduplicate inputs for COW and index write path; * Add a new sub-merger for MIT expression payload. --------- Co-authored-by: Lokesh Jain <ljain@192.168.0.234> Co-authored-by: Timothy Brown <tim@onehouse.ai> Co-authored-by: danny0405 <yuzhao.cyz@gmail.com>

…reparation (apache#13649) * Address comments

… in spark (apache#13652)

…e#13642)

Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com> Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>

… script (apache#13689)

…reated (apache#13691)

Co-authored-by: danny0405 <yuzhao.cyz@gmail.com>

…ord (apache#13686)

…ArchivedCommits (apache#13704)

…greater than SIX (apache#13687) Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>

…mor table failure (apache#13667) Co-authored-by: chenxuehai <chenxuehai@bytedance.com>

…e#13677) Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>

Co-authored-by: Jonathan Vexler <=> Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>

Co-authored-by: Jonathan Vexler <=>

…apache#13717) * [HUDI-9704] Move remaining APIs from reader context to record context * Fix compilation --------- Co-authored-by: Lokesh Jain <ljain@Lokeshs-MacBook-Pro.local>

…tion and field renaming (apache#13714)

…13615) Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com> Co-authored-by: sivabalan <n.siva.b@gmail.com>

…eGroupReader (apache#13699) The goal of this PR is to ensure consistent behavior while reading and writing data across our Merge-on-Read and Copy-on-Write tables by leveraging the existing HoodieFileGroupReader to manage the merging of records. The FileGroupReaderBasedMergeHandle that is currently used for compaction is updated to allow merging with an incoming stream of records. Summary of changes: - FileGroupReaderBasedMergeHandle.java is updated to allow incoming records in the form of an iterator of records directly instead of reading changes exclusively from log files. New callbacks are added to support creating the required outputs for updates to Record Level and Secondary indexes. - The merge handle is also updated to account for preserving the metadata of records that are not updated while also generating the metadata for updated records. This does not impact the compaction workflow which will preserve the metadata of the records. - The FileGroupReaderBasedMergeHandle is set as the default merge handle - New test cases are added for RLI including a test where records move between partitions and deletes are sent to partitions that do not contain the original record - The delete record ordering value is now converted to the engine specific type so there are no issues when performing comparisons Differences between FileGroupReaderBasedMergeHandle and HoodieWriteMergeHandle - Currently the HoodieWriteMergeHandle can handle applying a single update to multiple records with the same key. This functionality does not exist in the FileGroupReaderBasedMergeHandle - The FileGroupReaderBasedMergeHandle does not support the shouldFlush functionality in the HoodieRecordMerger --------- Co-authored-by: Sivabalan Narayanan <n.siva.b@gmail.com> Co-authored-by: Lokesh Jain <ljain@192.168.1.21> Co-authored-by: Lokesh Jain <ljain@192.168.0.234> Co-authored-by: danny0405 <yuzhao.cyz@gmail.com>

…13722)

…or payload deprecation (apache#13519) This PR implements the upgrade process from v8 to table v9, and downgrade process from v9 to v8. Main changes are payloads are getting deprecated and merge modes are used from migrating from V8 to V9. and in some cases PartialUpdateMode is added. The pseudocode of the upgrade and downgrade logic is as follows: upgrade high level logic for table with payload class defined in RFC-97, remove hoodie.compaction.payload.class from table_configs add hoodie.table.legacy.payload.class=payload to table_configs set hoodie.table.partial.update.mode based on RFC-97 set hoodie.table.merge.properties based on RFC-97 set hoodie.record.merge.mode based on RFC-97 set hoodie.record.merge.strategy.id based on RFC-97 for table with event_time/commit_time merge mode, set hoodie.table.partial.update.mode to default value set hoodie.table.merge.properties to default value for table with custom merger or payload, set hoodie.table.partial.update.mode to default value set hoodie.table.merge.properties to default value since hoodie.table.partial.update.mode and hoodie.table.merge.properties have default values. for the last two kinds of tables, no operations are needed. downgrade high level logic for all tables: remove hoodie.table.partial.update.mode from table_configs remove hoodie.table.merge.properties from table_configs for table with payload class defined in RFC-97, remove hoodie.legacy.payload.class from table_configs set hoodie.compaction.payload.class=payload set hoodie.record.merge.mode=CUSTOM set hoodie.record.merge.strategy.id accordingly --------- Co-authored-by: sivabalan <n.siva.b@gmail.com>

)

…pperUtils (apache#13730)

… except version id (apache#13731)

…apache#13725)

…or MIT (apache#13733)

…BasedMergeHandle (apache#13729)

Co-authored-by: Lokesh Jain <ljain@192.168.1.21> Co-authored-by: Lokesh Jain <ljain@192.168.0.234> Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>

* rename the table option to `hoodie.table.ordering.fields`; * deprecate the usage of write option, only uses it for table creation; * rename Spark SQL 'preCombineField' to 'orderingFields', Flink SQL 'precombine.field' to 'ordering.fields'; * keep the old SQL option compatible; * add upgrade/downgrade of the table option. --------- Co-authored-by: Lokesh Jain <ljain@192.168.0.234> Co-authored-by: danny0405 <yuzhao.cyz@gmail.com>

…ans (apache#13719) * [HUDI-9653] Add show_cleans and show_clean_plans SQL procedures - Implement ShowCleansProcedure for displaying completed clean operations - Implement ShowCleansPlanProcedure for displaying requested clean plans and schedules - Add ShowCleansPartitionMetadataProcedure for partition-level clean metadata - Register new procedures in HoodieProcedures.scala - Add comprehensive test suite TestShowCleansProcedures.scala with edge cases - Support limit parameter for result pagination - Include proper error handling for non-existent tables - Follow existing procedure patterns and coding standards * [HUDI-9653] Add support for showArchived in the clean procedures * [HUDI-9653] Add SQL filter expression support for Spark procedures This commit introduces SQL filter expression support for Hudi Spark procedures, enabling users to apply standard SQL expressions to filter timeline data. Key features: - Added parameter to show_clean_plans, show_cleans, and show_cleans_metadata procedures - Supports standard SQL expressions for filtering procedure results (e.g., plan_time > '20250101000000') - Automatic expression parsing and validation using Spark's SQL parser - Proper column binding and type conversion for expression evaluation - Comprehensive error handling with descriptive error messages Examples: - call show_clean_plans(table => 'my_table', filter => 'plan_time > 20250101000000') - call show_cleans(table => 'my_table', filter => 'clean_time >= 20250101 AND state = COMPLETED') - call show_clean_plans(table => 'my_table', filter => 'total_partitions_to_clean > 0') Implementation: - Created HoodieProcedureFilterUtils with SQL expression parsing and evaluation - Added filter validation to prevent invalid column references and syntax errors - Enhanced existing procedures with optional filter parameter (default empty) - Added comprehensive test coverage for time-based and state-based filtering * [HUDI-9653] Add SQL filter expression support for Spark procedures This commit introduces SQL filter expression support for Hudi Spark procedures, enabling users to apply standard SQL expressions to filter timeline data. Key features: - Added parameter to show_clean_plans, show_cleans, and show_cleans_metadata procedures - Supports standard SQL expressions for filtering procedure results (e.g., plan_time > '20250101000000') - Automatic expression parsing and validation using Spark's SQL parser - Proper column binding and type conversion for expression evaluation - Comprehensive error handling with descriptive error messages Examples: - call show_clean_plans(table => 'my_table', filter => 'plan_time > 20250101000000') - call show_cleans(table => 'my_table', filter => 'clean_time >= 20250101 AND state = COMPLETED') - call show_clean_plans(table => 'my_table', filter => 'total_partitions_to_clean > 0') Implementation: - Created HoodieProcedureFilterUtils with SQL expression parsing and evaluation - Added filter validation to prevent invalid column references and syntax errors - Enhanced existing procedures with optional filter parameter (default empty) - Added comprehensive test coverage for time-based and state-based filtering * [HUDI-9653] Add show_cleans and show_clean_plans SQL procedures - Implement ShowCleansProcedure for displaying completed clean operations - Implement ShowCleansPlanProcedure for displaying requested clean plans and schedules - Add ShowCleansPartitionMetadataProcedure for partition-level clean metadata - Register new procedures in HoodieProcedures.scala - Add comprehensive test suite TestShowCleansProcedures.scala with edge cases - Support limit parameter for result pagination - Support showArchived parameter to support querying both active and archived cleans - Include proper error handling for non-existent tables - Follow existing procedure patterns and coding standards * [HUDI-9653] Add show_cleans and show_clean_plans SQL procedures - Implement ShowCleansProcedure for displaying completed clean operations - Implement ShowCleansPlanProcedure for displaying requested clean plans and schedules - Add ShowCleansPartitionMetadataProcedure for partition-level clean metadata - Register new procedures in HoodieProcedures.scala - Add comprehensive test suite TestShowCleansProcedures.scala with edge cases - Support limit parameter for result pagination - Support showArchived parameter to support querying both active and archived cleans - Include proper error handling for non-existent tables - Follow existing procedure patterns and coding standards * [HUDI-9653] Add show_cleans and show_clean_plans SQL procedures - Implement ShowCleansProcedure for displaying completed clean operations - Implement ShowCleansPlanProcedure for displaying requested clean plans and schedules - Add ShowCleansPartitionMetadataProcedure for partition-level clean metadata - Register new procedures in HoodieProcedures.scala - Add comprehensive test suite TestShowCleansProcedures.scala with edge cases - Support limit parameter for result pagination - Support showArchived parameter to support querying both active and archived cleans - Include proper error handling for non-existent tables - Follow existing procedure patterns and coding standards * [HUDI-9653] Add show_cleans and show_clean_plans SQL procedures - Implement ShowCleansProcedure for displaying completed clean operations - Implement ShowCleansPlanProcedure for displaying requested clean plans and schedules - Add ShowCleansPartitionMetadataProcedure for partition-level clean metadata - Register new procedures in HoodieProcedures.scala - Add comprehensive test suite TestShowCleansProcedures.scala with edge cases - Support limit parameter for result pagination - Support showArchived parameter to support querying both active and archived cleans - Include proper error handling for non-existent tables - Follow existing procedure patterns and coding standards

) * [MINOR] Add additional upgrade tests for TestMORDataSource * address vc comment * address tim log comment * Trigger CI/CD pipeline

…or Flink COW write (apache#13734)

…ark driver (apache#13728)

… HoodieTableConfig (apache#13740)

cshuo and others added 30 commits August 1, 2025 16:32

[HUDI-9671] Ignore UPDATE_BEFORE msg for snapshot queries (apache#13657)

23f12e1

[HUDI-9672] Disable skipping clustering for spark incremental query t…

17f3720

…o avoid data duplication (apache#13659)

[HUDI-8746] Add ORC support to FileGroupReader paths and migrate Spar…

b763291

…k multi-format reading to FileGroupReader (apache#13632)

[HUDI-9676] Fix HoodieAvroUtils.rewriteRecordWithNewSchema for rename…

dc046c4

…d nested columns (apache#13663) * fix bug and clean up * refactoring to be more efficient --------- Co-authored-by: Jonathan Vexler <=> Co-authored-by: danny0405 <yuzhao.cyz@gmail.com>

[HUDI-9675] Add safety check to Savepoint when earliest commit retain…

aa9935c

…ed is not set in cleaner output (apache#13660)

[HUDI-9682] Add support to FileGroupRecordBuffer to assist w/ cow mer…

90fe9b2

…ge handle migration (apache#13670) Co-authored-by: danny0405 <yuzhao.cyz@gmail.com>

[HUDI-9684] Add Upgrade/Downgrade Test Fixtures Framework (apache#13669)

b22e8d9

[HUDI-9687] Set min parallelism to 10 for SparkHoodieBloomIndexHelper (…

b90cb5d

…apache#13676)

[HUDI-9665] Control the write status RDD parallelism for MDT record p…

68f940a

…reparation (apache#13649) * Address comments

[HUDI-9668] Fix float to double conversion for nested + arrays + maps…

f32e73a

… in spark (apache#13652)

[HUDI-9658] Improve Upgrade/Downgrade issues in event of crash (apach…

c65b657

…e#13642)

[HUDI-9556] Migrate trino-hudi plugin to hudi repo (apache#13493)

4e987fe

Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com> Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>

[MINOR] Revert license check to exclude hudi-trino-plugin and cleanup…

16c0665

… script (apache#13689)

[MINOR] Add deepakpanda to github collaborators (apache#13693)

6feee38

[HUDI-9694] Fix bug in Metadata Table restore flow where table is rec…

06d9e32

…reated (apache#13691)

[HUDI-9692] Adapt FileGroup reader for writing path (apache#13688)

f99620e

Co-authored-by: danny0405 <yuzhao.cyz@gmail.com>

[HUDI-9697] Fix small LP shutdown hook race condition (apache#13698)

1a23db2

[HUDI-9641] Improve Serialization Performance of HoodieAvroIndexedRec…

1b2c072

…ord (apache#13686)

[HUDI-9696] Use BufferedRecordMerger with CDCInputFormat (apache#13696)

3548930

[HUDI-9699] Fix flaky test ITTestHoodieDataSource#testIncrementalRead…

a83dce4

…ArchivedCommits (apache#13704)

[MINOR] Fix typo (change snake_case to camelCase) (apache#13706)

3c3fedd

[HUDI-9679] Refactor filesystem operations to use storage (apache#13666)

7398271

[HUDI-9698] Ensure valid upgrade/downgrade checks for table versions …

2ca7a75

…greater than SIX (apache#13687) Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>

[HUDI-9678] Add path option in overridingOpts to fix spark overwrite …

c2dd568

…mor table failure (apache#13667) Co-authored-by: chenxuehai <chenxuehai@bytedance.com>

[MINOR] Add tests for autoUpgrade config during table creation (apach…

30e5275

…e#13677) Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>

[HUDI-9669] Add schema on write support to hive (apache#13654)

e7aae2e

Co-authored-by: Jonathan Vexler <=> Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>

[HUDI-9695] Use BufferedRecordMerger for spark cdc reader (apache#13694)

7aa92cb

Co-authored-by: Jonathan Vexler <=>

lokeshj1703 and others added 21 commits August 13, 2025 21:47

[HUDI-9704] Move remaining APIs from reader context to record context (…

3c680be

…apache#13717) * [HUDI-9704] Move remaining APIs from reader context to record context * Fix compilation --------- Co-authored-by: Lokesh Jain <ljain@Lokeshs-MacBook-Pro.local>

[HUDI-9705] Fix bugs in spark and avro reader contexts for type promo…

ee485c2

…tion and field renaming (apache#13714)

[HUDI-9638] Infer table configs for new table creation in v9 (apache#…

fc41c22

…13615) Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com> Co-authored-by: sivabalan <n.siva.b@gmail.com>

[HUDI-9720] Forbid to modify hoodie.properties for read path (apache#…

9b2236d

…13722)

[HUDI-8401] Addressing some follow up feedback on PR 13519 (apache#13721

0a06b8e

)

[MINOR] Move Avro Wrapper Utils from HoodieAvroUtils to HoodieAvroWra…

ad79968

…pperUtils (apache#13730)

[HUDI-9597] Do not evolve schema if reconciled schema keeps unchanged…

9fde83d

… except version id (apache#13731)

[HUDI-8358] Add more tests to cover completion time incremental query (…

d6bb343

…apache#13725)

[HUDI-9725] Disable partial merging when partial update mode is set f…

9b8ccbb

…or MIT (apache#13733)

[HUDI-9724] Remove index updates when write fails for FileGroupReader…

a749d6b

…BasedMergeHandle (apache#13729)

[HUDI-9691] Migrate MysqlDebeziumPayload to merge mode (apache#13727)

30eca34

Co-authored-by: Lokesh Jain <ljain@192.168.1.21> Co-authored-by: Lokesh Jain <ljain@192.168.0.234> Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>

[MINOR] Add additional upgrade tests for TestMORDataSource (apache#13737

6971558

) * [MINOR] Add additional upgrade tests for TestMORDataSource * address vc comment * address tim log comment * Trigger CI/CD pipeline

[MINOR] Update RFC index with RFC-99 and RFC-100 (apache#13750)

22c739f

[HUDI-9727] Migrate merge handle to FileGroupReaderBasedMergeHandle f…

ec309f9

…or Flink COW write (apache#13734)

[HUDI-9614] should only use metaclient to read index def file from sp…

7da43d6

…ark driver (apache#13728)

[HUDI-9737] Add checksum checks while recovering from backup file for…

960dd66

… HoodieTableConfig (apache#13740)

ensure metaclient is not being created on executor

a187faf

rahil-c closed this Aug 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rahil/metaclient executor#5

Rahil/metaclient executor#5
rahil-c wants to merge 51 commits intomasterfrom
rahil/metaclient-executor

rahil-c commented Aug 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

16 participants

Conversation

rahil-c commented Aug 22, 2025

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

16 participants