Skip to content

Rahil/metaclient executor#5

Closed
rahil-c wants to merge 51 commits intomasterfrom
rahil/metaclient-executor
Closed

Rahil/metaclient executor#5
rahil-c wants to merge 51 commits intomasterfrom
rahil/metaclient-executor

Conversation

@rahil-c
Copy link
Owner

@rahil-c rahil-c commented Aug 22, 2025

Change Logs

Describe context and summary for this change. Highlight if any code was copied.

Impact

Describe any public API or user-facing feature change or any performance impact.

Risk level (write none, low medium or high below)

If medium or high, explain what verification was done to mitigate the risks.

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

cshuo and others added 30 commits August 1, 2025 16:32
…d nested columns (apache#13663)

* fix bug and clean up
* refactoring to be more efficient

---------

Co-authored-by: Jonathan Vexler <=>
Co-authored-by: danny0405 <yuzhao.cyz@gmail.com>
…riter (apache#13672)

* fix field renaming in spark projection
* should not use full path so that we match the output of the merger
* update comments that were wrong
* avoid string handling when field is not renamed

---------

Co-authored-by: Jonathan Vexler <=>
Co-authored-by: danny0405 <yuzhao.cyz@gmail.com>
)

* fix restore sequence to be in completion reverse order, still requested time comparison for compaction
* add a custom comparator for the restore instant sort

---------

Co-authored-by: danny0405 <yuzhao.cyz@gmail.com>
…ge handle migration (apache#13670)

Co-authored-by: danny0405 <yuzhao.cyz@gmail.com>
…apache#13600)

* use the BufferedRecordMerger to deduplicate inputs for COW and index write path;
* Add a new sub-merger for MIT expression payload.

---------

Co-authored-by: Lokesh Jain <ljain@192.168.0.234>
Co-authored-by: Timothy Brown <tim@onehouse.ai>
Co-authored-by: danny0405 <yuzhao.cyz@gmail.com>
Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>
Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>
Co-authored-by: danny0405 <yuzhao.cyz@gmail.com>
…greater than SIX (apache#13687)

Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>
…mor table failure (apache#13667)

Co-authored-by: chenxuehai <chenxuehai@bytedance.com>
…e#13677)

Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>
Co-authored-by: Jonathan Vexler <=>
Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>
lokeshj1703 and others added 21 commits August 13, 2025 21:47
…apache#13717)

* [HUDI-9704] Move remaining APIs from reader context to record context
* Fix compilation

---------

Co-authored-by: Lokesh Jain <ljain@Lokeshs-MacBook-Pro.local>
…13615)

Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>
Co-authored-by: sivabalan <n.siva.b@gmail.com>
…eGroupReader (apache#13699)

The goal of this PR is to ensure consistent behavior while reading and writing data across our Merge-on-Read and Copy-on-Write tables by leveraging the existing HoodieFileGroupReader to manage the merging of records. The FileGroupReaderBasedMergeHandle that is currently used for compaction is updated to allow merging with an incoming stream of records.

Summary of changes:

- FileGroupReaderBasedMergeHandle.java is updated to allow incoming records in the form of an iterator of records directly instead of reading changes exclusively from log files. New callbacks are added to support creating the required outputs for updates to Record Level and Secondary indexes.
- The merge handle is also updated to account for preserving the metadata of records that are not updated while also generating the metadata for updated records. This does not impact the compaction workflow which will preserve the metadata of the records.
- The FileGroupReaderBasedMergeHandle is set as the default merge handle
- New test cases are added for RLI including a test where records move between partitions and deletes are sent to partitions that do not contain the original record
- The delete record ordering value is now converted to the engine specific type so there are no issues when performing comparisons

Differences between FileGroupReaderBasedMergeHandle and HoodieWriteMergeHandle
- Currently the HoodieWriteMergeHandle can handle applying a single update to multiple records with the same key. This functionality does not exist in the FileGroupReaderBasedMergeHandle
- The FileGroupReaderBasedMergeHandle does not support the shouldFlush functionality in the HoodieRecordMerger

---------

Co-authored-by: Sivabalan Narayanan <n.siva.b@gmail.com>
Co-authored-by: Lokesh Jain <ljain@192.168.1.21>
Co-authored-by: Lokesh Jain <ljain@192.168.0.234>
Co-authored-by: danny0405 <yuzhao.cyz@gmail.com>
…or payload deprecation (apache#13519)

This PR implements the upgrade process from v8 to table v9, and downgrade process from v9 to v8. Main changes are payloads are getting deprecated and merge modes are used from migrating from V8 to V9. and in some cases PartialUpdateMode is added. 

The pseudocode of the upgrade and downgrade logic is as follows:

upgrade high level logic
for table with payload class defined in RFC-97,
  remove hoodie.compaction.payload.class from table_configs
  add hoodie.table.legacy.payload.class=payload to table_configs
  set hoodie.table.partial.update.mode based on RFC-97
  set hoodie.table.merge.properties based on RFC-97
  set hoodie.record.merge.mode based on RFC-97
  set hoodie.record.merge.strategy.id based on RFC-97
for table with event_time/commit_time merge mode,
  set hoodie.table.partial.update.mode to default value
  set hoodie.table.merge.properties to default value
for table with custom merger or payload,
  set hoodie.table.partial.update.mode to default value
  set hoodie.table.merge.properties to default value
since hoodie.table.partial.update.mode and hoodie.table.merge.properties have default values.
for the last two kinds of tables, no operations are needed.

downgrade high level logic
for all tables:
  remove hoodie.table.partial.update.mode from table_configs
  remove hoodie.table.merge.properties from table_configs
for table with payload class defined in RFC-97,
  remove hoodie.legacy.payload.class from table_configs
  set hoodie.compaction.payload.class=payload
  set hoodie.record.merge.mode=CUSTOM
  set hoodie.record.merge.strategy.id accordingly




---------

Co-authored-by: sivabalan <n.siva.b@gmail.com>
Co-authored-by: Lokesh Jain <ljain@192.168.1.21>
Co-authored-by: Lokesh Jain <ljain@192.168.0.234>
Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>
* rename the table option to `hoodie.table.ordering.fields`;
* deprecate the usage of write option, only uses it for table creation;
* rename Spark SQL 'preCombineField' to 'orderingFields', Flink SQL 'precombine.field' to 'ordering.fields';
* keep the old SQL option compatible;
* add upgrade/downgrade of the table option.

---------

Co-authored-by: Lokesh Jain <ljain@192.168.0.234>
Co-authored-by: danny0405 <yuzhao.cyz@gmail.com>
…ans (apache#13719)

* [HUDI-9653] Add show_cleans and show_clean_plans SQL procedures

- Implement ShowCleansProcedure for displaying completed clean operations
- Implement ShowCleansPlanProcedure for displaying requested clean plans and schedules
- Add ShowCleansPartitionMetadataProcedure for partition-level clean metadata
- Register new procedures in HoodieProcedures.scala
- Add comprehensive test suite TestShowCleansProcedures.scala with edge cases
- Support limit parameter for result pagination
- Include proper error handling for non-existent tables
- Follow existing procedure patterns and coding standards

* [HUDI-9653] Add support for showArchived in the clean procedures

* [HUDI-9653] Add SQL filter expression support for Spark procedures

This commit introduces SQL filter expression support for Hudi Spark procedures,
enabling users to apply standard SQL expressions to filter timeline data.

Key features:
- Added  parameter to show_clean_plans, show_cleans, and show_cleans_metadata procedures
- Supports standard SQL expressions for filtering procedure results (e.g., plan_time > '20250101000000')
- Automatic expression parsing and validation using Spark's SQL parser
- Proper column binding and type conversion for expression evaluation
- Comprehensive error handling with descriptive error messages

Examples:
- call show_clean_plans(table => 'my_table', filter => 'plan_time > 20250101000000')
- call show_cleans(table => 'my_table', filter => 'clean_time >= 20250101 AND state = COMPLETED')
- call show_clean_plans(table => 'my_table', filter => 'total_partitions_to_clean > 0')

Implementation:
- Created HoodieProcedureFilterUtils with SQL expression parsing and evaluation
- Added filter validation to prevent invalid column references and syntax errors
- Enhanced existing procedures with optional filter parameter (default empty)
- Added comprehensive test coverage for time-based and state-based filtering

* [HUDI-9653] Add SQL filter expression support for Spark procedures

This commit introduces SQL filter expression support for Hudi Spark procedures,
enabling users to apply standard SQL expressions to filter timeline data.

Key features:
- Added  parameter to show_clean_plans, show_cleans, and show_cleans_metadata procedures
- Supports standard SQL expressions for filtering procedure results (e.g., plan_time > '20250101000000')
- Automatic expression parsing and validation using Spark's SQL parser
- Proper column binding and type conversion for expression evaluation
- Comprehensive error handling with descriptive error messages

Examples:
- call show_clean_plans(table => 'my_table', filter => 'plan_time > 20250101000000')
- call show_cleans(table => 'my_table', filter => 'clean_time >= 20250101 AND state = COMPLETED')
- call show_clean_plans(table => 'my_table', filter => 'total_partitions_to_clean > 0')

Implementation:
- Created HoodieProcedureFilterUtils with SQL expression parsing and evaluation
- Added filter validation to prevent invalid column references and syntax errors
- Enhanced existing procedures with optional filter parameter (default empty)
- Added comprehensive test coverage for time-based and state-based filtering

* [HUDI-9653] Add show_cleans and show_clean_plans SQL procedures
- Implement ShowCleansProcedure for displaying completed clean operations
- Implement ShowCleansPlanProcedure for displaying requested clean plans and schedules
- Add ShowCleansPartitionMetadataProcedure for partition-level clean metadata
- Register new procedures in HoodieProcedures.scala
- Add comprehensive test suite TestShowCleansProcedures.scala with edge cases
- Support limit parameter for result pagination
- Support showArchived parameter to support querying both active and archived cleans
- Include proper error handling for non-existent tables
- Follow existing procedure patterns and coding standards

* [HUDI-9653] Add show_cleans and show_clean_plans SQL procedures
- Implement ShowCleansProcedure for displaying completed clean operations
- Implement ShowCleansPlanProcedure for displaying requested clean plans and schedules
- Add ShowCleansPartitionMetadataProcedure for partition-level clean metadata
- Register new procedures in HoodieProcedures.scala
- Add comprehensive test suite TestShowCleansProcedures.scala with edge cases
- Support limit parameter for result pagination
- Support showArchived parameter to support querying both active and archived cleans
- Include proper error handling for non-existent tables
- Follow existing procedure patterns and coding standards

* [HUDI-9653] Add show_cleans and show_clean_plans SQL procedures
- Implement ShowCleansProcedure for displaying completed clean operations
- Implement ShowCleansPlanProcedure for displaying requested clean plans and schedules
- Add ShowCleansPartitionMetadataProcedure for partition-level clean metadata
- Register new procedures in HoodieProcedures.scala
- Add comprehensive test suite TestShowCleansProcedures.scala with edge cases
- Support limit parameter for result pagination
- Support showArchived parameter to support querying both active and archived cleans
- Include proper error handling for non-existent tables
- Follow existing procedure patterns and coding standards

* [HUDI-9653] Add show_cleans and show_clean_plans SQL procedures
- Implement ShowCleansProcedure for displaying completed clean operations
- Implement ShowCleansPlanProcedure for displaying requested clean plans and schedules
- Add ShowCleansPartitionMetadataProcedure for partition-level clean metadata
- Register new procedures in HoodieProcedures.scala
- Add comprehensive test suite TestShowCleansProcedures.scala with edge cases
- Support limit parameter for result pagination
- Support showArchived parameter to support querying both active and archived cleans
- Include proper error handling for non-existent tables
- Follow existing procedure patterns and coding standards
)

* [MINOR] Add additional upgrade tests for TestMORDataSource

* address vc comment

* address tim log comment

* Trigger CI/CD pipeline
@rahil-c rahil-c closed this Aug 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.