Closed
Conversation
…o avoid data duplication (apache#13659)
…k multi-format reading to FileGroupReader (apache#13632)
…d nested columns (apache#13663) * fix bug and clean up * refactoring to be more efficient --------- Co-authored-by: Jonathan Vexler <=> Co-authored-by: danny0405 <yuzhao.cyz@gmail.com>
…riter (apache#13672) * fix field renaming in spark projection * should not use full path so that we match the output of the merger * update comments that were wrong * avoid string handling when field is not renamed --------- Co-authored-by: Jonathan Vexler <=> Co-authored-by: danny0405 <yuzhao.cyz@gmail.com>
…ed is not set in cleaner output (apache#13660)
…ge handle migration (apache#13670) Co-authored-by: danny0405 <yuzhao.cyz@gmail.com>
…apache#13600) * use the BufferedRecordMerger to deduplicate inputs for COW and index write path; * Add a new sub-merger for MIT expression payload. --------- Co-authored-by: Lokesh Jain <ljain@192.168.0.234> Co-authored-by: Timothy Brown <tim@onehouse.ai> Co-authored-by: danny0405 <yuzhao.cyz@gmail.com>
…reparation (apache#13649) * Address comments
Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com> Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>
Co-authored-by: danny0405 <yuzhao.cyz@gmail.com>
…greater than SIX (apache#13687) Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>
…mor table failure (apache#13667) Co-authored-by: chenxuehai <chenxuehai@bytedance.com>
…e#13677) Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>
Co-authored-by: Jonathan Vexler <=> Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>
Co-authored-by: Jonathan Vexler <=>
…apache#13717) * [HUDI-9704] Move remaining APIs from reader context to record context * Fix compilation --------- Co-authored-by: Lokesh Jain <ljain@Lokeshs-MacBook-Pro.local>
…tion and field renaming (apache#13714)
…13615) Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com> Co-authored-by: sivabalan <n.siva.b@gmail.com>
…eGroupReader (apache#13699) The goal of this PR is to ensure consistent behavior while reading and writing data across our Merge-on-Read and Copy-on-Write tables by leveraging the existing HoodieFileGroupReader to manage the merging of records. The FileGroupReaderBasedMergeHandle that is currently used for compaction is updated to allow merging with an incoming stream of records. Summary of changes: - FileGroupReaderBasedMergeHandle.java is updated to allow incoming records in the form of an iterator of records directly instead of reading changes exclusively from log files. New callbacks are added to support creating the required outputs for updates to Record Level and Secondary indexes. - The merge handle is also updated to account for preserving the metadata of records that are not updated while also generating the metadata for updated records. This does not impact the compaction workflow which will preserve the metadata of the records. - The FileGroupReaderBasedMergeHandle is set as the default merge handle - New test cases are added for RLI including a test where records move between partitions and deletes are sent to partitions that do not contain the original record - The delete record ordering value is now converted to the engine specific type so there are no issues when performing comparisons Differences between FileGroupReaderBasedMergeHandle and HoodieWriteMergeHandle - Currently the HoodieWriteMergeHandle can handle applying a single update to multiple records with the same key. This functionality does not exist in the FileGroupReaderBasedMergeHandle - The FileGroupReaderBasedMergeHandle does not support the shouldFlush functionality in the HoodieRecordMerger --------- Co-authored-by: Sivabalan Narayanan <n.siva.b@gmail.com> Co-authored-by: Lokesh Jain <ljain@192.168.1.21> Co-authored-by: Lokesh Jain <ljain@192.168.0.234> Co-authored-by: danny0405 <yuzhao.cyz@gmail.com>
…or payload deprecation (apache#13519) This PR implements the upgrade process from v8 to table v9, and downgrade process from v9 to v8. Main changes are payloads are getting deprecated and merge modes are used from migrating from V8 to V9. and in some cases PartialUpdateMode is added. The pseudocode of the upgrade and downgrade logic is as follows: upgrade high level logic for table with payload class defined in RFC-97, remove hoodie.compaction.payload.class from table_configs add hoodie.table.legacy.payload.class=payload to table_configs set hoodie.table.partial.update.mode based on RFC-97 set hoodie.table.merge.properties based on RFC-97 set hoodie.record.merge.mode based on RFC-97 set hoodie.record.merge.strategy.id based on RFC-97 for table with event_time/commit_time merge mode, set hoodie.table.partial.update.mode to default value set hoodie.table.merge.properties to default value for table with custom merger or payload, set hoodie.table.partial.update.mode to default value set hoodie.table.merge.properties to default value since hoodie.table.partial.update.mode and hoodie.table.merge.properties have default values. for the last two kinds of tables, no operations are needed. downgrade high level logic for all tables: remove hoodie.table.partial.update.mode from table_configs remove hoodie.table.merge.properties from table_configs for table with payload class defined in RFC-97, remove hoodie.legacy.payload.class from table_configs set hoodie.compaction.payload.class=payload set hoodie.record.merge.mode=CUSTOM set hoodie.record.merge.strategy.id accordingly --------- Co-authored-by: sivabalan <n.siva.b@gmail.com>
… except version id (apache#13731)
Co-authored-by: Lokesh Jain <ljain@192.168.1.21> Co-authored-by: Lokesh Jain <ljain@192.168.0.234> Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>
* rename the table option to `hoodie.table.ordering.fields`; * deprecate the usage of write option, only uses it for table creation; * rename Spark SQL 'preCombineField' to 'orderingFields', Flink SQL 'precombine.field' to 'ordering.fields'; * keep the old SQL option compatible; * add upgrade/downgrade of the table option. --------- Co-authored-by: Lokesh Jain <ljain@192.168.0.234> Co-authored-by: danny0405 <yuzhao.cyz@gmail.com>
…ans (apache#13719) * [HUDI-9653] Add show_cleans and show_clean_plans SQL procedures - Implement ShowCleansProcedure for displaying completed clean operations - Implement ShowCleansPlanProcedure for displaying requested clean plans and schedules - Add ShowCleansPartitionMetadataProcedure for partition-level clean metadata - Register new procedures in HoodieProcedures.scala - Add comprehensive test suite TestShowCleansProcedures.scala with edge cases - Support limit parameter for result pagination - Include proper error handling for non-existent tables - Follow existing procedure patterns and coding standards * [HUDI-9653] Add support for showArchived in the clean procedures * [HUDI-9653] Add SQL filter expression support for Spark procedures This commit introduces SQL filter expression support for Hudi Spark procedures, enabling users to apply standard SQL expressions to filter timeline data. Key features: - Added parameter to show_clean_plans, show_cleans, and show_cleans_metadata procedures - Supports standard SQL expressions for filtering procedure results (e.g., plan_time > '20250101000000') - Automatic expression parsing and validation using Spark's SQL parser - Proper column binding and type conversion for expression evaluation - Comprehensive error handling with descriptive error messages Examples: - call show_clean_plans(table => 'my_table', filter => 'plan_time > 20250101000000') - call show_cleans(table => 'my_table', filter => 'clean_time >= 20250101 AND state = COMPLETED') - call show_clean_plans(table => 'my_table', filter => 'total_partitions_to_clean > 0') Implementation: - Created HoodieProcedureFilterUtils with SQL expression parsing and evaluation - Added filter validation to prevent invalid column references and syntax errors - Enhanced existing procedures with optional filter parameter (default empty) - Added comprehensive test coverage for time-based and state-based filtering * [HUDI-9653] Add SQL filter expression support for Spark procedures This commit introduces SQL filter expression support for Hudi Spark procedures, enabling users to apply standard SQL expressions to filter timeline data. Key features: - Added parameter to show_clean_plans, show_cleans, and show_cleans_metadata procedures - Supports standard SQL expressions for filtering procedure results (e.g., plan_time > '20250101000000') - Automatic expression parsing and validation using Spark's SQL parser - Proper column binding and type conversion for expression evaluation - Comprehensive error handling with descriptive error messages Examples: - call show_clean_plans(table => 'my_table', filter => 'plan_time > 20250101000000') - call show_cleans(table => 'my_table', filter => 'clean_time >= 20250101 AND state = COMPLETED') - call show_clean_plans(table => 'my_table', filter => 'total_partitions_to_clean > 0') Implementation: - Created HoodieProcedureFilterUtils with SQL expression parsing and evaluation - Added filter validation to prevent invalid column references and syntax errors - Enhanced existing procedures with optional filter parameter (default empty) - Added comprehensive test coverage for time-based and state-based filtering * [HUDI-9653] Add show_cleans and show_clean_plans SQL procedures - Implement ShowCleansProcedure for displaying completed clean operations - Implement ShowCleansPlanProcedure for displaying requested clean plans and schedules - Add ShowCleansPartitionMetadataProcedure for partition-level clean metadata - Register new procedures in HoodieProcedures.scala - Add comprehensive test suite TestShowCleansProcedures.scala with edge cases - Support limit parameter for result pagination - Support showArchived parameter to support querying both active and archived cleans - Include proper error handling for non-existent tables - Follow existing procedure patterns and coding standards * [HUDI-9653] Add show_cleans and show_clean_plans SQL procedures - Implement ShowCleansProcedure for displaying completed clean operations - Implement ShowCleansPlanProcedure for displaying requested clean plans and schedules - Add ShowCleansPartitionMetadataProcedure for partition-level clean metadata - Register new procedures in HoodieProcedures.scala - Add comprehensive test suite TestShowCleansProcedures.scala with edge cases - Support limit parameter for result pagination - Support showArchived parameter to support querying both active and archived cleans - Include proper error handling for non-existent tables - Follow existing procedure patterns and coding standards * [HUDI-9653] Add show_cleans and show_clean_plans SQL procedures - Implement ShowCleansProcedure for displaying completed clean operations - Implement ShowCleansPlanProcedure for displaying requested clean plans and schedules - Add ShowCleansPartitionMetadataProcedure for partition-level clean metadata - Register new procedures in HoodieProcedures.scala - Add comprehensive test suite TestShowCleansProcedures.scala with edge cases - Support limit parameter for result pagination - Support showArchived parameter to support querying both active and archived cleans - Include proper error handling for non-existent tables - Follow existing procedure patterns and coding standards * [HUDI-9653] Add show_cleans and show_clean_plans SQL procedures - Implement ShowCleansProcedure for displaying completed clean operations - Implement ShowCleansPlanProcedure for displaying requested clean plans and schedules - Add ShowCleansPartitionMetadataProcedure for partition-level clean metadata - Register new procedures in HoodieProcedures.scala - Add comprehensive test suite TestShowCleansProcedures.scala with edge cases - Support limit parameter for result pagination - Support showArchived parameter to support querying both active and archived cleans - Include proper error handling for non-existent tables - Follow existing procedure patterns and coding standards
…or Flink COW write (apache#13734)
… HoodieTableConfig (apache#13740)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Change Logs
Describe context and summary for this change. Highlight if any code was copied.
Impact
Describe any public API or user-facing feature change or any performance impact.
Risk level (write none, low medium or high below)
If medium or high, explain what verification was done to mitigate the risks.
Documentation Update
Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".
ticket number here and follow the instruction to make
changes to the website.
Contributor's checklist