feat(schema): Add support to write shredded variants for HoodieRecordType.AVRO#18065
Open
feat(schema): Add support to write shredded variants for HoodieRecordType.AVRO#18065
Conversation
- Add adapter pattern for Spark3 and 4 - Cleanup invariant issue in SparkSqlWriter - Add cross engine test - Add backward compatibility test for Spark3.x - Add cross engine read for Flink
306da14 to
a0622a2
Compare
6605bcf to
c4b0b3c
Compare
Member
Author
|
Oops, accidentally pushed this to hudi repo instead of my own fork. Should be fine, need to remind myself to delete the branch after merge then. |
fb48ca6 to
48a01e9
Compare
- Added support to write shredded types for HoodieRecordType.AVRO - Added functional tests for testing newly added configs
bbbb8e7 to
f79a417
Compare
|
|
||
| /** | ||
| * Checks if a StructType looks like a variant representation (has value and metadata binary fields). | ||
| * This is a structural check that doesn't rely on metadata, useful during schema reconciliation |
Contributor
There was a problem hiding this comment.
Is it possible to use the struct's metadata to be more direct about what is or isn't a variant?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Describe the issue this Pull Request addresses
Closes: #18066
This PR introduces full read and write support for Shredded Variant types in Hudi, enabling the storage of Spark 4.0
Variantdata in an optimized Parquet format.Previously, Variant types were stored only as unshredded binary blobs. Shredding decomposes the variant data into typed columns (
typed_value), allowing for significantly better compression and query performance (column pruning) when querying specific sub-fields of the Variant.Summary and Changelog
This PR implements the "Shredding" mechanism at the Avro write layer, allowing Hudi to transform
Variantrecords into the shredded structure required by Parquet. It abstracts the shredding logic behind a provider interface to maintain separation between Hudi's core Avro support and Spark's Variant implementation.Key Changes:
Configuration:
hoodie.parquet.variant.shredding.provider.classinHoodieStorageConfigto specify the implementation used to parse/shred variants.org.apache.hudi.variant.Spark4VariantShreddingProvider) if present on the classpath.Core Write Support (
hudi-hadoop-common):VariantShreddingProvider: An interface for parsing variant binary data into shreddedGenericRecords.HoodieAvroWriteSupport: Added logic to identify shredded variant fields in the schema and utilize the provider to populate thetyped_valuecolumn during the write phase.HoodieAvroFileWriterFactory: Added logic to instantiate the shredding provider via reflection.Spark 4 Integration (
hudi-spark4-common):Spark4VariantShreddingProvider: A concrete implementation utilizingorg.apache.spark.types.variant.VariantShreddingWriterto perform the actual shredding.BaseSpark4Adapter: ModifiedisDataTypeEqualForParquetto recognize the shredded physical schema (structs with 3 fields:value,metadata,typed_value) as a valid Variant type.HoodieHadoopFsRelationFactory: Propagates thehoodie.parquet.variant.allow.reading.shreddedconfig to Spark'sSQLConfto ensure the reader handles the schema correctly.Testing:
TestVariantDataType: Comprehensive tests verifying:typed_value).Impact
Varianttype will experience improved query performance on Hudi tables due to the ability to leverage Parquet column shredding.Risk Level
Low
hoodie.parquet.variant.write.shredding.enabled).Documentation Update
hoodie.parquet.variant.shredding.provider.classneeds to be documented (marked as Advanced).Contributor's checklist