Skip to content

feat(schema): Add support to write shredded variants for HoodieRecordType.AVRO#18065

Open
voonhous wants to merge 7 commits intomasterfrom
variant-intro-shredded-avro-support
Open

feat(schema): Add support to write shredded variants for HoodieRecordType.AVRO#18065
voonhous wants to merge 7 commits intomasterfrom
variant-intro-shredded-avro-support

Conversation

@voonhous
Copy link
Member

@voonhous voonhous commented Jan 31, 2026

Describe the issue this Pull Request addresses

Closes: #18066

This PR introduces full read and write support for Shredded Variant types in Hudi, enabling the storage of Spark 4.0 Variant data in an optimized Parquet format.

Previously, Variant types were stored only as unshredded binary blobs. Shredding decomposes the variant data into typed columns (typed_value), allowing for significantly better compression and query performance (column pruning) when querying specific sub-fields of the Variant.

Summary and Changelog

This PR implements the "Shredding" mechanism at the Avro write layer, allowing Hudi to transform Variant records into the shredded structure required by Parquet. It abstracts the shredding logic behind a provider interface to maintain separation between Hudi's core Avro support and Spark's Variant implementation.

Key Changes:

  1. Configuration:

    • Added hoodie.parquet.variant.shredding.provider.class in HoodieStorageConfig to specify the implementation used to parse/shred variants.
    • Added logic to auto-detect the provider (org.apache.hudi.variant.Spark4VariantShreddingProvider) if present on the classpath.
  2. Core Write Support (hudi-hadoop-common):

    • Introduced VariantShreddingProvider: An interface for parsing variant binary data into shredded GenericRecords.
    • Updated HoodieAvroWriteSupport: Added logic to identify shredded variant fields in the schema and utilize the provider to populate the typed_value column during the write phase.
    • Updated HoodieAvroFileWriterFactory: Added logic to instantiate the shredding provider via reflection.
  3. Spark 4 Integration (hudi-spark4-common):

    • Added Spark4VariantShreddingProvider: A concrete implementation utilizing org.apache.spark.types.variant.VariantShreddingWriter to perform the actual shredding.
    • Updated BaseSpark4Adapter: Modified isDataTypeEqualForParquet to recognize the shredded physical schema (structs with 3 fields: value, metadata, typed_value) as a valid Variant type.
    • Updated HoodieHadoopFsRelationFactory: Propagates the hoodie.parquet.variant.allow.reading.shredded config to Spark's SQLConf to ensure the reader handles the schema correctly.
  4. Testing:

    • Added TestVariantDataType: Comprehensive tests verifying:
      • Writing with shredding enabled (verifying Parquet schema contains typed_value).
      • Writing with shredding disabled (verifying standard binary structure).
      • Reading back data in both scenarios.

Impact

  • Performance: Users on Spark 4.0+ utilizing the Variant type will experience improved query performance on Hudi tables due to the ability to leverage Parquet column shredding.
  • Compatibility: This feature is specific to Spark 4.0+. The implementation uses reflection and separate modules to ensure no regressions for older Spark versions.

Risk Level

Low

  • The changes are isolated to the Variant data type paths.
  • The shredding logic is guarded by configuration flags (hoodie.parquet.variant.write.shredding.enabled).
  • The provider is loaded via reflection, preventing classpath issues in environments without Spark 4 libraries.

Documentation Update

  • The new configuration hoodie.parquet.variant.shredding.provider.class needs to be documented (marked as Advanced).
  • Updates to Variant data type documentation to mention support for shredding in Spark 4.0.

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

- Add adapter pattern for Spark3 and 4
- Cleanup invariant issue in SparkSqlWriter
- Add cross engine test
- Add backward compatibility test for Spark3.x
- Add cross engine read for Flink
@voonhous voonhous marked this pull request as draft January 31, 2026 06:59
@github-actions github-actions bot added the size:XL PR with lines of changes > 1000 label Jan 31, 2026
@voonhous voonhous force-pushed the variant-intro-shredded-avro-support branch from 306da14 to a0622a2 Compare January 31, 2026 08:05
@voonhous voonhous marked this pull request as ready for review January 31, 2026 16:37
@voonhous voonhous force-pushed the variant-intro-shredded-avro-support branch 4 times, most recently from 6605bcf to c4b0b3c Compare February 1, 2026 06:53
@voonhous
Copy link
Member Author

voonhous commented Feb 1, 2026

Oops, accidentally pushed this to hudi repo instead of my own fork. Should be fine, need to remind myself to delete the branch after merge then.

@voonhous voonhous force-pushed the variant-intro-shredded-avro-support branch 5 times, most recently from fb48ca6 to 48a01e9 Compare February 2, 2026 09:07
@voonhous voonhous changed the title feat(schema): Variant intro shredded avro support feat(schema): Add support to write shredded variants for HoodieRecordType.AVRO Feb 3, 2026
- Added support to write shredded types for HoodieRecordType.AVRO
- Added functional tests for testing newly added configs
@voonhous voonhous force-pushed the variant-intro-shredded-avro-support branch from bbbb8e7 to f79a417 Compare February 3, 2026 02:44
@hudi-bot
Copy link
Collaborator

hudi-bot commented Feb 3, 2026

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build


/**
* Checks if a StructType looks like a variant representation (has value and metadata binary fields).
* This is a structural check that doesn't rely on metadata, useful during schema reconciliation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to use the struct's metadata to be more direct about what is or isn't a variant?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XL PR with lines of changes > 1000

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add SHREDDED Parquet writer support for Variant for HoodieRecordType.AVRO

3 participants