feat(schema): Add VARIANT support to HoodieSchema #17751

voonhous · 2025-12-30T10:39:57Z

Describe the issue this Pull Request addresses

Introduce the new VARIANT schema type. This will look similar to a record/struct with a logical type associated with it. The implementation is inline with parquet spec.

Linked task: #17745

Note: No reader and writer support has been added yet. It will be added in a separate PR.

Summary and Changelog

Add VARIANT support to HoodieSchema
Add VARIANT type to HoodieSchemaType enum.

Impact

Support for Variant types in accordance to parquet's spec.

Risk Level

Low

Documentation Update

None

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

the-other-tim-brown · 2025-12-30T15:23:21Z

hudi-common/src/main/java/org/apache/hudi/common/schema/HoodieSchemaType.java

        return DATE;
      } else if (logicalType == LogicalTypes.uuid()) {
        return UUID;
+      } else if (logicalType instanceof VariantLogicalType) {


Can we follow a similar pattern as the uuid where there is a singleton we can compare to instead of instanceof?

Make sense, limit the number of object creation for types. Will eagerly initialize a singleton.

the-other-tim-brown · 2025-12-30T15:24:25Z

hudi-common/src/main/java/org/apache/hudi/common/schema/HoodieSchema.java

  }

+  /**
+   * Creates an unshredded Variant schema.


General question, can you have shredded and unshredded values in the same dataset? if so, it seems like the schema should be the same for these

I don't quite understand this question. The implementation follows the parquet schema spec.

Nonetheless, will like to still understand what you're pushing towards.
Do you mean if we can have a unshredded_typed_column and shredded_typed_column in the dataset?

Or are you saying that since shredded_variant typed columns can hold unshredded data, we should just maintain the shredded type?

Unshredded

optional group variant_unshredded (VARIANT) { required binary metadata; required binary value; }

Shredded

optional group variant_shredded (VARIANT) { required binary metadata; optional binary value; optional int64 typed_value; }

So, to use shredded schema to represent unshredded, we can just make typed_value null and populate value?

I'm wondering if the column can have both shredded and unshredded in the same file.

cshuo · 2025-12-31T07:36:42Z

hudi-common/src/main/java/org/apache/hudi/common/schema/HoodieSchema.java

+   * @param typedValueSchema the schema for the typed_value field (can be null if typed_value is not needed)
+   * @return a new HoodieSchema.Variant representing a shredded variant
+   */
+  public static HoodieSchema.Variant createVariantShredded(HoodieSchema typedValueSchema) {


Do we need include shredding information in type/schema layer? IIUC, it's more about read/write optimization mechanism, which can be inferred or fetched from configuration during reading or writing.

Yeap, that's what i understand too. Since we are introducing the VARIANT type, i am merely following the spec of what parquet provided.

It's similar to Decimal. The same argument of "do we really need a Decimal backed by BYTE or FIXED"? Both are of DECIMAL logical types and each has their own performance differences.

I do not have an answer of whether we need to include shredding information in type/schema layer. But I think a similar question will be asked if i do not include it. So, might as well include it now and leave it up for discussion for the experts.

From what i read in the variant shredding spec: https://github.com/apache/parquet-format/blob/master/VariantShredding.md

Shredding is an optional/extended feature, so i am not really sure if ALL engines actually support it.

Might need devs with more domain knowledge to chime in on how different engines support VARIANT and if there's a risk for only supporting one type and not the other.

Thks for the explaining with DECIMAL example, extra type informations seems necessary since now the HoodieSchema is a wrapper class for avro schema and needs providing compatibility with Avro.

Shredding is an optional/extended feature, so i am not really sure if ALL engines actually support it.

AFAIK, Flink do not support shredding yet and Spark supports shredding. BTW, both engines do not include shredding information in the data type layer.

the-other-tim-brown · 2026-01-22T19:07:55Z

hudi-common/src/main/java/org/apache/hudi/common/schema/HoodieSchema.java

        throw new HoodieIOException("Failed to parse schema from InputStream", e);
+      } catch (IllegalArgumentException e) {
+        // Wrap validation exceptions to preserve the detailed error message
+        throw new HoodieAvroSchemaException(e.getMessage(), e);


This will duplicate the error message in the stacktrace. Can we add a more meaningful message here?

Same applies to the change above

the-other-tim-brown · 2026-01-22T19:08:41Z

.../hudi-spark/src/test/scala/org/apache/spark/sql/hudi/feature/index/TestExpressionIndex.scala


  test("Test Create Expression Index Syntax") {
-    withTempDir { tmp =>
+    withTempDir { tmp =>2


This looks accidental

the-other-tim-brown · 2026-01-22T19:10:14Z

hudi-common/src/main/java/org/apache/hudi/common/schema/HoodieSchema.java

+   *
+   * <p>This is a singleton type - use {@link #variant()} to get the instance.</p>
+   */
+  public static class VariantLogicalType extends LogicalType {


Can we make this package private?

the-other-tim-brown · 2026-01-22T19:14:17Z

hudi-common/src/main/java/org/apache/hudi/common/schema/HoodieSchema.java

+     * @param avroSchema the schema to check
+     * @return true if the schema has a Variant logical type
+     */
+    public static boolean isVariantSchema(Schema avroSchema) {


This is only used in tests so far, do we still need this?

Pre-emptively including this as i thought we'd need this for type comparison. I'm okay to move this to the TestHoodieSchema or a similar test scope for now.

the-other-tim-brown · 2026-01-22T19:18:40Z

hudi-common/src/test/java/org/apache/hudi/common/schema/TestHoodieSchema.java

+
+    // Verify they are the exact same instance (reference equality)
+    assertSame(instance1, instance2);
+    assertTrue(instance1 == instance2);


This is the same as assertSame

the-other-tim-brown · 2026-01-22T19:20:04Z

hudi-common/src/test/java/org/apache/hudi/common/schema/TestHoodieSchema.java

+    // Create shredded variant with typed_value
+    HoodieSchema typedValueSchema = HoodieSchema.create(HoodieSchemaType.STRING);
+    HoodieSchema.Variant originalVariant = HoodieSchema.createVariantShredded(typedValueSchema);
+    String jsonSchema = originalVariant.toString();


Let's have a similar test where the variant is a field within a record to better simulate variants as part of the table schema.

the-other-tim-brown · 2026-01-22T19:21:22Z

hudi-common/src/main/java/org/apache/hudi/common/schema/HoodieSchema.java

+     *
+     * @return Option containing the typed_value schema, or Option.empty() if not present
+     */
+    public Option<HoodieSchema> getTypedValueField() {


Should we switch to lombok for the getters/equals/hashcode?

I'd like to address them in a separate PR together with my Lombok refactorings to prevent merge conflicts.

hudi-bot · 2026-01-23T07:26:39Z

CI report:

be5f3fa Azure: FAILURE
5d3b0ef UNKNOWN

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

voonhous requested review from bvaradar, rahil-c and the-other-tim-brown and removed request for the-other-tim-brown December 30, 2025 10:40

github-actions bot added the size:L PR with lines of changes in (300, 1000] label Dec 30, 2025

voonhous changed the title ~~feat(schema): Add variant support to HoodieSchema~~ feat(schema): Add VARIANT support to HoodieSchema Dec 30, 2025

the-other-tim-brown reviewed Dec 30, 2025

View reviewed changes

cshuo reviewed Dec 31, 2025

View reviewed changes

voonhous linked an issue Jan 5, 2026 that may be closed by this pull request

Introduce Variant schema type #17745

Open

feat(schema): Add variant support to HoodieSchema

ed2bd55

voonhous force-pushed the variant-intro branch from 0fff72f to ed2bd55 Compare January 7, 2026 04:57

voonhous mentioned this pull request Jan 14, 2026

feat: Add Unshredded Variant read & write support #17833

Open

3 tasks

the-other-tim-brown reviewed Jan 22, 2026

View reviewed changes

github-actions bot added size:XL PR with lines of changes > 1000 and removed size:L PR with lines of changes in (300, 1000] labels Jan 23, 2026

Address comments

5d3b0ef

voonhous force-pushed the variant-intro branch from be5f3fa to 5d3b0ef Compare January 23, 2026 07:23

feat(schema): Add VARIANT support to HoodieSchema #17751

Are you sure you want to change the base?

feat(schema): Add VARIANT support to HoodieSchema #17751

Conversation

voonhous commented Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

voonhous Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Unshredded

Shredded

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

voonhous Jan 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

voonhous Jan 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cshuo Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hudi-bot commented Jan 23, 2026

CI report:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

voonhous commented Dec 30, 2025 •

edited

Loading

voonhous Dec 30, 2025 •

edited

Loading

voonhous Jan 2, 2026 •

edited

Loading

voonhous Jan 2, 2026 •

edited

Loading

cshuo Jan 5, 2026 •

edited

Loading