fix: Fix logical type issue for timestamp columns #17601

linliu-code · 2025-12-15T20:34:33Z

Describe the issue this Pull Request addresses

This PR combines mainly two PRs that fixing timestamp_millis logical type issue.

Summary and Changelog

The below is the PR description from #14161.

This pr #9743 adds more schema evolution functionality and schema processing. However, we used the InternalSchema system to do various operations such as fix null ordering, reorder, and add columns. At the time, InternalSchema only had a single Timestamp type. When converting back to avro, this was assumed to be micros. Therefore, if the schema provider had any millis columns, the processed schema would end up with those columns as micros.

In this pr to update column stats with better support for logical types: #13711, the schema issues were fixed, as well as additional issues with handling and conversion of timestamps during ingestion.

this pr aims to add functionality to spark and hive readers and writers to automatically repair affected tables.
After switching to use the 1.1 binary, the affected columns will undergo evolution from timestamp-micros to timestamp-mills. Normally a lossy evolution that is not supported, this evolution is ok because the data is actually still timestamp-millis it is just mislabeled as micros in the parquet and table schemas

Impact

When reading from a hudi table using spark or hive reader if the table schema has a column as millis, but the data schema is micros, we will assume that this column is affected and read it as a millis value instead of a micros value. This correction is also applied to all readers that the default write paths use. As a table is rewritten the parquet files will be correct. A table's latest snapshot can be immediately fixed by writing one commit with the 1.1 binary.

Risk Level

High,
extensive testing was done and functional tests were added.

Documentation Update

#14100

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

lokeshj1703

@linliu-code Thanks for working on this! The PR contains a few changes which are not part of https://github.com/apache/hudi/pull/14161/files. Can we add description about how the fix works for older hudi tables. Also the original PR mentions a limitation.

However, we used the InternalSchema system to do various operations such as fix null ordering, reorder, and add columns. At the time, InternalSchema only had a single Timestamp type. When converting back to avro, this was assumed to be micros.

Is this limitation fixed in older hudi tables?

pom.xml

hudi-common/src/avro/test/java/org/apache/parquet/schema/TestAvroSchemaRepair.java

hudi-common/src/avro/java/org/apache/parquet/schema/AvroSchemaRepair.java

hudi-common/src/main/java/org/apache/hudi/avro/AvroSchemaUtils.java

lokeshj1703 · 2025-12-30T12:30:32Z

hudi-common/src/main/java/org/apache/hudi/avro/ConvertingGenericData.java

+    // NOTE: Those are not supported in Avro 1.8.2 (used by Spark 2)
+    // Only add conversions if they're available


Should we validate the fix and added tests with spark 2? I am not sure if CI covers it by default.

Right now we only make the conversion for Spark3.4+.

hudi-common/src/main/java/org/apache/hudi/common/util/DateTimeUtils.java

nsivabalan

I have started reviewing the patch. Will keep sharing my reviews in smaller chunks so that you can start addressing them.

Can you confirm that we are not fixing nested fields for the logical ts issue?
Can we test 1.1.1 as well towards this. we need to understand what it takes to get it fixed. Or atleast call it out in documentation on whats expected behavior in 1.1.1, in 0.15.1 and 0.14.2 etc.

hudi-common/src/avro/java/org/apache/parquet/schema/AvroSchemaRepair.java

hudi-common/src/main/java/org/apache/hudi/avro/AvroSchemaUtils.java

hudi-common/src/main/java/org/apache/hudi/avro/ConvertingGenericData.java

...ent/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkParquetReader.java

...datasource/hudi-spark3.5.x/src/main/scala/org/apache/spark/sql/adapter/Spark3_5Adapter.scala

nsivabalan · 2026-01-27T06:39:23Z

hudi-client/hudi-client-common/pom.xml

+      <exclusions>
+        <exclusion>
+          <groupId>org.eclipse.jetty</groupId>
+          <artifactId>*</artifactId>


why do we need this change?

Introduced to resolve some conflicts. Will check if we can avoid this or due to some flakiness.

nsivabalan · 2026-01-27T06:40:18Z

hudi-client/hudi-client-common/pom.xml

          <groupId>org.pentaho</groupId>
          <artifactId>*</artifactId>
        </exclusion>
+        <exclusion>


can you hep me understand the necessity of this code change.

They are due to some dependency conflicts. Most likely since we use spark3.5 for Azure CI. I can remove these dependency change to see which compilation or tests fail.

nsivabalan

Sharing few more feedback

.github/workflows/bot.yml

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergedReadHandle.java

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/HoodieTable.java

hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/HoodieSparkKryoRegistrar.scala

...datasource/hudi-spark3.4.x/src/main/scala/org/apache/spark/sql/adapter/Spark3_4Adapter.scala

hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java

hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieAvroDataBlock.java

nsivabalan · 2026-01-28T04:27:03Z

hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java

    }
  }

+  public final List<String> getPartitionNames() {


why do we need this?
not related to logical ts fixes right.

I probably saw it in some test failures. Will remove it to see if any test failed.

...ommon/src/main/java/org/apache/hudi/internal/schema/convert/AvroInternalSchemaConverter.java

hudi-common/src/main/java/org/apache/hudi/BaseHoodieTableFileIndex.java

...src/main/java/org/apache/hudi/utilities/schema/postprocessor/ChainedSchemaPostProcessor.java

...rk-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/ColumnStatsIndexSupport.scala

linliu-code · 2026-01-28T21:29:39Z

I have started reviewing the patch. Will keep sharing my reviews in smaller chunks so that you can start addressing them.

Can you confirm that we are not fixing nested fields for the logical ts issue?
The repair logic supports the nested fields.

Can we test 1.1.1 as well towards this?
I think so. 1.1.0 should have the fix. Do you mean 1.0.3?

we need to understand what it takes to get it fixed. Or atleast call it out in documentation on whats expected behavior in 1.1.1, in 0.15.1 and 0.14.2 etc.
Right, for time travel queries, it could give wrong results when the as.of.instant is from the old Hudi version.

linliu-code · 2026-01-28T23:25:20Z

@linliu-code Thanks for working on this! The PR contains a few changes which are not part of https://github.com/apache/hudi/pull/14161/files. Can we add description about how the fix works for older hudi tables. Also the original PR mentions a limitation.
Will do.

However, we used the InternalSchema system to do various operations such as fix null ordering, reorder, and add columns. At the time, InternalSchema only had a single Timestamp type. When converting back to avro, this was assumed to be micros.

Is this limitation fixed in older hudi tables?

Sure. This limitation has been fixed.

…fety

nsivabalan · 2026-01-29T22:31:48Z

...ent/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkParquetReader.java

      requestedSchema = readerSchema;
    }
+    // Set configuration for timestamp_millis type repair.
+    storage.getConf().set(ENABLE_LOGICAL_TIMESTAMP_REPAIR, Boolean.toString(AvroSchemaUtils.hasTimestampMillisField(readerSchema)));


we should try to read the config from driver doing this.
if its not set, then we can parse the schema and set it.

@nsivabalan , I will focus on other comments first.

nsivabalan · 2026-01-29T22:44:54Z

hudi-common/src/main/java/org/apache/hudi/common/table/log/AbstractHoodieLogRecordReader.java

    this.forceFullScan = forceFullScan;
    this.internalSchema = internalSchema == null ? InternalSchema.getEmptyInternalSchema() : internalSchema;
    this.enableOptimizedLogBlocksScan = enableOptimizedLogBlocksScan;
+    this.enableLogicalTimestampFieldRepair = readerSchema != null && AvroSchemaUtils.hasTimestampMillisField(readerSchema);


can we check if hadoopConf already contains the info and fetch it from there.

Also, can we populate the value in hadoopConf only and use it to pass around. I wanted to add passing individual boolean flags like we are currently doing in this patch.

nsivabalan · 2026-01-29T22:53:15Z

hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFileReader.java

-    this(storage, logFile, readerSchema, bufferSize, reverseReader, enableRecordLookups, keyField, InternalSchema.getEmptyInternalSchema());
+    this(storage, logFile, readerSchema, bufferSize, reverseReader, enableRecordLookups, keyField,
+        InternalSchema.getEmptyInternalSchema(),
+        readerSchema != null && AvroSchemaUtils.hasTimestampMillisField(readerSchema));


We should try removing this.
lets always lookup in hadoop conf/storageConf.

And we should try and set the value in the driver before invoking these classes

nsivabalan · 2026-01-29T23:00:14Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala

      val shouldReadInMemory = columnStatsIndex.shouldReadInMemory(this, queryReferencedColumns)

+      // Identify timestamp-millis columns from the Avro schema to skip from filter translation
+      // (even if they're in the index, they may have been indexed before the fix and should not be used for filtering)


can we move this to a separate method.

and did we add UTs or functional tests(not end to end) directly against data skipping layer.

Will add UTs first, and see how to add FTs.

...rg/apache/spark/sql/execution/datasources/parquet/Spark34LegacyHoodieParquetFileFormat.scala

...rg/apache/spark/sql/execution/datasources/parquet/Spark35LegacyHoodieParquetFileFormat.scala

nsivabalan · 2026-01-30T00:11:42Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala

      val shouldReadInMemory = columnStatsIndex.shouldReadInMemory(this, queryReferencedColumns)

+      // Identify timestamp-millis columns from the Avro schema to skip from filter translation
+      // (even if they're in the index, they may have been indexed before the fix and should not be used for filtering)


and did we add UTs or functional tests(not end to end) directly against data skipping layer.

...rg/apache/spark/sql/execution/datasources/parquet/Spark34LegacyHoodieParquetFileFormat.scala

linliu-code · 2026-01-30T01:13:40Z

#17601 (comment)

We can add some.

hudi-bot · 2026-02-02T12:22:00Z

CI report:

5ef5773 UNKNOWN
1d2d706 UNKNOWN
4bbecb7 UNKNOWN
73e1942 UNKNOWN
ffcc9ca UNKNOWN
4c5a493 UNKNOWN
408cc29 UNKNOWN
0c4e026 UNKNOWN
8583da1 UNKNOWN
61285db UNKNOWN
d264474 UNKNOWN
a62e355 UNKNOWN
d22894c UNKNOWN
e8d9ca3 UNKNOWN
d2ef7f9 UNKNOWN
4440884 UNKNOWN
721b598 UNKNOWN
7d0b742 UNKNOWN
103e3b4 UNKNOWN
29a6abc UNKNOWN
9f21467 UNKNOWN
ed4eeff UNKNOWN
da522cd UNKNOWN
e2a1812 UNKNOWN
844a712 UNKNOWN
fcc5cda UNKNOWN
69af7f3 UNKNOWN
996080f UNKNOWN
fdd840c UNKNOWN
235f750 Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

linliu-code changed the base branch from master to branch-0.x December 15, 2025 20:34

linliu-code force-pushed the branch-0.x-with-logic_types_fix branch from ac2916a to 5ef5773 Compare December 15, 2025 20:40

github-actions bot added the size:XL PR with lines of changes > 1000 label Dec 15, 2025

linliu-code force-pushed the branch-0.x-with-logic_types_fix branch 13 times, most recently from 0c4e026 to 79c4a88 Compare December 16, 2025 03:22

linliu-code force-pushed the branch-0.x-with-logic_types_fix branch 2 times, most recently from 8583da1 to 0c7b7b9 Compare December 24, 2025 00:18

linliu-code marked this pull request as ready for review December 24, 2025 00:19

linliu-code force-pushed the branch-0.x-with-logic_types_fix branch 6 times, most recently from fcbe23c to 20ada07 Compare December 30, 2025 09:56

lokeshj1703 reviewed Dec 30, 2025

View reviewed changes

linliu-code force-pushed the branch-0.x-with-logic_types_fix branch 4 times, most recently from 8c011e0 to ac33414 Compare December 31, 2025 07:39

linliu-code force-pushed the branch-0.x-with-logic_types_fix branch from 70ca944 to 699e63b Compare January 11, 2026 22:01

Add Janino dependency

65f6fdd

linliu-code force-pushed the branch-0.x-with-logic_types_fix branch from 699e63b to 65f6fdd Compare January 12, 2026 19:34

Fix compiling error

cc1e856

linliu-code force-pushed the branch-0.x-with-logic_types_fix branch from 996080f to cc1e856 Compare January 12, 2026 22:13

yihua self-assigned this Jan 14, 2026

Fix incremental queries

1b966a1

nsivabalan reviewed Jan 27, 2026

View reviewed changes

hudi-common/src/avro/java/org/apache/parquet/schema/AvroSchemaRepair.java Show resolved Hide resolved

nsivabalan reviewed Jan 27, 2026

View reviewed changes

nsivabalan reviewed Jan 28, 2026

View reviewed changes

linliu-code changed the title ~~[MINOR] Fix logical type issue for timestamp columns~~ fix: Fix logical type issue for timestamp columns Jan 28, 2026

linliu-code added 3 commits January 29, 2026 10:04

Fix the data skipping bug

38edcb4

Pass schema from option instead of global configuration for thread sa…

61fae65

…fety

Address partial comments

833aac2

linliu-code force-pushed the branch-0.x-with-logic_types_fix branch from b30dc60 to 833aac2 Compare January 29, 2026 18:05

Address more comments

b16342b

linliu-code force-pushed the branch-0.x-with-logic_types_fix branch from 0b4aa51 to b16342b Compare January 29, 2026 21:24

nsivabalan reviewed Jan 29, 2026

View reviewed changes

nsivabalan reviewed Jan 30, 2026

View reviewed changes

Addressed comments

0305c5c

linliu-code force-pushed the branch-0.x-with-logic_types_fix branch 2 times, most recently from 2f1b385 to 5cce8c7 Compare January 30, 2026 21:00

address wiring comments

7a664d3

linliu-code force-pushed the branch-0.x-with-logic_types_fix branch from 5cce8c7 to 7a664d3 Compare January 30, 2026 22:00

Fix hive related tests

235f750

		// NOTE: Those are not supported in Avro 1.8.2 (used by Spark 2)
		// Only add conversions if they're available

fix: Fix logical type issue for timestamp columns #17601

Are you sure you want to change the base?

fix: Fix logical type issue for timestamp columns #17601

Conversation

linliu-code commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

lokeshj1703 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nsivabalan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nsivabalan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

linliu-code commented Jan 28, 2026

Uh oh!

linliu-code commented Jan 28, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

linliu-code commented Dec 15, 2025 •

edited

Loading