-
Notifications
You must be signed in to change notification settings - Fork 2.5k
fix: Fix logical type issue for timestamp columns #17601
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: branch-0.x
Are you sure you want to change the base?
fix: Fix logical type issue for timestamp columns #17601
Conversation
ac2916a to
5ef5773
Compare
0c4e026 to
79c4a88
Compare
8583da1 to
0c7b7b9
Compare
fcbe23c to
20ada07
Compare
lokeshj1703
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@linliu-code Thanks for working on this! The PR contains a few changes which are not part of https://github.com/apache/hudi/pull/14161/files. Can we add description about how the fix works for older hudi tables. Also the original PR mentions a limitation.
However, we used the InternalSchema system to do various operations such as fix null ordering, reorder, and add columns. At the time, InternalSchema only had a single Timestamp type. When converting back to avro, this was assumed to be micros.
Is this limitation fixed in older hudi tables?
hudi-common/src/avro/test/java/org/apache/parquet/schema/TestAvroSchemaRepair.java
Show resolved
Hide resolved
hudi-common/src/avro/java/org/apache/parquet/schema/AvroSchemaRepair.java
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/avro/AvroSchemaUtils.java
Outdated
Show resolved
Hide resolved
| // NOTE: Those are not supported in Avro 1.8.2 (used by Spark 2) | ||
| // Only add conversions if they're available |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we validate the fix and added tests with spark 2? I am not sure if CI covers it by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now we only make the conversion for Spark3.4+.
hudi-common/src/main/java/org/apache/hudi/common/util/DateTimeUtils.java
Outdated
Show resolved
Hide resolved
8c011e0 to
ac33414
Compare
70ca944 to
699e63b
Compare
699e63b to
65f6fdd
Compare
996080f to
cc1e856
Compare
nsivabalan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have started reviewing the patch. Will keep sharing my reviews in smaller chunks so that you can start addressing them.
Can you confirm that we are not fixing nested fields for the logical ts issue?
Can we test 1.1.1 as well towards this. we need to understand what it takes to get it fixed. Or atleast call it out in documentation on whats expected behavior in 1.1.1, in 0.15.1 and 0.14.2 etc.
hudi-common/src/avro/java/org/apache/parquet/schema/AvroSchemaRepair.java
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/avro/ConvertingGenericData.java
Show resolved
Hide resolved
...ent/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkParquetReader.java
Show resolved
Hide resolved
...ent/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkParquetReader.java
Outdated
Show resolved
Hide resolved
...ent/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkParquetReader.java
Outdated
Show resolved
Hide resolved
...datasource/hudi-spark3.5.x/src/main/scala/org/apache/spark/sql/adapter/Spark3_5Adapter.scala
Outdated
Show resolved
Hide resolved
| <exclusions> | ||
| <exclusion> | ||
| <groupId>org.eclipse.jetty</groupId> | ||
| <artifactId>*</artifactId> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Introduced to resolve some conflicts. Will check if we can avoid this or due to some flakiness.
| <groupId>org.pentaho</groupId> | ||
| <artifactId>*</artifactId> | ||
| </exclusion> | ||
| <exclusion> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you hep me understand the necessity of this code change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are due to some dependency conflicts. Most likely since we use spark3.5 for Azure CI. I can remove these dependency change to see which compilation or tests fail.
nsivabalan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sharing few more feedback
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergedReadHandle.java
Show resolved
Hide resolved
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/HoodieTable.java
Show resolved
Hide resolved
hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/HoodieSparkKryoRegistrar.scala
Show resolved
Hide resolved
...datasource/hudi-spark3.4.x/src/main/scala/org/apache/spark/sql/adapter/Spark3_4Adapter.scala
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
Outdated
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieAvroDataBlock.java
Outdated
Show resolved
Hide resolved
| } | ||
| } | ||
|
|
||
| public final List<String> getPartitionNames() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need this?
not related to logical ts fixes right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I probably saw it in some test failures. Will remove it to see if any test failed.
...ommon/src/main/java/org/apache/hudi/internal/schema/convert/AvroInternalSchemaConverter.java
Outdated
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/BaseHoodieTableFileIndex.java
Outdated
Show resolved
Hide resolved
...src/main/java/org/apache/hudi/utilities/schema/postprocessor/ChainedSchemaPostProcessor.java
Show resolved
Hide resolved
...rk-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/ColumnStatsIndexSupport.scala
Outdated
Show resolved
Hide resolved
Can we test 1.1.1 as well towards this? we need to understand what it takes to get it fixed. Or atleast call it out in documentation on whats expected behavior in 1.1.1, in 0.15.1 and 0.14.2 etc. |
Sure. This limitation has been fixed. |
b30dc60 to
833aac2
Compare
0b4aa51 to
b16342b
Compare
| requestedSchema = readerSchema; | ||
| } | ||
| // Set configuration for timestamp_millis type repair. | ||
| storage.getConf().set(ENABLE_LOGICAL_TIMESTAMP_REPAIR, Boolean.toString(AvroSchemaUtils.hasTimestampMillisField(readerSchema))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should try to read the config from driver doing this.
if its not set, then we can parse the schema and set it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nsivabalan , I will focus on other comments first.
| this.forceFullScan = forceFullScan; | ||
| this.internalSchema = internalSchema == null ? InternalSchema.getEmptyInternalSchema() : internalSchema; | ||
| this.enableOptimizedLogBlocksScan = enableOptimizedLogBlocksScan; | ||
| this.enableLogicalTimestampFieldRepair = readerSchema != null && AvroSchemaUtils.hasTimestampMillisField(readerSchema); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we check if hadoopConf already contains the info and fetch it from there.
Also, can we populate the value in hadoopConf only and use it to pass around. I wanted to add passing individual boolean flags like we are currently doing in this patch.
| this(storage, logFile, readerSchema, bufferSize, reverseReader, enableRecordLookups, keyField, InternalSchema.getEmptyInternalSchema()); | ||
| this(storage, logFile, readerSchema, bufferSize, reverseReader, enableRecordLookups, keyField, | ||
| InternalSchema.getEmptyInternalSchema(), | ||
| readerSchema != null && AvroSchemaUtils.hasTimestampMillisField(readerSchema)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should try removing this.
lets always lookup in hadoop conf/storageConf.
And we should try and set the value in the driver before invoking these classes
| val shouldReadInMemory = columnStatsIndex.shouldReadInMemory(this, queryReferencedColumns) | ||
|
|
||
| // Identify timestamp-millis columns from the Avro schema to skip from filter translation | ||
| // (even if they're in the index, they may have been indexed before the fix and should not be used for filtering) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we move this to a separate method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and did we add UTs or functional tests(not end to end) directly against data skipping layer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will add UTs first, and see how to add FTs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UT added.
...rg/apache/spark/sql/execution/datasources/parquet/Spark34LegacyHoodieParquetFileFormat.scala
Outdated
Show resolved
Hide resolved
...rg/apache/spark/sql/execution/datasources/parquet/Spark35LegacyHoodieParquetFileFormat.scala
Outdated
Show resolved
Hide resolved
...rg/apache/spark/sql/execution/datasources/parquet/Spark35LegacyHoodieParquetFileFormat.scala
Outdated
Show resolved
Hide resolved
| val shouldReadInMemory = columnStatsIndex.shouldReadInMemory(this, queryReferencedColumns) | ||
|
|
||
| // Identify timestamp-millis columns from the Avro schema to skip from filter translation | ||
| // (even if they're in the index, they may have been indexed before the fix and should not be used for filtering) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and did we add UTs or functional tests(not end to end) directly against data skipping layer.
...rg/apache/spark/sql/execution/datasources/parquet/Spark34LegacyHoodieParquetFileFormat.scala
Outdated
Show resolved
Hide resolved
|
We can add some. |
2f1b385 to
5cce8c7
Compare
5cce8c7 to
7a664d3
Compare
CI report:
Bot commands@hudi-bot supports the following commands:
|
Describe the issue this Pull Request addresses
This PR combines mainly two PRs that fixing timestamp_millis logical type issue.
Summary and Changelog
The below is the PR description from #14161.
This pr #9743 adds more schema evolution functionality and schema processing. However, we used the InternalSchema system to do various operations such as fix null ordering, reorder, and add columns. At the time, InternalSchema only had a single Timestamp type. When converting back to avro, this was assumed to be micros. Therefore, if the schema provider had any millis columns, the processed schema would end up with those columns as micros.
In this pr to update column stats with better support for logical types: #13711, the schema issues were fixed, as well as additional issues with handling and conversion of timestamps during ingestion.
this pr aims to add functionality to spark and hive readers and writers to automatically repair affected tables.
After switching to use the 1.1 binary, the affected columns will undergo evolution from timestamp-micros to timestamp-mills. Normally a lossy evolution that is not supported, this evolution is ok because the data is actually still timestamp-millis it is just mislabeled as micros in the parquet and table schemas
Impact
When reading from a hudi table using spark or hive reader if the table schema has a column as millis, but the data schema is micros, we will assume that this column is affected and read it as a millis value instead of a micros value. This correction is also applied to all readers that the default write paths use. As a table is rewritten the parquet files will be correct. A table's latest snapshot can be immediately fixed by writing one commit with the 1.1 binary.
Risk Level
High,
extensive testing was done and functional tests were added.
Documentation Update
#14100
Contributor's checklist