Add MOR snapshot reads, metadata index pruning, and table statistics to Hudi connector by voonhous · Pull Request #28511 · trinodb/trino

voonhous · 2026-03-03T08:50:17Z

Description

Significantly expands the capabilities of the Hudi connector in three main areas:

Merge-On-Read (MOR) snapshot reads
Previously the connector could only read the read-optimised view of MOR tables (base Parquet files only). This adds a full snapshot read path using HoodieFileGroupReader that merges base files with delta log files at query time, returning an up-to-date view of the table. TrinoHudiReaderContext and HudiPageSource bridge the Hudi merging infrastructure to Trino pages via bidirectional Avro <-> Trino type conversion (HudiAvroSerializer).

Metadata index-based file and partition pruning
Introduces an extensible HudiIndexSupport strategy hierarchy that uses the Hudi metadata table to skip file slices at split-planning time, without reading data files. Four strategies are implemented and selected in priority order via IndexSupportFactory:

Strategy	Prunes by
Record-level index	Exact record key equality / IN predicates
Secondary index	Arbitrary column equality predicates
Column stats index	Domain range overlap per-file (async, configurable timeout)
Partition stats index	Column stats extended with IS NULL / IS NOT NULL on partition columns

All strategies are independently toggleable via session properties and connector config.

Table statistics for Cost based optimizer (CBO)
Reads column-level statistics (row counts, null fractions, data sizes) from the Hudi metadata table's COLUMN_STATS partition. Refresh runs asynchronously in the background so statistics never block query planning; a stale-but-close-enough cached value is returned immediately if available.

Additional improvements

HudiSnapshotDirectoryLister replaces HudiReadOptimizedDirectoryLister, unifying the directory listing path for both COW and MOR tables
resolveColumnNameCasing config/session option handles column name mismatches between the Hive metastore schema and the Hudi table schema (common when tables are created by case-sensitive Spark jobs)
Inline file I/O (InlineSeekableDataInputStream, TrinoHudiInlineStorage) supports log files embedded via InLineFS URI scheme
TupleDomainUtils provides shared predicate-extraction helpers used across index implementations and split generation
18 new test datasets covering: COW/MOR, partitioned/non-partitioned, multi-file-group, custom and timestamp-based key generators, v6/v8 table versions, and mixed field-name casing

Additional context and related issues

Hudi tables come in two storage types:

COW (Copy-On-Write): writes produce new Parquet files; reads are simple file scans. Already worked before this PR.
MOR (Merge-On-Read): writes append delta records to Avro/HFile log files; reads must merge the log files on top of the base Parquet file for each file group to get the latest view. This PR enables that merge.

The metadata index pruning strategies require the Hudi metadata table to be enabled on the table (hoodie.metadata.enable=true, the default since Hudi 0.11). When the metadata table or a specific index partition is absent the connector falls back gracefully to a full listing.

Statistics are similarly gated: if the COLUMN_STATS metadata partition is unavailable the connector returns empty statistics and the planner uses its defaults.

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(X) Release notes are required, with the following suggested text:

## Hudi connector
* Add support for reading Merge-On-Read (MOR) tables using snapshot reads that merge base files with delta log files.
* Improve query performance by using Hudi metadata table indexes (column stats, partition stats, record-level, secondary) to skip irrelevant file slices at split-planning time.
* Add support for table statistics read from the Hudi metadata table, enabling the cost-based optimizer to generate better query plans.
* Add `hudi.resolve-column-name-casing` config property and `hudi_resolve_column_name_casing` session property to handle column name mismatches between the metastore schema and the Hudi table schema.

- HudiFile/HudiBaseFile/HudiLogFile: Trino-native wrappers for Hudi base files and log files - TupleDomainUtils: predicate helpers for domain-based index lookups - HudiAvroSerializer: bidirectional Avro <-> Trino type conversion - HudiTableTypeUtils: COW/MOR input format detection - InlineSeekableDataInputStream / TrinoHudiInlineStorage: support for reading log files embedded via InLineFS URI scheme

Introduces TrinoHudiReaderContext and HudiPageSource to support reading MOR tables by merging base files with delta log files using HoodieFileGroupReader. HudiAvroSerializer bridges the Avro record representation used internally by the merger back to Trino pages.

Adds an extensible index strategy hierarchy (HudiIndexSupport) backed by four implementations: column stats, partition stats, record-level, and secondary indexes. IndexSupportFactory selects the best applicable strategy per query based on session config. Also adds HudiSplitColumns for materialising virtual columns ($path, $file_size, partition keys).

Adds TableStatisticsReader and TableMetadataReader to pull column-level statistics (row counts, null fractions, data sizes) from the Hudi metadata table's column stats partition. HudiExecutorModule provides a dedicated async executor for background statistics refresh.

- HudiConfig/HudiSessionProperties: add knobs for index types (column stats, partition stats, record-level, secondary), statistics, async timeouts, and column-name casing resolution - HudiBackgroundSplitLoader: integrate IndexSupportFactory for metadata index-driven file-slice pruning - HudiSplitFactory: use HudiBaseFile/HudiLogFile abstractions - HudiSnapshotDirectoryLister: replaces HudiReadOptimizedDirectoryLister - HudiMetadata: wire async statistics refresh via HudiExecutorModule - pom.xml: updated Hudi dependency versions

Adds 18 new Hudi test datasets (COW/MOR, partitioned/non-partitioned, multi-field-group, custom keygen, v6/v8 table versions) along with HudiTableUnzipper for loading zip-archived test tables. Expands TestHudiSmokeTest to cover the new table variants including MOR snapshot reads, column-name casing, timestamp keygens, and file-operation counts (alluxio/memory/no-cache).

cla-bot · 2026-03-03T08:50:21Z

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

voonhous · 2026-03-03T09:15:44Z

@ebyhr I tried breaking up this PR into 6 smaller commits physically, grouping them by the components/feature they touch, but they cannot be independent of each other as they are stacked on top of each other. So, i think having them in one PR is the best way to go.

Will it be possible for us to iterate on this PR commit-by-commit? I'm open to suggestions.

After everything is done, I can squash the commit into one so that the CI doesn't throw any more errors.

voonhous added 6 commits March 3, 2026 16:49

github-actions bot added hudi Hudi connector lakehouse labels Mar 3, 2026

voonhous changed the title ~~Oss upstream 2~~ Upstream Hudi features to open source Mar 3, 2026

voonhous changed the title ~~Upstream Hudi features to open source~~ Add MOR snapshot reads, metadata index pruning, and table statistics to Hudi connector Mar 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MOR snapshot reads, metadata index pruning, and table statistics to Hudi connector#28511

Add MOR snapshot reads, metadata index pruning, and table statistics to Hudi connector#28511
voonhous wants to merge 6 commits intotrinodb:masterfrom
voonhous:oss-upstream-2

voonhous commented Mar 3, 2026 •

edited

Loading

Uh oh!

cla-bot bot commented Mar 3, 2026

Uh oh!

voonhous commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant

Conversation

voonhous commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Additional context and related issues

Release notes

Uh oh!

cla-bot bot commented Mar 3, 2026

Uh oh!

voonhous commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant

voonhous commented Mar 3, 2026 •

edited

Loading