feat: the basic new hudi source reader #17773

HuangZhenQiu · 2026-01-03T05:54:05Z

Describe the issue this Pull Request addresses

Add basic hudi source split reader functionality for MOR

Summary and Changelog

Added HoodieSourceSplitReader, HoodieRecordEmitter and BatchRecords and etc class
Added test cases for these classes

Impact

None

Risk Level

none

Documentation Update

none

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

hudi-bot · 2026-01-03T20:13:13Z

CI report:

93be7d8 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

xushiyan · 2026-01-05T23:38:34Z

...link-datasource/hudi-flink/src/main/java/org/apache/hudi/table/format/cdc/CdcInputSplit.java

      String fileId,
      HoodieCDCFileSplit[] changes) {
-    super(splitNum, null, Option.empty(), "", tablePath,
+    super(splitNum, null, Option.empty(), "", tablePath, "",


is there constant for empty str partition path

xushiyan · 2026-01-05T23:39:07Z

...ink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/format/cdc/CdcInputFormat.java

  public static MergeOnReadInputSplit singleLogFile2Split(String tablePath, String filePath, long maxCompactionMemoryInBytes) {
    return new MergeOnReadInputSplit(0, null, Option.of(Collections.singletonList(filePath)),
-        FSUtils.getDeltaCommitTimeFromLogPath(new StoragePath(filePath)), tablePath, maxCompactionMemoryInBytes,
+        FSUtils.getDeltaCommitTimeFromLogPath(new StoragePath(filePath)), tablePath, "", maxCompactionMemoryInBytes,


using the empty partition path constant makes this more readable

xushiyan · 2026-01-05T23:41:26Z

...link-datasource/hudi-flink/src/main/java/org/apache/hudi/source/split/HoodieSourceSplit.java

  // source merge type
  private final String mergeType;
+  // the latest commit instant time
+  private final String latestCommit;


calling it the "latest" does not seem appropriate for the corresponding commit, as there will always be new commits from time to time. this property is more accurately to be as_of_instant_time

xushiyan · 2026-01-05T23:43:51Z

...k-datasource/hudi-flink/src/main/java/org/apache/hudi/source/reader/SplitReaderFunction.java

+ * limitations under the License.
+ */
+
+package org.apache.hudi.source.reader;


why not under the function subpackage?

xushiyan · 2026-01-05T23:48:39Z

...tasource/hudi-flink/src/main/java/org/apache/hudi/source/reader/HoodieSourceSplitReader.java

+    // the SourceOperator will stop processing and recycling the fetched batches. This exhausts the
+    // {@link ArrayPoolDataIteratorBatcher#pool} and the `currentReader.next()` call will be
+    // blocked even without split-level watermark alignment. Based on this the
+    // `pauseOrResumeSplits` and the `wakeUp` are left empty.


move this multi-line comment to the method doc and add // no op here

xushiyan · 2026-01-06T00:01:18Z

hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/reader/BatchRecords.java

+
+  @Override
+  public Set<String> finishedSplits() {
+    return finishedSplits;


when will finishedSplits be populated?

xushiyan · 2026-01-06T00:11:37Z

...ink/src/main/java/org/apache/hudi/source/reader/function/MergeOnReadSplitReaderFunction.java

+    try (HoodieFileGroupReader<RowData> fileGroupReader = createFileGroupReader(split)) {
+      final ClosableIterator<RowData> recordIterator = fileGroupReader.getClosableIterator();
+      BatchRecords<RowData> records = BatchRecords.forRecords(splitId, recordIterator, split.getFileOffset(), split.getRecordOffset());
+      records.seek(split.getRecordOffset());
+      return records;


fileGroupReader will be closed before records being consumed by the caller. fileGroupReader lifecycle should be managed at a higher level

+1. The fileGroupReader will be closed when closing the ClosableIterator. We can close the iterator in BatchRecords#recycle().

cshuo · 2026-01-06T01:09:05Z

...ink/src/main/java/org/apache/hudi/source/reader/function/MergeOnReadSplitReaderFunction.java

+  private final HoodieReaderContext<RowData> readerContext;
+  private final HoodieSchema tableSchema;
+  private final HoodieSchema requiredSchema;
+  private final String mergeType;


field not used.

cshuo · 2026-01-06T01:09:42Z

...ink/src/main/java/org/apache/hudi/source/reader/function/MergeOnReadSplitReaderFunction.java

+        .withShouldUseRecordPosition(true);
+
+    // Add schemas if provided
+    if (tableSchema != null) {


dataSchema annd requestedSchema cannot be null for file group reader.

cshuo · 2026-01-06T01:15:59Z

hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/reader/BatchRecords.java

+  private final HoodieRecordWithPosition<T> recordAndPosition;
+
+  // point to current read position within the records list
+  private int position;


position value is increased but never accessed.

cshuo · 2026-01-06T01:16:59Z

hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/reader/BatchRecords.java

+
+  public void seek(long startingRecordOffset) {
+    for (long i = 0; i < startingRecordOffset; ++i) {
+      if (recordIterator.hasNext()) {


if position is necessary, it should be also increased here?

cshuo · 2026-01-06T02:23:05Z

...asource/hudi-flink/src/main/java/org/apache/hudi/table/format/mor/MergeOnReadInputSplit.java

    this.logPaths = logPaths;
    this.latestCommit = latestCommit;
    this.tablePath = tablePath;
+    this.partitionPath = partitionPath;


partitionPath seems not needed.

yes, we can just pass around an empty string when building the file group reader now.

cshuo · 2026-01-06T02:24:16Z

...asource/hudi-flink/src/main/java/org/apache/hudi/table/format/mor/MergeOnReadInputSplit.java

    this.fileId = fileId;
  }

+  public String getFileId() {


Unnecessary changes. We use lombok annotations now.

cshuo · 2026-01-06T03:17:32Z

...link-datasource/hudi-flink/src/main/java/org/apache/hudi/source/split/HoodieSourceSplit.java

    return toString();
  }

+  public String getFileId() {


Unnecessary changes. We use lombok annotations now.

danny0405 · 2026-01-06T04:56:16Z

...ink/src/main/java/org/apache/hudi/source/reader/function/MergeOnReadSplitReaderFunction.java

+/**
+ * Reader function implementation for Merge On Read table.
+ */
+public class MergeOnReadSplitReaderFunction<I, K, O> implements SplitReaderFunction<RowData> {


HoodieSourceSplitReaderFunction ?

I didn't see the mini-batch read in this function, is it hanled automatically?

danny0405 · 2026-01-06T04:57:01Z

...nk-datasource/hudi-flink/src/main/java/org/apache/hudi/source/reader/HoodieSourceReader.java

+    // We request a split only if we did not get splits during the checkpoint restore.
+    // Otherwise, reader restarts will keep requesting more and more splits.
+    if (getNumberOfCurrentlyAssignedSplits() == 0) {
+      requestSplit(new ArrayList<>());


use Collections.emptyList()

danny0405 · 2026-01-06T05:01:55Z

...ink/src/main/java/org/apache/hudi/source/reader/function/MergeOnReadSplitReaderFunction.java

+    try (HoodieFileGroupReader<RowData> fileGroupReader = createFileGroupReader(split)) {
+      final ClosableIterator<RowData> recordIterator = fileGroupReader.getClosableIterator();
+      BatchRecords<RowData> records = BatchRecords.forRecords(splitId, recordIterator, split.getFileOffset(), split.getRecordOffset());
+      records.seek(split.getRecordOffset());


looks like the recordOffset and consumed works as the same functionality in HoodieSourceSplit: to bookeep the offset of the last consumed records, is it possible to unify these two?

xushiyan · 2026-01-06T05:34:20Z

...tasource/hudi-flink/src/main/java/org/apache/hudi/source/reader/HoodieSourceSplitReader.java

+
+  @Override
+  public RecordsWithSplitIds<HoodieRecordWithPosition<T>> fetch() throws IOException {
+    HoodieSourceSplit nextSplit = splits.poll();


this removed the split from the queue right? if the read function errors out, the split will be lost?

xushiyan · 2026-01-06T05:35:56Z

...tasource/hudi-flink/src/main/java/org/apache/hudi/source/reader/HoodieSourceSplitReader.java

+
+  @Override
+  public void close() throws Exception {
+    currentSplitId = null;


how about clean up other properties

github-actions bot added the size:XL PR with lines of changes > 1000 label Jan 3, 2026

HuangZhenQiu force-pushed the new-source-split-reader branch 2 times, most recently from c6f4bb2 to faa181f Compare January 3, 2026 17:12

feat: the basic new hudi source reader

93be7d8

HuangZhenQiu force-pushed the new-source-split-reader branch from faa181f to 93be7d8 Compare January 3, 2026 17:24

xushiyan self-assigned this Jan 5, 2026

xushiyan linked an issue Jan 5, 2026 that may be closed by this pull request

Create Hudi Source Split Reader #17039

Open

xushiyan reviewed Jan 6, 2026

View reviewed changes

cshuo reviewed Jan 6, 2026

View reviewed changes

danny0405 reviewed Jan 6, 2026

View reviewed changes

xushiyan reviewed Jan 6, 2026

View reviewed changes

feat: the basic new hudi source reader #17773

Are you sure you want to change the base?

feat: the basic new hudi source reader #17773

Conversation

HuangZhenQiu commented Jan 3, 2026

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

hudi-bot commented Jan 3, 2026

CI report:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cshuo Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xushiyan Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

cshuo Jan 6, 2026 •

edited

Loading

xushiyan Jan 6, 2026 •

edited

Loading