-
Notifications
You must be signed in to change notification settings - Fork 2.5k
feat: the basic new hudi source reader #17773
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
c6f4bb2 to
faa181f
Compare
faa181f to
93be7d8
Compare
| String fileId, | ||
| HoodieCDCFileSplit[] changes) { | ||
| super(splitNum, null, Option.empty(), "", tablePath, | ||
| super(splitNum, null, Option.empty(), "", tablePath, "", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there constant for empty str partition path
| public static MergeOnReadInputSplit singleLogFile2Split(String tablePath, String filePath, long maxCompactionMemoryInBytes) { | ||
| return new MergeOnReadInputSplit(0, null, Option.of(Collections.singletonList(filePath)), | ||
| FSUtils.getDeltaCommitTimeFromLogPath(new StoragePath(filePath)), tablePath, maxCompactionMemoryInBytes, | ||
| FSUtils.getDeltaCommitTimeFromLogPath(new StoragePath(filePath)), tablePath, "", maxCompactionMemoryInBytes, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using the empty partition path constant makes this more readable
| // source merge type | ||
| private final String mergeType; | ||
| // the latest commit instant time | ||
| private final String latestCommit; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
calling it the "latest" does not seem appropriate for the corresponding commit, as there will always be new commits from time to time. this property is more accurately to be as_of_instant_time
| * limitations under the License. | ||
| */ | ||
|
|
||
| package org.apache.hudi.source.reader; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not under the function subpackage?
| // the SourceOperator will stop processing and recycling the fetched batches. This exhausts the | ||
| // {@link ArrayPoolDataIteratorBatcher#pool} and the `currentReader.next()` call will be | ||
| // blocked even without split-level watermark alignment. Based on this the | ||
| // `pauseOrResumeSplits` and the `wakeUp` are left empty. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move this multi-line comment to the method doc and add // no op here
|
|
||
| @Override | ||
| public Set<String> finishedSplits() { | ||
| return finishedSplits; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when will finishedSplits be populated?
| try (HoodieFileGroupReader<RowData> fileGroupReader = createFileGroupReader(split)) { | ||
| final ClosableIterator<RowData> recordIterator = fileGroupReader.getClosableIterator(); | ||
| BatchRecords<RowData> records = BatchRecords.forRecords(splitId, recordIterator, split.getFileOffset(), split.getRecordOffset()); | ||
| records.seek(split.getRecordOffset()); | ||
| return records; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fileGroupReader will be closed before records being consumed by the caller. fileGroupReader lifecycle should be managed at a higher level
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. The fileGroupReader will be closed when closing the ClosableIterator. We can close the iterator in BatchRecords#recycle().
| private final HoodieReaderContext<RowData> readerContext; | ||
| private final HoodieSchema tableSchema; | ||
| private final HoodieSchema requiredSchema; | ||
| private final String mergeType; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
field not used.
| .withShouldUseRecordPosition(true); | ||
|
|
||
| // Add schemas if provided | ||
| if (tableSchema != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dataSchema annd requestedSchema cannot be null for file group reader.
| private final HoodieRecordWithPosition<T> recordAndPosition; | ||
|
|
||
| // point to current read position within the records list | ||
| private int position; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
position value is increased but never accessed.
|
|
||
| public void seek(long startingRecordOffset) { | ||
| for (long i = 0; i < startingRecordOffset; ++i) { | ||
| if (recordIterator.hasNext()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if position is necessary, it should be also increased here?
| this.logPaths = logPaths; | ||
| this.latestCommit = latestCommit; | ||
| this.tablePath = tablePath; | ||
| this.partitionPath = partitionPath; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
partitionPath seems not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, we can just pass around an empty string when building the file group reader now.
| this.fileId = fileId; | ||
| } | ||
|
|
||
| public String getFileId() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unnecessary changes. We use lombok annotations now.
| return toString(); | ||
| } | ||
|
|
||
| public String getFileId() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unnecessary changes. We use lombok annotations now.
| /** | ||
| * Reader function implementation for Merge On Read table. | ||
| */ | ||
| public class MergeOnReadSplitReaderFunction<I, K, O> implements SplitReaderFunction<RowData> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HoodieSourceSplitReaderFunction ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't see the mini-batch read in this function, is it hanled automatically?
| // We request a split only if we did not get splits during the checkpoint restore. | ||
| // Otherwise, reader restarts will keep requesting more and more splits. | ||
| if (getNumberOfCurrentlyAssignedSplits() == 0) { | ||
| requestSplit(new ArrayList<>()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use Collections.emptyList()
| try (HoodieFileGroupReader<RowData> fileGroupReader = createFileGroupReader(split)) { | ||
| final ClosableIterator<RowData> recordIterator = fileGroupReader.getClosableIterator(); | ||
| BatchRecords<RowData> records = BatchRecords.forRecords(splitId, recordIterator, split.getFileOffset(), split.getRecordOffset()); | ||
| records.seek(split.getRecordOffset()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like the recordOffset and consumed works as the same functionality in HoodieSourceSplit: to bookeep the offset of the last consumed records, is it possible to unify these two?
|
|
||
| @Override | ||
| public RecordsWithSplitIds<HoodieRecordWithPosition<T>> fetch() throws IOException { | ||
| HoodieSourceSplit nextSplit = splits.poll(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this removed the split from the queue right? if the read function errors out, the split will be lost?
|
|
||
| @Override | ||
| public void close() throws Exception { | ||
| currentSplitId = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about clean up other properties
Describe the issue this Pull Request addresses
Add basic hudi source split reader functionality for MOR
Summary and Changelog
Added HoodieSourceSplitReader, HoodieRecordEmitter and BatchRecords and etc class
Added test cases for these classes
Impact
None
Risk Level
none
Documentation Update
none
Contributor's checklist