-
Notifications
You must be signed in to change notification settings - Fork 70
feat: Introduce ArrowArrayReader factory on FileScanTask #200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
05e9e40 to
cbb923f
Compare
5e4a761 to
cf2a0ea
Compare
|
@wgtmac I've updated the code to address the feedback from the review. Thanks! |
Have you pushed your changes? I don't see any change since last review. |
zeroshade
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one question really
| /// \brief A reader interface that returns ArrowArray in a streaming fashion. | ||
| class ICEBERG_EXPORT ArrowArrayReader { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why ArrowArray instead of RecordBatches?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zeroshade Could you help review #214 which returns ArrowArrayStream?
|
|
||
| int64_t FileScanTask::estimated_row_count() const { return data_file_->record_count; } | ||
|
|
||
| Result<std::unique_ptr<ArrowArrayReader>> FileScanTask::ToArrowArrayReader( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't a FileScanTask return a stream of RecordBatch? Not ArrowArray?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have two libraries, one is libiceberg and the other is libiceberg-bundle. For libiceberg, it only depends on Arrow C ABI so it cannot have access to ::arrow::RecordBatch. An alternative is to use ArrowArrayStream but it requires more work so I'm not sure if it worths the effort at this moment. Since the API is not stable before 1.0.0, we can change our mind at any time for any better idea.
|
Close this PR in favor of #214 |
ArrowArrayReaderInterface: Added a new abstract interface,ArrowArrayReader, to represent a stream of Arrow data. This provides a unified interface for all file format readers.FileScanTask::ToArrowArrayReader()Factory Method: The FileScanTask now has a factory method that creates a file-format-specific reader (e.g., Parquet) and returns it as aunique_ptr<ArrowArrayReader>. This encapsulates the reader instantiation logic.