Skip to content

Conversation

@HeartLinked
Copy link
Contributor

@HeartLinked HeartLinked commented Aug 29, 2025

  • Introduced ArrowArrayReader Interface: Added a new abstract interface, ArrowArrayReader, to represent a stream of Arrow data. This provides a unified interface for all file format readers.
  • Added FileScanTask::ToArrowArrayReader() Factory Method: The FileScanTask now has a factory method that creates a file-format-specific reader (e.g., Parquet) and returns it as a unique_ptr<ArrowArrayReader>. This encapsulates the reader instantiation logic.

@HeartLinked HeartLinked marked this pull request as ready for review August 29, 2025 10:04
@HeartLinked
Copy link
Contributor Author

@wgtmac I've updated the code to address the feedback from the review. Thanks!

@wgtmac
Copy link
Member

wgtmac commented Sep 2, 2025

@wgtmac I've updated the code to address the feedback from the review. Thanks!

Have you pushed your changes? I don't see any change since last review.

@wgtmac wgtmac mentioned this pull request Sep 3, 2025
45 tasks
Copy link
Member

@zeroshade zeroshade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one question really

Comment on lines +30 to +31
/// \brief A reader interface that returns ArrowArray in a streaming fashion.
class ICEBERG_EXPORT ArrowArrayReader {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why ArrowArray instead of RecordBatches?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zeroshade Could you help review #214 which returns ArrowArrayStream?


int64_t FileScanTask::estimated_row_count() const { return data_file_->record_count; }

Result<std::unique_ptr<ArrowArrayReader>> FileScanTask::ToArrowArrayReader(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't a FileScanTask return a stream of RecordBatch? Not ArrowArray?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have two libraries, one is libiceberg and the other is libiceberg-bundle. For libiceberg, it only depends on Arrow C ABI so it cannot have access to ::arrow::RecordBatch. An alternative is to use ArrowArrayStream but it requires more work so I'm not sure if it worths the effort at this moment. Since the API is not stable before 1.0.0, we can change our mind at any time for any better idea.

@wgtmac
Copy link
Member

wgtmac commented Sep 9, 2025

Close this PR in favor of #214

@wgtmac wgtmac closed this Sep 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants