-
Notifications
You must be signed in to change notification settings - Fork 70
feat: implement avro file reader #113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ee7cdbf to
1f1885e
Compare
|
cc @mapleFU would you like to take another look? |
|
Personally I prefer set a internal |
|
@mapleFU Sounds good. I think the current implementation aims to provide a minimal Avro reader implementation. We can revisit this once we have complete Avro and Parquet reader impls and make them robust. |
|
We can just move forward fast and leave this as a minor todo |
|
|
||
| // Open the input stream and adapt to the avro interface. | ||
| // TODO(gangwu): make this configurable | ||
| constexpr int64_t kDefaultBufferSize = 1024 * 1024; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the TODO, we can also defer this to a later PR. Implementations aim to create manifest files of ~8MB, and performance-wise wise it is best to read them all the way directly. The manifest list can be unbounded (theoretically).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me. I just blindly chose a default value for now.
| HasIdVisitor has_id_visitor; | ||
| ICEBERG_RETURN_UNEXPECTED(has_id_visitor.Visit(file_schema)); | ||
| if (has_id_visitor.HasNoIds()) { | ||
| // TODO(gangwu): support applying field-ids based on name mapping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep in mind that name-mapping only applies to data-files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think we just need to pass NameMapping object (if available) via ReaderOptions to read data files. For manifest and manifest list files, it should error out when field ids are missing.
| Status DeleteFile(const std::string& file_location) override; | ||
|
|
||
| /// \brief Get the Arrow file system. | ||
| const std::shared_ptr<::arrow::fs::FileSystem>& fs() const { return arrow_fs_; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Fokko We have two libraries: libiceberg and libiceberg-bundle:
- libiceberg: it only uses interfaces of
FileIOandFileReader, downstream projects should provide their own implementations. - libiceberg-bundle: it uses
ArrowFileSystemFileIOas the implementation and both Avro and Parquet reader implementations assume that arrow::FileSystem is available.
Fokko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Refactor a little bit of Reader and ReaderFactory interfaces. - Implement the skeleton of Avro reader to read data into ArrowArray. - `AppendDatumToBuilder` is not implemented yet.
AppendDatumToBuilderis not implemented yet.