-
Notifications
You must be signed in to change notification settings - Fork 70
feat: add manifest list reader #143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add manifest list reader #143
Conversation
dongxiao1198
commented
Jul 8, 2025
- Add manifest list reader
- Integrate with avro reader
- Add simple ut
- Add manifest list reader - Integrate with avro reader - Add simple ut
lidavidm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No major comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we want to evaluate Nanoarrow eventually? As it has these helpers
| std::string test_dir_prefix = "/tmp/db/db/iceberg_test/metadata/"; | ||
| for (const auto& file : read_result.value()) { | ||
| auto manifest_path = file.manifest_path.substr(test_dir_prefix.size()); | ||
| if (manifest_path == "2bccd69e-d642-4816-bba0-261cd9bd0d93-m0.avro") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we also need to assert that we actually see each of these manifests exactly once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will implement the operator== and validator of manifest list in my next pr to check the file
|
|
||
| #define PARSE_PRIMITIVE_FIELD(item, type) \ | ||
| for (size_t row_idx = 0; row_idx < view_of_column->length; row_idx++) { \ | ||
| if (!ArrowArrayViewIsNull(view_of_column, row_idx)) { \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an issue with regard to the definition of ManifestFile. If a field is optional, we need to choose from the below options:
- Define it as
std::optional<T>in theManifestFile, then we don't need to do anything if the read value is null. For example,added_files_countis defined in this way. - Define it as
T, then we need to assign a default value depending on its meaning when the read value is null. For example,contentandsequence_numberare defined in this way.
I think we need to address this to avoid any potential headache in the future. We can fix this in a followup PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe specific the version of manifest&manifest list in v1|v2|v3 is better to valid the schema of file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To address the issue of compatibility between versions, I understand that there are two approaches:
-
The io layer is only responsible for executing deserialization accurately without making excessive assumptions. Compatibility is handled by the upper layer, as it has more information to assist in interpreting missing fields, such as the table format version existing in the table metadata.
-
During the underlying deserialization, supplement missing fields to be transparent to the upper layer. This requires that new fields all have reasonable default values, so that the upper layer always sees the latest version of the meta.
Both approaches require that the supplementation of missing values be as converged as possible to the same place, to avoid each logic that consumes the field needing to handle compatibility issues.
Xuanwo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @dongxiao1198 for working on this!