Skip to content

Conversation

@dongxiao1198
Copy link
Contributor

  • Add manifest list reader
  • Integrate with avro reader
  • Add simple ut

xiao.dong added 4 commits July 8, 2025 11:24
- Add manifest list reader
- Integrate with avro reader
- Add simple ut
@wgtmac
Copy link
Member

wgtmac commented Jul 14, 2025

Could you help review this? @lidavidm @zhjwpku

Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No major comments

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we want to evaluate Nanoarrow eventually? As it has these helpers

std::string test_dir_prefix = "/tmp/db/db/iceberg_test/metadata/";
for (const auto& file : read_result.value()) {
auto manifest_path = file.manifest_path.substr(test_dir_prefix.size());
if (manifest_path == "2bccd69e-d642-4816-bba0-261cd9bd0d93-m0.avro") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we also need to assert that we actually see each of these manifests exactly once?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will implement the operator== and validator of manifest list in my next pr to check the file


#define PARSE_PRIMITIVE_FIELD(item, type) \
for (size_t row_idx = 0; row_idx < view_of_column->length; row_idx++) { \
if (!ArrowArrayViewIsNull(view_of_column, row_idx)) { \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an issue with regard to the definition of ManifestFile. If a field is optional, we need to choose from the below options:

  1. Define it as std::optional<T> in the ManifestFile, then we don't need to do anything if the read value is null. For example, added_files_count is defined in this way.
  2. Define it as T, then we need to assign a default value depending on its meaning when the read value is null. For example, content and sequence_number are defined in this way.

I think we need to address this to avoid any potential headache in the future. We can fix this in a followup PR.

Any suggestion? @lidavidm @zhjwpku @gty404 @mapleFU

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe specific the version of manifest&manifest list in v1|v2|v3 is better to valid the schema of file

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To address the issue of compatibility between versions, I understand that there are two approaches:

  1. The io layer is only responsible for executing deserialization accurately without making excessive assumptions. Compatibility is handled by the upper layer, as it has more information to assist in interpreting missing fields, such as the table format version existing in the table metadata.

  2. During the underlying deserialization, supplement missing fields to be transparent to the upper layer. This requires that new fields all have reasonable default values, so that the upper layer always sees the latest version of the meta.

Both approaches require that the supplementation of missing values be as converged as possible to the same place, to avoid each logic that consumes the field needing to handle compatibility issues.

Copy link
Member

@Xuanwo Xuanwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @dongxiao1198 for working on this!

@Xuanwo Xuanwo merged commit 84565b5 into apache:main Jul 17, 2025
7 checks passed
@dongxiao1198 dongxiao1198 deleted the support_manifest_list_reader branch July 17, 2025 07:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants